Surety is performing system maintenance this weekend. Electronic date stamps on new Prior Art Database disclosures may be delayed.
Browse Prior Art Database

Prevention of Data Corruption in split brain Scenario for High Availability Clusters

IP.com Disclosure Number: IPCOM000247764D
Publication Date: 2016-Oct-06
Document File: 5 page(s) / 56K

Publishing Venue

The IP.com Prior Art Database


Disclosed is a method for preventing the data corruption in split brain scenarios in Cluster environment. This method aims to handle the situations where the data of an application could be accessed concurrently from both the islands in a partitioned cluster. This method would identify the isolated machines from the cluster and would shut down those machines through Hardware Management Console from the remote island which is expected to bring the application online assuming the other island is down.Considering Data integration as priority, this approach would make sure that the application is brought up only on a single island in case of partitioned cluster with a less time interval that would be consumed for planned move of the application without data corruption.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 5

Prevention of Data Corruption in split brain Scenario for High Availability Clusters

Cluster computing model has been widely used in the global computing space for high

performance , high availability and less cost. The other important aspect for a clustering model is

prevention of data corruption for the application running on the machines and provide high availability of the application in all the situations. The main reason for the data corruption would be in general accessing the application data by more than one machine at a time and any clustering technology would not allow this kind of scenario in general. But this kind of situation cannot be completely avoided during cluster split in many clustering technologies.

In a high availability cluster, when the application is running on the primary machine and if there is a split between the primary and secondary machines where both the machines assume the other machine is down and try to pick the application resources and try to run the application ,there are many policies or methods to follow handle split scenarios but here the problem is, if there is a IO freeze for a partition for long period of time, in particular not a fixed amount of time which leads to false machine down detection by which other machine would like to bring the application up forcibly from source machine. When the source machine comes back it will to complete the pending events and try to access the same disk of the application which leads to data corruption.

Following is an example of split brain that could not be avoided from inside the cluster machine.

This method avoids the access to the storage that is being used by application from the remote machine where the application is brought up after split.

The core idea of this method is to avoid the access of storage by the unresponsive machine during the split of cluster and make sure the application running on secondary machine is not interrupted.


Page 02 of 5

This method shall look the state of the cluster machine from Cluster Aware (CA) OS and collect the list of machines whose state is detected as DOWN and provide this list of machines to a utility which can confirm the state of those machines is down from Hardware Management Console and then shutdown the machine completely from one of the possible HMC to avoid the access of storage from unresponsive machine.

Cluster Aware OS also informs, whether a machine is down because if genuine halt/shutdown or unable to do heartbeat with other machines. Just before dying in case of genuine downs, Cluster Aware OS sends last message that it is going down. Whereas if the communication channels are not up, then other machine misses the heartbeat. This also helps to differentiate, whether cluster is split or genuine machine downs.

Unlike other methods which are dependent on the repository storage which may include disk storage device or data comm...