Browse Prior Art Database

Determining High Availability State without Custom Hardware or Operating Systems

IP.com Disclosure Number: IPCOM000172772D
Original Publication Date: 2008-Jul-14
Included in the Prior Art Database: 2008-Jul-14
Document File: 4 page(s) / 268K

Publishing Venue

Motorola

Related People

Matthias Martin: INVENTOR [+2]

Abstract

In a high availability (HA) system utilizing 1 to 1 asymmetric failover, it is critical that one and only one controller attempt to assert control over the system at any one point in time. If more than one controller were to attempt to assert control, the system could be permanently placed in an inconsistent state. Many solutions to this problem involve custom hardware or operating systems in order to enforce this assertion. This paper presents a method for ensuring that only one controller attempts control, even in the presence of certain common hardware faults and in the presence of certain classes of operating system and application software faults.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 41% of the total text.

Determining

High

Availability

State

without Custom Hardware or Operating Systems

By Matthias Martin, Jeremy Fixemer

Motorola, Inc.

Astro Systems Engineering

                                                                                  

 

ABSTRACT

In a high availability (HA) system utilizing 1 to 1 asymmetric failover, it is critical that one and only one controller attempt to assert control over the system at any one point in time. If more than one controller were to attempt to assert control, the system could be permanently placed in an inconsistent state.

Many solutions to this problem involve custom hardware or operating systems in order to enforce this assertion. This paper presents a method for ensuring that only one controller attempts control, even in the presence of certain common hardware faults and in the presence of certain classes of operating system and application software faults.

PROBLEM

HA systems, in general, need to decide which controller should be active based on the peer controller’s respective ability to control the system.  If communication between the controllers fails, then a controller must assume that its peer controller is no longer capable of system control.

Several failures will manifest themselves as communications failure from the perspective of the other controller even though all communications hardware is functioning normally.

The most common case is that the other controller encountered a critical software or hardware fault and reset.

A communications failure may also be caused when the active controller stops functioning temporarily due to a fault in a device driver or other high priority third party thread which causes this thread to consume all available CPU. When the runaway thread stops consuming all the CPU, the application may attempt to reassert control of the system as its state before the thread took over was active. This results in two controllers being simultaneously active.

Depending on the nature of the system, dual active controllers could result in many different problems. For instance, clients may perceive a connection to be in a different state than the controller to which it is communicating. Even supposing that the dual active situation is detected and one controller transitions to standby, long term corruption of the system may result because other devices in the system may not be aware that messages which were sent and acknowledged were in effect lost because they were sent to the controller which transitioned to standby.

Additionally, HA systems must only switch controllers when absolutely necessary. This is to minimize the downtime associated with clients reestablishing connectivity to the new active controller.

False switchovers can occur when a controller is reco...