Browse Prior Art Database

Failover to a redundant controller during controller recovery

IP.com Disclosure Number: IPCOM000172777D
Original Publication Date: 2008-Jul-14
Included in the Prior Art Database: 2008-Jul-14

Publishing Venue

IBM

Abstract

The eCLipz high-end system control structure is spread across what is referred to as system controllers (SC) and node controllers (NC). The SCs provide higher level function across the entire system, and is the end communication point for the system hypervisor. The NCs have a node wide view, and are used for accessing hardware within a given node. Because of this structure, the NCs are a proxy for communications between the SCs and the hypervisor. To provide compatibility across all of the eCLipz platforms, each SC must select a separate NC to proxy its communications to the hypervisor. This has the affect of eliminating the redundancy at the SC level if there is a failure at the NC level in smaller configurations where there is only one set of NCs. The desire is to be able to maintain the SC redundancy if there is no failure at the SC level. This means leaving the backup SC in a state where it can become the primary, even if it is not actively communicating with the hypervisor. When an NC fails, the SC which was associated with that NC for hypervisor communications will reset itself to clear out any in-flight messages. Upon recovering from the reset, the SC will partially initialize itself and then wait, polling for an NC to proxy its communications to the hypervisor. This can happen in one of two ways, the failed NC may recover on its own, the network communication between the SC and NC may recover, or the primary SC may fail, freeing up the NC it was using to proxy its communications. In the case of a network failure/recovery, the backup SC relies on the primary SC to reinitialize the NC to allow communications to the hypervisor. In the case of a primary SC failure, the backup SC must first detect the primary SC failure, failover to the primary role, reinitialize the NC for itself, then complete the reset/reload recovery by communicating with the hypervisor.