Browse Prior Art Database

Method to detect and recover from failure during failovers of redundant micro controllers

IP.com Disclosure Number: IPCOM000166851D
Original Publication Date: 2008-Jan-25
Included in the Prior Art Database: 2008-Jan-25
Document File: 6 page(s) / 41K

Publishing Venue

IBM

Abstract

Method to detect and recover from failure during failovers of redundant micro controllers

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 41% of the total text.

Page 1 of 6

Method to detect and recover from failure during failovers of redundant micro controllers

In today's real time, always on business environment, servers are required to be ultra reliable to prevent service interruptions. This high availability is achieved through complex computer systems with built-in hardware and software redundancy. The idea is to detect failure and fail-over to the redundant component to prevent interruption and then repair the failed/failing component during a scheduled service window.

The focus of this document is on redundant micro controller based servers. The micro-controller provides initialization, run-time and service functions to the server while the hypervisor is the server firmware that communicates with the micro-controllers for various services. The micro-controller (referred to as the service processor (SP)) that has the ownership of the server is referred to as the primary service processor and the redundant service processor is referred to as the backup service processor. Any failure to the service processor can result in server operation interruption and can result in significant down time. Therefore, the redundant service processor is designed to fail-over and prevent server operation interruption due to primary service processor failure.

The fail-over operation in this setup refers to the action of transferring ownership of the server from the primary service processor to the backup service processor. This publication is relevant to systems that distinguish between two different types of fail-over that can occur in the system - an Administrative Fail-over (AFO) and Dynamic Fail-over (DFO).

Administrative Fail-over refers to the operation where the system ownership is transferred from the primary service processor to the backup service processor in a controlled manner. This AFO is either user initiated or automatically initiated during certain actions to transfer the system ownership. AFOs are most commonly used during code load operation and repair scenarios.

Dynamic Fail-over refers to the operation where the system ownership is taken over by the backup service processor in the event of primary service processor failure.

The server is vulnerable to failures during an administrative fail-over operation due to the transfer of server control between the SP action, which, upon failure can leave the system in an inconsistent state. This can lead to a system state where user intervention is required.

This publication is pertinent to the current fail-over implementation in IBM i/p series e-Servers and provides a mechanism to prevent abnormal system termination due to a fault (software or hardware) during the fail-over operation by falling back to the healthy SP if possible. A mechanism is outlined to prevent failures during AFO (that could result in server operation interruption) by causing the DFO mechanism to be invoked.

1

Page 2 of 6

The current AFO action...