Browse Prior Art Database

Establishing a Configuration COORDINATOR for Highly Available Systems

IP.com Disclosure Number: IPCOM000046292D
Original Publication Date: 1983-Jun-01
Included in the Prior Art Database: 2005-Feb-07
Document File: 6 page(s) / 72K

Publishing Venue

IBM

Related People

Kim, W: AUTHOR

Abstract

A function that must be provided in a highly available system is that of coordinating the detection of and recovery from failures. This function is performed by distributed software subsystems, called auditors, each of which resides in a separate processor. An auditor is a collection of tasks that are responsible for (l) recording the failures reported by the operating system, database subsystems, data communications subsystem, and failures the auditor itself may determine by diagnosing these subsystems; (2) initiating and monitoring appropriate actions to reconfigure the system, thereby shielding the users from the effects of subsystem or processor failures; and (3) responding to system status queries and system reconfiguration requests from the operator.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 21% of the total text.

Page 1 of 6

Establishing a Configuration COORDINATOR for Highly Available Systems

A function that must be provided in a highly available system is that of coordinating the detection of and recovery from failures. This function is performed by distributed software subsystems, called auditors, each of which resides in a separate processor. An auditor is a collection of tasks that are responsible for (l) recording the failures reported by the operating system, database subsystems, data communications subsystem, and failures the auditor itself may determine by diagnosing these subsystems; (2) initiating and monitoring appropriate actions to reconfigure the system, thereby shielding the users from the effects of subsystem or processor failures; and (3) responding to system status queries and system reconfiguration requests from the operator.

An auditor resides in each processor, and one of the auditors is designated as the audit coordinator. The notion of the audit coordinator is central to the operation of the audit mechanism. The audit coordinator is responsible for periodically requesting status reports from all the other subordinate auditors, analyzing the reports and initiating system reconfiguration, the replacement of failed subsystems or processors with their backups and (re)integration of repaired (or new) subsystems or processors.

The audit coordinator can serve as the arbitrator of conflicting reports from different auditors. More importantly, the audit coordinator can take, in a coherent manner, corrective actions against concurrent multiple failures. Allowing each auditor to initiate reconfiguration may cause confusion or result in a less-than- optimal system configuration. Further, the audit coordinator is responsible for maintaining a stable configuration database which contains information about the status and physical location of each of the subsystems. The configuration database is stored in a single dual-disk system and is managed by a database subsystem upon request from the audit coordinator. It is also cached for access by each subordinate auditor.

The design of the surveillance mechanisms in these systems is usually guided by the single-failure assumption; that is, these systems are guaranteed to be available only if no software or hardware component fails while another related component has failed. The auditor allows the system to function despite multiple concurrent failures of software and/or hardware components.

The Audit Coordination Protocol

An auditor is awakened either by a timeout or by the STATUS-REQUEST message from the audit coordinator. The audit coordinator is ordinarily awakened by a timeout.

Each auditor has the line of succession, that is, the list of active auditors and their ranks in the line of succession to become the audit coordinator. Each auditor maintains its rank, which may deviate from its rank indicated in the line of succession. If the auditor does not receive the STATUS-REQUEST message from the...