Browse Prior Art Database

Failure Detection in a Symmetric System

IP.com Disclosure Number: IPCOM000114622D
Original Publication Date: 1995-Jan-01
Included in the Prior Art Database: 2005-Mar-29
Document File: 4 page(s) / 67K

Publishing Venue

IBM

Related People

Carlson, WC: AUTHOR

Abstract

A means of using multiple communications links between two identical units in a symmetric redundant system to distinguish between failure of a unit and failure of a communication link is disclosed.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 72% of the total text.

Failure Detection in a Symmetric System

      A means of using multiple communications links between two
identical units in a symmetric redundant system to distinguish
between
failure of a unit and failure of a communication link is disclosed.

      The units are connected with two or more independent
communication links.  Besides the normal traffic, regular "heartbeat"
transactions are processed on each link to assure failure detection
in the absence of other traffic.

      Failure of a single link indicates nothing beyond the link
failure.  Coincident failure of the two independent links indicates
that the monitored unit has failed (Fig. 1).

      Because of the possible asynchronous nature of the transaction
and heartbeat traffic on the independent links, failure of the
monitored
unit is likely to cause failure indications on the two links at
different
times.  A means of correlating the failures is required in order to
differentiate link failures caused by unit failure from those with
other
causes.

      Such correlation is done by defining a time interval, within
which failures of both links are considered to be due to a common
cause (Fig. 2), and outside of which such failures are considered to
be due to independent causes (Fig. 3).  The interval is chosen to be
as small as possible, consistent with worst-case failure detection
and reporting times on the communication links.

      Failure correlation is accomplished by using a timer wit...