Browse Prior Art Database

Automatic Fault Management of Recurring Faults

IP.com Disclosure Number: IPCOM000111495D
Original Publication Date: 1994-Feb-01
Included in the Prior Art Database: 2005-Mar-26
Document File: 2 page(s) / 44K

Publishing Venue

IBM

Related People

Ferris, MM: AUTHOR [+2]

Abstract

Disclosed is a method for automatically managing fault recovery in a system. By interpreting the error logs, fault management code can make intelligent decisions about how to recover from recurring faults.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 87% of the total text.

Automatic Fault Management of Recurring Faults

      Disclosed is a method for automatically managing fault recovery
in a system.  By interpreting the error logs, fault management code
can make intelligent decisions about how to recover from recurring
faults.

      In a system where faults are recorded in a log, this log
provides a vital history of the system over time.  By assigning each
fault to a particular component of the system, statistics can be
derived from the fault history.  Then the fault management code can
invoke different recovery actions for each failing component based on
its failure history.  Error thresholds can be assigned to the
components so that when a threshold is exceeded, a more drastic
recovery action is performed.

      An example of this concept is shown in the Figure.  According
to the Figure, the fault management code is notified when a fault
occurs.  Since the fault is assigned to a component, the fault
management code looks in the history file to determine the number of
errors already assigned to this component.  If the error threshold
has not been exceeded, the fault management code increments the error
count against the component and invokes some kind of "normal"
recovery action.  The results of this action, in this example, are
logged and the fault management code waits for the next fault.  If
the error threshold has been exceeded, the fault management code
takes the right branch of the flow chart and executes a more...