Browse Prior Art Database

Error Log Analysis in Serial Fixed Disk Sub-System

IP.com Disclosure Number: IPCOM000106832D
Original Publication Date: 1993-Dec-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 2 page(s) / 88K

Publishing Venue

IBM

Related People

Oldfield, C: AUTHOR [+3]

Abstract

Disclosed is a method of processing errors collected in a large disk system over a long period of time, resulting in an outcome presented to the Customer Engineer (CE) which will identify precisely the cause of failure and action to be taken. A quicker and more accurate analysis results than would otherwise be possible, given the quantity of information needed to be analysed. Benefits are reduced time and effort in determining cause of faults; the most relevant error is presented irrespective of order and incorrect conclusions due to consecutive errors are eliminated. Repeated repair actions for a disk that has been repaired are avoided, repair procedures are simplified and predictive maintenance philsophy is extended.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Error Log Analysis in Serial Fixed Disk Sub-System

      Disclosed is a method of processing errors collected in a large
disk system over a long period of time, resulting in an outcome
presented to the Customer Engineer (CE) which will identify precisely
the cause of failure and action to be taken.  A quicker and more
accurate analysis results than would otherwise be possible, given the
quantity of information needed to be analysed.  Benefits are reduced
time and effort in determining cause of faults; the most relevant
error is presented irrespective of order and incorrect conclusions
due to consecutive errors are eliminated.  Repeated repair actions
for a disk that has been repaired are avoided, repair procedures are
simplified and predictive maintenance philsophy is extended.

      Customer Engineer maintenance of large disk subsystems can be
potentially difficult.  In an installation where there are upwards of
40 individual hard-disks, it is imperative that the CE is guided
through fault diagnosis and replacement in an intelligent manner.
This intelligence is provided by a suite of software programs that
collect, process and present information to the CE for the purpose of
fault diagnosis.

      During use of the disk-subsystems, periodic errors are
inevitably generated.  These can be due to a variety of reasons, but
fall into two main classes.  First are Recovered Errors which do not
stop normal operation of the subsystem.  They are reported solely for
the purpose of predictive analysis, alerting the CE or User to
possible future problems with a particular disk or group of disks.
The second class are Unrecovered Errors which are hard errors that
stop normal operation.  They can be generated by permanent hardware
failure, disk media damage, or possibly intermittent failures e.g.,
cabling, connectors.  Both classes of errors are stored by the device
driver in a file during normal system use.  The error messages
contain the following information:  Physical disk location,
Recovered/Unrecovered flag, Error number, Time stamp, and further
error information.  This set of errors forms the data for analysis by
the Analysis software and is referred to as the 'Error Log'.

      The basis of analysis is to extract the most relevant error for
each particular disk in turn, and present this one error to the CE
for repair or action which is achieved by analysing the Error Log
twice.  The first pass is to find the most relevant UNRECOVERED error
for a particular disk.  Since the errors saved in the log are
time-ordered so scanning backward from the latest error to the
e...