Browse Prior Art Database

Enhanced Software Recovery for Storage Errors

IP.com Disclosure Number: IPCOM000103940D
Original Publication Date: 1993-Feb-01
Included in the Prior Art Database: 2005-Mar-18
Document File: 4 page(s) / 115K

Publishing Venue

IBM

Related People

Daly, JC: AUTHOR [+3]

Abstract

Disclosed is a method for software processing of recurring storage-related errors presented by the hardware. This method:

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 48% of the total text.

Enhanced Software Recovery for Storage Errors

      Disclosed is a method for software processing of recurring
storage-related errors presented by the hardware.  This method:

o   is more effective in limiting the impact of such multiple
    occurrences

o   increases the probability that the source of the error will be
    removed

o   decreases the probability that the error will take the system
    down

Computing systems typically contain mechanisms for detecting and
acting upon deviations from expected or desired processing (that is,
the occurrence of an error in the processing).  For many errors, the
recovery actions (by hardware, software, or both) on an initial
occurrence is sufficient to prevent any recurrence while still
allowing the system's operation to continue.

      However, despite the recovery action, an error will sometimes
recur, so it is equally important that a system be able to react to
multiple occurrences of the same or similar errors - an indication
that action(s) taken in response to an earlier occurrence did not
resolve or eliminate the cause.

      One embodiment of such a scenario is the System/370*
Architecture's presentation of an IPD (Instruction Processing Damage)
machine check when the hardware detects a storage-related error that
is not correctable, and MVS' processing of this type of machine
check.

      For any given occurrence of a storage error, MVS attempts to
stop all usage of the affected storage and this, in turn, usually
requires that one or more units of work be terminated.  Because of
the way the system uses some areas of storage, this approach is often
ineffective and a particular error can keep recurring until the
system eventually goes down.

      This method for processing a series of uncorrected
storage-related errors is more effective than previous methods in
limiting the disruption to the entire system.  The major features of
this new method are:

1.  The introduction of a technique whereby a program can determine
    whether a given occurrence of a storage error represents a new
    error or a recurrence of an earlier error.

2.  Maintaining and acting on multiple thresholds for one type of
    error rather than accumulating errors that affect different users
    or that occur on different areas of storage and acting on them
    under a single threshold.  The new method has three such
    thresholds, all are measured against a time interval:

    o   the number of storage-related errors experienced by a single
        user on a given storage area

    o   the number of separate users that experience an error on a
        given storage area

    o   the number of separate storage areas that experience errors

3.  Application of a hierarchy of increasingly severe recovery
    actions to storage-related errors.  Thus when a persistent error
    occurs on a given area of storage, the first recovery acti...