Browse Prior Art Database

Early Detection and Recovery of Permanent Control Unit Errors

IP.com Disclosure Number: IPCOM000105672D
Original Publication Date: 1993-Aug-01
Included in the Prior Art Database: 2005-Mar-20
Document File: 2 page(s) / 51K

Publishing Venue

IBM

Related People

Cook, T: AUTHOR [+3]

Abstract

Disclosed is a method for determining whether a permanent error signal from one device is indicative of future permanent error signals from many other devices. Also described is a method to determine the set of devices affected by the permanent error and a method to eliminate the unnecessary redundant error recovery associated with the recovery of each device affected by the permanent error. Application of these methods enables an operating system to minimize the disruption and recovery time for I/O errors.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 58% of the total text.

Early Detection and Recovery of Permanent Control Unit Errors

      Disclosed is a method for determining whether a permanent error
signal from one device is indicative of future permanent error
signals from many other devices.  Also described is a method to
determine the set of devices affected by the permanent error and a
method to eliminate the unnecessary redundant error recovery
associated with the recovery of each device affected by the permanent
error.  Application of these methods enables an operating system to
minimize the disruption and recovery time for I/O errors.

Certain types of control unit errors are only surfaced to the
operating system when the operating system attempts to access devices
on the control unit interface in error.  Once the operating system
causes an error to surface by accessing a device on the control unit
in error, the operating system will begin to take a recovery action.
In many cases, the recovery action involves repetitive retries to the
failing device on the path that surfaced the error.  After a
threshold of retries has been reached, the error is deemed permanent,
and the failing path through the control unit to the device is
removed.  With ESCON*, 64 devices can be connected to a control unit.
If a repetitive retry threshold was set at 10 retries per device
path, and there are 64 devices connect to a control unit with a bad
path, it could take 640 failed I/O operations before all device paths
in error have been removed.  Su...