Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Extending Enhanced I/O Error Handling (EEH) Framework in AIX to Handle Long Recovery Times

IP.com Disclosure Number: IPCOM000030383D
Original Publication Date: 2004-Aug-09
Included in the Prior Art Database: 2004-Aug-09
Document File: 4 page(s) / 69K

Publishing Venue

IBM

Abstract

Disclosed is a method of extending EEH (Enhanced I/O Error Handling) recovery time for any device driver that follows AIX's multifunction programming model with respect to EEH. Current state machine for the EEH recovery assumes instantaneous execution of each recovery step. This poses a problem for those device drivers that need to perform a certain cleanup task during the EEH recovery, and the cleanup task can take up to several minutes. The EEH state machine is extended in such a way that it allows for longer recovery times. Specifically, four new states are added to the state machine that permit waiting from the interrupt environment combined with a retry protocol. The retry protocol requires storing and restoring some context information before each retry. Thus, the invention is the next step forward in expanding the scope of EEH recovery in AIX.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 36% of the total text.

Page 1 of 4

Extending Enhanced I/O Error Handling (EEH) Framework in AIX to Handle Long Recovery Times

There is a basic methodology of recovering from an I/O error in the AIX operating system, which allows coordination of recovery steps among independent device drivers. Independent device drivers pertain to the same physical adapter card (for example, a four port ethernet card, a dual channel SCSI card, etc.) but control different functions on the card. The drivers do not even know about each other's existence. The I/O error recovery process requires that there is some coordination among these drivers such that the state of a slot does not end up being inconsistent and irrecoverable. Also, the recovery procedure consists of a set of steps. Each step is guided by the Operating System and carried out by the device drivers. However, the current solution in AIX assumes that each device driver will carry out a single recovery step instantaneously. By instantaneous, we mean that a driver will carry out a recovery step completely without blocking the overall progress of the system for an extended period (a few minutes) and return control to the Operating System. The result of such a design is that the recovery procedure takes only a few seconds at the worst.

While the assumption is valid for most adapter drivers, there is a class of drivers which need longer time to process some recovery steps. Current design does not accommodate such drivers and hence needs to be extended. Disclosed is a method by which a device driver can take as much time as needed to finish a given recovery step without causing performance degradation and deadlock in the system. Also, the method is transparent to the drivers (i.e. contained entirely inside the Operating System kernel) so that there is a very little impact on the device driver implementation and no impact on the system hardware and firmware.

The key idea behind the method of allowing unlimited time for each recovery step is simply the extension of I/O error handling automata (or state machine). The core I/O error handling automata consists of a set of states that guide independent device drivers through the error recovery procedure with an assumption that each driver will finish a recovery step instantaneously. The new design adds some more states to the automata such that it is possible to allow longer time for each recovery step. The new states are added in a way that does not affect the normal operation of the core automata. Also, even after adding the new states, the new automata is still deterministic -- that is, unambiguous.

1

Page 2 of 4

Normal

ibm,read-slot-reset-state

EEH_DD_SUSPEND

EEH_DD_DEBUG

ibm,slot-error-detail

ibm,slot-error-detail

Multifunction Error Recovery State Machine

Suspend

ibm,set-eeh-option (PIO)

EEH_DD_ACTIVE

EEH_DD_DEACTIVE (100 ms trb)

EEH_DD_RESUME ibm,configure-bridge EEH_DD_RESUME

Debug

skip Debug

Dead

Activate

EEH_DD_DEAD

Deactivate

Figure 1

Figure 1 shows the core Enhanced I/O Er...