Browse Prior Art Database

Heuristic limitation of device error recovery

IP.com Disclosure Number: IPCOM000126201D
Original Publication Date: 2005-Jul-06
Included in the Prior Art Database: 2005-Jul-06
Document File: 2 page(s) / 44K

Publishing Venue

IBM

Abstract

What is described is a simple method of limiting I/O device error recovery actions based on a combination of time, the number of prior retry attempts, and the complexity of the error recovery.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Heuristic limitation of device error recovery

The technique described here achieves the simplicity of limiting error recovery to a sensible number of recovery attempts, while limiting more complex error recovery actions so that they do not take an excessive amount of time.

     Earlier techniques to limit error recovery specified a number of recovery actions, usually simple retries, which would be attempted. If that number of actions was reached then error recovery halted and the error was considered not recoverable. Unfortunately, some error recovery actions - such as resetting a device - take far longer than others. The problem with simply limiting the recovery attempts is that in some cases error recovery could still be excessively long. Another technique used to limit error recovery was to limit the time allowed for recovery actions. The problem with this approach is that an obviously doomed, but simple, recovery attempt will repeat many times until the maximum amount of time allowed is reached.

     When error handling is done a common technique is to do some table look-up to decide what action to take next based on the current error condition. In addition to what has been there before this technique adds a "recovery unit cost". The cost assigned to something simple like retrying a simple data transfer might be "1". The cost assigned to a device reset might be "50". Before the next recovery step is taken there is a check to see if taking this step would increase the total recovery cost beyond the allowed limit. If the limit is not reached, then the total cost is advanced, and the recovery action initiated.

     The first point that differentiates this technique from the prior art is that more is done than simply checking to see if performing this next action would exceed the error recovery cost limit. At this point the ability to modify the next recovery actions based on the current total recovery costs is added. For example, consider the case where the next recovery step is to reset the device, then retry the failing operation.. However, due to the actions which have already been taken, the recovery cost of these operations would now exceed the maximum allowed. At this point this technique will choose to simply ignore the expensive reset operation and just do a retry. In this way the recovery attempt continues, but it is limited to what can be done while remaining within a total recovery effort limit.

     The s...