Browse Prior Art Database

A method for minimizing error recovery actions in a parallel processing environment

IP.com Disclosure Number: IPCOM000022262D
Original Publication Date: 2004-Mar-03
Included in the Prior Art Database: 2004-Mar-03
Document File: 2 page(s) / 46K

Publishing Venue

IBM

Abstract

Disclosed is a method for minimizing the quantity of error recovery actions in a multi-processing environment. The goal is to have one error recovery step that will recover multiple I/O operations in parallel, instead of each I/O operation performing individual error recovery actions.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

A method for minimizing error recovery actions in a parallel processing environment

Within a multi-processing environment, the operating system dispatches I/O operations to the hardware (e.g., controller or disk unit are examples). Communication between the operating system and hardware consists of building I/O request operations, and passing these back and forth through the architected interfaces between the operating system and the hardware. Parallelism is achieved by having multiple I/O requests outstanding to the hardware simultaneously.

     The problem arises during the steps that are taken to recover from error conditions. Every I/O operation potentially has a different architected timeout. When an operation times out, the operating system is supposed to initiate a device/controller reset, in order to recover the failing piece of hardware. Reset processing can take up to several minutes, in order to collect dump information. If the hardware stops responding, and the operating system has multiple operations outstanding, it would be undesirable to issue more than one reset to recover all the I/O operations at the same time. This would lead to multiple hardware dumps to analyze, only one of which contains the true defect to analyze. The problem to be solved, is how to coordinate the independent I/O operations, so that at most one hardware reset is performed in this environment.

     An existing solution is to throttle the number of resets that can occur within a particular period of time (e.g., 4 per hour). However, the problem with this mechanism, is that it can still result in capturing an excessive number of dumps, especially with parallel, independent operations.

     Another existing solution would be to cancel all outstanding I/O operations when the first timeout occurs, before allowing any new operations to be initiated. However, the drawback of this kind of solution, is the underlying assumption that if one operation fails, so too will all the other outstanding operations. If this assumption i...