A Method to Prevent a Box Down Situation Due to Repeated Host I/O Access Requests that Cause a Failure
Publication Date: 2013-Jan-31
The IP.com Prior Art Database
Disclosed is a method to prevent a box down situation due to repeated host access requests to a failed component. This invention prevents a resource from becoming fenced by sending a message to the host that a problem exists and needs intervention. The method introduces new descriptive metadata and a new module that decides whether a new input/output (I/O) can be executed immediately or prevented from causing a loss of access or box down situation.
Page 01 of 2
A Method to Prevent a Box Down Situation Due to Repeated Host I /O Access Requests that Cause a Failure
Repeated hardware errors can lead to outage situations. For example, a hardware or microcode failure detected at a Host Adapter (HA) level causes an input/output (I/O) error to occur multiple times. The failure eventually leads to the fence of the HA resource. When the host determines that it cannot use the failed resource because the resource has been fenced, it tries the next available path to access a storage controller. When the host tries the next logical path, meaning the next HA, the same problem occurs. Eventually, that HA is also fenced. Although the fencing of an adapter prevents an error from constantly being presented to the host through a path using that adapter, the current approach does not take into consideration that a host might try the next available path. The result of a host trying to do the I/O on all the paths leads to the fencing of all HA adapters. At that point, there is a complete Loss of Access (LOA) to not only the host with the failed I/O, but also to all hosts accessing the storage controller through the same HAs (which are now fenced).
The invented solution is a method to prevent a box down situation due to repeated host access requests to a failed component. This invention prevents a resource from becoming fenced by sending a message to the host that a problem exists and needs intervention. The method introduces new descriptive metadata and a new module that decides whether a new I/O can be executed immediately or prevented from causing a loss of access or box down situation.
The method maintains a history of errors in metadata that contains levels and frequency of failures in accessing components of a storage subsystem. This new descriptive metadata contains information about the failures and errors detected by each component in a storage subsystem. The component can be a microcode component, hardware component, or a hybrid of the two. The metadata consists of the following elements:
• A global flag or a bit indicating a high severity error condition exists
• An index to the location of the failed component and its
• Subsequent elements describing:
- the failing component
- the number of times the error has occurred
- the severity of the error and the action to take
The descriptive metadata is populated by each component, which determines its own severity level based on predetermined conditions. A flag bit is highlighted for a high severity condition that allows incoming I/O to quickly recognize a potential problem and act on it accordingly (described below as the gatekeeper). The flag bit is set only when an error condition reaches high severity. Next to the flag bit is an index to the location in the metadata of the failed component and its environment. The location inside the metadata contains identification of the component, the type of failure, the severity of the failure, and an act...