Browse Prior Art Database

Controlled Error Detection and Recovery of Paths to Data, while systems remain in normal operation

IP.com Disclosure Number: IPCOM000237047D
Publication Date: 2014-May-28
Document File: 4 page(s) / 305K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is an approach for controlled error detection and recovery of paths to data for high capacity, high performance storage systems. The approach is active while systems remain in normal operation, systematically checking alternate paths using high levels of stress to simulate customer workloads, and then executing fixes before the system is compromised.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 4

Controlled Error Detection and Recovery of Paths to Data, while systems remain in normal operation

A high capacity, high performance storage system has at least two paths to data during normal operation. If one path goes down, then the alternate path is used. However, if the alternate path is not operational, then this can cause a loss of access to customer data.

The solution is a system that systematically checks the alternate paths using high levels of stress to simulate customer workloads and ensure it is fully operational. If an error is detected on the alternate path, it communicates the problem and takes action to resolve any latent issues. This is particularly important for systems in which an older integrated circuit card is exposed to higher than microchip standard transient errors. Once an error is detected on an alternate path, actions can be taken to resolve the error (e.g., resetting, fencing hardware, calling home to allow resolution of the issue, etc.) prior to hitting a second error.

The advantage is validation of an alternate path that is rarely used but critical when it becomes necessary to use it.

In an example embodiment, a storage system's device adapter (DA) is used to implement a non-affinity adapter heartbeat (HB)

with high stress (i.e., periodically validate the unused adapter so when it is needed in an error situation it can perform its duties).

If an error is discovered, it takes action to recover. This includes actions such as resetting the bay to clear any latent transient errors, calling home, and/or fencing hardware. The DA Microcode periodically runs this heartbeat, to ensure machine is operational at critical times. The heartbeat may also be run on demand to validate the alternate path to data at key times, such as prior to a code load. This can be synchronized between systems on multiple storage systems. This includes items such as:

• Specific solution for the non-affinity adapter heartbeat • Cover HB and High Stre...