Browse Prior Art Database

Isolate Faulty Components in a Clustered Storage System with Random Redistribution of Errors in User Data

IP.com Disclosure Number: IPCOM000245495D
Publication Date: 2016-Mar-12
Document File: 3 page(s) / 50K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method to isolate faulty components or a system-wide fault in a clustered storage system and take said components offline or shut the components down.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 3

Isolate Faulty Components in a Clustered Storage System with Random Redistribution of Errors in User Data

In a storage system, software (SW) and hardware (HW) bugs and malfunction (e.g., of compression objects and compression hardware, or slices/dices and multipoint control unit (MCU)) can lead to errors in the user data written to the storage device, either before or after processing (e.g., compressing/uncompressing). A clustered storage system has the

processed objects, such as compressed objects, equally and fairly spread across the storage system. In a cluster storage system, when a storage component fails, the system redistributes the objects to the remaining storage components. It may be assumed that the redistribution of objects is random. Errors may be detected during an input/output (I/O) operation on the previously written data (i.e. read/modify). It can be expected that some of the errors

were not detected before redistribution, thus it can be assumed that the erroneous data is moved with the objects according to the new distribution.

When errors are detected in the stored data, it is beneficial to isolate the faulty component causing the error and take it offline/shut down before more user data is compromised. In some cases, the cause of error may be global to the system (e.g., SW bug in all objects). In which case, the entire storage cluster should be shut down or taken offline.

The novel contribution is a method to isolate faulty components or a system-wide fault in a clustered storage system and

take said components offline or shut the components down.

The manager of the storage system collects error statistics of the object (e.g., compression object, or slices/dices) upon detection during I/O operations. The manager calculates the statistics for global error count and statistical information about the errors of each component. The statistical information is used to identify statistically significant divergence from the average and identify a component that stands out compared to other components, which indicates a high probability

of a faulty state. In such a case, the manager can take it offline. The manager can also calculate error statistics for the entire system and identify a global error condition for which time it may take the entire storage system offline or shut it down.

The advantage of this method is that if undetected errors are redistributed with the objects, it is assumed that those errors are redistributed randomly, thus keeping the divergence statistical distribution unchanged. Only when the statistics indicate significant divergence, is it assume...