Browse Prior Art Database

A method and system to alleviate double disk failure impact on raid level

IP.com Disclosure Number: IPCOM000222569D
Publication Date: 2012-Oct-18

Publishing Venue

The IP.com Prior Art Database

Abstract

This invention provides a new method of finding faulty disks from the RAID level. It measures the health of RAID-member disks from a collective perspective, so that the"sub-healthy" disks are kicked out of the RAID before it turns to "unhealthy" at the critical time.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 10

A method and system to alleviate double disk failure impact on raid level

Background & Problem with existing technology

Data loss and downtime due to disk failures cost businesses billions of dollars each year. RAID technology brings certain level of protection but not the complete avoidance. For example, the single-parity RAID 5 does not protect against the double failure events - two concurrent disk failures. In practical, this often happens in such scenario that one disk fails and is kicked out of the array, then during the array rebuilding/reconstruction another disk fails, leaving the consequence of data loss and array offline.

Furthermore, although disk drive densities continue to increase, their reliability is not improved at the same rate. Businesses today have already adopted the TB-level Serial ATA (SATA) drives to reduce storage costs while still eagerly awaiting the availability of even larger drives. However, in the meanwhile, they have to face a substantially increased risk of double disk failures or unrecoverable media errors during reconstruction (MEDR) that result in data loss.

How to reduce the double failure incidents? Thinking of how the double failure happens, modern disk drive has its own failure prediction mechanism, a.k.a. SMART, which is a monitoring system for computer hard disk drives to detect and report on various indicators of reliability, in the hope of anticipating failures. In contemporary enterprise storage systems, there are also disk error monitoring and failure prediction methods in different components, e.g. disk enclosure f/w, RAID adapter f/w, RAS (Reliability Accessibility and Stability component), etc. Most of them use the threshold mechanism - if a certain type of disk error reaches the predefined threshold value, alert or notification will be sent out, and usually the disk will be kicked out of the array. Let's name this as the "first failure". Then a hot spare is taken in and array goes rebuilding. Obviously the rejected drive has certain type of error beyond the threshold, but what about the rest of drives in the same array? Generally speaking, drives of the same array are of the same type and from the same batch. The same physical (same environment) and logical (similar data access model) working condition would probably make these drives age in the same way. So there is a high possibility that the rest of drives do have the same type of error, and some may even have an accumulated value near but just not exceeding the threshold. During the rebuilding of the array, these drives are at risk to fail. If it does happen, a "second failure" case occurs.

Current error monitoring and failure predication methods, no matter the SMART within the disk drive itself, or the upper-layer subsystems of storage products, all focus on the single drive, i.e. thresholding based on one disk's certain types of errors. So they work well only to predict the "first failure". The disclosure discloses a metho...