Browse Prior Art Database

A smart sparing system for declustered RAID storage

IP.com Disclosure Number: IPCOM000245964D
Publication Date: 2016-Apr-21
Document File: 8 page(s) / 353K

Publishing Venue

The IP.com Prior Art Database

Abstract

Declustered RAID technology is used to solve the long rebuild time issue. However, as the array set width growing, the probability of multiple drive failure in the same array will also increase. This disclosure discloses a sparing system which is designed to preemptively rebuild potential bad drives at background, without the need to force a reject of a bad drive when the drive is usually too bad, and causes more impact to system if the reject is in an inappropriate time.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 30% of the total text.

Page 01 of 8

A smart sparing system for declustered RAID storage

1 Background

As the drive size keeps increasing quickly, while drive bandwidth not, the array rebuilding time will take longer. Declustered RAID (referred as DRAID in this article) technology is used to solve the long rebuild time issue. However, as the array set width growing -- typically an DRAID system can use over 40 even 100+ drives to create one DRAID array -- the probability of multiple drive failure in the same array will also increase.

1.1 DRAID terminology:

A DRAID array has both an array width and a set width

- The width of the array is the total number of strips in the array stride

e.g. a 5+P+Q RAID-6 array has a width of 7

- The width of the set is the number of drives that the array is distributed across

- The set width may be any value greater than or equal to the array width

Optionally a DRAID array may include distributed hot spares

- A distributed spare only provides protection for a single array

- Without distributed spare, the rebuild time will have no difference with traditional RAID array
1.2 Some traits about DRAID array:

- DRAID can be implemented on top of RAID5, RAID6, Reed Solomon code or erasure code, so a DRAID array can support 1 or more drive rebuilding at same time. - A DRAID array should contain the number of distributed spares which equal to max number of concurrent rebuilds of the DRAID.

e.g, a DRAID implementing RAID6 should contain 2 distributed spares. For a DRAID array implementing erasure code which can tolerant 3 disk loss should have 3 distributed spares. - DRAID rebuild time is much faster than traditional array, however, it's still limited by storage controller bandwidth.

For a DRAID array which has IO stress, the rebuild time is determined by (the max storage controller bandwidth) - (current IO bandwidth). So for an array

with high IO stress, the rebuild time will also be longer than best case. For an array which have multiple drive rebuilding, the rebuilding time will be longer than best case.

- A distributed spare is actually spare spaces distributed across all disks in the array set.
1.3 Some typical scenario by multiple drive failure:

- One or multiple drive rebuilding in one DRAID array, another drive has media errors

This will cause data loss on the array.

- One or multiple drive rebuilding in one DRAID array, another drive has hardware error and fall out of array

This will lose the entire array.

- Multiple drives have hardware failure or fall out of array

This will lose the entire array; we may bring this array back if the drive is not so bad during recovery.

1



Page 02 of 8


1.4 Summary of the problem:

- The probability of multiple drive failure in the same array will be multiple times as compared to traditional array

e.g., for a DRAID array set width = 64, normally we have traditional array width = 8, so in theory the probability of multiple drive failure in same array increased to x8 times. - The rejection of a drive lead to reb...