Browse Prior Art Database

Method for preventing catastrophic failures in RAID volumes using dynamic reconfiguration

IP.com Disclosure Number: IPCOM000125746D
Publication Date: 2005-Jun-15
Document File: 5 page(s) / 52K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method for preventing catastrophic failures in redundant array of independent disks (RAID) volumes using dynamic reconfiguration. Benefits include improved functionality and reliability

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 42% of the total text.

Method for preventing catastrophic failures in RAID volumes using dynamic reconfiguration

Disclosed is a method for preventing catastrophic failures in redundant array of independent disks (RAID) volumes using dynamic reconfiguration. Benefits include improved functionality and reliability.

Background

      RAID systems are conventionally used for their high throughput and low cost. However, they have the inherent disadvantage of lesser reliability. The mean time between failure (MTBF) of a RAID system without a data protection mechanism can be expressed as F/n where n is the number of disk drives and F is the MTBF of each drive. As a result, the MTBF of the RAID system decreases linearly with the decreasing MTBF of the individual disks and the increasing number of disks. The eventual result can be data loss.

              The configuration of RAID volumes is static and is maintained throughout the life of the RAID volume. As the individual drives become old, bad sectors begin to develop. Recovery becomes more frequent. The probability of drive failure increases as the number of drives in a RAID volume increases.

              To improve the reliability of RAID systems, check bits derived from stored data are used for data recovery. A number of algorithms, including exclusive OR (XOR) parity-based or the Reed-Solomon (R-S) code-based algorithms, are used to generate and maintain the check bits. The tolerable number of simultaneous disk failures with these algorithms is same as the number of check disks.

      For example, an Dual XOR-based RAID system includes seven disks. Five disks are used for data storage (data disks). Two disks are used as check disks for storing the check bits (the horizontal and diagonal parity). The RAID system can recover from simultaneous failures of up to two disks (see Figure 1).

      The MTBF of the RAID system is inversely proportional to the number of disks in the volume. When a RAID system includes a large number of data disks, the number of check disks should also be proportionately increased to maintain MTBF level. The number of check disks in a RAID system depends on the RAID configuration. In conventional systems, the RAID levels are and statically maintained and cannot be changed dynamically. Storage space goes unused when few disk failures occur if the number of check disks is large. When the system is old, more check disks are required. As a result, reliability is compromised when the individual disks in the RAID system are old.

              Conventionally, RAID volumes are not reconfigured. When a hard drive fails, RAID recovery is used to restore the volume data.

General description

      The disclosed method optimizes the RAID system storage space used for check bits and prevents the possibility of catastrophic failure.

      The key elements of the disclosed method include the following:

•             Reconfiguration is based on the current reliability of the RAID system.

•           ...