Browse Prior Art Database

A method to recover from a RAID rebuild failure on the fly

IP.com Disclosure Number: IPCOM000234059D
Publication Date: 2014-Jan-09
Document File: 8 page(s) / 102K

Publishing Venue

The IP.com Prior Art Database

Abstract

Summary of Invention: This invention is to restore data during the array rebuilding process, on-the-fly in dual failure scenario described above. The invention requires a remote copy service function exists, either synchronous or non-synchronous copy function. During an array rebuilding, when there is a media error that causes usual data loss error, the rebuilding process record this failure location, record this sector address and continue rebuilding. There is another process to translate the sector address to a volume track; read this track data from the remote storage subsystem; extract the sector data from the track; notify the rebuilding process that there is a write to the bad sector. The rebuilding process overwrites the bad sector using the data and rebuilds the associated sectors relating to that particular sector. If all finish successfully, the rebuilding process removes the data loss mark and the data are restored.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 48% of the total text.

Page 01 of 8

A method to recover from a RAID rebuild failure on the fly

Background Information

It is not unusual to see data loss during RAID5 rebuilding due to double failures. For example there is a 7+P RAID5 array, one disk was rejected from the array due to drive failure and the array starts to do rebuilding. During the rebuilding process another disk hits KCQ=3/11/0 Media Error and this results data loss.

This type of data loss happens on one or more logically-bad blocks.

When this bad sector is accessed during a read, usually the RAID adapter will fail the transaction by returning Medium Error in the Result word and it reports the failing LBA in the Status.

The record of the failure can be erased when the bad sectors are written next time.

This usually causes host data loss and the user has to restore data from backup. Currently there are several ways to deal with such situation:


1). the user restores data from the backup. This requires a lot of efforts and human intervention and brings some drawbacks, e.g.


a). the backup data might not be the most recent one thus the latest data cannot be recovered.


b). incorrect restore procedure can cause much more impact.


c). restoring data usually takes a lot of time, and causes big impact to the user's business resilience.


2). If the user has a disaster recovery solution prepared, for example a synchronous remote mirroring system, the user can restore the data from the remote storage system. Usually there are two types of recovery steps involved:

a). when the data loss occurs, the synchronous remote mirroring pairs are suspended. The lost data can be re-written from the remote storage system to local system.

After the data are recovered, the copy service relationship can be restored to its original configuration and host IO can be resumed.

b). If the user has a auto-recovery software, the host IO can be automatically switched to the remote system and continue the operation. At later time the user can issue failover/failback process to allow the remote storage system to re-write the lost data to the primary storage system.

There are still some drawbacks in each of the above recovery method:


When using the first recovery procedure, the host application is already got impact by the primary data loss and has to be stopped for data restore.

When using the second auto-recovery solution, this requires hosts to have extra links, either SAN or IP network to the remote storage subsystem. In a majority time this link is not used thus the bandwidth is wasted.

Abbreviation


RAID: Redundant Array of Independent Disks

1



Page 02 of 8

RAID is a storage technology that combines multiple disk drive components into a logical unit for the purposes of data redundancy and performance improvement. Data is distributed across the drives in one of several ways, referred to as RAID levels, depending on the specific level of redundancy and performance required.

KCQ: Key Code Qualifier is an error-code returned by a SCSI device...