Original Publication Date: 2002-Sep-01
Introduction: In RAID controllers the goal is to minimize the customers exposure to data loss due to any components failure. To protect against disk failures, parity striping (RAID 5) and mirroring (RAID 1) are performed on the users data. To protect against controller failures, i.e. firmware errors or hardware problems, redundant controllers are used. These are actions taken to protect a users data in the event a single component failure takes place. The problem is that while a system is in a degraded mode, a second component failure can cause user data loss. This disclosure provides a method to help limit the customers exposure to data loss due to multiple disk failures. Problem Description: In today’s RAID controllers when a disk fails, a rebuild is started to recreate the data that was written to the failed disk. When the rebuild completes the RAID group is then fully redundant and can once again withstand a single disk failure. The goal of this disclosure is to limit the time between when the data on the spare drive is fully regenerated and when the failure takes place. To accomplish this, the firmware takes advantage of information received from the drive, for example, the number of sectors which have needed to be reassigned, or the drives predicted fault information. With this information the firmware can make a determination that a drive is likely to suffer a hard failure in the near future. The goal is to have the data on that drive copied onto a spare drive prior to the failure actually taking place. In this case, if the drive actually fails, the controller will not need to perform the rebuild and there is no exposure to a second drive failure. Problem Solution: When a drive exceeds a predefined threshold in terms of the number of expected failures/problems, the controller will mark the drive as a background copy candidate. This means that the drive is one a fatal error is expected to occur on at some point in the near future. A hot spare, which is available in the system, will be used as a write mirror drive for the failing drive. Thus, a disk write which was supposed to be directed to the failing drive will now also be sent to the hot spare. In addition, as a background task, all of the data on the failing drive will be copied onto the hot spare. When the data is all copied from the failing drive to the hot spare, the failing drive can then be replaced in the RAID group with the hot spare. The RAID group at this point should now have a set of drives which are not exhibiting any signs of imminent failure. In addition, if the failing drive should actually have a true hard failure and need to be removed from the RAID group, the rebuild will already have been started, thus limiting the exposure time to a second drive failure. The hot spare will be used to replace the failed drive and the rebuild will continue from the point the copy operation was halted due to the drive failure.