Browse Prior Art Database

System and method to improve performance with active data protection for distributed array

IP.com Disclosure Number: IPCOM000245609D
Publication Date: 2016-Mar-22

Publishing Venue

The IP.com Prior Art Database

Abstract

With this method, we can avoid the rebuild process totaly or reduce the rebuild time. This method looks up for the verging on failure drive in arrays before it really goes offline. Then migrate data from this failing drive to new drive meanwhile handle IO to ensure data integrity.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 49% of the total text.

Page 01 of 10

System and method to improve performance with active data protection for distributed array

   In conventional disk array, reliability is ensured by data backup and redundancy technologies. Once the disks fails, the system can regain data by recovery or rebuilding. However, it takes a long time to restore data in these methods.

   When the raid initially created, the drives of each disk are the same. However, after being used for a period of time, drives usually have different lifetime. When one of the drives' efficiency becomes lower, especially when it is almost to break down, it cause big impair of raid efficiency. If one of the disks fails, the raid needs to be rebuilt. The process always needs a plenty of I/O. What's worse, during the recovery time a new storage device failure may occur, it may cause even serious data lose.

   So, in consideration of solving the two drawbacks of traditional ways, the low efficiency and data lose risk, we come up with a new way to deal with verging on fault disks before real failurein the raid.

   If the raid can predict the failure by monitoring the disk efficiency, and perform data migration actively before failure happens, then data rebuilding and recovery can be mostly avoided. It can improve the efficiency of the raid remarkably.

   With bad disks learning algorithm, the efficiency of disks in the raid can be monitored. Traditionally, when the detected efficiency is lower than 2/3 which is an acknowledged value for the bad disk to be replaced, new disk can be plugged in. Then, there will be a long time waiting for the rebuilding of the array.

To avoid the rebuilding process after replacing disk, data in the old bad disk can be migrated early before it is completely broken down.

   There will be a threshold value of disk efficiency for the users themselves to define. The migration starts when the detected efficiency is lower than the threshold. Migrated data can be marked with a label. When migration is completed, a new disk can be plugged in. Labeled data in other disks then can be migrated to the new inserted disk. Of course, the threshold value must be higher than 2/3. In an addition, our invention to deal with the lowest efficiency disk is on the base of declustered array, which distributes hot spare disk in the array.

   Take 6 disks on a distribution raid as an example. Their efficiency of disk A, B, C, D, E, F, G are 1.1,1.0,0.9, 0.95,0.75,1.05 respectively. The bad disk learning algorithm can decide disk F has the lowest efficiency 0.75. It can be predicted that F will be the first to break down. If set the threshold value 0.7, when the detected value is lower than then threshold value, the migration begins. Migrate F's data to standby disks in order, meanwhile redirecting read IO to standby disk if related data was already migrated, and duplicating write IO to both F disk and standby disk before migration completed. Finally replace F disk with standby disk after integrity verification completed for...