Browse Prior Art Database

Smart selection of array rebuild type in storage system

IP.com Disclosure Number: IPCOM000246193D
Publication Date: 2016-May-16
Document File: 6 page(s) / 169K

Publishing Venue

The IP.com Prior Art Database

Abstract

This article introduces systematic method to make smart decision about array rebuild type when disk fails. It will decrease the impact of rebuild on I/O performance significantly and decrease the likelihood of impacting availability or reliability under condition of disk failure. It incorporates machine learning and big data processing to enhance the service level of critical business applied either on cloud or on premise.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 38% of the total text.

Page 01 of 6

Smart selection of array rebuild type in storage system


1. Background: What is the problem solved by your invention?

In Raid arrays used on systems, esp. storage systems,when there is disk failing, there are two commonlyused array rebuild methods. First method is disk rejection and array full rebuilding as depicted in figure 1. Thismethod is referred to as full rebuild in this article. In this method, the failing disk is completelyremoved and data is reconstructed out from other members of that array and written to a new spare. Second method is smart rebuild in which the failing disk's data is being read out and copied to a new spare as depicted in figure 2. Thismethod is referred to as smart rebuild in this article. In this method, the failing disk is still serving I/O, both host I/O and smart rebuilding read I/O.

Figure 1 Array full rebuild

Figure 2 Array smart rebuild

There are so many failure modes of disks that once it fails, it might be totally broken so that it cannot service any commands any more, but it can also happen that the disk be only tentatively bad at a future time, currentlystill sustaining commands well with help of error recovery mechanism. For example, if customers enables aggressive

1



Page 02 of 6

mode such as marking disk as failing when disk hung command 5 second once, the disk will be considered "bad" whereas it might still sustain I/O for long time.

For the above mentioned disk broken case, the full rebuild should be triggered when disk fails. For the cases of less severe disk errors where disk can sustain, the smart rebuild can be triggered. The full rebuild should be triggered by either layer whoever detects fatal error at the very first. The smart rebuild is usually to be triggered by host who keeps records of various error types at variouslogical or physical boundaries and thus have good understanding about severity of disk failing symptom. Figure 3 depicts the modules responsible for selecting rebuild method.

Figure 3 Rebuild method selection

Currently system takes conservative attitude towards smart rebuild and only triggers smart rebuild following sets of very strict rules. With the advent of bigger and bigger capacity drives, full rebuild not only exposes system to potentially high impact issues such as loss of access or even data loss, which are intolerable for critical business, but also inevitably takes proportionally longer to complete rebuild during which period of time, there is I/O performance degradation. For example, when diskcapacity is 6TB, 8TB or more, the full rebuild time will increase to days even weeks during which customer I/O response time will be increased 20% or above.

This invention provides systematic method to selectarray rebuild method smartly. The method utilizes machine learning, i.e. logistic regression, to predict the severity of disk failure error and select array rebuild type based on the predicted result.

Logistic regression is picked out as the method here because...