Pre-failure identification of disk drives by comparative analysis of their operating characteristics
Publication Date: 2012-Aug-29
Through various vendor and model changes, our appliances encounter drives which operate well for a far shorter period of time than expected, often these failures are related to manufacturing deficiencies which manifest themselves after some combination of age, activity and other environmental factors (e.g. operating temperatures). This invention described is to identify drives which are headed toward failure but have not yet failed.

This disclosure describes a system which acquires operational drive data utilizing vendor-implemented standard programming interfaces (e.g. SCSI standards), for a plurality of operating attributes (ie. per I/O operation data counts, error correction algorithm invocation counts, on-drive controller activity logs and on-drive resources consumed) too periodically monitor, record and evaluate collected data by software processes on the (local or remote) monitoring server. This database is analyzed to identify drives which exhibit uncommon patterns of activity over the recent intervals, or/and the life of the drive, such as:

+ Ratio of "Errors Corrected with Possible Delay" to "Errors Corrected without delay"
+ Ratio of "Errors Corrected without Delay" to "Total Bytes"
+ On-device log activity: nature of activity, location (head, cylinder, track), real-time of event (mapped from device power-on-time)
+ Volume of I/O operations (applicable in a large set of drives with similar usage characteristics.)

The ratios calculated for the recent interval(s) and lifetime of the disk drive are compared to that of other units of the same drive model, logging and warnings are generated for units which are statistical outliers. (Though the magnitude of the ratio changes between normal and abnormal activity will vary by drive family, an example from the models we have been working with has normal ratio of 10**5 or greater, which decreases by two or more orders of magnitude when the drive is mis-behaving.)

The log page entries are examined for the distribution of errors with both geographic and time domains. When the activity density in either the time or geographic (head, cylinder track) domain indicates abnormal activity, these statistical outliers are also logged and warnings are generated.

The software system may then generate a pre-failure advisory for the identified drives, allowing the system operator to schedule replacement at a scheduled maintenance window (or triggering an automated recovery mechanism ie. 5,727,144 James Thomas Brady, prior...