Browse Prior Art Database

Method for failure prediction for computer systems

IP.com Disclosure Number: IPCOM000022122D
Publication Date: 2004-Feb-25
Document File: 6 page(s) / 100K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method for failure prediction for computer systems. Benefits include improved availability.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 29% of the total text.

Method for failure prediction for computer systems

Disclosed is a method for failure prediction for computer systems. Benefits include improved availability.

Background

Conventionally, most of the crashes experienced by computing systems are due to intermittent and transient faults. The early detection of failure-prone circuits or subsystems significantly improves the availability of computing systems. The isolation of a failing component before a crash occurs enables scheduling of preventive maintenance, seamless activation of a spare, or graceful degradation (if spares are not available).

Conventional failure prediction mechanisms rely on the counting of errors that occur within a component or a subsystem under normal operating conditions, such as nominal voltage and temperature. A failure is considered eminent when the number of errors reaches a predetermined threshold over a given period of time. As a result, the component experiencing errors is isolated and further action is taken. For example, a spare is activated, followed by replacement of the failing part.

General description

         The disclosed method is failure prediction for computer systems. The method uses test programs that are designed to stress the processor more than a regular application. A test program may also detect and report errors, such as wrong computational results. Errors may be also reported by other error detection mechanisms of the processor itself, like parity, error correcting codes (ECC), and cyclical redundancy codes (CRC).

The main steps for predicting failures of a processor include the following:

1.         Complete all tasks running on the processor or migrate them under the control of the operating system to another processor.

2.         Execute a sequence of test programs at nominal voltages and temperature and/or at higher voltages and temperatures and/or lower voltages and temperatures.

3.         Log all detected errors.

4.         Predict failure, based on the number of errors detected over a period of time.

5.         Return the processor to normal operation under the control of the operating system, if no failure is predicted.

6.         Schedule preventive maintenance, activation of a spare, or graceful degradation if a failure of the processor is predicted.

Advantages

         Some implementations of the disclosed structure and method provide one or more of the following advantages:

•         Improved functionality due to providing an improved failure prediction technique, which is based on the observation that intermittent faults tend to be activated at voltages and temperatures higher or lower than nominal operating values

•         Improved system availability due to preventing unexpected system crashes

•         Improved system availability due to enabling the scheduling of preventive maintenance, seamless activation of a spare, or graceful degradation (if spares are not available)

Detailed description

The disclosed method is a failure prediction technique based on the observation that intermittent faults tend to be act...