Browse Prior Art Database

Predictive Maintenance for Prevention of Uncorrectable Multiple BIT Errors in MEMORY

IP.com Disclosure Number: IPCOM000035958D
Original Publication Date: 1989-Aug-01
Included in the Prior Art Database: 2005-Jan-28
Document File: 5 page(s) / 57K

Publishing Venue

IBM

Related People

Ahrens, GH: AUTHOR [+4]

Abstract

The methodology described in this paper addresses two areas of impact in processor main storage. The first area of concern is to significantly reduce the probability of compromising the integrity of customer data in memory by preventing the occurrence of odd multiple bit errors (MBEs) which have a low probability of detection. The second area of concern is the reduction of unscheduled interruptions to the customer arising from these occurrences, leading to reduced processor availability.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 26% of the total text.

Page 1 of 5

Predictive Maintenance for Prevention of Uncorrectable Multiple BIT Errors in MEMORY

The methodology described in this paper addresses two areas of impact in processor main storage. The first area of concern is to significantly reduce the probability of compromising the integrity of customer data in memory by preventing the occurrence of odd multiple bit errors (MBEs) which have a low probability of detection. The second area of concern is the reduction of unscheduled interruptions to the customer arising from these occurrences, leading to reduced processor availability.

This invention involves both hardware and microcode to accomplish the defined objective. The objective is to detect the accumulation of array errors and circumvent the possibility of creating a data integrity situation by initiating a maintenance action (deferred or immediate, if required) to either entirely remove the failure or remove the significant failure contributor from usage.

Each extended error correction code (ECC) operation requires extra time to perform correction of the error (when possible) which could result in performance degradation if the extended error correction code (XECC) operation is performed repeatedly. Also, if the array errors are allowed to accumulate until an uncorrectable error condition arises, the system is required to perform maintenance operations. If predictive analysis is used to project potential uncorrectable error conditions, maintenance can be deferred to a time which is

(Image Omitted)

more convenient for the customer. This could result in higher system availability by performing the maintenance when the system is not being used instead of requiring the system from the customer when an error occurs.

The problem of data integrity (passing bad data to the customer as though it were good information) arises from the ECC code which is used to detect and correct data errors in the main storage arrays. The "Hamming" code matrix configuration used as a 39/32 Single Error Correct Double Error Detect (SEC/DED) code can detect 100% of single and double bit errors. However, the detection for odd multiple bit errors (other than SBE) is 40% or less. This means that on the occurrence of a triple (quintuplet, etc.) bit error, there is only a 40% chance of detecting the error. The remaining 60% of the time, the data will be used as though it contained no errors.

The purpose of the invention is to detect massive errors on the array cards (which affect only a single bit position within the ECC word) which could line up with other errors to create triple bit errors and report them to the support processor so that corrective action can be taken before the error condition occurs.

Depending on the failures involved, the correction could be performed by invoking the Bit Steering algorithm or might actually necessitate the replacement of an array card.

1

Page 2 of 5

The hardware portion of the invention resides on the memory card support modules....