Browse Prior Art Database

Detection of Temporary Single Data Bit Memory Errors

IP.com Disclosure Number: IPCOM000106617D
Original Publication Date: 1993-Dec-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 4 page(s) / 98K

Publishing Venue

IBM

Related People

Chen, CL: AUTHOR [+5]

Abstract

Disclosed is a means to detect and identify temporary (soft) single bit errors in the ECC data word of a computer's central storage. Also disclosed is a means to have this information add special significance to the sparing criteria for the faulty chip.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Detection of Temporary Single Data Bit Memory Errors

      Disclosed is a means to detect and identify temporary (soft)
single bit errors in the ECC data word of a computer's central
storage.  Also disclosed is a means to have this information add
special significance to the sparing criteria for the faulty chip.

      A memory chip can exhibit failures caused by weaken control
within the chip.  This condition can result in lost data in multiple
locations within the chip.  Since the failures are due to weakened
hardware, the failures are intermittent and reoccurring over time.
They appear as soft type errors to the computer system using the
central storage.  The persistence of intermittent errors from one
chip greatly increases the probability of alignment with a soft error
from another chip to form a two bit soft/soft error in the ECC data
word.  Since an error of this kind is uncorrectable by either system
ECC or double complementing algorithms, a detected system failure
would occur.

      Therefore, to avoid a system detected failure, it is important
to identify, ahead of time, the chip with multiple intermittent
errors and to activate chip sparing, based on a threshold of soft
errors that is lower than the threshold set for the permanent (hard)
errors, so as to reflect the severity of the situation.

      Soft error identification requires an algorithm which involves
4 fetch and store operations in sequence denoted as (F1S1F2S2).  The
ECC status from both the first and second fetches (F1 & F2) of the
conventional process to correct soft single bit data errors and
record hard data errors is used along with the hard error compare
results (Comp) to detect the soft error.  Then, the single bit ECC
error syndrome from either the first or second fetch is used to
identify the bit position of the soft error within the ECC word.  The
table shows the algorithm in which there are three cases pertaining
to soft errors.

      To eliminate the need for additional counters dedicated to soft
errors, the counters used to record the number of hard errors can be
used for both purposes.  Once the soft error bit is identified, the
hard error counter for that bit position will be incremented by a
weighted amount.  Therefore, the same threshold value used for hard
errors can be used for soft errors.  Thus, a small quantity of soft
errors on the same chip could be magnified by, say, 16 times, to make
them exceed the sparing threshold and to invoke the automatic chip
sparing process.

     F1   F2  COMP ERROR TYPES

---- ---- ---- ------------------

 NE   NE    0  no error

 NE   CE    0  xxx

 NE   UE    0  xxx

 CE   NE    0  xxx

 CE   CE    0  1 soft                        <--- case 1 (note)

 CE   UE    0  xxx

 UE   NE    0  x...