Browse Prior Art Database

Concurrent On Line Array Chip Sparing From Scrub Assisted Fault Data

IP.com Disclosure Number: IPCOM000122356D
Original Publication Date: 1991-Dec-01
Included in the Prior Art Database: 2005-Apr-04
Document File: 4 page(s) / 157K

Publishing Venue

IBM

Related People

Fasano, LT: AUTHOR [+4]

Abstract

On-Line Array Chip Sparing consists of a technique for identifying faulty storage chips and substituting good chips during normal system operation. The sparing algorithm operates as a background task which identifies both stuck faults and intermittent faults in a storage array. Spare chips are initialized and inserted into the array without disrupting system operation.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 43% of the total text.

Concurrent On Line Array Chip Sparing From Scrub Assisted Fault Data

      On-Line Array Chip Sparing consists of a technique for
identifying faulty storage chips and substituting good chips during
normal system operation.  The sparing algorithm operates as a
background task which identifies both stuck faults and intermittent
faults in a storage array. Spare chips are initialized and inserted
into the array without disrupting system operation.

      There are two types of storage array problems which need to be
addressed: stuck faults and intermittent, or "soft", failures. The
presence of many intermittent errors may also suggest that a chip is
bad.  For example, a solid address decoder failure would be detected
as an intermittent error.

      In earlier systems, storage must be taken off-line for
analysis.  Test patterns are applied to the array to determine the
quality of each array chip.  Off-line tests effectively identify
stuck-faults.  Some methods use an ECC (Error Correcting Code) to
detect and substitute for faulty chips which contribute to a UE
(uncorrectable error) during system operation.  These methods also
work well on stuck- faults, but have no effect until there is a
system problem. Further, the substitution of a new chip introduces a
large number of "soft" errors into the array which can cause many
more UEs.

      Large systems have used a procedure called scrubbing to correct
soft errors before they accumulate over time. Scrubbing consists of a
background operation which successively fetches, corrects, and stores
the contents of storage.  Large systems have also used the
complement/recomplement procedure to correct certain types of
ECC-"uncorrectable" errors:  complement/recomplement works when
errors caused by stuck-faults align in the same ECC word with
intermittent, or soft, errors.

      The technique described here uses the normal system data as a
test pattern.  The background "scrubbing" operation processes the
data, and collects counts of both stuck faults and intermittent
errors for each chip in the storage array.

      For On-Line Sparing, we enhanced the scrub procedure to perform
a complement/recomplement algorithm on the data. The complemented
data is compared in a unique way with the corrected data to identify
stuck faults in the array. Intermittent errors are also counted by
chip on each pass through the array.  If the total count of errors
from a given chip exceeds a threshold, the chip is scheduled for
sparing.

      The figure shows the on-line sparing data flow.  The enhanced
scrubbing procedure consists of the following steps:
   1) Fetch one line (eight quadwords) from memory.
       a) correct the data using ECC
       b) if a UE is detected, mark the failing ECC-word in a
UE-vector
       c) save the data in both the A- and B- buffers
   2) Store the complement of the data in the A-buffer into memory.
   3) Fetch the line from memory...