Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Sequence Timer for Isolating Errors among Many Synchronous Chips

IP.com Disclosure Number: IPCOM000114689D
Original Publication Date: 1995-Jan-01
Included in the Prior Art Database: 2005-Mar-29
Document File: 2 page(s) / 105K

Publishing Venue

IBM

Related People

Lewis, DO: AUTHOR [+2]

Abstract

Disclosed is a method of determining the first chip to detect an error in a system that allows errors to propagate through chips to other chips. Knowing the first chip to detect an error allows better determination of the source of errors and selection of repair actions.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Sequence Timer for Isolating Errors among Many Synchronous Chips

      Disclosed is a method of determining the first chip to detect
an error in a system that allows errors to propagate through chips to
other chips.  Knowing the first chip to detect an error allows better
determination of the source of errors and selection of repair
actions.

      When an error occurs within a chip or between groups of chips,
there are many instances when the error will propagate from the chip
that first detected the error to another chip and perhaps back to the
first chip again setting a different error indication.  The code or
hardware that looks at the errors now has a problem that there are
multiple error indications pointing to different chips as failing and
there is no way to figure out which chip error indication was the
first error to be seen.

      The first chip to detect an error is the place to start looking
for the cause of an error in a computer system.  That chip may or may
not be the source of the error, but it probably contains information
identifying the source of the error.  In previous systems errors were
not allowed to propagate, so the first chip to detect an error was
the only chip to detect an error (unless several detected an error at
the same time from a common bus).

      As systems get faster, it is no longer feasible to prevent the
propagation of errors.  If a chip observes an error, it may well
cause an error in other chips as bad data or invalid sequences are
passed.  Potentially many chips report errors, but most are cascaded
from the original error and are unlikely to help with isolating the
source of the error.

      A sequence timer that continually increments from its initial
value (zero) through its maximum (implementation-dependent) and
starts over again at zero in each chip solves the problem.  Each chip
in a system of chips running from the same clock must contain a timer
of the same length; they must all be reset at the same time and they
must all start counting at the same time.  If a chip detects an
error,
it must stop the sequence timer or capture its value.  After an
error,
the service function retrieves the sequence timer values (or captured
sequence timer values) from all chips reporting errors.  The values
are
processed to determine which chip saw the error first.  After the
first
chip is determined, its error log data is examined to determine the
source of the error for problem determination and to identify the
hardware that must be replaced (if needed).  See "Determining the
First
Chip to See An Error" for how the service processor determines the
first
chip to see the error from the sequence timer captured values.  The
sequence timer must be large enough to guarantee that the highest
sequence timer bits will not make a transition from 00 to 10 or 11
from
the time the error is detected until all of the sequence timers are
stopped or exam...