Browse Prior Art Database

Checkstop Architecture for a Multiprocessor System

IP.com Disclosure Number: IPCOM000119334D
Original Publication Date: 1991-Jan-01
Included in the Prior Art Database: 2005-Apr-01
Document File: 3 page(s) / 110K

Publishing Venue

IBM

Related People

Jaber, TK: AUTHOR

Abstract

In multiprocessing environments where many processors are running various applications in parallel, it is essential that the reliability of the entire system does not depend on the reliability of a single processing element (PE). An error condition on one of the processors must not bring down the entire system. This article describes a multiprocessor error handling 'checkstop' architecture that solves the problem of a single processing element causing an error condition in a multiprocessing system.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Checkstop Architecture for a Multiprocessor System

      In multiprocessing environments where many processors are
running various applications in parallel, it is essential that the
reliability of the entire system does not depend on the reliability
of a single processing element (PE).  An error condition on one of
the processors must not bring down the entire system.  This article
describes a multiprocessor error handling 'checkstop' architecture
that solves the problem of a single processing element causing an
error condition in a multiprocessing system.

      In a multiprocessing environment, it is a very desirable
feature to prevent error conditions that occur in one processing
element from infiltrating the other processing elements causing a
bring-down of the entire system.  It is often taken for granted that
continuous reliability is a realistic goal to achieve in MP systems.
A failing processing element must be instantaneously identified,
isolated and withdrawn from the rest of the system.  How well this is
achieved is based on the system error handling architecture and error
handling and fault tolerance logic implemented on every processing
element.  It is essential that an error that originates on a
particular PE does not spread and cause errors on other PEs in the
system.  In this case, the speed with which the error is detected and
isolated is crucial.  A processing element can be a single chip or a
number of chips packaged on a MCM (Multi-Chip Module) or a card.

      The checkstop architecture described in this disclosure is
shown in Fig. 1.  This architecture was proposed for the
multiprocessing system.  The system consisted of 4 processors (PE),
each processor consisted of 7 VLSI chips: one instruction fetch unit
(ICU chip), 2 instruction execution units consisting of 2 chips (FXU,
FPU) and a data cache unit (DCU) consisting of 4 chips.  A Shadow
Directory unit (SDU chip) guaranteed cache coherency across 4
processors and between processor caches and memories.  A set of 8
chips constituted the Memory Buffer Unit (MBU) which acted as a
switch between processors and memories and as I/O data buffers.

      In this checkstop architecture each processor ge...