Browse Prior Art Database

Cycle Counting to Determine Critical Machine Check

IP.com Disclosure Number: IPCOM000102546D
Original Publication Date: 1990-Nov-01
Included in the Prior Art Database: 2005-Mar-17
Document File: 2 page(s) / 76K

Publishing Venue

IBM

Related People

Geer, CP: AUTHOR

Abstract

Described is a method which makes it possible to determine which processor, in a multiprocessor environment, was the first to detect an error. In many cases, when one processor detects an error it will inadvertently cause an error to occur on the other processors. It is very important to determine who detected the first error for failure isolation.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Cycle Counting to Determine Critical Machine Check

       Described is a method which makes it possible to
determine which processor, in a multiprocessor environment, was the
first to detect an error.  In many cases, when one processor detects
an error it will inadvertently cause an error to occur on the other
processors.  It is very important to determine who detected the first
error for failure isolation.

      In a multiprocessor, the Storage Control Unit (SCU) of each
processor may be tightly coupled to each of the other processors.  In
this case, each SCU must know what is going on in each processor in
the way of data to and from memory. If a scheme is used which does
not require a master/slave to control bus usage, it is required that
each SCU knows what is going on in each of the others.

      Since all SCUs are running together the clocks on each of the
SCUs must be started and stopped together.  This is also required to
allow this method to work.  A 16-bit binary counter is implemented in
each of the SCUs and increments each cycle.  During the Initial
Program Load (IPL), each of the chips are flushed and the clocks are
started simultaneously.  This means that the counters will all have
the same count in them.

      When the SCU reports that an error has occurred, the operating
system will branch to code which will handle the error.  When the
operating system determines that it was the SCU that signaled a
machine check condition, it will log some information and report it
to the service processor. When the SCU reports the machine check, it
will also stop the counter.  Therefore, as each SCU detects a machine
check condition, their counters are stopped in...