Browse Prior Art Database

Software Diagnostic Routine for Failing Chip Isolation

IP.com Disclosure Number: IPCOM000039150D
Original Publication Date: 1987-Apr-01
Included in the Prior Art Database: 2005-Feb-01
Document File: 2 page(s) / 14K

Publishing Venue

IBM

Related People

DeBellis, RS: AUTHOR [+3]

Abstract

Failing chips are identified in a defective field replaceable unit (FRU) by comparing the failing FRU with an identical good FRU. This can be done in any environment where these two FRUs can be plugged in identical processors and run in synchronization, such as the two processors in a dyadic machine. This comparison is done at machine speeds, thereby insuring valid AC testing and fault isolation. Each FRU contains many hardware-designed error checkers which check logic in their respective domains. These domains may overlap or cross FRU boundaries, and there may be logic which does not fall within any domain (unchecked logic). AC faults may occur only at machine cycle time and are dependent on the particular data that is being processed at the time.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 2

Software Diagnostic Routine for Failing Chip Isolation

Failing chips are identified in a defective field replaceable unit (FRU) by comparing the failing FRU with an identical good FRU. This can be done in any environment where these two FRUs can be plugged in identical processors and run in synchronization, such as the two processors in a dyadic machine. This comparison is done at machine speeds, thereby insuring valid AC testing and fault isolation. Each FRU contains many hardware-designed error checkers which check logic in their respective domains. These domains may overlap or cross FRU boundaries, and there may be logic which does not fall within any domain (unchecked logic). AC faults may occur only at machine cycle time and are dependent on the particular data that is being processed at the time. Using error checkers, programs will analyze which checkers came on when an error is detected and determine from those which chips within the failing FRU to call as potentially defective. However, the detection is only as good as the number and type of hardware-designed error checkers and the correctness of the database of which chips to call based on the error checkers which appear. Current methods may not do an adequate job of chip isolation, indicating a large number of chips for replacement. When many chips are to be replaced, there is a high probability that errors are introduced during the repair. A cycle-by-cycle comparison of the failing processor elements LSSD ring data with that produced by an identical good element while executing the same instructions and data in identical environments will reduce the number of chips that are identified as failed. Faults will be detected on a mismatch of like facilities between each element. All facilities in each scan ring are compared; therefore, all facilities in error are detected. The machine used for the test is a "golden" dyadic machine, i.e., known good hardware. The test would start by placing the defective FRU into the first central processor (CP1) of the dyadic machine and the good FRU in the other central processor (CP2) of the dyadic machine. CP1 is IPLed and placed in "hardstop" mode. In this mode the machine stops on the clock cycle after an error is detected. A functional exerciser is loaded and executed. Execution will continue until the fault occurs and the machine check stops. The instruction op- code was executing when the failure occurred, and the operand data are saved externally to the CPUs. It is this data which uniquely allows the AC...