Browse Prior Art Database

Fault-tolerant startup in a multiprocessor system.

IP.com Disclosure Number: IPCOM000019024D
Original Publication Date: 2003-Aug-27
Included in the Prior Art Database: 2003-Aug-27
Document File: 1 page(s) / 40K

Publishing Venue

IBM

Abstract

In computing systems which incorporate more than one processor, the system may fail if any one of the processors fails. A simple technique allows multiple processors to monitor each other, and to detect when one of their number fails. Failure detection enables recovery actions to be initiated, and a total system failure is avoided.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 70% of the total text.

Page 1 of 1

Fault-tolerant startup in a multiprocessor system.

Disclosed is a method for recovering from the failure of a processor in a multiprocessor system.

    In computing systems which incorporate more than one processor, the entire system may fail if any one of the processors fails e.g. stops executing instructions. A common solution is to detect the failure of a processor by some associated timing logic in the system, and apply a reset to the failing component - this will often succeed if the cause of the original failure was a logical fault. However, in cases where the original failure also prevents the processor from responding to a reset, the system cannot be recovered.

    The invention described here allows other "good" processors to identify the "failed" processor, bring it back into operation, and so recover the system.

    The invention requires that each processor in the system executes the same (or a very similar) string of instructions during initial startup. These instructions alter some common part of the system (e.g. a memory location, a register, etc) in such a way as to uniquely identify the processor which made the alteration. Each processor then checks the common area for other signatures, and thus can detect which, if any, other processor did not write its signature in the common area. At this point there is an opportunity for a "good" processor to execute some error-recovery function on behalf of a failed processor, and thus ensure that the whole system does...