Browse Prior Art Database

Pre-Processor for Inherently Fault-Tolerant Code

IP.com Disclosure Number: IPCOM000101447D
Original Publication Date: 1990-Aug-01
Included in the Prior Art Database: 2005-Mar-16
Document File: 4 page(s) / 168K

Publishing Venue

IBM

Related People

Katz, S: AUTHOR [+2]

Abstract

This article describes a pre-processor that transforms an ordinary distributed program into one that is tolerant of transient failures in the computers that execute it. The class of programs covered by this invention are those designed for a distributed computing system in which N processes (named O through N - 1, each executing on its own processor) are connected by a unidirectional ring communication network (a physical topology in which process i is connected by a FIFO channel directly only to process i + 1 mod (N)). The IBM Token Ring is an example of such a network.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 42% of the total text.

Pre-Processor for Inherently Fault-Tolerant Code

       This article describes a pre-processor that transforms an
ordinary distributed program into one that is tolerant of transient
failures in the computers that execute it.  The class of programs
covered by this invention are those designed for a distributed
computing system in which N processes (named O through N - 1, each
executing on its own processor) are connected by a unidirectional
ring communication network (a physical topology in which process i is
connected by a FIFO channel directly only to process i + 1 mod (N)).
The IBM Token Ring is an example of such a network.

      Failures may result in arbitrary values being stored in any
mutable hardware device (e.g., read/write memory, disks, location
counters, communication channels, etc.) and are not necessarily
detectable.  We only assume that the program's code is stored in an
immutable read- only memory.  This article details how a program may
be transformed such that, following a failure, the program eventually
resumes its intended behavior (up until the point of the next
failure).

      To illustrate the difficulty of the problem, note that a
failure may create a program state inconsistent with normal program
execution.  For example, the location counter can be pointing to an
instruction immediately following an assignment of O to variable x,
but x's memory location can contain a 1.

      Similarly, the control and memory of two processes may give the
appearance that each has just sent a query to the other when, in
fact, no such query was sent.  This can introduce the possibility of
each forever waiting for a response, i.e., deadlock.

      Transient failures are characterized by a brief period of
faulty operation followed by long periods of correct operation.  Such
faults occur, for example, when the operating environment experiences
temporary deviations in acceptable levels of heat, power, radiation,
etc.  In important applications where repair or human intervention is
not possible (e.g., a deep-space probe), the tolerance of such faults
is critical.  This invention is based on the research area known as
the Theory of Self-Stabilizing Programs (1); this invention is a
means for automatically making an ordinary program self-stabilizing.

      Notation: Each program in this invention is expressed as a
sequence of statements (written as separated by the symbol.  The
statement

      in the program of process i says that command c may be executed
by process i if the state of the local variables of i satisfies the
predicate (called the guard) g.  Process i's behavior is to
repeatedly (a) evaluate the guards of each statement in its program,
(b) select one statement with a true guard (if any), and (c) execute
it.  If no guard is true, then process i executes a no-op.  The
variables in c must be local to process i.

      To be able to syntactically distinguish between messages of
diff...