Browse Prior Art Database

Pre-Processor for Inherently Fault-Tolerant Code

IP.com Disclosure Number: IPCOM000102542D
Original Publication Date: 1990-Nov-01
Included in the Prior Art Database: 2005-Mar-17
Document File: 5 page(s) / 203K

Publishing Venue

IBM

Related People

Katz, S: AUTHOR [+2]

Abstract

This invention describes a pre-processor that transforms an ordinary distributed program into one that is tolerant of transient failures in the computers that execute it. The class of programs covered by this invention are those designed for a distributed computing system in which N processes (named 0 through N - 1, each executing on its own processor) are connected by a bidirectional ring communication network (a physical topology in which process i is connected by a FIFO channel directly only to processes i - 1 mod N and i + 1 mod N).

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 40% of the total text.

Pre-Processor for Inherently Fault-Tolerant Code

       This invention describes a pre-processor that transforms
an ordinary distributed program into one that is tolerant of
transient failures in the computers that execute it.  The class of
programs covered by this invention are those designed for a
distributed computing system in which N processes (named 0 through N
- 1, each executing on its own processor) are connected by a
bidirectional ring communication network (a physical topology in
which process i is connected by a FIFO channel directly  only to
processes i - 1 mod N and i + 1 mod N).

      Failures may result in arbitrary values being stored in any
mutable hardware device (e.g., read/write memory, disks, location
counters, communication channels, etc.) and are not necessarily
detectable.  We only assume that the program's code is stored in an
immutable read-only memory.  This invention details how a program may
be transformed such that, following a failure, the program eventually
resumes its intended behavior (up until the point of the next
failure).

      To illustrate the difficulty of the problem, note that a
failure may create a program state inconsistent with normal program
execution.  For example, the location counter can be pointing to an
instruction immediately following an assignment of O to variable x,
but x's memory location can contain a 1.  Similarly, the control and
memory of two processes may give the appearance that each has just
sent a query to the other when, in fact, no such query was sent. This
can introduce the possibility of each forever waiting for a response,
i.e., deadlock.

      Transient failures are characterized by a brief period of
faulty operation followed by long periods of correct operation.  Such
faults occur, for example, when the operating environment experiences
temporary deviations in acceptable levels of heat, power, radiation,
etc.  In important applications where repair or human intervention is
not possible (e.g., a deep-space probe), the tolerance of such faults
is critical.  This invention is based on the research area known as
the Theory of Self-Stabilizing Programs [1]; this invention is a
means for automatically making an ordinary program self-stabilizing.

      Notation Each program in this invention is expressed as a
sequence of statements (written as       ) separated by the symbol.
The statement          in the program of process i says that command
c may be executed by process i if the state of the local variables of
i satisfies the predicate (called the guard) g.  Process i's behavior
is to repeatedly (a) evaluate the guards of each statement in its
program, (b) select one statement with a true guard (if any), and (c)
execute it.  If no guard is true, then process i executes a no-op.
The variables in c must be local to process i.

      Messages are sent by process i to processes i - 1 mod N and
i + 1 mod N by the sendleft and...