Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Synchronizing Clocks in a System Resilient to a Wide Class of Failures

IP.com Disclosure Number: IPCOM000105347D
Original Publication Date: 1993-Jul-01
Included in the Prior Art Database: 2005-Mar-19
Document File: 2 page(s) / 93K

Publishing Venue

IBM

Related People

Garay, JA: AUTHOR [+2]

Abstract

A common technique for constructing computer systems that are able to tolerate faults is to replicate the computation on several processors. In order to apply this technique, two tools are necessary: a fault-tolerant Broadcast method for broadcasting data to all processors (such data may originate either from one of the processors or as the output of a sensor) and a fault-tolerant Consensus method allowing the processors to reach consensus on the computation's result, so that they may take a common action (e.g., activating an actuator). These two methods have been studied in the abstract and are known respectively as the Byzantine Generals Problem and the Consensus Problem.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Synchronizing Clocks in a System Resilient to a Wide Class of Failures

      A common technique for constructing computer systems that are
able to tolerate faults is to replicate the computation on several
processors.  In order to apply this technique, two tools are
necessary:  a fault-tolerant Broadcast method for broadcasting data
to all processors (such data may originate either from one of the
processors or as the output of a sensor) and a fault-tolerant
Consensus method allowing the processors to reach consensus on the
computation's result, so that they may take a common action (e.g.,
activating an actuator).  These two methods have been studied in the
abstract and are known respectively as the Byzantine Generals Problem
and the Consensus Problem.

      Previously methods have been disclosed for Broadcast and
Consensus, as well as a means for combining the methods to implement
a fault-tolerant computing system.  The salient feature of these
methods was the ability of the n processors to overcome up to t
processor failures where processors may fail either by crashing or
acting arbitrarily, subject to no more than b processors acting
arbitrarily.  However, there was the implicit assumption that the
hardware clocks of all processors were synchronized within some given
tolerance.

      In real life, hardware clocks drift, e.g., they advance
approximately 1 second after one second of true time.  For example, a
clock may have a drift rate of plus or minus one percent, meaning
that after one second of true time, a fast clock can advance 1.01
seconds whereas a slow clock may only advance .99 seconds.  Thus,
after one second of true time, clocks that are initially perfectly
synchronized may differ in their readings by as much as .02.  The
difference in readings gets worse over time:  after an additional
second of true time, the fast clock can read 2.02 while the slow
clock reads 1.98.  After enough real time has passed, the readings
may differ by so much as to render the clocks useless.  Therefore, a
software method is necessary to re-synchronize the clocks.  A method
is disclosed that is tolerant of crash and arbitrary failures of the
processors (and their clocks).

      The disclosed method involves the processors maintaining a
software clock, i.e., a variable that is periodically incremented and
used in place of the hardware clock.  The method works as follows.
Once the hardware clock of a processor has advanced a certain amount,
the processor sends messages to all other processors.  Only after a
processor has received a sufficient number of certain messages from
the other processors doe...