Communication Support for Reliable Distributed Computing
Original Publication Date: 1986-May-31
Included in the Prior Art Database: 2007-Mar-29
Software Patent Institute
Birman, Kenneth P.: AUTHOR [+3]
Communication Support for Reliable Distributed Computing*
Communication Support for Reliable
Thomas A. Joseph
Department of Computer Science Cornell University
Ithaca, NY 14853
. * This work was supported by the Defense Advanced Research Projects Agency (DoD) under
ARPA order 5378, Contract MDA903-85-C-0124, .and by the National Science Foundation under grant DCR-8412582.
The views, opinions and findings contained in this report are those
. of the authors and should not be construed as an official Department of Defense position, policy, or decision.
COMMUNICATION SUPWRT FOR R@LMLE
DIS-UTEH) C O r n r n G
Kenneth P. Birmaa cnd hornas A. Jasepb
Dcpm?men? of Contputcr Science Cornell University, Ithaca, New York
We describe a collection of axmnunication primitives integrated with a mecbnb for han- dling process failure and recovery. These prknilk~es facilitate the ~ 1 ~ t a t i o n
process groups, which can be used to provide distributed &QS in an envhmmt subject to non-dcious crash failures.
At Cornell, we recently completed a protow of the ZSIS system, which transform ebstract type specif~cations into fault-tolerant distributed implfmcntations, VIUC
insulating users frcm the
mechanisms by which fault-tolerance is achieved m a ] .
A wide range of reliable compdca- tim primitives have been proposed in the literature, tmd we became comrinad that by uskg such
primitives when building the ISIS system, complexity could be avoided. klnforbmately, the exist- ing protocols, which range from reliable and atomic broadcast [-] [Cistian] [SrQleider] to Byzantine agreement [Strong], either do not satisfy the ordering constraints required for many fault-tolerant applications or satisfy a stronger constraint than necessary at too high a cost. h par- ticular, these jxotocols have not attempted to minimbe the latency (delay) b e d
sage delivery can ocw.
. In ISIS, latency appears to bt a major factor that limits performance.
Fault-tolerant distributed systems also need a way to det- failures and recaveria consistently, and we found that this could be integrated into the communication layer in a manner tbat reduces the synchronization burden on higher level algorithms. These obations motivated the ckvdop
rnent of a new collection of primitives, which we present bdow.
*This work. was supparted by rhe Deftnse Advanced Research Projects Agency CT)aD) under ARPA order 5378, Gnuact MDA903-85-C-0124, and by thc National Science Fcundatica under grant DCR-8412582. The views, cpinian and findings caKained in this repart are those of the auttuPs and shrxld na bc emstrued as an official Dqammt of
Defense po6idm. pdicy, ar decision.