On communication support for fault tolerant process groups (RFC0992)
Original Publication Date: 1986-Nov-01
Included in the Prior Art Database: 2019-Feb-14
Internet Society Requests For Comment (RFCs)
K.P. Birman: AUTHOR [+1]
This memo describes a collection of multicast communication primitives integrated with a mechanism for handling process failure and recovery. These primitives facilitate the implementation of fault-tolerant process groups, which can be used to provide distributed services in an environment subject to non-malicious crash failures.
K. P. Birman (Cornell) Network Working Group T. A. Joseph (Cornell) Request for Comments: 992 November 1986
On Communication Support for Fault Tolerant Process Groups
K. P. Birman and T. A. Joseph Dept. of Computer Science, Cornell University Ithaca, N.Y. 14853 607-255-9199
1. Status of this Memo.
This memo describes a collection of multicast communication primi- tives integrated with a mechanism for handling process failure and recovery. These primitives facilitate the implementation of fault- tolerant process groups, which can be used to provide distributed services in an environment subject to non-malicious crash failures. Unlike other process group approaches, such as Cheriton’s "host groups" (RFC’s 966, 988, [Cheriton]), our approach provides powerful guarantees about the behavior of the communication subsystem when process group membership is changing dynamically, for example due to process or site failures, recoveries, or migration of a process from one site to another. Our approach also addresses delivery ordering issues that arise when multiple clients communicate with a process group concurrently, or a single client transmits multiple multicast messages to a group without pausing to wait until each is received. Moreover, the cost of the approach is low. An implementation is be- ing undertaken at Cornell as part of the ISIS project.
Here, we argue that the form of "best effort" reliability provided by host groups may not address the requirements of those researchers who are building fault tolerant software. Our basic premise is that re- liable handling of failures, recoveries, and dynamic process migra- tion are important aspects of programming in distributed environ- ments, and that communication support that provides unpredictable behavior in the presence of such events places an unacceptable burden of complexity on higher level application software. This complexity does not arise when using the fault-tolerant process group alterna- tive.
This memo summarizes our approach and briefly contrasts it with other process group approaches. For a detailed discussion, together with figures that clarify the details of the approach, readers are re- ferred to the papers cited below.
Distribution of this memo is unlimited.
Birman & Joseph [Page 1]
RFC 992 November 1986
This memo was adopted from a paper presented at the Asilomar workshop on fault-tolerant distributed computing, March 1986, and summarizes material from a technical report that was issued by Cornell Universi- ty, Dept. of Computer Science, in August 1985, which will appear in ACM Transactions on Computer Systems in February 1987 [Birman-b]. Copies of these paper, and other relevant papers, are available on request from the author: Dept. of Computer Science, Cornell Universi- ty, Ithaca, New York 14853. (email@example.com). The ISIS project also maintains a mailing list. To be added to this list, contact M. Schmizzi (firstname.lastname@example.org).