Browse Prior Art Database

A Replication Method to Reduce Latency and Communications Costs

IP.com Disclosure Number: IPCOM000123388D
Original Publication Date: 1998-Oct-01
Included in the Prior Art Database: 2005-Apr-04
Document File: 3 page(s) / 134K

Publishing Venue

IBM

Related People

Dixit, A: AUTHOR [+6]

Abstract

This disclosure relates to a replication method for use in the Encina (*) Publish/Subscribe system. The replication method uses a weighted voting algorithm to give one replica authority to assign sequence numbers on behalf of all. Also disclosed are details of the method in dealing with failures.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 41% of the total text.

A Replication Method to Reduce Latency and Communications Costs

   This disclosure relates to a replication method for use in
the Encina (*) Publish/Subscribe system.  The replication method
uses a weighted voting algorithm to give one replica authority to
assign sequence numbers on behalf of all.  Also disclosed are
details of the method in dealing with failures.

   The primary goal of the replication method is to provide
high availability of the Publish/Subscribe system while delivering
messages in a consistent order to subscribers.  Essentially, the
replication method assigns a sequence number to each message as it
is published; the rest of the system uses the sequence number to
order messages.  In database parlance, the sequence number is a "hot
spot," so the replication method is optimized to minimize the
latency of sequence number assignment.

   The replication method uses a "voting"  algorithm.  The
basic voting principle is that when an item is written, multiple
copies are written, and when an item is read, multiple copies are
read.  The size of the read set (quorum) and the size of the write
quorum are chosen so that, if a quorum can be formed with the
available replicas, the quorums are guaranteed to overlap.  For
example, with 3 replicas both the read and write quorums could be 2
replicas.  (Such a configuration would continue to function even if 1
replica failed.)  The classic reference for voting algorithms is:
David Gifford, Weighted Voting for Replicated Data, Proceedings of
the 7th Symposium on Operating System Principles, Pacific Grove, CA,
1979.

   In voting algorithms, a write is frequently transactional:
either all replicas in the write quorum perform the write, or none of
them do.  A straightforward voting-based transaction for the sequence
number would be multi-phase:
  1.  Assign a sequence number greater than any previously
      assigned.
  2.  Store the new message with that sequence number.
  3.  Commit the transaction using two-phase commit.

   The fixed nature of the sequence-number-assignment
transaction suggests that the commit could profitably be combined
with other work.  For example, storing the message could be part of
the commit.

   Optimizing the sequence number assignment is difficult,
however.  For latency reasons it would be desirable to assign the
sequence number without an explicit round-trip exchange among the
replicas.

   One option would be to use a time-based scheme
that would permit each replica to assign sequence numbers
independently.  Unfortunately, time-based sequence numbers would
tend to be very sparse which would render them almost unusable for
detecting message loss in the remainder of the system.

   Instead, one replica can be given the authority
(represented by a "token") to assign sequence numbers.  Since this
replica becomes a single point of failure, a voting scheme is used to
reassign or regenerate the token in the case of failure.

   Give...