Browse Prior Art Database

Method and System for Safe and Efficient Checkpointing Using Multiple Signatures

IP.com Disclosure Number: IPCOM000014039D
Original Publication Date: 1999-Nov-01
Included in the Prior Art Database: 2003-Jun-19
Document File: 4 page(s) / 36K

Publishing Venue

IBM

Related People

Elmootazbellah Elnozahy: AUTHOR

Abstract

Method and System for Safe and Efficient Checkpointing Using Multiple Signatures Disclosed here in is a method for performing safe and efficient process checkpointing. The novelty disclosed here is about using several signature functions to detect the changes that occur to a process's state between consecutive checkpoints. The resulting benefits include the reduction of the amount of state that must be saved during each checkpoint, independence from hardware or operating systems, and efficiency. Rollback-recovery has been an established method for achieving high availability and reliability in database systems and others. The principal idea in this style of fault tolerance is to periodically save on stable storage a checkpoint that includes the state of a process, a set of cooperating processes, or a database, depending on the application at hand. If a failure occurs, the system will restart from a saved checkpoint and resume computation. Checkpointing on stable storage incurs performance and storage overheads. Thus, reducing these overheads is very important in any implementation effort.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 41% of the total text.

Page 1 of 4

Method and System for Safe and Efficient Checkpointing Using

Multiple Signatures

Disclosed here in is a method for performing safe and efficient process
checkpointing. The novelty disclosed here is about using several signature
functions to detect the changes that occur to a process's state between
consecutive checkpoints. The resulting benefits include the reduction of the
amount of state that must be saved during each checkpoint, independence from
hardware or operating systems, and efficiency.

      Rollback-recovery has been an established
method for achieving high availability and
reliability in database systems and others. The
principal idea in this style of fault tolerance is
to periodically save on stable storage a checkpoint
that includes the state of a process, a set of
cooperating processes, or a database, depending on
the application at hand. If a failure occurs, the
system will restart from a saved checkpoint and
resume computation. Checkpointing on stable
storage incurs performance and storage overheads.
Thus, reducing these overheads is very important in
any implementation effort.

      Incremental checkpointing is an established
technique for reducing the storage overhead of
checkpointing. It relies on recording only the
changes that occur to the state of a process
between any two consecutive checkpoints, instead of
recording the entire state during each checkpoint.
A kernel level implementation of incremental
checkpointing can use the "dirty" bits in the
memory management hardware to detect the memory
pages that change between consecutive checkpoints.
A user level implementation can use the protection
bits available in the memory management hardware to
emulate the effects of dirty bits with additional
overhead. Either way, relying on the memory
management hardware obviously is not portable, and
often, may be too cumbersome for databases.
Moreover, the granularity of change detection is
too coarse. If a word changes within a page, the
checkpointing system must save the entire page as
it cannot determine which words have actually
changed. Therefore these methods may not make
efficient use of the storage available for the
checkpoint. Therefore, these problems increase the
overhead in performance and storage.

      When storage overhead or portability is
important, however, there are other alternatives.
For example, Plank et al propose a portable
technique that computes the changes in software [
*]. It works by comparing the contents of each
page or block with the corresponding values in the
previous checkpoint. This technique is not very
efficient because of the comparison overhead and
the requirement that the old pages be available in
memory. Another way to compute the changes is to
use the compiler. This technique detects changes

1

Page 2 of 4

at the word-level with no overhead at runtime, but
it works only when compile-time analysis can
predict memory usage patterns, which seldom happens
in reality.

1. Checkpointing Using Signatures
Checkpointing using signature (also called
probabilist...