Browse Prior Art Database

A method to add fault tolerance and reliability to software system

IP.com Disclosure Number: IPCOM000032852D
Original Publication Date: 2004-Nov-16
Included in the Prior Art Database: 2004-Nov-16
Document File: 6 page(s) / 114K

Publishing Venue

IBM

Abstract

This memo describes an approach to design a software system in order to make it able to manage transient problems such as network interruptions, power-offs, software crashes, etc... increasing in this way the reliability of the system.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 47% of the total text.

Page 1 of 6

A method to add fault tolerance and reliability to software system

This memo describes an approach to design a software system in order to make it able to manage transient problems such as network interruptions, database errors, power-offs, software crashes, etc.... increasing in this way the reliability of the system.

    The proposed approach employs a checkpoint and restart technique for handling these transient failures combined with a finite state machine that manages the controlled retries and the overall application flow.

    In the checkpoint and restart technique the state of the application is periodically checkpointed and if an error is detected, the process is rolled back to the last checkpointed state or recovered, if a roll back is not possible, and then restarted.

    The technique that we are describing here has been successfully embedded into the APM application in order to give it the capability to recovery from unexpected JVM crashes and unexpected power-offs and the capability to dynamically handle transient database/network errors.

Architecture of the system

    The class diagram below shows how this system is designed. The yellow objects represent the actual classes that comprise the system while the pink objects are the application-specific classes that will be plugged into this framework. The yellow part is totally generic and can be reused among different applications; only the yellow classes should be rewritten for a new application that leverages this system.

1

Page 2 of 6

+ getCo + getNe + clean + init ( + getIn

1

«interface»

RestartableCommand

+ recover ( [in] ctx : Context ) : State + execute ( [in] ctx : Context ) : State

+ saveCheckPoint ( ) + getLastSavedCheckPoint ( ) + CheckpointManager ( [in] Con + execute ( [in] Context : Conte

«instantiate»

«use»«use»

«instantiate

«instantia

A ppl

«interface»

Context

+ beginTransaction ( ) : void + rollback ( ) : void + commit ( ) : void

«instantiate»

«interface» State

«instantiate»

A ctionCmd1

ConcreteSt

(....)

A ctionCmdn

1

«instantiate»

More actions can be created

Applicatio

2

[This page contains 4 pictures or other non-text objects]

Page 3 of 6

A ctionCmdn

1

«instantiate»

More actions can be created

Applicatio

    The following sequence diagrams clarify the interactions between the above classes:

: CheckpointManager

: CommandHandler

1 : getLastSavedCheckPoint ( )

2 : [lastSavedCheckoint != null] getNextCommand ( s )

3 : \CreateOperation\

: ActionCmdn

currentCommand

RECOVER

3

[This page contains 2 pictures or other non-text objects]

Page 4 of 6

EXECUT E

: CommandHandler

: A pplication

  1 : CheckpointManager ( Context , CommandHandler )

2 : \CreateOperation\

: CheckpointManager

: A pplic

3 : execute ( Context , A pplicationContext )

4 : getLastSavedCheckPoint ( )

5 : [lastCheckPoint == null] getInitialCommand ( )

These steps will be different if a recover is in place; see the above box for details

6 : \CreateOperation\

currentCommand

7 : execute ( ctx )

[Con

Repeat this block wh...