Browse Prior Art Database

Fault Isolation and Job Restart

IP.com Disclosure Number: IPCOM000078967D
Original Publication Date: 1973-Apr-01
Included in the Prior Art Database: 2005-Feb-26
Document File: 1 page(s) / 12K

Publishing Venue

IBM

Related People

Benes, RE: AUTHOR

Abstract

Problem programs for operating computers are conventionally provided with checkpoints where sufficient information is recorded for continuing the program, if a failure later occurs and the program must be restarted. These checkpoints can be located only at certain logical boundaries in the problem program. Hardware operated check-points may also be provided to record, for example, the status of each latch in the system. A hardware checkpoint can be located at any machine cycle time without regard to the status of the problem program. One advantageous use of such a hardware checkpoint, is to locate the hardware checkpoints at more frequent intervals than the usual program checkpoints.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 70% of the total text.

Page 1 of 1

Fault Isolation and Job Restart

Problem programs for operating computers are conventionally provided with checkpoints where sufficient information is recorded for continuing the program, if a failure later occurs and the program must be restarted. These checkpoints can be located only at certain logical boundaries in the problem program. Hardware operated check-points may also be provided to record, for example, the status of each latch in the system. A hardware checkpoint can be located at any machine cycle time without regard to the status of the problem program. One advantageous use of such a hardware checkpoint, is to locate the hardware checkpoints at more frequent intervals than the usual program checkpoints. When a hardware failure occurs, the problem program can be advantageously restarted from the last hardware check on a different machine that does not have the failure.

If the program is to be rerun on the failing computer, the hardware check-points can be set at more frequent intervals between the last check-point before the failure and the point of failure. On successive runs of the problem program with the hardware checkpoints at successively closer intervals, the machine cycle of the failure can be identified.

In the early stages of development of a computer, hardware failures may be found frequently. With integrated circuits, there is a significant time lag in correcting such a hardware failure. The failing machine cycle is identified by the proced...