Browse Prior Art Database

Method for Dynamic Handling of MACHINE Checks

IP.com Disclosure Number: IPCOM000036689D
Original Publication Date: 1989-Oct-01
Included in the Prior Art Database: 2005-Jan-29
Document File: 3 page(s) / 44K

Publishing Venue

IBM

Related People

Hester, RL: AUTHOR [+2]

Abstract

A machine check is considered to be a hardware failure.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 3

Method for Dynamic Handling of MACHINE Checks

A machine check is considered to be a hardware failure.

A hardware failure is normally a reason to stop a system or, at least, to interrupt its normal clocking to scan out and/or reset hardware, to log machine failure data, and, to resume normal clocking. This period of time could be longer than time-dependent devices, on channels, can withstand, so they, in turn, time out and raise Interface Control Checks. Now the Operating System error recovery routines have to handle the IFCCs which are considered soft errors, if infrequent.

With dynamic handling of machine checks (where possible), no stopping of clocks is necessary and recovery is limited to the failing area only. An area can use this method if it can be isolated and not affect other areas of the system. A channel is an area suited to this.

If Channel 1 should have an internal intermittent hardware failure and cause a machine check, there is no reason to stop the clocks on the entire sub-system and cause channel overruns on other channels.

With dynamic handling, the machine check causes that channel to enter a Null-state, corrects the parity for the failing register or control lines, signals error to Channel 1 related hardware but does not propagate the error condition, and signals an error to the support processor. The support processor interrogates the machine check latches and logs away error codes for later service records or inquiry. The service processor then signals the hardware to reset Channel 1 machine checks and cause a Channel 1 control check to be reported to the Operating System.

WHEN A MACHINE CHECK HAPPENS: 1. Stop operation but don't stop clocks. 2. Don't propagate machine check. 3. Notify related hardware. 4. Notify service processor. 5. Wait for SP response. 6. Reset machine check latch. 7. Reset hardware. 8. Notify operating system.

1. STOP OPERATION BUT DON'T STOP CLOCKS: There are various ways of doing this, but the end result has to be that the hardware associated with the machine check no longer tries to do any useful operations until clean up and logging are complete. In the case of a pico-code driven engine, it could be a pico-code word that branches on itself. In the case of hardware sequencing, it would be a sequence that does nothing or forces a register to zeros so no operation is attempted. The main thing is to keep the clocks running and stay in a Null-state until FORCED back into operation. Null-state means don't do anything and don't look at anything until forced into another state.

1

Page 2 of 3

2. DON'T PROPAGATE MACHINE CHECK: Whatever caused the Machine check should be amended (not necessarily corrected) to prevent further machine checks. One way to do this is to pre...