Browse Prior Art Database

Alternate CPU Recovery

IP.com Disclosure Number: IPCOM000080311D
Original Publication Date: 1973-Nov-01
Included in the Prior Art Database: 2005-Feb-27
Document File: 5 page(s) / 22K

Publishing Venue

IBM

Related People

Casey, DP: AUTHOR [+2]

Abstract

Alternate CPU Recovery (ACR) is that process which is invoked when a CPU, in a tightly coupled multiprocessing environment, can no longer function. The invocation of ACR is the result of a "signal" that is sent by the dying CPU before it enters a permanent wait or stopped state. This signal can be either hardware or software generated.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 22% of the total text.

Page 1 of 5

Alternate CPU Recovery

Alternate CPU Recovery (ACR) is that process which is invoked when a CPU, in a tightly coupled multiprocessing environment, can no longer function. The invocation of ACR is the result of a "signal" that is sent by the dying CPU before it enters a permanent wait or stopped state. This signal can be either hardware or software generated.

The hardware generated signal is known as a Malfunction Alert (MFA). A MFA is generated under two conditions. The first is when the hardware determines that it is damaged to such an extent that it cannot even generate a machine check interrupt. The second is when a machine check occurs while the CPU is disabled for machine check interrupts. (Note that in MVM, the only time that we will be disabled for machine checks is during the first part of MCH (machine check handler) processing.) Under either of these circumstances, after generating the MFA, the CPU enters a hard stop "red-light" state ("check stop" state).

The software generated signal is known as an Emergency Signal (EMS). It is generated by MCH when, after analyzing a machine check, it determines that the CPU can no longer function properly. This determination is based on either an analysis done by the MCH software or by the occurrence of another machine check, during that portion of MCH processing that is enabled for machine check interrupts. After generating the emergency signal, MCH loads a disabled wait state PSW (program status word).

The ACR process is invoked on a "good" CPU when it receives either an MFA or the described EMS.

The objective of ACR is to enable the system to continue without the use of the "dead" CPU. In a loose sense the objective of ACR is to avoid "sympathy sickness".

When the system resumes normal operation, it will do so in a degraded fashion. 0bviously, the reduction of available CPU power contributes to this degradation. Also, jobs which require the dead CPU in order to execute (for an emulator feature or for a device(s) available only to the dead CPU) will fail (if in progress), or will not be permitted to run. If there are a significant number of such jobs (such as all), the meaningfulness of continued system operation is questionable. Note, however, that ACR will not attempt to pass judgment on the meaningfulness of system operation. ACR will enable the system to continue. The decision to terminate the system will be left to the system operator. Design Objective.

The primary design objective for ACR is to get the system to a point at which it can resume normal operation. This will be done by having the CPU that receives the MFA or EMS take responsibility for the work that was in progress on the dead CPU at the time of its failure. ACR will cause the cleanup of that work through any functional recovery routines (FRRs) that were established prior to the malfunction and will monitor that cleanup, and its interactions with work that

1

Page 2 of 5

was in progress in the good CPU, until a poin...