Browse Prior Art Database

Method to prevent machine checkstop during I/O subsystem failure analysis from machine check interrupt handler and greatly enhance machine failure recovery

IP.com Disclosure Number: IPCOM000016490D
Original Publication Date: 2003-Jun-25
Included in the Prior Art Database: 2003-Jun-25
Document File: 3 page(s) / 60K

Publishing Venue

IBM

Abstract

In the current generation of GP, GQ, and GR processors, I/O load failures may be signaled to the processor by a machine check interrupt. Under certain conditions a second machine check can occur while processing the first which will cause a check stop resulting in the crash of the entire machine. This is an extremely undesirable event.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 3

  Method to prevent machine checkstop during I/O subsystem failure analysis from machine check interrupt handler and greatly enhance machine failure recovery

In the current generation of GP, GQ, and GR processors, I/O load failures may be signaled to the processor by a machine check interrupt. This interrupt invokes the machine check interrupt handler at a high exception processing priority and is typically used to analyze system failures. Some I/O adapters must rely on this operation to receive error notification in the pSeries* hardware. If one of these processors is executing at the machine check state, a second machine check will cause a check stop resulting in the crash of the entire machine. This is an extremely undesirable event.

Shown in Figure 1 is a simplified block diagram of a multiprocessor system with N processors, Memory cards and an I/O book.

The processors, memory and I/O book are connected via the HOST GX BUS. The I/O drawers(not shown) are connected via RIO cable through the I/O book.

Each processor may have multiple concurrent I/O streams.

MMIO load/store accesses to the I/O adapters. MMIO accesses may be used to move data directly to/from the IOA's Start and stop DMA activity from the IOA's to memory. Collect DMA status from the IOA's. Each IOA in the system may have multiple concurrent I/O streams.

     DMA activity to/from memory. All I/O access, either MMIO or DMA, flows through the pathway shown in Figure 2.

     The PCI IOA is the target of MMIO load/store activity in normal operation. The PCI IOA is the source of DMA read/write activity in normal operations. Each PHB, PCI HOST BRIDGE, is shared by up to four PCI IOA's. If a PCI IOA incurs an error, the error may be propagated to the PHB. An error propagated to the PHB will "freeze" the PHB, causing failure of all subsequent MMIO and DMA operations.

     Each PCI IOA slot may be assigned to a different independent partition. If a IOA belonging to Partition 1 freezes the shared PHB as a result of an error, all partitions sharing that PHB will fail on the next access, either MMIO or DMA. This may result in the failure of up to four partitions.

     The PHB may be unfrozen and recovered through a series of complex software manipulations, however this action must wait until all partitions using the PHB have failed. If the PHB is unfrozen prior to all partition failures, it is possible that a partition will access the PHB during recovery operations in such a manner that the PHB is refrozen. This may result in a machine checkstop and failure of the entire machine.

1

Page 2 of 3

H O S T C E C - M icrop roce ssors & M e m ory

P rocessor [0 ]

P rocessor [1 ]

P rocessor

rocessorP[.....]

[N ]

Memory Card

[0 ]

Memory Card

[....]

Memory Card

[m ]

Host GX Bus

I/O Book - Host Bus to RIO Bus Bridges

H ost B us to RIO Bus Bridge[0]

H ost B us to RIO Bus Bridge[...]

H ost B us to RIO Bus Bridge[p]

       Port 1 P o rt 0 P ort 2 P o rt 3

Figure 1

R IO C ables to/From I/O Draw ers

Shown in Figure 2 is a...