Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method to prevent termination of multiple partitions as a result of I/O error propagation in shared hardware LPAR systems.

IP.com Disclosure Number: IPCOM000013213D
Original Publication Date: 2003-Jun-18
Included in the Prior Art Database: 2003-Jun-18
Document File: 3 page(s) / 60K

Publishing Venue

IBM

Abstract

In a Logically Partitioned (LPAR) system where the I/O subsystem is a shared resource having partitions assigned I/O resources to a slot level a single I/O Adapter error can cause errors to propagate across what are designed to be independent Logical Partitions.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 3

  Method to prevent termination of multiple partitions as a result of I/O error propagation in shared hardware LPAR systems.

   This article describes a software solution to recover errors caused when a single PCI adapter fails and propagates errors into hardware shared by multiple partitions in a LPAR system. In this system multiple slots share a PCI to PCI Host Bridge (PHB) and in the event of an error on any slot not using Extended Error Handling (EEH), an error is propagated to the PHB above that is contained within the RIO to PCI Bridge. As a result this PHB enters a "freeze" mode that causes all future accesses to any slot blow ths PHB to fail until the error is cleared. Because this is a multiprocessor, multiple partition system the error cannot be cleared without a complete system shutdown and this results in the eventual termination of any partition using the "frozen" PHB. In the current design this can result in up to four (4) partitions being terminated as a result of a single I/O card failure. Figure 1 below shows the topology of one branch of this hierarchy, in an actual system there can be up to 16 branches.

The actual operations are described below:

Figure 1

Host Bus

H o st B u s to R IO B u s Bridge

RIO Bus

I Bus1PCPC I Bus 2

To R IO to P C I B r id g e

R IO to P C I B r id g e

PH B0

PH B1

PHB2

PC I Bus 0

PC I Bus PC I Bus PC I Bus

                                 PC I Slots 5 - 8 PC I Slots 1 - 4 PC I Slots 9 - 12

Referring to Figure 1 above:

1. A PIO access is made to a PCI slot by device driver code executing within a partition. This access is made via the Host Bus through the various bridges to the PCI slot. Each of the slots
2. The target address is used to determine the slot.
3. A selector algorithm is used to determine which PHB is being accessed and the accessor must obtain the lock. The lock method is unique in that a table is used to determine which lock is accessed and in that the table is used in each and every pass into the lock. This table is dynamically modifiable to give the required properties and described later in detail.
4. If the lock is not in use it is obtained and the I/O access takes place. 4A. If the lock is not obtained, the accessor waits in the spin lock loop.

PC I to PC I Bridge

PC I to PC I Bridge

P C I to P C I B r id g e

1

Page 2 of 3

5. If the I/O is successful the lock is released. This ends this processing phase, processing is complete, results are normal. 5A. If an I/O error occurs on the adapter, a Machine Check Interrupt (MCI) occurs and the processor execution vectors to the interrupt handler, the lock remains unavailable until the MCI code completes the interrupt level code consisting of the following steps 6 - 17, returns and releases the lock .
6. The MCI handler begins execution and determines an I/O error has occurred.
7. The PCI bridge and PHB are determined from the error registers.
8. The PCI Bridge arbitration is disabled, preventing further IOA DMA errors.
9. The PHB error state is released, "unfreezing" the...