Browse Prior Art Database

Method for Enhanced CAPI I/O Error Handling Disclosure Number: IPCOM000249312D
Publication Date: 2017-Feb-16
Document File: 2 page(s) / 67K

Publishing Venue

The Prior Art Database


Disclosed is a method for enhancing Coherent Accelerator POWER Interface (CAPI) error handling by placing all active contexts on a garbage list. This solution increases the availability of applications as well as applications’ tolerance to the device errors, which are more desirable functions than rebooting the operating system is.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.


Method for Enhanced CAPI I/O Error Handling

Coherent Accelerator POWER Interface (CAPI) accelerators are fundamentally devices. Like any input/output (I/O) devices, CAPI devices may experience various types of errors (e.g., capi bus link down, or capi host bridge freeze, internal errors to the coherency unit on the Central Processing Unit (CPU), etc.). Because the CAPI devices have direct access to application memory, such errors have the potential to cause application outages. If the CAPI hardware contexts are exploited inside the kernel, such an error may even cause the operating system (OS) to crash. A more robust automatic error recovery can substantially increase application availability in the face of CAPI hardware error.

When a CAPI hardware error occurs, detection, reporting, and recovery must happen to properly deal with the error. The applications and kernel can perform detection of the CAPI h/w errors in a variety of ways (e.g., the commands to the accelerator may not finish and time out, the memory mapped I/O loads return all Fs instead of valid values, the hardware may send an accelerator function error interrupt, etc.). The application and the CAPI device driver, thus, have means of detecting the error. Once the first entity (application or device driver) detects the error, it may confirm the error and kick off the recovery. The confirmation happens by means of checking the CAPI device state. Once the error is confirmed, the device driver takes over the recovery and quiesces the device. Quiescing the device mainly comprises of purging the active contexts.

The novel contribution is a method to place all active contexts on a garbage list.

Because the error recovery happens from an interrupt environment, but the garbage collection cannot be done at the interrupt level, the recovery code in the kernel defers deallocation of certain resources at the time of purging. However, the recovery purges any pending page faults, masks, or clear interrupts from the I/O chipset (i.e., CAPI host bridge), purges any page table entry for the device memory mapped I/O, and begins emulating memory mapped I/O by returning all Fs for loads and ignoring stores (a behavior requ...