Browse Prior Art Database

Collection of Error Information after Catastrophic Adapter Failure Requiring Reset

IP.com Disclosure Number: IPCOM000124325D
Original Publication Date: 2005-Apr-15
Included in the Prior Art Database: 2005-Apr-15
Document File: 2 page(s) / 25K

Publishing Venue

IBM

Abstract

A method of collecting error information after a catastrophic adapter failure which requires an adapter to be reset.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 2

Collection of Error Information after Catastrophic Adapter Failure Requiring Reset

Disclosed is a method that takes advantage of hardware that refreshes DRAM during a PCI reset. After reset, FLASH code copies the key data information necessary to identify a failure to a known location where it can be retrieved by the server. This solution provides the following features:

Data is collected from the adapter with the adapter in a known state. Hardware is

placed in best possible state by a reset and uncorrupted microcode is running from FLASH. Multiple images of FLASH code exist with LRC checking.

The rest of the system is in a known condition and less likely to cause additional

errors. The adapter reset can be done earlier to get system running I/O sooner

When a serious error occurs on an adapter, it is sometimes necessary to reset the adapter to recover or protect the system. When the adapter is reset very little information is available to diagnose the problem because the contents of memory are destroyed. This disclosure gets the most information with almost no risk to the system. The adapter needs to be reset because malfunctioning hardware may cause additional system errors and bad microcode may corrupt external data structures or cause additional system hardware errors. To collect information on a failure a number of approaches have been taken in the industry:

The adapter places information in a location that can be read by server. The big

drawback is that the access of information on adapter may cause additional system errors. A number of variations of this data collection exist but they all control risk by what information they collect and how much information. Minimum risk means collecting very little information. Additional complications can occur when server remotely accesses hardwar...