Use of Platform Management Interrupt (PMI) Duty-Cycle to Detect Memory Chip-Kill
Original Publication Date: 1999-Nov-01
Included in the Prior Art Database: 2003-Jun-18
Summary and Problem Statement During a memory chip-kill event, a memory device on a DIMM fails and causes an ECC error in the data upon each access from that DIMM. Since server computer systems are usually designed to interrupt the processor when a memory ECC error occurs, a linear correlation exists between ECC errors in a DIMM and the frequency of PMI in IA-64 systems ("SMI" in IA-32 systems). When a "chip-kill" occurs, the failing DIMM produces extremely high rates of ECC errors, as compared to random soft or hard errors. For example, if a software program is running in a DIMM which has experienced a chip-kill, ECC errors, and hence PMIs, are generated continuously whenever the program is loaded or accesses the failed memory. Usually, one ECC error per cache line can be detected in hardware and correlated to a DIMM. However, the execution time of SMI or PMI firmware in a system determines the actual rate at which errors can be counted, since these programs perform in terms of milliseconds while memory accesses occur in nano- and microseconds. Due to this inherent low-frequency sampling of high-frequency errors, a chip-kill event may appear the same as other ECC error distributions to the SMI or PMI firmware. A method is needed to easily distinguish chip-kill from other ECC error distributions. Some IA-32 systems determine the chip-kill event by counting SMI interrupts which are caused by ECC errors over a very short time interval. A chip-kill is inferred when a threshold count is reached, resulting in a message sent to the system error log and deactivation of the failed DIMM on the next system boot. This prior technique basically establishes processor saturation by the SMI handler code by observing the preclusion of a system dependent, lower-level interrupt due to the higher priority SMI occurrences. This system dependence represents a problem as the evolution from 32-bit to 64-bit systems occurs since it is possible for the OS to mask PMI in IA-64 systems. This masking makes it impossible to guarantee that PMI is really monopolizing the system, as the OS may throttle PMI if it desires. Also, the presence of the necessary lower-level interrupts is not guaranteed in future systems, as the prerequisite "legacy" EBDA storage area is retired. This disclosure describes a method of determining chip-kill which instead measures the average duty-cycle of PMI events over a short time, and infers a chip-kill situation when the average PMI duty cycle exceeds a threshold value. The threshold duty-cycle can be adjusted to accommodate any interference by the OS, and still reliably detect chip-kill scenarios. (Duty-cycle is defined as the time duration of a PMI divided by the time between two PMIs.) The method described in this disclosure is superior as it does not simply count errors, and it also does not rely on system-dependent mechanisms, allowing it to be re-used as IA-64 systems evolve.