Browse Prior Art Database

Software Mechanism to Measure Memory Single-Bit Error Rates for Fault Detection

IP.com Disclosure Number: IPCOM000124001D
Original Publication Date: 1999-Sep-01
Included in the Prior Art Database: 2005-Apr-05
Document File: 2 page(s) / 86K

Publishing Venue

IBM

Related People

Cerbini, CD: AUTHOR

Abstract

Problem This disclosure is a software mechanism which measures the correctable single-bit error rates in a memory subsystem for the purpose of Fault Detection (FD) and Predictive Failure Analysis (PFA). By using a special 55 millisecond timer, whose value is located in the Extended BIOS Data Area (EBDA) and hereafter called the "Timer", we can compute actual error rates, which is required for accurate tracking of memory component aberrations and the resulting alert to system management software.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Software Mechanism to Measure Memory Single-Bit Error Rates for Fault
Detection

Problem

   This disclosure is a software mechanism which measures
the correctable single-bit error rates in a memory subsystem for the
purpose of Fault Detection (FD) and Predictive Failure Analysis
(PFA).  By using a special 55 millisecond timer, whose value is
located in the Extended BIOS Data Area (EBDA) and hereafter called
the "Timer", we can compute actual error rates, which is required for
accurate tracking of memory component aberrations and the resulting
alert to system management software.

   Another important problem solved by this invention
regards the response of the system during extremely high error
rates.  Correctable single-bit error detection is often a
hardware-driven event which invokes the System Management Interrupt
(SMI) Handler firmware and consequently interrupts the operating
system.  Unfortunately, there is often no throttling mechanism
available, and under conditions of high densities of memory errors
(i.e. at least one error per 32 bytes), system performance is
catastrophically impacted by the SMI interrupts.  Many prior
solutions to this problem still compromised system performance and
retained an undesirable high degree of complexity.  This solution can
identify when memory errors adversely affect system performance and
can automatically adapt the error detection hardware to prevent
degradation of system performance.

   Solution

   Upon detection of the first correctable single-bit memory
error, the SMI Handler firmware captures the Start Time from the
Timer.  When a subsequent error occurs, the current Timer value is
used to perform an error rate computation, since both an elapsed time
and an error count are known.  When this rate exceeds a design
parameter threshold, a message is sent to the system management
software, iden...