Method of Determining System Memory Failure Mechanisms Through the Tracking of Correctable Errors
Original Publication Date: 2002-Sep-24
Included in the Prior Art Database: 2003-Jun-21
Disclosed is a method for determining system memory failure mechanisms through the tracking of inbound data correctable errors (CEs). In certain memory systems, errors originating from hardware internal to the memory subsystem can be detected and corrected. Error correcting code (ECC) is a common industry practice by which simple data errors can be detected and corrected. For example, when data is stored to memory, a syndrome is generated and stored based on the data itself. If a memory controller requests to read this data back at a later point, it will recalculate the syndrome based on the data it reads and compare it to the stored value. If these values are not equal, the original syndrome will be unencrypted and the data bit that does not match will be corrected. There are many cases that could lead to corrupt data bits, including, manufacturing defects, early life fails, noise on power or control signals, temperature extremes, marginal timing, or the like. There are two main types of data correctable errors: inbound and outbound. Inbound data correctable errors refer to bits that are checked and corrected on a memory fetch. This is when data is read from a particular dynamic random access memory (DRAM) and sent to the memory controller for further processing. Outbound data correctable errors refer to bits that are checked and corrected on a memory store. This is when data is sent by another computer subsystem to the memory controller for further processing (i.e., storing the data in a DRAM). Because of the intermittent nature of inbound data correctable errors, it is necessary to collect as much data as possible on these types of errors. This data can then be used to determine if any memory component has exceeded a threshold for data correctable errors. This data can also be used by memory component vendors so they know which part failed and how. This allows for marginal parts to be found and replaced, data to be available for vendor analysis, and savings to the manufacturer of the system by being able to articulate precisely how these components are defective. This document provides a method and system for tracking correctable errors on memory cards. More importantly, it can track individual errors on parts as they move from one system to another. This is done in order to determine failure characteristics and prevent marginal parts from compromising the stability of customers' systems.