Browse Prior Art Database

Method of Determining System Memory Failure Mechanisms Through the Tracking of Correctable Errors

IP.com Disclosure Number: IPCOM000016230D
Original Publication Date: 2002-Sep-24
Included in the Prior Art Database: 2003-Jun-21
Document File: 2 page(s) / 40K

Publishing Venue

IBM

Abstract

Disclosed is a method for determining system memory failure mechanisms through the tracking of inbound data correctable errors (CEs). In certain memory systems, errors originating from hardware internal to the memory subsystem can be detected and corrected. Error correcting code (ECC) is a common industry practice by which simple data errors can be detected and corrected. For example, when data is stored to memory, a syndrome is generated and stored based on the data itself. If a memory controller requests to read this data back at a later point, it will recalculate the syndrome based on the data it reads and compare it to the stored value. If these values are not equal, the original syndrome will be unencrypted and the data bit that does not match will be corrected. There are many cases that could lead to corrupt data bits, including, manufacturing defects, early life fails, noise on power or control signals, temperature extremes, marginal timing, or the like.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

  Method of Determining System Memory Failure Mechanisms Through the Tracking of Correctable Errors

  Disclosed is a method for determining system memory failure mechanisms through the tracking of inbound data correctable errors (CEs). In certain memory systems, errors originating from hardware internal to the memory subsystem can be detected and corrected. Error correcting code (ECC) is a common industry practice by which simple data errors can be detected and corrected. For example, when data is stored to memory, a syndrome is generated and stored based on the data itself. If a memory controller requests to read this data back at a later point, it will recalculate the syndrome based on the data it reads and compare it to the stored value. If these values are not equal, the original syndrome will be unencrypted and the data bit that does not match will be corrected. There are many cases that could lead to corrupt data bits, including, manufacturing defects, early life fails, noise on power or control signals, temperature extremes, marginal timing, or the like.

There are two main types of data correctable errors: inbound and outbound. Inbound data correctable errors refer to bits that are checked and corrected on a memory fetch. This is when data is read from a particular dynamic random access memory (DRAM) and sent to the memory controller for further processing. Outbound data correctable errors refer to bits that are checked and corrected on a memory store. This is when data is sent by another computer subsystem to the memory controller for further processing (i.e., storing the data in a DRAM).

Because of the intermittent nature of inbound data correctable errors, it is necessary to collect as much data as possible on these types of errors. This data can then be used to determine if any memory component has exceeded a threshold for dat...