Browse Prior Art Database

Method of Predicting a Bad DIMM out of a Failing DIMM Set upon an Uncorrectable ECC Error

IP.com Disclosure Number: IPCOM000146916D
Original Publication Date: 2007-Feb-27
Included in the Prior Art Database: 2007-Feb-27
Document File: 1 page(s) / 22K

Publishing Venue

IBM

Abstract

In most computer systems, data path to the DIMM is more than one DIMM wide. ECC is calculated across an entire data path to ensure the integrity of the data read from the memory. ECC with chipkill coverable is typically able to detect & correct an error in any one DRAM. However, ECC can only detect an error, if an error happens in multiple DRAM, but can not identify the failing DRAMs. Since, the ECC is calculated across entire data path, the error can only be isolated to a set of DIMMs forming the data path, even if all failing DRAMs are in a single DIMM.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 69% of the total text.

Page 1 of 1

Method of Predicting a Bad DIMM out of a Failing DIMM Set upon an Uncorrectable ECC Error

Software can predict a bad DIMM out of a failing DIMM set upon uncorrectable ECC error, based on the statistics of past correctable errors on each DIMM. As the ECC with chipkill coverage can identify a failing DRAM, correctable error statistics can be collected on each DRAM in each DIMM of a DIMM set, across which the ECC is calculated. Upon an uncorrectable error that can not identify the failing DRAM, software can look at the history of correctable errors in each DIMM. If most of the correctable errors were detected in multiple DRAMs of a single DIMM, it's very likely that the uncorrectable error was due to more than one DRAM failing at the same time in that single DIMM. So, software can use the statistical history of correctable errors, to predict a failing DIMM, when an error can not be isolated to a single DIMM based on ECC.

     Each DIMM with ECC storage is typically made of 18 x4 DRAM or 9 x8 DRAM to give 64 bit wide data path with 8 bits for ECC. Two such DIMMs typically form a DIMM pair to make a 128 bit wide data path with 16 bits for ECC. ECC with chipkill coverage is typically calculated across a multiple of 128 data bits. So, the ECC can identify and correct the error, if any one DRAM within a DIMM pair is failing. However, the ECC can only detect an uncorrectable error, if more than one DRAM is failing at the same time in a...