Browse Prior Art Database

Comparitive Memory Predictive Failure Analysis and Memory Architecture Analysis

IP.com Disclosure Number: IPCOM000168591D
Original Publication Date: 2008-Mar-17
Included in the Prior Art Database: 2008-Mar-17
Document File: 1 page(s) / 23K

Publishing Venue

IBM

Abstract

There is a deficiency in the state of the art in effectively detecting that a memory device is likely to fail and also in understanding whether there are underlying architectural concerns that may be causing the failure. Certainly, by counting the total number of single bit errors, we can understand if a DIMM is failing, however, what is needed is a memory reliability metric that also allows us to analyze the underlying memory architecture. By tracking the number of failures on a DIMM relative to the other DIMMs in the system, and counting the number of times this offset exceeds a threshold value, we can determine if failures are due to faulty memory or a poorly designed memory architecture.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 100% of the total text.

Page 1 of 1

Comparitive Memory Predictive Failure Analysis and Memory Architecture Analysis

In example implementation, memory single bit errors will increase an error counter for that DIMM if and only if it is not the least failing DIMM. If a single bit error occurs in the least failing DIMM, the counters of all other DIMMs will be decremented. Once a DIMM's error offset exceeds a slot threshold it will be reset, the DIMM can be flagged as faulty and a corresponding DIMM slot failure counter can be incremented.

     In one embodiment, replacing a DIMM will also cause the DIMM error offset to be reset.

     By this process, it can be determined that a specific DIMM appears to be failing more than the others, and it can be determined if a given slot in the system seems to be more prone to failure than the others, thereby implicating an underlying problem with the system design.

     It is contemplated that failure rates across multiple disks and other devices could be similarly tracked.

1