Browse Prior Art Database

Method to Isolate Multiple Bit Memory Errors at the time of failure.

IP.com Disclosure Number: IPCOM000012921D
Original Publication Date: 2003-Jun-10
Included in the Prior Art Database: 2003-Jun-10
Document File: 4 page(s) / 50K

Publishing Venue

IBM

Abstract

Disclosed is a method to isolate uncorrectable memory errors at the time of failure in systems protected by Error Correction Code (ECC). Because memory error can be hard to recreate due to the many modes of failure, are hare to isolate, the best time to identify the failing Dual Inline Memory Module (DIMM) is at the time of failure. On systems with the data bus covers more then one DIMM, With out better isolation all DIMMs in the chip select group would need to be replaced.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 33% of the total text.

Page 1 of 4

Method to Isolate Multiple Bit Memory Errors at the time of failure.

  Disclosed is a method to isolate uncorrectable memory errors at the time of failure in systems protected by Error Correction Code (ECC). Because memory error can be hard to recreate due to the many modes of failure, are hare to isolate, the best time to identify the failing Dual Inline Memory Module (DIMM) is at the time of failure. On systems with the data bus covers more then one DIMM, With out better isolation all DIMMs in the chip select group would need to be replaced.

    There are different ways to protect data in memory with ECC. The most common is to be able to correct a single bit in the data width covered by the ECC code. More aggressive systems can handle a chip kill (a DRAM that is will not respond). The number of data bits that one DRAM supplies to a ECC word is called a packet. When using memory protected by ECC and a error involves more then one packet, the ECC can not isolate the error to any particular DRAMs. The memory controller will capture the chip select group within its error registers. Since a chip select group can include anywhere from 2-8 DIMMs and the failure itself can come from a single DIMM, DIMMs that don't have any problems would otherwise end up being replaced to solve the problem.

    The different modes of failure in DRAMs are, a single sell that has lost its state because of some other external event (alpha particle), a single sell that is always high or low, a data line with a bad driver (week, open, shorted), a control line not responding to commands, a address line not responding, a completely dead DRAM, data lines shorted and other types.

    The disclosure describes how to, using the service processor and the memory address trapped by the memory controller during the ECC error, interrogate the memory at the failing address to find a bit that is stuck in the high or low position, and then direct the service action to be for that DIMM or DRAM only.

    The problem is solved by service processor code which will read 128 Bytes of data, including the ECC, at the memory address where the error occurred, complement the data, and then write it and read it back again. It will then do the same thing at another address close by the first one. The data read can then be used to find a bit that is stuck to either a '1' or a '0'. Since a majority of multiple bit errors are caused by 1 bit that has a stuck fault, and another that is just a soft error, replacing that DIMM will solve the fatal multiple bit error, since a single bit error is correctable by ECC.

    This disclosure describes a method to isolate to a failing memory chip (DRAM) for an uncorrectable memory error. Computer memory has become smaller in physical size for a given amount of memory. The access speeds has increased greatly. To reduce heat and power, the voltage used to run the memory has dropped. The amount of memory in a computer system has increased significantly. All of these factors...