Browse Prior Art Database

Run-time Failure Analysis Circuit for Autonomic Computing Systems

IP.com Disclosure Number: IPCOM000030759D
Original Publication Date: 2004-Aug-25
Included in the Prior Art Database: 2004-Aug-25
Document File: 3 page(s) / 113K

Publishing Venue

IBM

Abstract

A method is shown in which memory failures may be more accurately and efficiently diagnosed. This technique allows for either software or hardware to map failures in a novel way that is both efficient and comprehensively maintains failure information. The information can be used in turn to aid in failure diagnostics, array repair during manufacturing, or in array repair in field applications.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 43% of the total text.

Page 1 of 3

Run-time Failure Analysis Circuit for Autonomic Computing Systems

Historically, in any memory array, it is desirable to have a bit map that shows the extent of failures for both diagnostic reasons and to facilitate repairs of defective elements. For example, a monitor that counts errors within a controller it is running can be fooled by a single-cell defect that fails often and is detected frequently, leading to a misleadingly high failure count. The high fail count may trigger the invocation of redundancy, but the resource is wasted because the failing area is small, and it would be better to reserve the redundant resource for a bigger fail that may come later. Conversely, a "scrub" sequencer that "steals" memory cycles to walk through memory to fetch, detect, correct and count the errors (but ignores errors occurring during normal operation) can very accurately determine how many cells were bad on each pass through the memory space, but it may not detect failure mechanisms that require a particular timing sequence, address sequence, read/write sequence, or data switching pattern in order to expose themselves. As a result, failures could be occurring that the scrub sequencer does not detect, the redundancy resource should be applied, but is not, and that may lead to a system failure later on. This concept provides for an improved means to eliminate some of the ambiguity about the nature of failures that are detected within arrays both under test or diagnostic routines or in actual customer running systems. Additionally, the nature of the invention is such that it can be added to many memory structures with little effort. It is applicable also as a diagnostic aid, providing the effect of a bit map on failures where little to no data existed previously.

     This invention compares a first failure occurrence with subsequent failures and enables you to diagnose the type of failure (single cell, wordline, bitline, chip kill, etc.) with minimal additional logic in hardware or with a very small register set in software as opposed to using a bit for bit complete memory map. This is accomplished by saving the address of the first failure and using an 'exclusive or' (xor) against each additional failing address which in turn is saved in a register that acts as an address map. The xor indicates which address failed for both 1 and 0 values, and once an xor bit is turned on, it remains on for all subsequent address failures for that particular data bit (or symbol) as determined by the error correcting code (ECC). In order to address multiple failures, more than one copy of the register bits for this function can be implemented. However, going beyond two or three copies for a given interface or array is not a probable event and would therefore not be necessary. In addition to an address trap xor register (which acts as an address map), it is necessary to save the syndrome of the fail (to indicate which devices, symbols, or data bits are failing),...