Browse Prior Art Database

Context-aware Uncorrected Memory Error (UE) Handling

IP.com Disclosure Number: IPCOM000249163D
Publication Date: 2017-Feb-08
Document File: 4 page(s) / 37K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a technique that considers the location of the UE and the context in which an UE occurred to decide whether the system should crash or not. Context includes one or more of the following but not limited to: time, user, actual data which got corrupted, workload running, status of the system etc. Proposed technique is based on the observation that the data corruption by UE, depending on the context in which UE occurred, is either acceptable in a lot of situations or does not harm the overall integrity of the system.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 26% of the total text.

1

Context-aware Uncorrected Memory Error (UE) Handling

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern computer systems. A memory error is an event that leads to corruption of one or more bits in the memory. Memory errors can be caused by electrical or magnetic interference (e.g. due to cosmic rays), can be due to problems with the hardware (e.g. a bit being permanently damaged), or can be the result of corruption along the data path between the memories and the processing elements.

Most enterprise systems employ different mechanisms to recover from these errors. The recovery mechanism can be in the hardware or at the software level. At the hardware level, Error Correcting Codes (ECC) are used to recover form single bit errors and other techniques are used to recover from multi-bit errors. However, hardware cannot recover from all kinds of memory errors. For example, hardware cannot recover from memory errors if the number of affected bits exceed the limit of what ECC can correct. Memory errors that are automatically detected and corrected by hardware are called as Corrected Errors (CE) and memory errors that are detected by hardware, however, that cannot be be corrected are categorized as Uncorrected Errors (UE). UEs are passed on to the software (firmware, kernel) through a non-maskable interrupt. The software employs different methods to recover from UEs depending on the location of the UE, however, not all UEs can be recovered at the software level. As an UE leads to data corruption, whenever an unrecoverable UE is encountered the firmare/OS panics leading in system crash. Handling UEs is important from the systems availability stand point as unrecoverable UEs leads to system crash.

Proposed is a technique that avoids system crash even in case of UEs. The proposed solution takes into consideration the location of the UE and the context in which an UE occurred to decide whether the system should crash or not. Context includes one or more of the following but not limited to: time, user, actual data which got corrupted, workload running, status of the system etc. Though UEs corrupt data, such data corruptions are acceptable in a lot of situations or do not harm the overall integrity of the system depending on the context in which UE occurred. Hence system/application crash in such cases can be avoided.

Errors in dynamic random access memory (DRAM) devices have been a major concern. In many production environments, a single UE is considered serious enough to replace the DIMM that caused it. Hence memory errors are costly in terms of the system failures they cause and the repair costs associated with them. According to the study memory errors are one of the most common hardware problems that lead to machine crashes. There is also a fear that advancing densities in DRAM technology might lead to increased memory errors, exacerbating this problem in the future.

The proposed technique is to avoid syste...