Memory Error Detection Using Light-Weight Software Approach
Original Publication Date: 2005-Aug-16
Included in the Prior Art Database: 2005-Aug-16
Problem: Need the ability to apply threshold to certain errors when detecting/logging. The threshold provides a filter so that critical errors can be predicted yet occasional errors are not treated as being critical. In existing implementations, the memory controller is configured to invoke a system service when at least one error occurs or when a specified number of errors occur. If the service is invoked for each error, software keeps a count of how many errors have occurred by incrementing the total count for each system service invocation. Alternatively, if the memory controller supports it, the system service can be invoked only when a specified number of errors have occurred. By noting the range of time in which the errors occurred, the system service can conclude that the error is critical or not. Frequent errors would suggest that the memory module is likely to experience an critical uncorrectable error. In frequent errors are considered correctable and don't constitute a potential problem. The previous two approaches require the memory controller to invoke a system service upon detection of one or more memory errors. Unfortunately, if the memory controller lacks this ability, significant problems could exist. Since there is no way to invoke a system service as a result of a memory error, some form of polling must be used. Typically this takes the form of a periodic system service. The service checks the error status to determine if any errors have occurred since its last invocation. With any polling implementation a trade-off must be made between latency and overhead. If response latency is the priority, a high rate of polling is required. Unfortunately, this results in high service overhead. If the polling is too small, user applications may experience reduced performance. Also, errors may be introduced in the form of system time skew. At the other extreme, if the period is too high it may take more time to detect memory errors. If the detection time is too high, smaller bursts of errors may go undetected.
Memory Error Detection Using Light -Weight Software Approach
Contribution: Light-weight software method for detecting and characterizing memory errors in systems without full hardware support for detection and reporting of such errors.
Benefit 1: Eliminates need for intrusive high polling rates. This improves system performance.
Benefit 2: Detects critical errors within a single polling interval. This improves reliability of the system.
Our approach is to improve the timeliness of error detection in system service
routines that use polling in lieu of interrupt generation. If an error is detected during the
periodic system service a focused characterization of memory is made. This
characterization allows potentially critical errors to be detected within a single polling
event. This method is intended for systems that have the ability to detect memory
errors, but not the ability to invoke a system service to characterize the severity of the
error. The system is assumed to be able to capture the address of at least one of the
memory accesses which caused the detected memory error. For example, the address
of the first or last error. A high level description of the steps is listed below.
1) Periodic invocation of system service.
2) System service check for presence of error.
3) If error exists, address of at least one of the errors obtained by system service.
4) Clear error status.
5) A tight loop is executed to access the failing address. Following each access, the cache line containing the failing address is flushed using CLFLUSH instruction or equivalent.
6) Following each cache line flush, check for presence of memory error and increment error count as required.
7) Clear error status.
8) If the number of errors is greater than an acceptable th...