Browse Prior Art Database

Method and Apparatus for Using TTA Vectors to Avoid System Failures Disclosure Number: IPCOM000244481D
Publication Date: 2015-Dec-15
Document File: 3 page(s) / 53K

Publishing Venue

The Prior Art Database


Disclosed is a method to minimize the main memory failure rate by closely and actively monitoring temperature, traffic, and air flow vectors of memory subsystem. This data set is then periodically evaluated against pre-characterized data to enable graceful handling of data movement or suitable mitigation, in case of either predicted or actual fault scenarios.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 3

Method and Apparatus for Using TTA Vectors to Avoid System Failures

In today's systems, there are a good number of techniques for fault management, but most of them are used "post failure" of components. Some of the features used today are the following:

"DRAM sparing" after a DRAM module becomes unusable

"Rank sparing" if a rank has become faulty in such a way that errors have grown in number, which ECC cannot correct

A DIMM is called out and guarded, based on UE occurrence

Throttling of performance when buffer/DRAM temperature reaches a temperature threshold (has adverse effect on performance)

"Chip mark" of faulty

    All of these approaches are similar and such approaches are reactionary, and once applied for use, either it degrades the RAS (or) has loss of performance. Whenever there are errors/fails due to excessive DRAM temperature, currently, the DIMM is called out and this depends on the occurrence of correctable errors hitting a specific threshold limit. There are cases where it may be continual errors, but not enough to trip a limit, so it is reset, and keep doing the same until there will be unrecoverable fail. It impacts the client's workload, sometimes in a catastrophic manner
(i.e., system crash). Handling of such specific situations can be improved to avoid DIMM call-out and, therefore, sustain system performance.

    Thermal controller (TC) monitors the health check (HC) status of memory subsystem components (controller, buffer, and DRAM) periodically as follows:
a) Collect temperature data of all DIMMs in the system and associated traffic volume (GB/s) to each DIMM
b) Compute "Temperature/Traffic" with respect to the "Air flow" (say TTA vector) of each DIMM module over a time window
c) Compare TTA vector with pre-characterized data and evaluate against predetermined safe threshold, say N. If one or more DIMM have TTA above N, then move to flow steps 1 thru 3. Otherwise, move to step (a)

(1) Increase HC process more often to the affected DIMM modules
(2) If step (1) shows that the "TTA above N" is settled at some value or keep increasing, then TC informs hypervisor to initiate the control action
(3) Hypervisor requests memory controller(s) to mark the logical memory blocks (LMBs) belonging to affected DIMM(s) and performs the below steps:

3a) Start gracefully migrating applications of affected LMBs to other portion of memory, if memory...