Browse Prior Art Database

Recovery of core in case of persistent cache error Disclosure Number: IPCOM000240409D
Publication Date: 2015-Jan-29
Document File: 3 page(s) / 38K

Publishing Venue

The Prior Art Database


Disclosed is a method to mitigate the loss of a processor core due to flood of errors anywhere in its cache hierarchy. Often in an SMP, if a core conistently reports cache error, it is marked as bad. Core and caches are tightly coupled for optimal performance. This tight coupling prevents a core to function reliably. Using method below, it is possible to use this core for specific kinds of application. Once firmware error diagnosis concludes that core is faulty, it should disable that core. Hypervisor and hardware must ensure that 1. any acess to core fails gracefully. 2. all other caches in the core start acting as private cache of given core. 3. core operates on an exclusive copy of main memory. 4. core is allocated tasks which are not memory intensive.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 3

Recovery of core in case of persistent cache error

For optimal performance, an SMP operation need support from underlying hardware. Any modern processor intended for an SMP system invariably has following two attribute :

1.1 Multiple Cores

1.2 Multiple levels of cache.

Caches are intended to bridge the operational speed gap between processor and memory.

A multi level cache basically intends to maximize the benefits possible from caching. This is achieved by incorporating L1 cache next to core, an L2 cache closer to core and an L3 cache at some more distance from core. This multilevel caching helps achieve maximum efficiency without over complicating the design or making it prohibitively expensive.

These caches are prone to various kinds of read and write error. In order to improve the overall

performance, caches are very tightly coupled with core. As a result when any level of cache

persistently reports error, operation of core is severely impacted. For a reliable and smooth operation , workloads from such core is often moved to some other healthy core. Core reporting these cache errors are marked as bad for good.

Cores are a critical and an expensive hardware resource. For a small low end system, loss of a core

can be significant. It can be avoided by

- making design more immune to errors

- identifying ways by which it can be still be useful.

The idea below focuses on identifying ways by which it can be still useful.

For cores to operate efficiently, current trend is to promote a tight coupling between core and cache.

As a result, failure of one unit can often make other useless.

There are situations when firmware detects a persistent cache error and it is not possible to fix it through available means of repair. Let us call this situations as CEF. Usually, in CEF, cache is considered as bad. Since, cores are closely associated with cache, cores are marked bad as well. As a result, we loose an expensive computational resource primarily because of a bad part which is relatively cheaper and relatively unintelligent.

In order to avoid this resource loss due to bad cache, firmware shall configure the core to operate in a trim cache mode or cache less mode. In this mode, core shall bank on either higher level of cache private to the core or on main memory chunk with an exclusive access. It can be done by following all the steps below:

Step No Description


Page 02 of 3

1 2




Say core C1 is consistently reporting L2 cache error. Service processor firmware handles the cache error and takes corrective action. Corrective actions can be one from list but not limited to the list

3.1 just ignore it for some time