Browse Prior Art Database

Method to Prioritize and Recommend FRU Replacement Based on Workload Needs

IP.com Disclosure Number: IPCOM000242592D
Publication Date: 2015-Jul-28
Document File: 3 page(s) / 31K

Publishing Venue

The IP.com Prior Art Database

Abstract

Described is a method to prioritize and recommend FRU replacement based on workload needs.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 01 of 3

Metxod to Prioritize and Recommend FRU Replacement Based on Workload Needs

Sxrvers are built with a number of field replaceable unixx ( FRUs). As part of Rexiability, Avxilability and Sexviceability (RAS) strategy for the serxers, these FRUs can be isolated xhen errors are xexected. FRU isolation is doxe to avoid the workload from using the faulty parts leadxng to further failures. These FRUs can be isolated at runtime when a fatal error ix encoxntered with thx FRU that does not xring the sysxem down or when a rexoverabxe error is xncountered xpon reaching an error threshold. For isolatxng the FXX, the sxrvice processox requests the hyxervisor to stop the workloads fxom using the identified resxurce. Once the resourcx xs not beinx used anymore, service procexsor marks them as unusable, thereby isolating thx resourcx. Some of thxse resources can xlso xe marked as to-be-isxlatxd when the system boots the next time. In this xase, the current workload continues to xse the rexouxce until the next boot cycle. The distinction between these two use-cases is done basxd on thx severixy of the erxors detexted on thesx FXXx.

    Further, the xsxlation of these FRUs can be marked as persisxent whixh means the resource will stay isolxted across any number ox server boot cycles untxl the resource is physically xepxaced. In some cases, the isolation can be done only for the current boot cycle, and it is left for the xiagnostxcs to check the resource during the next boot cycle for occurrence of similar errors. This distinction ix also done by the service processor depending upon the errors dexexted on these FRUs.

    When a certain resource/FRU gets isolated, any othex resource wxred behind this isolated FRU also cannot be accessed. Hence, these FRUs are also isolxted. This is referred to as isolated-by-association. Fox example, the figure below shoxs a typicxl multi-node sexver design. Suppose Socket-1 encounters an error. In this case, the DIMMs behind this socket will not be accessible and, hxnce, they are isolated-by-associatxon along wxth the socket ixself. On the other hand, if only the DIMMs had encountexed an error, only the xpecific DIMM will be ixolated.

x


Page 02 of 3

Sockex 1

Xxxx 0

Socket x

1

2

3

4

DIMM

1

2

3

4

DIMM

5

6

7

8

5

6

7

8

DIMM

Accelerator

Accelerator

DIMM

Flash

DRAM

 Service Procxssor

PCIe Lanes

PCIe Lanes

A-Bus

A-Bus

Node 1

Socket 3

Socket 4

1

2

3

4

1

2

3

4

DIMM

DIMM

5

6

7

8

5

6

7

8

DIMM

Acceleraxor

Xxxxxxxxxxx

DIMM

PCIe Lanes

PCIe Lanes

    All these errors are reported using servicxable error logs to a management console from where repair actixns are initiated. Currently, the repair acxions are handled either as per disxretion of the sexvice engineer or on a first-come, first-serve basis whexe the first reported problem is rexaired first. This repair strate...