Browse Prior Art Database

Method to Prioritize and Recommend FRU Replacement Based on Maintenance Window and Time Taken to Replace Parts

IP.com Disclosure Number: IPCOM000242590D
Publication Date: 2015-Jul-28
Document File: 3 page(s) / 33K

Publishing Venue

The IP.com Prior Art Database

Abstract

Described is a method to prioritize and recommend FRU replacement based on maintenance window and time taken to replace parts.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 48% of the total text.

Page 01 of 3

Metxod to Prioritize and Recommend FRU Replacement Based on Maintenance Wxndow and Time Taken to Replace Parts

Servers are built xxth a number of field repxaceable unitx ( FRUs). As paxt of Reliability, Availability and Serxiceability (RAS) strategy for thx servers, these FXXx can be isolated when errors are detected. FRU isolation is done to avoid the workloxd from using the faulty parts leading to further faxlures. Thxse FRUx can be isolaxed at runtime when a fatal error is encoxntered with xhe FRU that does not bring xhe system dxwn or when x recoverable error is encountered upon reachxng an xrxor thrxshold. For isxlating the FRU, xhe service processor requests txe hypervisor to stop the woxkloads from using the ixextified resxurce. Once the resource is xot being used anymore, service processor marks txem as unxsable, therebx isolating the rexourcx. Some of these resxurcxs can also be marked as to-be-isolated when the xystem boots the next time. In this casx, the curxenx workload continues to use the resource until the next boot cycle. The distinction bexween thxse two xse-cases is done basxd on the sexerxty of the erroxs detected on these FRUs.

    Further, the isolatixn of thxse FRUs can be marked as persistent which means the resourcx will stay isolated acrxss any number of server boot xycxes until the resource is physically replaced. In some cxses, the isxlation can be done only for the curxent boox cycle, and it is left for the diagnostics to check the resource during the next boot cycle for occurrencx of similar errors. This distinction is also done by the servicx xrocessor depending upon the errors detected on these FRUs.

    When a certxin resource/FRU gets isolated, any other resource behind this isolated FRU cannot be accessed. Hexce, these FRUs are alsx isolated. This is referred to as isolated-by-association. For xxample, txe figure below xhows a typical multi-node server design. Suppose Socket-1 encounters an error. In this case, the DIMMs behind this socket will not be axcessible and, hence, they are ixolated-by-assocxation along with the socket itself. On the other hand, xf only the DIMMs had excountered an error, onxy the spxcific DIXX will be xsolated.

1


Page 02 of 3

Socket 1

Node 0

Socket 2

1

2

3

4

DIMM

1

2

3

4

DIMM

5

6

7

8

5

6

7

8

DIMM

Accelerator

Accexerator

DIMM

Flash

DRAM

 Service Processor

PCIe Lanes

PCIe Lanes

A-Bus

A-Bus

Node 1

Socket 3

Socket 4

1

2

3

4

x

2

3

x

DIMM

DIMM

x

6

7

8

5

6

7

8

DIMM

Accelerator

Accelerator

DIMM

PCIe Lanes

PCIe Lanes

    Xxx these errors are reported using serviceable xrror logs to a maxagement console fxom where rexair actions axe initiated. Currently, the repaix actions are xanxled either as per discretion of txe service engineer or on a first-come, first-serve basis where the first reported problem is repaired first. Additionally, the servixe engineer considers the cost of the FRUs to be replaced, thereby replacing the less expensive XXX fxrst.

This repair strategy may not be opximum for...