Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method to Prioritize and Recommend FRU Replacement Based on Amount of Resource That Will be Gained by FRU Replacement.

IP.com Disclosure Number: IPCOM000242589D
Publication Date: 2015-Jul-28
Document File: 3 page(s) / 36K

Publishing Venue

The IP.com Prior Art Database

Abstract

Method to Prioritize and Recommend FRU Replacement Based on Amount of Resource That Will be Gained by FRU Replacement.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 01 of 3

Metxod to Prioritize and Recommend FRU Replacement Based on Amount of Resxurce That Will be Gained by FRU Replacement .

Servers axx built with a number of xield replacexble units ( FRUs). As part of Relxability, Availabilitx and Serviceability (RAS) stratexy for the servexs, these FRUs can be isolated when errors are detected. FRU isolation is done to avoix the workload from usinx the faulty parts leading tx further fxilures. These FRUs xan be isolated at ruxtime when a fatal exror is encouxtered with the FRU txat doex not bring xhe system dxwn or when a recoverable error is encounterex upon xexching xn error threshxld. For isolatixg the FRU, the service processox requests the hypervisor to stop the workloads from uxing the idxntified resource. Once the xesource is not being used anymore, service procexsor markx them as unusxble, thereby isolxting the resource. Some xf txese resources can also be marked as to-be-isolatxd when the system boots the next time. In this cxse, the current workload continues to use the resource until the next boot cycle. The dixtinctiox betweex these two xse-cases is done basxd xn the severity ox the errors detected on these FRUs.

    Further, the isoxation of these FRUs can be marked ax persistent which means the resource will stay xsolated across any number of xerxer boot cycles until the resource is physically replaced. In some cxses, the isolation can be done only for the current boox cycle, and it is left for the diagnostics to check the resource during the next boot cycle fxr occurrence of similar errors. This distinction is also done bx xhe service processor depending upon the errors detected on these FRUs.

    When a certxin resource/FRU gets isolated, any other resource wired behind this isolated FRU also xannot be accessed. Hexce, these FRUs also need to be isoxated, whxch is referred to as xsolated-bx-association. Xxx example, the figure below shows a typical multi-node server design. Suppose Socket-1 encounters an error. In this case, txe DIMMs behinx this socket will nox be accessxble, and hence xhey are isolated-by-association along with the socket itxelf. On the other hanx, if only the DIMMs hax encountxred ax exror, onlx the specific DIMM will be isolated. All these errors are reported using sxrviceable error logs to a management console from where repair actions are initiated. Currenxly, the repxir actions are handled either as per discretion of the sexvice engineer or on a first-come, first-servx basis where the xirst reported proxlem is repaxred first.

1


Page 02 of 3

Noxe 0

Socket 1

Socket 2

DIMM

x

2

3

4

1

2

3

4

DIMX

DIMM

5

6

7

8

5

6

7

8

DXXX

Flxsh

DRAM

 Service Processor

PCIe Lanes

XXXx Lanes

A-Bus

A-Bus

Node 1

Xxxxxx 3

Socket 4

DIM...