Browse Prior Art Database

Remote server identification for error correlation

IP.com Disclosure Number: IPCOM000237561D
Publication Date: 2014-Jun-24
Document File: 4 page(s) / 218K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method to assist service personnel in a multiple-server environment with identifying which server is associated with a particular transmitted error. The method for remote server identification for error correlation provides the server number and the error indications to the service personnel to allow efficient problem resolution.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 4

Remote server identification for error correlation

In an installation environment, at first power on, a number of errors can be identified on multiple physical severs. This is particularly a problem with high performance computing cluster installations that have a rack of servers, with some installations having thousands of compute servers. When a service person goes on site and reviews the cluster, a number of amber error indicator light emitting diodes (LEDs) might be illuminated on multiple servers in multiple different racks. Normally, errors are serviced one at a time with a blue identification LED. This blue identification is set with the management entity from a central site. However, it is difficult for a person who is physically at the servers (plural) to determine what error is associated with which server, even with the blue identification LED.

Figure 1: Service personnel's dilemma determining which server to repair

A method is needed to assist service personnel with matching error indications with the problem server. The novel contribution is a method that can provide the server number and the error indications to the service personnel with the on site service personnel initiating the repair action.

The service person presses a button on one of the servers having an error indication. That indication goes back to a central management entity such as a cluster administration toolkit. With this operation, the baseboard management entity sends an unsolicited response back to the cluster administration toolkit, indicating the server number and the error indications. The operator at the toolkit console then communicates with the person who is physically at the server the type of error and the repair action required to remedy the problem for that specific server.

1


Page 02 of 4

In the installation of a cluster of compute nodes in a high-performance computing configuration, a number of errors can occur. When querying the errors, the remote management information technology (IT) person sees a display of error indications. (Figure 2)

Figure 2: Snapshot of error indications on a high-performance computing cluster

In this example, three compute nodes have a number of errors. Normally, the remote management IT person use...