Browse Prior Art Database

Automated failure detection, diagnosis and repair mechanism for OnDemand autonomic self-healing systems Disclosure Number: IPCOM000022531D
Original Publication Date: 2004-Mar-19
Included in the Prior Art Database: 2004-Mar-19
Document File: 6 page(s) / 103K

Publishing Venue



Autonomic computing is a key feature of the OnDemand environment, requiring systems to be self-healing and thereby providing resource capacity that can be used based on demand. Existing failure detection and repair mechanisms designed for self-healing systems require customers to notify Service and Support personnel about failure occurences, and in turn the Service and Support personnel diagnose the failure and provide customers with the necessary fixes. This process requires multiple manual interactions and increases the mean time taken to fix the failures. In addition the diagnostic and systems management tools used by customers to detect failures are not integrated with Service and Support tools. Lack of integration between these two sets of tools also result in additional manual interactions. Increased mean time to fix results in direct increase of warranty costs. The invention described in this disclosure provides an automated mechanism to detect, diagnose and repair system failures. This mechanism also integrates diagnostic and systems management tools used for failure detection with those used by Service and Support personnel. This approach improves the self-healing characteristics of the targetted systems such as Servers (including Blade Servers), thereby providing Autonomic computing capabilities for OnDemand environments. For the remainder of this disclosure, the invention will be termed as Service Manager.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 47% of the total text.

Page 1 of 6

Automated failure detection, diagnosis and repair mechanism for OnDemand autonomic self-healing systems

Service Manager reduces time spent on problem determination and the mean time to repair. This is achieved by monitoring targetted systems, capturing failure data without requiring system restarts, automated service calls to Service and Support personnel, and automated failure fix updates. Service Manager provides a closed-loop failure tracking mechanism which results in direct reduction of warranty costs.

Service Manager provides an automated mechanism to integrate failure detection with Service and Support. It is designed to function either as stand-alone tool or integrated into a systems management application (eg. IBM Director). The design is based on the following assumptions:

Systems being monitored by Service Manager have an out-of-band interface (eg. Service

Processors) that can be used to capture failure data without requiring system restarts. Software for interacting with the above out-of-band interface is available for use by Service Manager.



Service Manager runs on a management server and gets notified when failures occur. Service Manager can also be deployed as stand-alone tool. If hardware or software failures occur, Service Manager captures the failure data from the failed systems, and stores this data in a central repository. Additional information such as hardware or software inventory and system logs are also collected by Service Manager. The failure data is combined with this additional information and a problem report is created. This problem report can be customized based on the customer's Service and Support agreement. The problem report is then formatted and sent by Service Manager to the appropriate Service and Support personnel interface.

Service Manager activates tracking for each reported problem, whereby status of the problems are tracked as they are processed by Service and Support. Customers can monitor progress of reported problems using Service Manager's tracking mechanism.

When fixes or patches for the reported problems are available from Service and Support, Service Manager is notified and obtains the required fixes and makes those available in a central repository. For problems requiring hardware fixes or updates, Service Manager tracks only the status of the problem. The actual hardware fix or update is out-of-scope for Service Manager, and is not described in this invention disclosure. However, the potential fix of a hardware problem may be described in the tracking information.


Page 2 of 6

Figure 1 - Service Manager Architecture and High-level Design

Service Manager Prototype

The current prototype is based on IBM Director and the IBM eServer xSeries x445 server with the Remote Supervisory Adapter (RSA I) card. The RSA serves as the Service Processor described above. The targetted system runs Windows Server 2003. Extensions to the IBM Director Console and Server components were made fo...