Browse Prior Art Database

SCALABLE, EVENT BASED, BOTTOM-UP, DISTRIBUTED FAULT MANAGEMENT FOR USE IN DENSE COMPUTING SYSTEMS (MICRO SERVERS) AND TRADITIONAL RACK AND BLADE SERVERS

IP.com Disclosure Number: IPCOM000241310D
Publication Date: 2015-Apr-15

Publishing Venue

The IP.com Prior Art Database

Related People

Sayantan Bhattacharyya: AUTHOR [+3]

Abstract

A fault engine and multi controller model is presented that achieves efficient data acquisition, assimilation, aggregation and reporting in a multi hierarchical hardware set up of a dense computing system. With this model, a protocol is established between multiple managed entities to simplify the process of fault reporting, so that it scales to any number of nodes. This architecture enables each upper layer management controller to detect only its local faults. The upper layer management controller need not be burdened with the complex fault detection job for its lower layer controllers, since it offloads the task to the respective fault engines of each lower layer controllers. Upper layer controllers only collect the fault data available from each lower layer controller and present the fault data to an end user.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 15% of the total text.

Page 01 of 13

 SCALABLE, EVENT BASED, BOTTOM-UP, DISTRIBUTED FAULT MANAGEMENT FOR USE IN DENSE COMPUTING SYSTEMS (MICRO SERVERS) AND TRADITIONAL RACK AND BLADE SERVERS

   AUTHORS: Sayantan Bhattacharyya Yogindar Das Yasodhar Krishnan

CISCO SYSTEMS, INC.

ABSTRACT

    A fault engine and multi controller model is presented that achieves efficient data acquisition, assimilation, aggregation and reporting in a multi hierarchical hardware set up of a dense computing system. With this model, a protocol is established between multiple managed entities to simplify the process of fault reporting, so that it scales to any number of nodes. This architecture enables each upper layer management controller to detect only its local faults. The upper layer management controller need not be burdened with the complex fault detection job for its lower layer controllers, since it offloads the task to the respective fault engines of each lower layer controllers. Upper layer controllers only collect the fault data available from each lower layer controller and present the fault data to an end user.

DETAILED DESCRIPTION

     Fault detection and management is a very important subsystem in computing devices. There are challenges for a micro server to manage faults in the complete system. A similar kind of challenge could be found in traditional rack server deployment models where a bulk of rack servers are managed by a single management controller.

    In a typical Unified Computing System (UCS) micro server, there are multiple cartridges and each cartridge can contain multiple servers. Thus, in a small chassis unit, there are many servers. For example, one UCS micro server model chassis has 8 cartridges and yields 16 servers in a chassis. Typically, each of these servers has a dedicated Baseboard Management Controller (BMC). All these servers are managed by

Copyright 2015 Cisco Systems, Inc.
1


Page 02 of 13

one centralized controller called Chassis Management Controller (CMC), which is the management end point of the complete box, in standalone deployment mode. Each server is locally managed by anIntelligent Platform Management Interface (IPMI) stack running in its dedicated BMC. The CMC manages all the BMCs.

    In a standalone deployment mode, if it is desired to track faults and events for all the server nodes in a centralized place, such as the CMC, the CMC needs to keep track of all the IPMI sensors spread across all the servers and also keep track of any other (other than IPMI) source of faults/events (if exists). In attempting to aggregate all the BMC sensors in the CMC, the IPMI maximum number of sensors (which is only 255) will be exceeded. So this is not a scalable option. Similarly, for non-IPMI sources of faults/events (if they exist), the CMC needs to collect all the data to detect & report faults for them.

    One of the major ways to detect faults is to collect periodic data and compare against previous data to detect state changes of various devices. This calls for some kind...