Method and System for Fault Identification through Elimination of Non-Faulty Resource Clusters
Publication Date: 2010-Dec-15
The IP.com Prior Art Database
A method and system for fault identification through cluster elimination of non-faulty resource clusters is disclosed. The method includes an elimination-based fault isolation approach that leverages shared resources among applications.
Page 01 of 3
Method and System for Fault Identification through Elimination of Non -Faulty Resource Clusters
Disclosed is a method and system for fault identification through cluster elimination of non-faulty resource clusters. Fig. 1 illustrates a system overview of the components involved for fault identification through cluster elimination of non-faulty resource clusters. The method includes an elimination-based fault localization approach that leverages shared resources among applications. Shared resources are used as 'Readily Available Probes' to find the real-time state of applications. Thereafter, these probes are used to eliminate non-faulty resources leaving minimal subset of resources that are likely to be the faulty components.
(This page contains 00 pictures or other non-text object)
A dependency graph of the resource sharing applications in large scale data centers is utilized to build overlapping clusters of resources for different applications at various levels. The overlapping clusters of resources for different applications are then used to find out cluster(s) that has the problem by identifying applications that are failing.
In order to find out the applications that are failing, the method initially identifies resource dependencies of applications and dependencies among resources. After identifying the dependencies, groups of applications are created such that each application in the group has at least one common resource dependency with at least one other application in the group. After creating the groups of applications, a reduced subset (minimal) of resource clusters is created for each group. The step of creating reduced subset of resource clusters is performed iteratively until each cluster has
Page 02 of 3
minimum resources assigned to it. Test-cases are then associated with each cluster. A test-case that runs successfully implies that all resources in that cluster are working fine. Thus, when a fault event occurs, the test-cases associated with each cluster are automatically run until a faulty subset is reached. The above steps are repeated until two minimally reduced clusters with complementary resources are identified such that the test case for one works and for another fails. Subsequent to identification of such clusters, the faulty resources are then reported.
For example, consider the group applications shown in fig. 2 wherein App-A, App-B and App-D form one group and App-C...