Method and System for Detecting Faulty Nodes in Large-Scale Enterprise Systems
Publication Date: 2010-Dec-14
The IP.com Prior Art Database
A method and system for detecting faulty nodes in large-scale enterprise systems is disclosed. The method and system is based on an Eigen space of a co-variance matrix of a monitored node to identify faulty nodes in the large-scale enterprise systems such as, an enterprise cluster environment. The method and system adapts well in dynamic conditions of the large-scale enterprise systems, is simple to operate and automatically produces a list of faulty node(s) in case of an anomaly in the large-scale enterprise systems.
Page 01 of 5
Method and System for Detecting Faulty Nodes in Large -Scale Enterprise Systems
Typically, in large-scale enterprise systems, replicated nodes do not fail in isolation. In other words, even though the large-scale enterprise systems are designed for high availability, performance and reliability, when a subset of nodes fail, this failure has effect on other nodes in a cluster that are healthy. Another complexity in this environment comes from the fact that workload is not uniformly distributed over different nodes. When these nodes fail even due the same failure, the differences in magnitude of a shift in different affected variables on those nodes may be different. Due to the clubbing of these nodes in the same cluster, it is difficult to apply a normal clustering algorithm on different variables (of the nodes) to separate faulty nodes from healthy nodes.
Therefore, in order to overcome this problem a method and system is disclosed for detecting faulty nodes in the large-scale enterprise systems. More precisely, the method is an eigen space based method that computes covariance matrix for each server that is being monitored. The eigen space of the computed covariance matrix is then used to determine a health of each server.
In order to understand the dynamics of the eigen space based method, the behavior of the eigen values and eigen vectors of the covariance matrix of a monitored node is analyzed and understood. The covariance matrix is a symmetric and a positive definite matrix. All eigen values associated with the covariance matrix are positive and real. If the covariance matrix is decomposed into the eigen values and eigen vectors, the eigen vectors of such matrix are orthogonal to each other. These eigen vectors represent an axis along which data is spread. Eigen values represent the strength of the direction of these eigen vectors.
In order to identify faulty nodes, only a principal eigen vector is considered because this is the principal axis along which the data with highest variability is spread. When a system is in a normal state, data is distributed as per the current correlation among different metrics of the covariance matrix. The normalized principal eigen vector of this covariance matrix is computed such that its magnitude is one. In case, the system enters into an anomalous state, a natural correlation among these different metrics changes considerably and some metrics show a high degree of volatility in their behavior. For example, the response time seen at a node may increase significantly with respect to current arrival rate in an anomalous state. In such situations, the spread of data shifts from its normal state and hence the direction or the coordinates of the principal eigen vector also change. Considering the normalized principal eigen vector, the highest co-ordinates of the normalized principal eigen vector correspond to the metrics that show highest variability. As such, most volatile metrics on each of...