Browse Prior Art Database

Unified and Comprehensive Monitoring Topology for Clusters

IP.com Disclosure Number: IPCOM000012923D
Original Publication Date: 2003-Jun-11
Included in the Prior Art Database: 2003-Jun-11
Document File: 3 page(s) / 67K

Publishing Venue

IBM

Abstract

Disclosed is a process for monitoring multi-homed computer nodes and network paths between computer nodes.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 45% of the total text.

Page 1 of 3

Unified and Comprehensive Monitoring Topology for Clusters

The Cluster Node & Network Monitoring Problem Solving the monitoring problem in such an environment has the following objectives:

Detect node loss: This enables node failover (cluster reformation).

Detect network path loss: This enables network failover and/or cluster

reformation. Provide reliable and redundant data points for both the node loss and network

loss: This helps prevent false-positive loss detection and helps validate the loss using the redundant data point. Enable fault resolution for quick repair: Distinguishing between the various faults

enables quick resolution as well as repair.

Known Solutions & Drawbacks Node Loss Detection Solutions Many of the existing solutions to the monitoring problem in many high-availability research projects and products have one or more of the following problems and limitations:

Not Comprehensive: These solutions have solved only the node loss detection problem (Examples: any typical cluster management solution, high availability solutions such as Starfire (RSF-1), Legato) and do not, typically, solve the network loss detection problem in a way so as to enable network failover. Closed & Inflexible Architecture: They use a network path for network heartbeats. Many of them cannot use a second network path for heartbeats when available. Some, however, can use a disk heartbeat as a redundant heartbeat channel. Hardware Inflexibility: In many cases, they need special serial communication capability to provide redundant data points (Example: Starfire). Serial communication support is sometimes a hindrance and sometimes not possible since all available serial ports may be used up. Furthermore, serial heartbeats only monitor the node but not the communication medium for the cluster. Static Configuration: These solutions cannot often either deal with or take advantage of different kinds of network environments. Network topologies and component utilization can be diverse leading to single points of failure, which cannot be resolved by single (or even sometimes dual network heartbeats). The solutions quite often assume a backplane network and cannot reliably deal with the cluster network being the same as the client network. Client networks tend to stress the network heartbeat systems more owing to network saturation.

Network Loss Detection Solutions The solutions that solve the network loss problem do not present a solution to the larger high-availability problem. These solutions have one or more of the following problems and limitations:

They deal with only network loss and network failover and do not integrate

1


1.


2.


3.


4.

Page 2 of 3

node loss detection mechanisms. Examples here are: AIX* virtual internet protocol address (VIPA), Linux bonding driver, ethernet driver bonding and failover features. Many of these solutions use network link loss detection methods, which is not reliable if the network component that is faulty is separated...