Surety is performing system maintenance this weekend. Electronic date stamps on new Prior Art Database disclosures may be delayed.
Browse Prior Art Database

Precise Method for Isolating Hang Conditions in Multi-Node Systems

IP.com Disclosure Number: IPCOM000249308D
Publication Date: 2017-Feb-16
Document File: 3 page(s) / 129K

Publishing Venue

The IP.com Prior Art Database


A method for isolating hang conditions in multi-node systems is disclosed.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.


Precise Method for Isolating Hang Conditions in Multi-Node Systems

Disclosed is a method for isolating hang conditions in multi-node systems.

As computer systems become larger and larger, it's become increasingly difficult to isolate the failing components of the system. These large systems typically consist of multi-nodes, which are essentially systems within systems. To further add to the complexity, the components often contain large processing units that contain many units which implements a variety of functions.

In these complex multi-node systems, it is very challenging to determine which component is the cause of a system hang condition. The unit that issues the command and is waiting for data detects the hang condition in some Fault Isolation Register (FIR). When the system eventually check stops (or comes to a complete halt), the monitoring software scans throughout the system for the unit that triggered the check stop. However, more frequently then not the unit that detects the time-out condition is not the root cause of the hang. Since the monitoring software calls out the components that contain the fail detecting unit as the Field Replacement Unit (FRU), system components are incorrectly discarded. This unnecessary discard of system components could be costly for businesses and data warehouses.

This disclosure introduces a method for precisely isolating the unit or component that is the root cause for a large multi-node system hang. Each system node will have at least one very small Local Time-out Engine (LTE), which builds a system routing table sophisticated enough to determine the path a given operation trave...