Browse Prior Art Database

Method And Apparatus For Intelligent Grid Error Call Home

IP.com Disclosure Number: IPCOM000200535D
Publication Date: 2010-Oct-18
Document File: 3 page(s) / 32K

Publishing Venue

The IP.com Prior Art Database

Abstract

In the existing Peer-to-Peer environment for Virtual Tape Systems (VTS), Call Homes are filtered for redundancy by passing a list of Call Home errors around the Peers. Each individual VTS then consults this list before calling home. This method is simply a redundancy filter. There is no intelligence applied to creating or interpreting the messages passed around by the various machines in the Peer group. In a grid environment, however, there is a network of communication paths between nodes. In most cases, only one node should report a problem, in other cases a subset of nodes, and in extreme instances every node in the grid should call home. A new method is needed to allow each machine to independently decide, given an error message, what its response should be, and if other nodes need to be informed of the failure.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 36% of the total text.

Page 01 of 3

Method And Apparatus For Intelligent Grid Error Call Home

When a failure occurs on one node in a grid configuration, multiple Call Home events could occur if the failure effects multiple machines. This is the case, for example, when a communication path goes down. The failure effects at least two nodes in the grid, but needs to be reported only once. For these 'single failure single call home' cases, a message needs to be sent out to other nodes in the grid to prevent them from calling home. In the case of a communication failure, however, the node that first sees the failure will not be able to send such a message to the node with which communication has been lost. Given such limitations, the failing node will need to broadcast the failure message to every available node, each of which will rebroadcast the message to each available node thereby propagating the message indirectly to the node on the other end of the broken communication link. Using this method, the nodes in the grid can work together to report grid level errors intelligently.

This invention is an error message propagation method for reporting 'domain level' errors intelligently within a grid architecture. When an error occurs that is predefined as 'domain level', ie the impact of this error is not limited to a single node in the grid, the node can send an error message to a specific node in the grid, to a subset of nodes, or to all available nodes informing them of the error. Within the error message the node could send further instructions to the other nodes to either not call home for the same error, call home with specific logs, forward the message on to another node, or forward the message on to every other available node. Each node individually has the intelligence to know how to handle a given error message from another node. In the case that the first node doesn't know what specific node needs to receive the message, for instance, the error message could simply contain instructions to rebroadcast to all other nodes, ensuring that even with communication failures every possible node is informed. Thus, the error reporting of the entire grid is always coordinated without the use of a centralized error reporting mechanism.

The layer of code that provides the error coordination intelligence will also be able to be updated remotely. We are able to continuously update the nature of errors as the release matures on a given machine. For instance, if we learn in the Support Center that a given error needs to have logs sent from additional machines in the grid, the Support Center will be able to update that error and broad cast the new information to all of the machines at that given code level or higher. This will ensure that next time a machine hits this given error, it will make sure that the grid call home feature is invoked instead of the single machine from the original configuration. This also can work in the reverse manner. If an error first is realized as a g...