Browse Prior Art Database

Concurrent Error Detection and Retry Mechanism for Interconnection Networks for Parallel Computers

IP.com Disclosure Number: IPCOM000039106D
Original Publication Date: 1987-Apr-01
Included in the Prior Art Database: 2005-Feb-01
Document File: 3 page(s) / 74K

Publishing Venue

IBM

Related People

Johnson, AM: AUTHOR [+2]

Abstract

In order to avoid severe degradation of performance and availability of a parallel computer system, it is very important that the system continue to operate in the presence of faults. Since most of the communication between resources (e.g., processors and memories) of the system is usually done across a multistage interconnection network, prompt error detection at each switch level of the network is very important, in order to reduce fault distribution and system contamination. Further, it has been observed that the frequency of transient errors on links of the network is much higher than that of permanent errors. Therefore, it is essential to recover from these transient errors to avoid severe performance degradation. Disclosed in this article is a concurrent fault detection and recovery method that meets these requirements.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 3

Concurrent Error Detection and Retry Mechanism for Interconnection Networks for Parallel Computers

In order to avoid severe degradation of performance and availability of a parallel computer system, it is very important that the system continue to operate in the presence of faults. Since most of the communication between resources (e.g., processors and memories) of the system is usually done across a multistage interconnection network, prompt error detection at each switch level of the network is very important, in order to reduce fault distribution and system contamination. Further, it has been observed that the frequency of transient errors on links of the network is much higher than that of permanent errors. Therefore, it is essential to recover from these transient errors to avoid severe performance degradation. Disclosed in this article is a concurrent fault detection and recovery method that meets these requirements. By "concurrent" it is meant that fault detection and recovery can be done without halting normal system operations. In Fig. 1, there is shown a simple example of the network which consists of stages of switches that are interconnected by links. Each link consists of a collection of wires that interconnect two switches, or the network to external resources of the system. Each switch has the necessary switching logic to route data at its input ports to its output port(s), as required.

(Image Omitted)

To describe the present method, it is assumed here (1) that transients in the network cause unidirectional errors, if any, on the wires of a link, during any network clock cycle, (2) that in two consecutive network clock cycles, transients in the network do not drive a link wire erroneously in opposite directions, and (3) that there is one common clock for the whole network. The switches of the network transmit the required information across their links as a sequence of words. No constraint is placed on the width of these words, or the length of this sequence. In order to detect multiple permanent and transient errors on the links, the present method suggests that these words be transmitted in two consecutive network cycles. In the first cycle, the word is transmitted in its true form, while in the following cycle the complemented (inverted) representation of this word is transmitted. This complementing is done within the transmitting logic of the switches. A high level circuit for this is shown in Fig. 2. Since the network is assumed to be synchronous, it implies that all the switches execute these true and complement word transfer cycles at the same time. Therefore, the receiving logic of a switch knows about the form of the received words. A high level circuit for the present method, for the receiving logic of the switches, is shown in Fig. 3. The receiving logic of the switches receives both these words, inverts the complemented word, and then compare...