Concurrent Error Detection and Retry Mechanism for Interconnection Networks for Parallel Computers
Original Publication Date: 1987-Apr-01
Included in the Prior Art Database: 2005-Feb-01
In order to avoid severe degradation of performance and availability of a parallel computer system, it is very important that the system continue to operate in the presence of faults. Since most of the communication between resources (e.g., processors and memories) of the system is usually done across a multistage interconnection network, prompt error detection at each switch level of the network is very important, in order to reduce fault distribution and system contamination. Further, it has been observed that the frequency of transient errors on links of the network is much higher than that of permanent errors. Therefore, it is essential to recover from these transient errors to avoid severe performance degradation. Disclosed in this article is a concurrent fault detection and recovery method that meets these requirements.