Browse Prior Art Database

Reliability and Performance Modelling of Hypercube based Multiprocessors

IP.com Disclosure Number: IPCOM000128672D
Original Publication Date: 1987-Dec-31
Included in the Prior Art Database: 2005-Sep-16
Document File: 13 page(s) / 43K

Publishing Venue

Software Patent Institute

Related People

Walid Najjar: AUTHOR [+3]

Abstract

Large-scale multiprocessor systems are now made possible by higher levels of integration. However, as the number of Processing Elements (PEs) in a multiprocessor system increases, so does the rate of failure. Therefore, the issue of system reliability must acquire an increased importance. In traditional fault-tolerant architectures, the objective is to insure failure-free operation over a large period of time. In order to achieve this objective, a heavy reliance is placed on hardware replication and redundancy. On the other hand, in the case of large-scale parallel computing with homogeneous processors, the redundancy needed for fault-tolerance is inherent to the design of the system. The objective thus becomes to allow the system to degrade gracefully under conditions of failures down to the lowest acceptable performance level. While recovery schemes exists which allow safe degradation upon failure of a single PE, they do not usually protect against a failure which would isolate two portions of the system. In such an environment, the goal of this paper is to determine the variation of several measures of reliability for a commonly used multiprocessor network topology (the hypercube) as a function of the size of the system. This topology has indeed received 'a large amount of attention in recent times because of the high degree of connectivity and scalability which makes these architectures very attractive to a large class of scientific and numerical applications. Some of these applications utilize the specific topological properties of the hypercube for the mapping of the application, while some merely make use of the high level of spatial locality. In Section 2, several principles of fault-tolerance in multiprocessor systems are recalled. Section 3 analyzes the probability of occurrence of an isolation (also called a disconnection). Analytical approximations of this probability are confirmed by a Monte-Carlo simulation approach. Disconnection probability measures of hypercubes of different sizes as well as x This paper is based upon research supported by the National Science Foundation under Grant No. CCR-8603772 (USC/Department of Engineering - Systems), and by the Office of Naval Research, Arlington, VA under Contract No. N00014-86-K-0311 (USC/Information Sciences Institute). i other topologies, mesh and random graphs, are compared. The influence of the disconnec-tion probability on reliability and on performance measures is presented in Section 4. New performance measures geared towards computationally oriented applications are also intro-duced in Section 4.3. Concluding remarks and directions for future research are presented in Section 5.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 9% of the total text.

Page 1 of 13

THIS DOCUMENT IS AN APPROXIMATE REPRESENTATION OF THE ORIGINAL.

Reliability and Performance Modelling of Hypercube based Multiprocessors

Walid Najjar Jean-Luc Guadiot

Reprinted from the 2nd International Workshop on Applied Mathematics and Performance/Reliability Models of Computer/Communication Systems, held in Rome, Italy, May 25-29, 1987. RELIABILITY AND PERFORMANCE MODELLING OF HYPERCUBE-BASED MULTIPROCESSORS*

W. Najjar and J.-L. Gaudiott

USC Information Sciences Institute Marina del Rey, California

tComputer Research Institute Department of Electrical Engineering - Systems University of Southern California Los Angeles, California

1 Introduction

Large-scale multiprocessor systems are now made possible by higher levels of integration. However, as the number of Processing Elements (PEs) in a multiprocessor system increases, so does the rate of failure. Therefore, the issue of system reliability must acquire an increased importance. In traditional fault-tolerant architectures, the objective is to insure failure-free operation over a large period of time. In order to achieve this objective, a heavy reliance is placed on hardware replication and redundancy. On the other hand, in the case of large-scale parallel computing with homogeneous processors, the redundancy needed for fault-tolerance is inherent to the design of the system. The objective thus becomes to allow the system to degrade gracefully under conditions of failures down to the lowest acceptable performance level. While recovery schemes exists which allow safe degradation upon failure of a single PE, they do not usually protect against a failure which would isolate two portions of the system. In such an environment, the goal of this paper is to determine the variation of several measures of reliability for a commonly used multiprocessor network topology (the hypercube) as a function of the size of the system. This topology has indeed received 'a large amount of attention in recent times because of the high degree of connectivity and scalability which makes these architectures very attractive to a large class of scientific and numerical applications. Some of these applications utilize the specific topological properties of the hypercube for the mapping of the application, while some merely make use of the high level of spatial locality. In Section 2, several principles of fault-tolerance in multiprocessor systems are recalled. Section 3 analyzes the probability of occurrence of an isolation (also called a disconnection). Analytical approximations of this probability are confirmed by a Monte-Carlo simulation approach. Disconnection probability measures of hypercubes of different sizes as well as

x This paper is based upon research supported by the National Science Foundation under Grant No. CCR-8603772 (USC/Department of Engineering - Systems), and by the Office of Naval Research, Arlington, VA under Contract No. N00014-86-K-0311 (USC/Information Sciences Institute). othe...