Browse Prior Art Database

Using VLSI to Reduce Serialization and Memory Traffic in Shared Memory Parallel Computers

IP.com Disclosure Number: IPCOM000128241D
Original Publication Date: 1986-Dec-31
Included in the Prior Art Database: 2005-Sep-15
Document File: 11 page(s) / 42K

Publishing Venue

Software Patent Institute

Related People

Susan Dickey: AUTHOR [+5]

Abstract

The NYU Ultracomputer is an architecture for a large scale MI1Vm (Multiple Instruction stream, Multiple Data stream) shared memory parallel computer that may be viewed as a column of processors and a column of memory modules connected by a rectangular network of enhanced two by two buffered crossbars. The primary novelty of the design is the ability of the network to combine multiple requests directed at the same memory location, including a new coordination request, fetch-and-add. This permits task coordination to be achieved in a highly parallel manner. For example, if an arbitrary number of tasks simultaneously issue inserts or deletes to a single shared queue that is neither empty nor full, then all these inserts and deletes are accomplished in essentially the same time as would be required for a single insert or delete. This report reviews the Ultracomputer architecture and system design and describes the VLSI enhanced buffered crossbars that are. the key to the highly parallel behavior mentioned above. Consider a powerful machine composed of thousands of processors and gigabytes of memory. With 10-20 MIPS (including floating point) and a megabyte of memory soon to be available on a dozen chips, such a configuration could be built to yield significantly more performance than current supercomputers with roughly the same component count. Moreover, due to replication of parts, the 'This' work was supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U. S_ Department of Energy, under contract number DE-AC0276ER03077, and in part by the National Science Foundation, under grant number DCR-8413359. This paper will appear in Advanced Research In VLSI: Proceedings of the Fourth MIT Conference, Charles E. Leiserson, editor. ~~U

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 9% of the total text.

Page 1 of 11

THIS DOCUMENT IS AN APPROXIMATE REPRESENTATION OF THE ORIGINAL.

Using VLSI to Reduce Serialization and Memory Traffic in Shared Memory Parallel Computers

Susan Dickey, Allan Gottlieb, Richard Kennerl

Ultracomputer Research Laboratory Courant Institute of Mathematical Sciences New York University 251 Mercer Street New York, NY 10012

Ultracomputer Note #94.

January, 1986

. ABSTRACT

The NYU Ultracomputer is an architecture for a large scale MI1Vm (Multiple Instruction stream, Multiple Data stream) shared memory parallel computer that may be viewed as a column of processors and a column of memory modules connected by a rectangular network of enhanced two by two buffered crossbars. The primary novelty of the design is the ability of the network to combine multiple requests directed at the same memory location, including a new coordination request, fetch-and-add. This permits task coordination to be achieved in a highly parallel manner. For example, if an arbitrary number of tasks simultaneously issue inserts or deletes to a single shared queue that is neither empty nor full, then all these inserts and deletes are accomplished in essentially the same time as would be required for a single insert or delete.

This report reviews the Ultracomputer architecture and system design and describes the VLSI enhanced buffered crossbars that are. the key to the highly parallel behavior mentioned above.

Consider a powerful machine composed of thousands of processors and gigabytes of memory. With 10-20 MIPS (including floating point) and a megabyte of memory soon to be available on a dozen chips, such a configuration could be built to yield significantly more performance than current supercomputers with roughly the same component count. Moreover, due to replication of parts, the

'This' work was supported in part by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U. S_ Department of Energy, under contract number DE- AC0276ER03077, and in part by the National Science Foundation, under grant number DCR- 8413359. This paper will appear in Advanced Research In VLSI: Proceedings of the Fourth MIT Conference, Charles E. Leiserson, editor.

~~U number of distinct components would be quite small. Hardware design and assembly of a multiprocessor with a very high degree of parallelism therefore poses no new problems. However, actually using. all the processing power that can theoretically be generated presents a two-fold challenge. First, several thousand processors must be coordinated in such a way that their aggregate power is applied to useful computation. Serial procedures in which one processor works while the others wait become bottlenecks that drastically reduce the. power obtained. The cost of serial bottlenecks rise linearly with the number of processors involved; in any highly parallel architecture, they must be eliminated. Second, the machine must be

New York University Page 1 Dec 31, 1986

Page 2 of 11

Using VLSI to Redu...