Browse Prior Art Database

PARALLEL EQUI-JOIN ALGORITHM for LARGE RELATIONAL DATA BASE Operations

IP.com Disclosure Number: IPCOM000040297D
Original Publication Date: 1987-Oct-01
Included in the Prior Art Database: 2005-Feb-02
Document File: 4 page(s) / 19K

Publishing Venue

IBM

Related People

Pfister, G: AUTHOR

Abstract

A technique is described whereby an algorithm minimizes the processing time of large relational data base operations in multiprocessing computer systems. The algorithm follows straightforward hash-table base techniques for equi-join, by providing virtually unbounded parallelism. The algorithm assumes implementation on highly parallel processors which rely on three characteristics: 1. The input/output (I/O) operations are performed concurrently with the computations. 2. There is a large main memory which is accessible by all processors equally. 3. There is an adequate implementation of the synchronization primitive fetch-and-add. As in similar equi-join algorithms the performance of this concept assumes that the output relations are approximately the same size as the input relations.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 32% of the total text.

Page 1 of 4

PARALLEL EQUI-JOIN ALGORITHM for LARGE RELATIONAL DATA BASE Operations

A technique is described whereby an algorithm minimizes the processing time of large relational data base operations in multiprocessing computer systems. The algorithm follows straightforward hash-table base techniques for equi-join, by providing virtually unbounded parallelism. The algorithm assumes implementation on highly parallel processors which rely on three characteristics: 1. The input/output (I/O) operations are performed concurrently with the computations. 2. There is a large main memory which is accessible by all processors equally. 3. There is an adequate implementation of the synchronization primitive fetch-and-add. As in similar equi-join algorithms the performance of this concept assumes that the output relations are approximately the same size as the input relations. In the worst case, the join is the product of the sizes of the inputs. The concept is designed to produce the correct result in the worst case. The processing performed, while reading the first relation, will go up non-linearly with the size of that relation. For the equi-join, assume two relations R1 and R2, with a column C1 of R1 and C2 of R2. The relation Ro has rows that are the concatenation of rows of R1 and R2. Specifically, Ro contains all those pairs of rows from R1 and R2 that have the contents of C1 equal to the contents of C2. C1 and C2 must contain data of equivalent type. For the fetch-and-add (FAA), use:V = FAA(A,X), where V is an integer variable, X is an integer, and A is an integer variable, in that A is implicitly the address of a location containing an integer. The equation will enable V to contain A's original contents and A contains the sum of X and A's original contents. When multiple processors execute FAA simultaneously, FAA appears indivisible, such that the result is the same as if the multiple processors executed FAA in some undefined sequence. Also, the time required for any number of processors to execute FAA simultaneously is equal to the time required for one processor to execute FAA. Standard double buffering techniques are assumed, allowing I/O to proceed concurrently with the computation. Also, an extension to triple or N-buffer operation is assumed and the first relation R1 is read into a linked list of buffers and held in memory. R1 is chosen to be the smaller of the two input relations.

For the rest of the algorithm, the type of operation provided assumes that all processors assigned to the problem execute the same program, taking different paths, as in: a) single program multiple data (SPMD) mode and b) fetch-and-add based implementation of a simple indexed parallel loop that contains no serial sections, as in the following: FORALL i IN 1...n;

...body of loop...

ENDFORALL The qualifier "shared" will be added to declarations of data that are globally shared by all processors

1

Page 2 of 4

assigned to the problem. The default is for data to not b...