Browse Prior Art Database

Methods for Improving the Efficiency of Parallel Sort Merge Joins in the Presence of Data Skew

IP.com Disclosure Number: IPCOM000119860D
Original Publication Date: 1991-Mar-01
Included in the Prior Art Database: 2005-Apr-02
Document File: 5 page(s) / 238K

Publishing Venue

IBM

Related People

Dias, DM: AUTHOR [+4]

Abstract

Methods are disclosed for gathering information during the sort step of a parallel sort merge join. This information can be used to improve the efficiency of the scheduling and join steps that follow, by providing better estimates and reduced I/O and communication. This disclosure improves upon the parallel sort merge join algorithm invented by Wolf, Dias and Yu (*).

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 25% of the total text.

Methods for Improving the Efficiency of Parallel Sort Merge Joins
in the Presence of Data Skew

      Methods are disclosed for gathering information during
the sort step of a parallel sort merge join.  This information can be
used to improve the efficiency of the scheduling and join steps that
follow, by providing better estimates and reduced I/O and
communication.  This disclosure improves upon the parallel sort merge
join algorithm invented by Wolf, Dias and Yu (*).

      Following is a brief description of the original algorithm in
(*):  The method assumes that, initially, the two relations to be
joined, R1 and R2, are partitioned among the P processors.  In the
sort phase of the algorithm, each of the processors sorts its
partition of each of relations R1 and R2 (using a conventional
external sorting method such as a tournament tree sort and the
merging of sorted runs). The next and crucial phase of the method is
the scheduling phase. Essentially, in this phase each of the
relations R1 and R2 are partitioned according to ranges of values of
the join column value, and each such partition is assigned to a task
or set of tasks that will perform the join of values in the
corresponding range during the join phase.  Further, the scheduling
phase allocates the tasks to processors so as to balance the
processing time of the join phase among the P processors.  In the
transfer phase (for the data partitioning architecture only) ranges
of the two relations corresponding to the allocation of tasks are
shipped to the appropriate processors.  Finally, in the join phase
the allocated tasks carry out the join operation of the corresponding
ranges of the relations.

      The scheduling phase partitions the relations into ranges as
follows:  Start with the entire relations, each processor having
sorted its own portions of the relations. Each processor determines
the median value of its partition for each relation, and associates a
weight equal to the number of tuples in its partition.  The values
from all the processors are shipped to a coordinator and the weighted
median M of these medians is obtained (i.e., if the medians are
sorted, then the sum of the weights of medians less than or equal to
the value of M is greater than or equal to half the sum of all
weights, but this is not true if the weight of median M is excluded).
The value M is broadcast to all processors, each of which splits the
tuples of the two relations they have into three regions, the first
with values less than M, the second with values equal to M, and the
third with values greater than M.  Each of these ranges is associated
with a task, and the time to join each of the tasks created is
estimated.  A task is said to be of type 1 if it corresponds to a
range of join column values, and of type 2 if it corresponds to a
single join column value.  The largest type 1 task is then split into
three sub-tasks using the same method as above (finding median of
medians o...