Browse Prior Art Database

Improved Estimation Techniques for Parallel Hash Joins

IP.com Disclosure Number: IPCOM000107654D
Original Publication Date: 1992-Mar-01
Included in the Prior Art Database: 2005-Mar-22
Document File: 2 page(s) / 103K

Publishing Venue

IBM

Related People

Dias, DM: AUTHOR [+3]

Abstract

Disclosed is an improvement to the parallel hash join algorithm of (*). The original algorithm was designed to effectively handle parallel hash joins in the presence of data skew. The improvement involves two new techniques. First, multiple fine hash functions are used, and their intersection is used as an indication of matches. Second, more detailed statistics pertaining to the highest skew hash classes of the first relation are obtained when scanning the second relation.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Improved Estimation Techniques for Parallel Hash Joins

       Disclosed is an improvement to the parallel hash join
algorithm of (*).  The original algorithm was designed to effectively
handle parallel hash joins in the  presence of data skew.  The
improvement involves two new techniques. First, multiple fine hash
functions are used, and their intersection is used as an indication
of matches.  Second, more detailed statistics pertaining to the
highest skew hash classes of the first relation are obtained when
scanning the second relation.

      The basic method is as follows:  The first relation R1 is
scanned, multiple fine hash functions are applied, and statistics
(number of elements falling in each composite hash class) are
gathered in memory for each fine hash function.  For each coarse hash
bucket, the composite hash function resulting in the lowest skew
statistics is selected for use in the scan of R2.  (Several possible
measures  of skew can be used in selecting the composite hash
function, for example, the smallest average cardinality of the
largest K bins.)  Note that the composite hash function selected can
be different for each coarse hash bucket.  With the selected
composite (and, therefore, fine) hash function for each coarse hash
bucket, the top cardinality N bins are selected, where N is chosen to
allow these statistics to be kept in memory.

      In the hash phase for relation R2, only the selected fine hash
function is used for each coarse hash bucket.  For the selected (high
skew) composite hash buckets, complete statistics  are  gathered on
the values and their cardinalities that fall in  these composite hash
buckets. Thus, after hashing both relations, detailed  statistics  of
the selected high skew composite hash classes are available for the
second relation, but not for the first.

      The join output cardinalities are then estimated as follows.
In the non-selected (i.e., lower skew) hash classes, the estimation
is the same as in (*).  For the selected high skew composite hash
classes, the top cardinality values are examined in turn.  If any of
the corresponding buckets in R1 are not high skew, then that element
is assumed not to  correspond  to  the  high  skew element that
flagged the composite hash class as a high skew hash class in R1.  In
this case, the next highest cardinality value of the composite hash
class in R2 is examined, and so on; th...