Browse Prior Art Database

Implementing Colocation in a Distributed Database

IP.com Disclosure Number: IPCOM000110289D
Original Publication Date: 1992-Nov-01
Included in the Prior Art Database: 2005-Mar-25
Document File: 3 page(s) / 147K

Publishing Venue

IBM

Related People

Hargis, SD: AUTHOR [+2]

Abstract

Disclosed is a software method for implementing the colocation of tables, in a distributed database, using a Bucket Index. In a distributed database environment where the data may be stored on more than one node, it can be advantageous if the database administrator can specify that certain tuples, from two or more different tables, should be stored on the same node; we call this colocating tuples. It is particularly advantageous if the database administrator knows that particular tables will be joined frequently. The tables can then be colocated on the join key value(s). Thus a method of implementing tuple colocation is needed.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 49% of the total text.

Implementing Colocation in a Distributed Database

       Disclosed is a software method for implementing the
colocation of tables, in a distributed database, using a Bucket
Index.  In a distributed database environment where the data may be
stored on more than one node, it can be advantageous if the database
administrator can specify that certain tuples, from two or more
different tables, should be stored on the same node; we call this
colocating tuples.  It is particularly advantageous if the database
administrator knows that particular tables will be joined frequently.
The tables can then be colocated on the join key value(s).  Thus a
method of implementing tuple colocation is needed.

      The result of hashing the key of a database tuple can be used
as an index to the storage location for that tuple.  The collection
of these hash results, and a node identifier for the storage
location, is called a bucket map.  Bucket maps map ranges of hashed
tuple keys into the available storage nodes.

      The function must accept any data type as input.  This is
accomplished by requiring that the starting location and length of
the tuple key be passed as parameters rather than passing only one,
fixed data type.  Thus the hash function views its inputs as strings
of bits; this also allows us to configure the number of bits used in
the hash operation at any one time.  Additionally, this strategy
allows control of the range of hashed tuple values that result from
the hash by configuring the number of bits in the hash result.  Our
hash function has been shown to produce results that are uniformly
distributed when the input is normally distributed.

      It is possible for the bucket maps to contain either a physical
node address or a logical node address that is easily transformed
into a physical address.  Maximum flexibility is achieved when the
range of hashed tuple keys is much greater than the number of
possible storage locations (e.g., 0 <= hash tuple keys <= 4095; with
256 possible nodes).

      Bucket is the term used to refer to the collection of tuples
whose hashed key values, thus storage nodes, are the same.  It is
possible, indeed planned, that several tuples will be stored in the
same bucket.  Of course, multiple buckets may be stored on the same
node.

      The solution utilizes a Bucket Index mechanism to specify which
tables should be colocated.  Accessing data in a database usually
requires knowing a table name and something about the tuple; ideally
the tuple key is known since tuple keys are unique per table.  In a
distributed database environment the data may be stored on more than
one node.  It is the responsibility of the database system to know
where each tuple is stored; this is often accomplished by hashing the
tuple key and transforming the hashed tuple key into a physical node
address.  Tuple key values whose hash results yield the same number
are said to be in the same bucket.

   ...