Browse Prior Art Database

Hash Index for Locating Tuples in a Distributed or Parallel Database

IP.com Disclosure Number: IPCOM000104632D
Original Publication Date: 1993-May-01
Included in the Prior Art Database: 2005-Mar-19
Document File: 4 page(s) / 118K

Publishing Venue

IBM

Related People

Copeland, GP: AUTHOR [+3]

Abstract

Disclosed is a method for determining tuple storage locations in a distributed or parallel database environment. The tuples are partitioned across multiple storage nodes which do not share disks or memory. The method uses an efficient hash index facility. Other variations of hash indexes have been proposed or used for the purpose of locating data in parallel and distributed database systems. This disclosure describes a particular implementation of a hash index, which:

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 48% of the total text.

Hash Index for Locating Tuples in a Distributed or Parallel Database

      Disclosed is a method for determining tuple storage locations
in a distributed or parallel database environment.  The tuples are
partitioned across multiple storage nodes which do not share disks or
memory.  The method uses an efficient hash index facility.  Other
variations of hash indexes have been proposed or used for the purpose
of locating data in parallel and distributed database systems.  This
disclosure describes a particular implementation of a hash index,
which:

1.  uses no special hardware to accomplish the routing of data and
    requests (in contrast to the Teradata database machine), allowing
    it to be used with arbitrary networks of computers;

2.  is used by sending-nodes to route requests, versus by
    receiving-nodes to filter requests, thereby reducing overall
    communications and data processing overhead;

3.  allows the flexibility of spreading a table's data over a
    variable number of nodes (one, some, or all); and

4.  allows data to be easily redistributed by moving specific buckets
    between nodes (as opposed to unloading and reloading entire
    tables), for the purpose of balancing storage consumption or data
    access across nodes, or adding and removing nodes from the
    system.

      A particular architecture for parallel and distributed database
systems, known in the industry as a "shared-nothing" architecture,
uses a set of computers to collectively manage a database; the set of
computers do not share memory or disks.  In this architecture, the
files or tables are partitioned into disjoint subsets; each subset is
stored on a different computer, or "node".  Since the nodes do not
share access to any common memory or disks, they use communication
networks to send data or database requests between them.

      In such an environment, a facility is required to determine the
node which will store each tuple in a database table.  When new
tuples are added to the database, the facility is consulted to find
out where each tuple should be stored.  In requests that retrieve or
update tuples, the facility is consulted to find out where each tuple
has been stored.  This facility is important in achieving maximum
throughput of database requests:  the facility allows a large class
of requests to be sent to only those computers which are known will
contribute to the request's answer, avoiding unnecessary disk
input/output, communications, and cpu processing on those nodes that
cannot contribute to the answer.  Without such a facility, every
request would have to be broadcast to every node for processing,
since the location of relevant tuples could not be determined a
priori.

      To solve this problem, we use a hash-index to determine the
location of tuples on nodes in the distributed database system.  Each
database table is assumed to consist of tuples that are all of the
same ty...