Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Workload Monitoring in a Distributed Database Environment

IP.com Disclosure Number: IPCOM000110310D
Original Publication Date: 1992-Nov-01
Included in the Prior Art Database: 2005-Mar-25
Document File: 3 page(s) / 145K

Publishing Venue

IBM

Related People

Hargis, SD: AUTHOR

Abstract

Disclosed is a software method for determining the most efficient division of work among n nodes in a distributed database. In a distributed database environment where the data may be stored on more than one node, it can be advantageous for the nodes to share the workload. Workload sharing is particularly advantageous when queries are divided into multiple sections that may not be processed by the same node. To minimize response time, each node should finish processing at about the same time; otherwise, some nodes may be left idle while waiting for other nodes to complete. Thus, a method of determining the most efficient division of work among n nodes is necessary.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 50% of the total text.

Workload Monitoring in a Distributed Database Environment

       Disclosed is a software method for determining the most
efficient division of work among n nodes in a distributed database.
In a distributed database environment where the data may be stored on
more than one node, it can be advantageous for the nodes to share the
workload.  Workload sharing is particularly advantageous when queries
are divided into multiple sections that may not be processed by the
same node.  To minimize response time, each node should finish
processing at about the same time; otherwise, some nodes may be left
idle while waiting for other nodes to complete.  Thus, a method of
determining the most efficient division of work among n nodes is
necessary.

      In accessing tables, disk access time is most often the
bottle-neck factor; CPU processing time may also be significant.
User (person or machine) query patterns also affect disk access time.
Based on these assumptions, we will pursue a strategy that is based
on table access statistics.  Since table accesses are directly
affected by query patterns, our strategy includes tracking query
patterns and deriving workload statistics from them.

      The result of hashing the key of a database tuple can be used
as an index to the storage location for that tuple.  The collection
of these hash results, and a node identifier for the storage
location, is called a bucket map.  Bucket maps map ranges of hashed
tuple keys into the available storage nodes.  Note that there is only
one hash function; this contrasts with various other solutions which
use different hash functions depending on the type of hashing result
(e.g., normal, uniform, bimodal, etc. distribution) that is desired.

      It is possible for the bucket maps to contain either a physical
node address or a logical node address that is easily transformed
into a physical address.  Maximum flexibility is achieved when the
range of hashed tuple keys is much greater than the number of
possible storage locations (e.g., 0 <= hash tuple keys <= 4095; with
256 possible nodes).

      "Bucket" is the term used to refer to the collection of tuples
whose hashed key values, thus storage nodes, are the same.  It is
possible, indeed planned, that several tuples will be stored in the
same bucket.  Of course, multiple buckets may be stored on the same
node.

      The solution utilizes a Bucket Index mechanism to track user
query patterns on a set of tuples collectively referred to as a
bucket.  These bucket statistics can be used to formulate workload
models that enable us to balance the workload among n nodes.

      Accessing data in a database usually requires knowing a table
name and something about the tuple; ideally the tuple key is known
since tuple keys are unique per table.  In a distributed database
environment the data may be stored on more than one node.  It is the
responsibility of the database system to know where each...