Browse Prior Art Database

Technique for Duplicate Key Elimination During Data Loading

IP.com Disclosure Number: IPCOM000123711D
Original Publication Date: 1999-Mar-01
Included in the Prior Art Database: 2005-Apr-05
Document File: 3 page(s) / 108K

Publishing Venue

IBM

Related People

Lightstone, S: AUTHOR

Abstract

This invention deals with how database record IDs (RIDs) can be sorted in order of first table appearance, even though the record identifier may not contain information regarding sequential page numbers. This is particularly import in the context of data loading, a process whereby user data is fed into a data at high speeds. In such processing table data, including RIDs may need to be completely sorted. If duplicate entries appear in the data, and if such duplicates are invalid for the target table, then the loading process will need to determine which records should be left in the table, and which do not belong. This requires the loader to be able to ascertain which records existed in the table prior to the loading attempt, by using the technique mentioned in this disclosure.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Technique for Duplicate Key Elimination During Data Loading

   This invention deals with how database record IDs (RIDs)
can be sorted in order of first table appearance, even though the
record identifier may not contain information regarding sequential
page numbers.  This is particularly import in the context of data
loading, a process whereby user data is fed into a data at high
speeds.  In such processing table data, including RIDs may need to be
completely sorted.  If duplicate entries appear in the data, and if
such duplicates are invalid for the target table, then the loading
process will need to determine which records should be left in the
table, and which do not belong.  This requires the loader to be able
to ascertain which records existed in the table prior to the loading
attempt, by using the technique mentioned in this disclosure.

   In database processing sorting of record identifiers is
a standard operation.  In DB2, and many other Relational Database
Management Systems, the record identifiers are stored as 4 byte
quantities, which typically include three bytes representing a page
identifier, and a single byte representing a "page slot".  A page
slot is being an index into a list of page offsets.  The DB2 load
utility adds user data (as a set of records) to a table in three
phases.  The three phases are: LOAD, BUILD and DELETE.  In the first
phase all user records are loaded, indiscriminately.  In the second
phase, table indexes are built.  During the BUILD phase the load
utility may discover that some records loaded during the LOAD phase
were actually in violation of a uniqueness constraint on the table,
and should not be present in the table at all.  The row identifiers
(RIDs) for these records are stored, and during the DELETE phase, all
such records are deleted.

   Consider a RID being a 4 byte quantity consisting of a 3
byte I/O space page number and a 1 byte slot.  The idea of a 3 byte
space page can be explained most easily with a trivial example.
Consider every age on disk as being assigned a number that maps to a
location.  Now consider any particular page in this scheme, such as
page number: 2762196.  This page number is represented in Hexadecimal
as: x 2A25D4, which requires three bytes for representation in
computer memory: x2A, x25 and xD4.

   Similarly, pages in the database table are represented
as 3 byte quantities.  Consider t...