Browse Prior Art Database

Full Participation Hashing/Compression Algorithm

IP.com Disclosure Number: IPCOM000114139D
Original Publication Date: 1994-Nov-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 4 page(s) / 121K

Publishing Venue

IBM

Related People

Hargis, SD: AUTHOR [+2]

Abstract

Our algorithm solves the problem of compressing m bits of information into n bits of information, where m>n, when it is not necessary to be able to restore the original m bits. Our algorithm is fast, achieves full storage utilization, and has been shown to yield a uniform distribution across the range.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 47% of the total text.

Full Participation Hashing/Compression Algorithm

      Our algorithm solves the problem of compressing m bits of
information into n bits of information, where m>n, when it is not
necessary to be able to restore the original m bits.  Our algorithm
is fast, achieves full storage utilization, and has been shown to
yield a uniform distribution across the range.

      Data compression algorithms sometimes require that the original
information be retrievable by reversing the compression algorithm;
however, it can be useful to compress information without needing to
retrieve the original form.  This is usually because the full
information is stored elsewhere and an abbreviated form that retains
its uniqueness when compared to other information is needed.  Hashing
functions are a good example.  Hashing functions can be used when the
original information is too voluminous to work with efficiently, thus
the information is "summarized" into a suitable form.  It is
important that the transformation used to summarize the information
generates unique summaries given unique information; which is one
reason why it is important that all data participate in generating
the result.  The summary, or transformation, can be evaluated on the
following characteristics:  execution speed, storage utilization,
number of collisions/number of probes, and loading factor.  For our
product the loading factor is not relevant because storage (the table
size) is well beyond the largest expected size.

      Our algorithm was invented to compress a 32 bit hash key into a
24 bit key, although more generalized applications are given.  One
can view this as double hashing or as a hash and a compression, hence
the name of the disclosure.  The evaluation of our algorithm is
excellent on all three criteria referenced above:  execution speed is
44 micro seconds per element on a machine with a SPEC mark of 52.4,
full storage utilization is achieved, and the algorithm yielded a
uniform distribution from randomly generated data items (indicating
that the number of collisions is minimal given the range).  The
algorithm generates more collisions if sequential input is given.

      Our product has need to store information in a database;
efficient retrieval of information from the database dictates the use
of an index.  However, the primary key is a variable length character
field that can be up to 2048 characters long; databases do not
usually allow indexing on such a field.  To maintain the uniqueness
of the key and to allow efficient retrieval from the database, we
decided to hash the variable length field and use the hash result (a
32 bit integer) as the key (thus the range of the hash result is
0-((2**32)-1).  The desired method of handling collisions is to
always have the 8 low order bits of the hash result be zeroes and
collisions can be resolved by incrementing the low order byte (thus
255 collisions are possible).  This limits the hash result to the...