Browse Prior Art Database

Sort Order Preserving Encoding for Data Compression

IP.com Disclosure Number: IPCOM000122019D
Original Publication Date: 1991-Oct-01
Included in the Prior Art Database: 2005-Apr-04
Document File: 2 page(s) / 50K

Publishing Venue

IBM

Related People

Hauser, DL: AUTHOR [+2]

Abstract

Database management systems use considerable amounts of disk storage and so, a reduction of disk storage could reduce the total system price. Data compression through statistical techniques is one method to reduce the disk storage requirement. Complex queries often require the execution of range predicates against the data. These predicates cannot be applied against data compressed using conventional compression algorithms, because the algorithms do not preserve sort order. An algorithm that does compression as well as preserve sort order therefore is needed and would enable the application of range predicates on compressed data.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 63% of the total text.

Sort Order Preserving Encoding for Data Compression

      Database management systems use considerable amounts of
disk storage and so, a reduction of disk storage could reduce the
total system price.  Data compression through statistical techniques
is one method to reduce the disk storage requirement.  Complex
queries often require the execution of range predicates against the
data.  These predicates cannot be applied against data compressed
using conventional compression algorithms, because the algorithms do
not preserve sort order.  An algorithm that does compression as well
as preserve sort order therefore is needed and would enable the
application of range predicates on compressed data.

      Data may be thought of as comprised of "subunits" that we may
refer to as symbols.  For example, under the EBCDIC storage scheme,
each subunit is a byte -- there are 256 symbols in the EBCDIC
alphabet.  In practice, analysis of large data collections show that
a few symbols occupy a large portion of the space in the data
collection.

      The Order Preserving Compression Code is as follows:
      1.   A leaf node is created for each symbol.
      2.   Order the leaf nodes according to each symbol's collating
order.
      3.   Label each node by the frequency of occurrence of the
symbol it represents.
      4.   Locate the two "adjacent nodes" with the lowest combined
frequency where "adjacent nodes" are nodes located next to each othe...