Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method and Mechanism for Storing the Consecutive Space Chars in Data Compression Dictionary for RDBMS

IP.com Disclosure Number: IPCOM000212624D
Publication Date: 2011-Nov-21

Publishing Venue

The IP.com Prior Art Database

Abstract

The disclosure is about applying RLE compression in the dictionary data for dictionary-based RDBMS compression technique. Dictionary-based compression technique is commonly used in RDBMS to save storage space and reduce I/O to improve performance. The high repeatable values are selected as the patterns and kept in dictionary, which may be in row-level or table-level. In such data dictionary, the most appeared patterns are often the concective space chars, because the fix-length string data type, CHAR(), is commonly used by database applications. Actually the consecutive space chars in dictionary consume quite a lot space. The idea of the disclosure is to replace the consecutive space chars with a byte pair, SPACE CHAR (0X20) and the length of this consecutive space chars, thus to save the space.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 25% of the total text.

Page 01 of 11

Method and Mechanism for Storing the Consecutive Space Chars in Data Compression Dictionary for RDBMS


1.1 Data Compression in RDBMS

Data compression is a commonly used technique in many RDBMS products. One advantage that is provided by compressing data is the reduced costs for storing data onto storage mediums. Another advantage that is provided by compression techniques is an increase in I/O and transmission efficiency by reducing the amount of data to be sent/received between computing entities or to/from storage devices.

The current RDBMS products often utilize lossless compression to store text information. One particular subset of lossless data compression methods commonly used is binary-string/symbol substitution methods, which have been developed that exploit the redundancy of byte-strings repeated within a text value. Compression is accomplished by replacing frequently occurring byte-strings with shorter identifiers/placeholders, referred to hereinafter as symbols. In accordance with this method: a static dictionary is created that contains frequently occurring byte-strings and corresponding symbols; and compression is accomplished by replacing frequently occurring byte-strings with respective symbols.

The most common granularity of data compression used by RDBMS is a page, which is the smallest data storage unit. Figure 1 shows a sample of page-level compression. Each page belonging to a table with compression attribute has a dictionary to contain all repeated values and a symbol substitute the value in the row value.

1



Page 02 of 11

Figure 1. non compressed block vs. compressed block


1.2 Consecutive Space Char in Dictionary

There are several major data types used in RDBMS including character strings, integer, decimal, date.

Character Strings:


Fixed-length character string. All values in a fixed-length string column have the same length, which is determined by the length attribute of the column.

Variable-length character strings. The max length that can be contained in the column is specified as the column attribute. Others:

2



Page 03 of 11


Integer


Decimal


Date

According to the database application scenarios, both fixed-length and variable-length character strings are commonly used everywhere. Figure 2 shows the standard benchmark, TPC-H, table definitions, which is for OLAP performance measurement. Fixed-length character strings and variable-length character strings are used in every table actually. The same is true for OLTP TPC-C benchmark circumstance too.

(Note: the figure 2, TPC-H tables, can be referred to TPC benchmark website link http://www.tpc.org/tpch/spec/tpch2.14.2.pdf)

3



Page 04 of 11

4



Page 05 of 11

Figure 2. The TPC-H Schema


1.3 Massive consecutive space chars in compression dictionary

For the fixed-length character string data type columns, if the input values are less than the specified column fixed-length, space chars are appended to fill out the rest whole value. For page-level compression circumstan...