Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

automatically create compression scheme for data with different distribution characteristics

IP.com Disclosure Number: IPCOM000238890D
Publication Date: 2014-Sep-24
Document File: 9 page(s) / 236K

Publishing Venue

The IP.com Prior Art Database

Abstract

Data compression is widely used in column store database and other area to reduce the storage requirement and increase the IO/memory usage efficiency. The key idea is to encode the data using symbols which has fewer bits than the original representation. To achieve the maximum compression ratio, optimized encoding algorithms, e.g. Huffman coding, are often used to build up the compression scheme in which the symbol used to replace original datum is determined by how frequently the datum occurs, i.e. more frequently occurred datum is replaced by fewer-bits symbol and vice versa. Currently, for one set of data, there is only one optimal compression scheme. The scheme will be built up by analyzing the initial set of data available. Once the scheme is built up, all coming data will be compressed using the same scheme. The compression scheme can only be refined when it is triggered. The problem is: for one application, the data is not always showing the same distribution symptom. It may have very different data distribution for data generated in different time range or generated from different terminal. In this article, we introduced one system and method to automatically build up new compression scheme if the recent coming in data has significant different data distribution compared with the previous data. This can be sensed by monitoring the compression ratio that can achieve with the current available compression scheme. If the compression ratio has been lower than some threshold for some predefined period, we can know the previous compression scheme(s) doesn't work well because of the data distribution change. Then we can enable the new compression scheme build process. This may be an iterate process if there is always new data coming in.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 30% of the total text.

Page 01 of 9

automatically create compression scheme for data with different distribution characteristics


1.1 Field of this invention

This invention is about adaptively creating compression scheme to improve the effectiveness of data compression in column store database system.


1.2 Background

Data compression is widely used in column store database and other area to reduce the storage requirement and increase the IO /memory usage efficiency. The key idea is to encode the data using symbols which has fewer bits than the original representation. Data are loaded and persisted in the unit of page. To achieve the maximum compression ratio, optimized encoding algorithms, e.g. Huffman coding, are often used to build up the compression scheme in which the symbol used to replace original datum is determined by how frequently the datum occurs, i.e. more frequently occurred datum is replaced by fewer-bits symbol and vice versa.

In the simplified example below, a database management system has built an optimal compression scheme for a given data distribution. Original Data
(Integer)

Frequency

Original Data length

Symbol

Symbol length

0

1580900 32-bits 0

1-bit

10

1245830 32-bits 1

1-bit

100

81880 32-bits 00 2-bits

1000

72230 32-bits 11 2-bits

10000

41220 32-bits 10 2-bits

During the data compression, the database will look up the symbol table (only include the original data and symbol pairs) and replace the data with symbols. Decompression follows the other way around.

1



Page 02 of 9


1.3 The problem

Currently, for one set of data, there is only one arguably optimal compression scheme. The scheme will be built up by analyzing the initial set of data available. Once the scheme is built up, all coming data will be compressed using the same scheme. The compression scheme can only be refined when it is triggered.

The problem is: for one application, the data is not always showing the same distribution symptom. It may have very different data distribution for data generated in different time range or generated from different terminal. With only one compression scheme built up based on current data distribution, when the data with very different distribution coming in, although the scheme can self-refined in some extent, the compression rate will be not good. Basically, you can't find one compression scheme which works well for two totally different data distribution. Even it will work somehow, the compression ratio will be not good.

In the simplified example below, it shows how data distribution is changed between time 0 and time1. The compression scheme that is optimal at time 0 will not still be optimal at time 1.

Original Data (Integer) Frequency at time 0 Data come after time 0 Frequency at time 1 0

1580900
33100

1614000

10
1245830

21050
1266880

100
81880

3001110
3082990

1000
72230

5042840
5115070

10000
41220

10020
51240

Compression Ratio (using scheme A) 30 / 1


18.4 / 1

Compression Ratio (using scheme B)


16.4 / 1


25.32 / 1

Compression scheme (Optimal at time 0)...