Browse Prior Art Database

An Enhanced Sampling Approach for DBMS Table Compression

IP.com Disclosure Number: IPCOM000223307D
Publication Date: 2012-Nov-15
Document File: 5 page(s) / 115K

Publishing Venue

The IP.com Prior Art Database

Abstract

This closure is about an uniform sampling approach for skewed table. Distinguished from most sampling approaches, it's useful for some particular scenarios, especially for skewed table, which is very popular in DBMS tables. The major idea of the approach is to scan the tables by vary step lengths (powers of 2), which can ideally match all the requirements of the scenario: effective, uniform, no omission, and no duplication. With the approach, there is a significant improvement on I/O & CPU cost without impact to statistics effect.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 5

An Enhanced Sampling Approach for DBMS Table Compression

Background

Table compression is an important feature for DBMS. In general, it's based on LZ77/78 or similar algorithm. It follows 3 steps:

1. Sample data from the table

2. Build dictionary from sampled data

3. Compress data by the dictionary

The impacts of table compression are:

Reduce the storage requirement in a large size

Reduce I/O cost in a large size


Increase CPU cost

Because of the benefits, table compression is widely used for OLTP and other I/O heavy system.

Problem & Goal

The particular scenario which we need to handle is how to sample data from the table . There are specific requirements:

Sample data from a table to build dictionary for compression: Sample data by page (a more complicated case of skewed table).The sampled data must be statistic effective to cover the table, which would impact compression ratio.

The table might be full, not-full or empty due to previous Insert/Update/Delete operations: We don't know the layout.

The table might be very very big (TB size or bigger): We cannot simply scan the entire table.

The sampling approach should be uniform anytime: To assure the effect of the dictionary.

A result must be got for even very skewed table: We might have much data for very big table.


No duplication

The most important feature is that the table is skewed, i.e. it might be full, not-full or empty, and we don't know the layout at the

1


Page 02 of 5

time. Most of known sampling approachs CANNOT be applied for skewed table without impact to statistics effect .

There are two existing approaches which are generally used for the scenario:


Reorg the skewed table to a full table
Read all the pages firstly, remove invalid pages, and sample the new table


Shortcoming: Extra storage required; bad performance, especially for the very big tables


Read forward sampling
Read all the pages by order, and ignore the invalid pages


When sampling buffer is full, remove every 2nd pages in the buffer


Shortcoming: bad performance, especially for the very big tables

Key ideas

It's a description about the approach by steps:

1. given a table with n pages: 0, 1, 2, ...n-1.

2. let l = pow (2, (int)log2(n)). In another word, let l = the biggest 2^x number which less than n.

3. Scan the array with step of l: read page #l.

4. Scan the array with step of l/2, but ignore any page which number can be divided exactly be l: read page #(l * (1/2))...

5. Scan...