Browse Prior Art Database

A method to locally cluster data while populating tables in a modern columnar database

IP.com Disclosure Number: IPCOM000247576D
Publication Date: 2016-Sep-18
Document File: 5 page(s) / 37K

Publishing Venue

The IP.com Prior Art Database

Abstract

This disclosure is about perform local data clustering while populating data into columnar database.With the proposed method of local clustering, later queries could could take better advantage of storage index for filtering data scanned and hence speed up query processing time.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 41% of the total text.

Page 01 of 5

A method to locally cluster data while populating tables in a modern columnar database

this invention proposed a method to locally cluster data while populating data into a columnar database. and thus improve efficiency of data filtering in predicate evaluation and query processing. in particular it greatly improves efficiency of storage index.
the core idea is to add a local sort phase in a typical data load processing for a columnar database. the phase is to be added fits nicely into the existing pipelined and parallel data load processing with careful performance considerations so that we enjoy the benefit of data clustering but not breaking the existing high speed parallel data load processing.

Overview

This disclosure is about perform local data clustering while populating data into columnar database.

With the proposed method of local clustering, later queries could could take better advantage of storage index for filtering data scanned and hence speed up query processing time.

Background

Columnar Data Engine (CDE)Table

A term used to describe tables in columnar database.

Storage index

Storage index is a popular technique adopted in modern column database implementations. It is an additional meta data object which records value ranges of portions of a table in the database for purpose of data filtering when running queries against the table. There are many flavors of implementation for storage index. In particular, we will assume it is implemented as a collection of internal tables. There is one such internal table per base table in the database. One row in the internal table records all the per column max and min value of a row batch in the base table where a row batch is a set of 1000 consecutive rows. Thus the # of columns in the intenal table is 2* that of the base table. Here 1000 is named row/tuple batch size and is configurable. The internal storage index tables are stored as CDE tables and represented in the system catalogs, but are not otherwise accessible by users.

The aim of storage index is to improve query performance by pre-qualifying on meta data that is recorded in the storage index to avoid reading and further processing of portions of CDE tables for queries.

Data clustering and efficiency of storage index

A storage index becomes more efficient for data filtering when the data is clustered. Though in general, it is very hard to ensure the data is globally clustered. In an ideal case where the data is globally sorted, say the base table row has 1 integer column C with the followingdistribution

1



Page 02 of 5

Row batch # (1000 rows per batch)

Column Max ( C )

Column Min ( C )

1

1000

0

2

2000

1000

3

3000

2000

….

4000

3000

N

N*1000

(N-1)*1000

In this case, to process any query with a range predicate over column C, we can examine the storage index first to tell which batches of rows in the base table could be skipped when scanning the base table, e.g. we could tell that we could skip all batches except batch 1 f...