Browse Prior Art Database

Method of tracking data ballooning in MPP environments

IP.com Disclosure Number: IPCOM000240724D
Publication Date: 2015-Feb-23
Document File: 7 page(s) / 79K

Publishing Venue

The IP.com Prior Art Database

Abstract

Article describes the way of detection of data ballooning during JOIN within two particular tables in the data warehouse/database world, especially for cases where unique constraints are not maintained.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 34% of the total text.

Page 01 of 7

Method of tracking data ballooning in MPP environments

Data ballooning is the process of increasing number of rows being result of join, due to the joining on none unique keys. Information about size of the join product is very important, is being used by:

1. Optimizer, to produce the best query plan (e.g. move joins ballooning data to the end of the query)

2. DB Administrators, to find issues in data (e.g. loaded twice) and find potential sources of performance drop

In OLAP MPP environments, not having UNIQUE indexes, the only way to find and track none unique keys is to look at the table statistics, especially at the column dispersion. Unfortunately calculating column dispersion is very resource consuming (system needs to scan whole table and group by desired column). What is more dispersion of the columns get outdated one table data is modified. Every INSERT, DELETE or UPDATE makes dispersion out of data and stats must be generated from the scratch in order to have up to date dispersion values. This is very ineffective, thus dispersion changes leading to data ballooning is not tracked.

In the below article the problem is resolved/mitigated by the idea proposed.

The proposed solution contains following parts:
- functionality of detection of data ballooning in Massive Parallel Processing (MPP)

   data warehouse environment
- said functionality is based on statistics gathered and maintained on the level of

   MPP processing units
- said statistics are based on query feedback

Detection of data ballooning during JOIN within two particular tables can be further used by optimizer to produce best query execution plan.

The article addresses problem of tracking data ballooning. The idea is to keep on each MPP node pairs TABLE, COLUMN NAME and vectors storing ballooning factors assigned to each par. Ballooning factor would be calculated from query feedback as a proportion of expected numbers of rows being result of join, to real number of rows in

join result table. Each time join is performed, the table storing ballooning factors for specific join keys used in the join would be updated. Factor at the level of 1 means no ballooning, factors bigger than 2 means: join was made on none unique keys (the bigger number the more rows, as a result of data ball, we got from the join). Once can imagine background process or action on demand which would go through tables of factors and prepare report about health of the data in terms on unique keys. Such a report could be a good base for finding performance issues, periodical data cleaning/management and improving queries.

Apart from data ballooning keeping and analyzing factors would also help in finding

1


Page 02 of 7

processing data skew (situation when more data is processed on one of the MPP nodes, comparing to others).

The idea is to store, on every MPP node, following structure

[TABLE][JOIN_KEY]{BALLOONING FACTORS LIST}

where

[TABLE] - table name
[JOIN_KEY] - each column name taking part into join key
{...