Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Data redistribution adjustment during query processing based on disks scan speed ratio

IP.com Disclosure Number: IPCOM000246233D
Publication Date: 2016-May-18
Document File: 4 page(s) / 95K

Publishing Venue

The IP.com Prior Art Database

Abstract

One of the important factors of the data warehouse performance is the hard disk scan speed. The disks are usually the bottleneck of the whole system. Sue to fact that the modern data warehouses contain hundreds of disk in each system quite common problem met by end user is that one or few of the disks start to read data slower than the rest of the storage. From user perspective it means most of the queries running slower – one slower disk means one node processing slower than others. In parallel systems the time of the query execution directly depends from slowest processing node. Such problems raised by customers are really common in the support and causing frustration (and additional costs of the S&S). Our method allows to smoothly recompense the performance gap to level unnoticeable for end user.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 4

Data redistribution adjustment during query processing based on disks scan speed ratio

This article is related to the domain of data warehouses based on massive parallel processing (MPP) architecture. It can be also implemented in the Hadoop environment, however significance of the problem is much higher in the data warehouses area.

1. Computerized method for partitioning data for an query operation on one table t1, based on attribute col1 from t1
- estimating value distribution (e.g. density function) of the attribute based on value distributions of col1;
- estimating the time of reading required data from the disk for every processing unit;
- determining boundaries for partitioning ranges for the attribute based on the estimated value distribution, each partitioning range corresponding to a number of rows which is inversely proportional to data reading time
- partitioning tables t1 on processing nodes based on the determined partitioning ranges

2. Computerized method for partitioning data for an operation on two tables t1 or t1 and t2, the operation being based on a common attribute having values in col1 of t1 and in col2 of t2,
- estimating value distribution (e.g. density function) of the common attribute based on value distributions of col1 and col2;
- estimating the time of reading required data from both tables from the disk for every processing unit;

- determining boundaries for partitioning ranges for the common attribute based on the estimated value distribution, each partitioning range corresponding to number of rows

which is inversely proportional to data reading time;

- partitioning tables t1, t2 (and/or the result table) on processing nodes based on the determined partitioning ranges

Background

At the beginning lets define density function, which will be used to describe the idea. Please keep in mind that this is only one of the ways how the data spread within the column can be represented.

Density function f is an integrable function (or set of functions) defined on whole set of values between minimum and maximum value of particular column in the table:

For which

is estimation of number of rows between 'a' and 'b'.

Comment: density function is only the way of representing calculation of # of rows
in particular period. In easy way histogram representation or any other kind of data spread representation can be transformed into density function and back into original one.

We assume that system collects and maintain the data spread information about the tables and columns we are interested in.

1


Page 02 of 4...