Parallel multidimensional query's selectivity estimation using integrable function
Publication Date: 2014-Apr-29
The IP.com Prior Art Database
Method of multidimensional selectivity estimation using integrable function interpolation.
Page 01 of 4
Parallel multidimensional query '
'''s selectivity estimation using integrable function
s selectivity estimation using integrable function
Most of database engines before execution of the query prepares execution plan which is optimized from time and resources perspective. Such optimization is done based on several information - spread of data in particular table columns is the one of them.
Problem of one-dimensional selectivity estimation is known and there are several approaches implemented in the databases engines.
Problem of multidimensional selectivity is much more complex and usually assumption about columns independency is done. This may result in really bad estimations and as a result not optimal query execution plans.
Main goal of this work is to describe of the approaches that's can be taken to have this information up-to-date and most accurate as it is possible, using integrable function for selectivity estimation.
Following other methods are widely used on the market:
assumption that data is spread evenly across all values
assumption that columns are independent, so spread of data in one-dimensions (so for each column separately) is crucial and determine the final result
building multidimensional histogram of data
bringing multidimensional case to one-dimensional using i.e. Hilbert curve.
Two first method are highly imprecise. Disadvantage of the third and fourth is that it requires complicated maintenance methods to be precise and the size data required to obtain the selectivity estimation is medium size. We strongly believe that following idea addresses some of the disadvantages while holding on what is good.
Complexity of the problem and exponentially growing calculations caused that the problem was to hard to solve by previous generations of database systems.
Currently when we are working with parallel datawarehouses with huge computational power. We strongly believe that there is a need to have easy to parallel method of multidimensional selectivity estimation which will not simplify the data model.
Proposed method contains following steps:
1. [offline] split the domain of table's columns to N disjunctive parts (in the way which will leverage all processing unit), where N is number of processing units in the system
2. [offline] each processing unit calculates integrable multidimensional function which will reflect the data spread within assigned domain sub-cube (two different methods are described below to calculate such function)
3. [during query execution] based on query conditions, set of periods is calculated for
which integral of function from steps above is calculated to obtain number of rows in conditions cube.
Background and definitions
domain of data within given set of columns we understand n-dimensional cube, which each edge is domain of one of the columns (period [a,b] - all values from this column belongs to this period and there is at least one row with value a and value b in the table).