Improved Estimation of Cardinalities for Grouping Expressions
Original Publication Date: 2005-Dec-15
Included in the Prior Art Database: 2005-Dec-15
Disclosed is a method for an improved estimation of the number of groups when a grouping operation (group-by in SQL) is done on an expression in a database system. This is important for an optimizer that has to come up with an optimal query plan to use based on the cost of different query plans. The accuracy of selecting an optimal plan depends on the costs which in turn depend on the cardinality of the query. Therefore, it is necessary to have as accurate cardinality estimates for intermediate results as possible. Traditionally these cardinality estimates for grouping operations involving complex expressions have often involved hard coding selectivity. The method disclosed here suggests to use the maximum of the number of distinct values for any (with some exceptions) column appearing in the expression instead of hard coding the selectivity. The exceptions are the columns appearing in the conditions.
Improved Estimation of Cardinalities for Grouping Expressions Improved Estimation of Cardinalities for Grouping ExpressionsImproved Estimation of Cardinalities for Grouping Expressions Improved Estimation of Cardinalities for Grouping Expressions
This publication proposes to improve the accuracy of cardinality estimation of the Group-By operation if the grouping column is an expression.
This publication proposes to use the maximum distinct value of a column appearing in the grouping expression to calculate the final selectivity of a Group -By operation. This gives much better estimate as compared to hard coding the selectivity for grouping expressions to an arbitrary value.
Let's assume that t is a table in a relational database system, and let c 1,...,cn be the columns of table t. Let E1(c1,...,cn), ..., Em(c1,...,cn) be expression over a subset of the columns c1 to cn possibly having some constants. t could be either a base table or a view.
For each SQL query, the optimizer generates all possible execution plans and chooses the best plan (ie. a plan with a low cost). The query plan determines how the query gets executed. Each query plan consists of one or more database operations (e.g. scans, joins, etc). Each operation yields an intermediate result for the computation of a query. For choosing a query plan, an optimizer has to calculate the execution cost of a query. For this cost calculation the optimizer needs to know the cardinality of (intermediate) results of a query plan execution. The basis for these cardinality estimates are statistics collected for the tables used in a query . There are many well known methods for calculating the cardinalities of different database operations.
This invention will focus on improving the cardinality estimate for grouping operations. A group operation puts the inputs rows according to some criteria into disjoint groups. Often some aggregation function is performed on each group .
Based on the above definitions a query for which our inventions could be applied would have the following form:
GROUP BY E1(c1,...,cn),
E1(c1,...,cn), ... Er(c1,...,cn) are non-aggregation SQL expression, E[r+1](c1,...,cn), ... Em(c1,...,cn) are aggregating SQL expressions.
Ci should be the subset of columns which actually appear in Ei (c1,...,cn). Di should be a subset Ci. It contains all columns which appear in Ei(c1,...,cn) not only in the conditions of conditional SQL statements (such as case statements). Di can
be defined recursively over the structure of a SQL expression in the following way:
If Ei is an atomic expression which is not a column name then Di is the empty set .
If Ei is an atomic expression which is the column name cj then Di is the set with one element cj.
If Ei is a composite expression with operator o and subexpressions E 1, ..., Ep, and o is not a condi...