Browse Prior Art Database

Filter Factor Estimation for Non-numeric Datatypes

IP.com Disclosure Number: IPCOM000123411D
Original Publication Date: 1998-Nov-01
Included in the Prior Art Database: 2005-Apr-04
Document File: 6 page(s) / 184K

Publishing Venue

IBM

Related People

Yan, WP: AUTHOR [+2]

Abstract

Problem In a relational database management system, a query optimizer determines the query execution plan that will be used to retrieve data requested. This plan consists of a set of primitive operations, e.g., join; a sequence in which the operations will be performed, e.g., join order; specific methods of performing the operations, e.g., sort-merge join; and access methods to obtain records from the base relations, e.g., index scan. In a cost-based query optimizer both estimates of I/O and CPU resource consumption are used to select the most efficient query execution plan. Both of these estimates depend heavily on the number of rows that need to be processed at various stages in the query execution plan.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 32% of the total text.

Filter Factor Estimation for Non-numeric Datatypes

   Problem

   In a relational database management system, a query
optimizer determines the query execution plan that will be used to
retrieve data requested.  This plan consists of a set of primitive
operations, e.g., join; a sequence in which the operations will be
performed, e.g., join order; specific methods of performing the
operations, e.g., sort-merge join; and access methods to obtain
records from the base relations, e.g., index scan.  In a cost-based
query optimizer both estimates of I/O and CPU resource consumption
are used to select the most efficient query execution plan.  Both of
these estimates depend heavily on the number of rows that need to be
processed at various stages in the query execution plan.

   The number of rows that need to be processed is determined by
filter factor, which represents the selectivity of predicates.  For a
range predicate, the filter factor is calculated as the portion of
the rows between the statistics LOW2KEY and HIGH2KEY.  If non-uniform
statistics is not available in the database,  the optimizer assumes a
uniform and continuous distribution of the rows over that range.  So,
for example, for a range predicate on column C, "C < b", where b is a
value in the domain of C, by assuming uniform distribution, the
filter factor is computed as:
               float(HIGH2KEY) - float(b)
      ff =  --------------------------------
            float(HIGH2KEY) - float(LOW2KEY)
  where, HIGH2KEY and LOW2KEY are the second highest value and
second lowest value of C.  Some other database systems use HIGHKEY
and LOWKEY.  The discussion of why HIGH2KEY and LOW2KEY are better
than HIGHKEY and LOWKEY is outside the scope of this invention and
will not be discussed here.  The number of distinct values for this
column can then be found by multiplying the filter factor and the
column cardinality.  Better RDBMS, like DB2, also exploits
non-uniform information to interpolate the distribution of the column
and obtain a better estimation of filter factor.  All these
techniques require numeric (float) values for the concerned columns.
However, many RDBMS system provides non-numeric data types, such as
CHAR (character string), date, time and time stamp.  Therefore we are
facing  the challenge of correctly converting non-numeric values into
continuous numeric values.  This is the problem that this invention
solves.  The data types covered in this invention include:
CHAR(VARCHAR), DATE, TIME, and TIMESTAMP.  VARCHAR is a CHAR with
variable length.  In the rest of this invention, we will talk about
CHAR in the rest of this invention, while we will actually be
referring to either one of CHAR and VARCHAR.  The objective is to
convert this value into a floating point number as follows:

   CHAR TYPE

   Typically, a value of type CHAR is very long (e.g., 33 or
longer bytes).  The prior art to convert it into a floating poi...