Dynamic Statistics in RDBMS based on Query Feedback.
Original Publication Date: 2009-Jan-16
Included in the Prior Art Database: 2009-Jan-16
Statistics that accurately describe the distribution of data values in the columns of relational tables are essential for effective query optimization in a database management system. Manually maintaining such statistics in the face of changing data is difficult and can lead to suboptimal query performance and high administration costs. In this paper, we describe a method and prototype implementation for automatically maintaining high quality single-column statistics, as used by the optimizer in IBM Informix Dynamic Server (IDS). Our method both refines and extends the ISOMER algorithm of Srivastava et al. for maintaining a multidimensional histogram based on query feedback (QF). Like ISOMER, our new method is based on the maximum entropy (ME) principle, and therefore incorporates information about the data distribution in a principled and consistent manner. However, because IDS only needs to maintain one-dimensional histograms, we can simplify the ISOMER algorithm in several ways, significantly speeding up performance. First, we replace the expensive STHoles data structure used by ISOMER with a simple binning scheme, using a sweep-line algorithm to determine bin boundaries. Next, we use an efficient method for incorporating new QF into the histogram; the idea is to aggregate, prior to the ME computation, those bins that do not overlap with the new feedback records. Finally, we introduce a fast pruning method to ensure that the number of bins in the frequency distribution stays below a specified upper bound. Besides refining ISOMER to deal efficiently with one-dimensional histograms, we extend previous work by combining the reactive QF approach with a proactive sampling approach. Sampling is triggered whenever (as determined from QF records) actual and estimated selectivities diverge to an unacceptably large degree. Our combined proactive/reactive approach greatly improves the robustness of the estimation mechanism, ensuring very high quality selectivity estimates for queries falling inside the range of available feedback while guaranteeing reasonably good estimates for queries outside of the range. By automatically updating statistics, query execution is improved due to better selectivity estimates, and the total cost of ownership (TCO) is reduced since the database administrator need not update statistics manually for monitored columns.
Dynamic Statistics in RDBMS based on Query Feedback .
IBM Informix Dynamic Server (IDS) is a light-weight relational database management system that focuses on online transaction processing. It uses a uses a cost-based optimizer for choosing an efficient physical execution plan. Currently, only single-column statistics are used for selectivity estimation. These statistics are gathered proactively by sampling. So, we wish to enhance selectivity estimates by the use of QF. Since only single-column statistics are currently used we expect a big impact by improving their accuracy.
Distributions in Informix Dynamic Server
Updating Statistics in Informix Dynamic Server
Apart from data distributions IDS gathers other statistical information on tables, columns and indexes which are useful to the IDS optimizer when choosing an execution plan. The "UPDATE STATISTICS" command is used to update statistics. This command offers three basic modes of operation: UPDATE STATISTICS [LOW-MEDIUM-HIGH] We add a new mode of UPDATE STATISTICS for enabling the feedback based updating of statistics described in this work:
FBS (feedback-based statistics)
1. monitor queries on specified column gathering estimated and actual cardinalities
2. create/refine maximum-entropy distribution at given condition (e.g. update-deleteinsert counter or the number of feedback records reach a certain threshold)
3. use maximum-entropy distribution for selectivity estimation in optimizer
Feedback-Based Statistics Maintenance in IDS
Adding Feedback to the ME Distribution Efficiently
Assume there is an existing ME distribution with k bins that we wish to refine with n QF records. The straightforward way to achieve this is to start the process "from scratch" with
n + k intervals (see figure 7) . However, since the cost of the process grows exponentially
with an increasing number of intervals (see 5.1) a more efficient way for refinement is desirable. An important observation is that in real-life scenarios feedback covers only some
parts of the column domain. Thus, some parts of the existing maximum entropy distribution
are "independent" of the new feedback (they are not really independent but we can
apply the principle of maximum entropy in a simpler fashion, see below). We exploit this to reduce the total number of intervals, saving costs.
Differences between creating a ME distribution from scratch and refining an existing one
Legend for efficient refinement illustrations
Aggregating Bins: In figure above we see that some parts of the existing ME distribution do not overlap with any new QF interval. We say these bins are "not directly affected" by
the new QF. To reduce the total number of intervals we can aggregate not directly aff...