Computing Quantiles In Streaming Mode Using A Union-All Synopsis Table For An Asymmetric Parallel Processing System
Publication Date: 2014-Jan-09
The IP.com Prior Art Database
Disclosed is a method of assembling data in each computing node in such a way that a streaming OUTPUT operation can effectively be performed in the central processing node by using a synopsis table followed by an N-way Merge-Sort operation. The disclosed method introduces the concept of an optimal and a single synopsis table.
Page 01 of 4
Computing Quantiles In Streaming Mode Using A Union - Asymmetric Parallel Processing System
Most proprietary relational databases are based on a mathematical cost-based query optimizer for executing Structured Query Language (SQL) queries in an optimal manner. The cost-based query optimizer prepares optimal plans based on the inputs fed to the cost-model. These inputs are various statistical and distribution information
which are computed for relations (tables) and attributes (columns) in the relational database. More accurate and recent statistics help the cost-based query optimizer compute an optimal query execution plan.
One of the statistics that is collected is quantile, or commonly known as height-balanced histograms. Quantiles are extremely helpful for estimating the selectivity of range predicates and in some cases equality predicates.
The MRL99 algorithm is suitable for computing quantiles for the cost-based query optimizer in an Asymmetric Massively Parallel Processing (AMPP) Database Appliance for an infinitely large set of input data sequence. MRL99 cites three operations for computing quantiles:
• INPUT: Allocate, Load, or Reuse an existing buffer with column values
• COLLAPSE: Collapse the filled buffers into a single buffer. This operation can occur multiple times depending on the input stream.
• OUTPUT: Final operation for deriving quantiles which involves at the most two buffers
The output of a last COLLAPSE operation is the input for the OUTPUT operation. In an
AMPP system, the OUTPUT operation needs to gather data from all processing nodes into a central processing node and compute quantiles. However, AMPP systems tend
to have a large number of columns and tera-bytes of data in each of the data partitions.
After initial processing on each computing node, all of this data has to be organized and sent to the central processing node in such a way that the quantile values for each column can be computed in a streaming mode. If this is not done, then data from all of the computing nodes has to be hosted on the central processing node before any processing can be done.
A method is herein disclosed to assemble data in each computing node in such a way that a streaming OUTPUT operation can effectively be performed in the central processing node by using a synopsis table followed by an N-way Merge-Sort operation.
The disclosed method provides an optimal and a single synopsis table, which hosts the
data from all of the different relational data-types. This synopsis table requires only one Merge-Sort operation for the final OUTPUT operation to compute quantiles for all columns in the table, irrespective of its data type. At the same time, this synopsis table can also provide an intermediate storage for some of the legacy stats such as MIN, MAX, COUNT, NULLS, MAXLEN, SUMLEN, etc.
-All Synopsis Table For An
All Synopsis Table For An
Page 02 of 4
With a combination of the MRL99 algorithm and the synopsis table, a rel...