Browse Prior Art Database

Efficient Method Of Discovering Correlations In An Asymmetric Massively Parallel Processing Environment

IP.com Disclosure Number: IPCOM000237956D
Publication Date: 2014-Jul-23
Document File: 4 page(s) / 75K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a method for efficiently discovering column groups in an Asymmetric Massively Parallel Processing (AMPP) system. The method brings in a reservoir sample together in an AMPP system, uses this reservoir for column group discovery, splits the column group discovery analysis across an AMPP system such that the work can be distributed across all nodes of an AMPP system, and then performs the actual discovery and merging the results on the central node.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 42% of the total text.

Page 01 of 4

Efficient Method Of Discovering Correlations In An Asymmetric Massively Parallel Processing Environment

Column relations are a key statistic for a database. If the optimizer knows that a relation exists between a column pair, triad etc., it uses that information and computes selectivity of a predicate involving these columns using the collected column group statistics. If it did not have this information, it always assumes that the columns are independent, and computes predicate selectivity differently. Under the assumption of independence, there is an underestimation of conjunctive predicate selectivity leading to suboptimal query execution plans. Therefore, the identification of column group correlations in a database is essential.

Large warehouse databases have a large number of tables and each table can have a high number of columns. As the number columns in a table increases, the column combinations exponentially increase. Analyzing these combinations to discover column groups for databases can be a Central Processing Unit (CPU)-intensive task, affecting performance and resources for other concurrent queries in the system.

Current published work indicates how to discover correlated column pairs in a table. There is no current implementation of algorithms to discover column groups by analyzing the table data.

Adapting existing knowledge, the approach discussed herein presents a method for efficiently discovering column groups in an Asymmetric Massively Parallel Processing (AMPP) system. The approach focuses on three parts:

1. Bringing in a reservoir sample together in an AMPP system, and then using this reservoir for column group discovery

2. Splitting the column group discovery analysis across an AMPP system such that the work can be distributed across all nodes of an AMPP system

3. Performing the actual discovery (based on the existing work) and merging the results on the central node

Collecting the Reservoir

An existing algorithm works on a 4K sample. An AMPP environment has data split across different nodes. The applied algorithm runs on each of the nodes, walks through the data, and at the end builds a reservoir in each node based on the amount of data the node has. A sparsely populated node contributes less to the reservoir compared to a densely populated node, which contributes more to the reservoir. At the end, each node sends its part of the reservoir to the central node for consolidation.

1


Page 02 of 4

Figure 1: Collecting the reservoir

Breaking up Column Sets across Nodes

Given a table T1 with five columns, C1, C2, C3, C4, and C5, the number of column groups (just considering pairs of columns) that requires analysis is:

Table 1: Column Groups

which is about 5C2 or 10. Adding triads to the mix adds to the number.

This number can become very large on a table with 500 or 1000 columns, which is common in a warehouse schema. A table with 500 columns has 1,24,750 pairs to analyze. A table with 1000 columns has 4,99,500...