Browse Prior Art Database

Method of fast and low-host-involvement sorting in AMPP datawarehouse using adaptive hash function

IP.com Disclosure Number: IPCOM000233579D
Publication Date: 2013-Dec-16
Document File: 6 page(s) / 134K

Publishing Venue

The IP.com Prior Art Database

Abstract

One of the approaches that's can be taken to avoid overloading the host with sorting operation in the AMPP datawarehouse architecture is described. It leverages the AMPP features and allows to avoid situation where host become the bottleneck for the whole system.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 34% of the total text.

Page 01 of 6

Method of fast and low-host-involvement sorting in AMPP datawarehouse using adaptive hash function

One of the several tasks which can be a part of query executed against database is sorting of final or partial result of the query. In AMPP datawarehouses (Asymmetric Massive Parallel Processing) this operation is considered to be heavy from host load perspective - all nodes are returning not sorted, or partially sorted results to host, which have to do the final sort and merge. In many cases host become bottleneck for the whole system.

Main goal of this work is to describe one of the approaches that's can be taken to avoid overloading the host with sorting operation - it leverages the AMPP architecture and avoids situation where host become the bottleneck of whole system.

Article describes following features:

1. Method for minimizing SORT operation processing time in data warehouse AMPP environment based on efficient data redistribution within processing units and minimizing host involvement in sorting process.

2. Method for minimizing LIMIT operation processing time in data warehouse AMPP environment based on efficient SORT operation (from claim #1) and limiting amount of data send between host and processing unit.

3. Said efficient data redistribution from claim #1 is done by even split of processed
data within processing units using adaptive hash function and minimizing host role by splitting data within processing units in already ordered way

4. Said minimizing host involvement from claim #1 is done by sorting disjunctive portions of data in parallel on processing units and limitation of host effort to streaming the results.

5. In preferred embodiment, said adaptive hash function from claim #2 is calculated based on so called density function, which is representation of data spread within table's column.

The main idea behind this article is to improve SORT operation in following way:

1. split the table before the sort operation to N almost equal chunks, disjunctive from each others and covering whole domain (this is most complex part - and in fact whole idea is placed here)


2. redistribute them within N processing units based on newly defined hash function


3. sort them in parallel on processing units

4. send the chunks back to host - it does not need to merge the results - it is sufficient to stream the chunks to end user in proper order.

Improvement that can be taken from above for LIMIT operation is quite simple: when system have already sorted data on processing units, system can send back to host only rows which meet LIMIT criteria.

Background

At the beginning lets define density function, which is crucial to understand idea described in this article.

Density function f is a integrable function (or set of functions) defined on whole set of values between minimum and maximum value of particular column in the table:

For which

1


Page 02 of 6

is estimation of number of rows between 'a' and 'b'.

Lets assume there is a table A whi...