Browse Prior Art Database

Method to decide on the best partitioning columns and best range values for creating the range partition tables

IP.com Disclosure Number: IPCOM000200914D
Publication Date: 2010-Oct-29
Document File: 6 page(s) / 71K

Publishing Venue

The IP.com Prior Art Database

Abstract

Range partitioned tables[RPT] uses a data organization scheme in which table data is divided across multiple storage objects, called data partitions or ranges. These ranges are created according to values based on one or more table partitioning key columns of the table. DBAs are required to decide the column & the ranges on which to partition RPT tables. Failures to determine correct partition column/ranges will result in bad query performances. In this idea , we propose a method to determine the best partition columns and ranges while creating a RPT tables to make the maximum utilization of the feature. The idea analyses the load file and depending on its contents, the clusters are constructed using Hierarchical Clustering method. These clusters in turn help in deciding the best partition column and ranges for a given set of data. Using this method to determine partition columns/ranges will improve query performance by minimizing on the IO operations. This also ensures that there are no performance issues because of any human errors as the data is analyzed by the algorithm.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 29% of the total text.

Page 01 of 6

Method to decide on the best partitioning columns and best range values for creating the range partition tables

Background:

      Range Partitioning (or table partitioning) is a feature in DB2 that allows us to create tables where the data is placed in multiple partitions according to data ranges that is specified. It is the most common type of partitioning. The DB2 optimizer is aware of the partitions and if the query needs to scan many rows to get the result set and the predicate (where clause) uses ranges that is defined. It only scans the partitions that have rows that satisfy the query instead of searching the entire table. This is called "partition elimination" and can greatly reduce query elapsed times.

Problem:

      If the range partitioned tables are not partitioned with the right ranges, then the best possible query performance is not obtained as per the Range partition table feature design. Disadvantages of not getting the ranges right would be as follows:
1) If the range of the partitions are such that some partitions are really huge and others are small then the purpose of creating a partition is lost, i.e the query performance on that large partition will be less and does not make the best use of the feature.
2) The partition elimination might not happen and the query might end up accessing more than one partition.
3) Increase in I/O as more number of pages needs to be 'read in' to the page buffers, hence less efficient.

Difficulties in designing a range partition table:
1. In data warehouses, the size of the tables poses challenges to the administrators. These tables routinely contain hundreds of billions of rows. Manually analyzing the data and getting the right ranges to get the maximum benefit of the feature is something that needs a lot of effort, efficiency, time and experience.
2. Any kind of human errors which might end up in deciding wrong range partition can lead into the low performance of the query. Example for human errors would be like like analyzing only 5 columns when there were 6 columns considered for partitioning

Problem Solved:

      The problem on the performance of the range partition tables for partitioning on the wrong ranges can be avoided by using an analyzing algorithm that will help in getting the best options for the key design decisions needed while creating the range partition tables. The algorithm will analyze the datafile or the load file and analyze the the partitioning column and the best possible ranges (consisting of some interesting patterns) using the hierarchical cluster method.

The results of the algorithm will be the column on which the table needs to be partitioned on and the best range values for creating the range partition tables.

The algorithm decides on the ranges in such a way that
1) the partitions are almost of same size so that the query performance is at its best as I/ O is reduced considerably, since lesser number of pages need to be 'read in' to the page buffers

2) Most of the releva...