Automatic learning of efficient data distribution between nodes on Netezza's parallel architecture
Publication Date: 2014-Jun-06
The IP.com Prior Art Database
Disclosed is a method for distributing data within a storage system that is based on record similarity, as defined by various machine learning techniques. Resulting distribution is useful and efficient for further data exploration and analytics.
Page 01 of 2
Automatic learning of efficient data distribution between nodes on Netezza ' architecture
Current distributed storage systems where data is stored on several nodes require a method for a suitable data distribution between the nodes. Common solutions to this problem are:
* random distribution,
* distribution based on a selected attribute,
* distribution based on hash value of a record.
These solutions suffer from several drawbacks:
* similar data are often located on different nodes, which may cause redistribution to be necessary in certain use cases,
* single attribute is often not enough to measure similarity between records,
* the user is often unable to determine which attribute is the best choice as a basis of distribution.
Our idea is a data distribution that is based on the use of a machine-learning technique for measuring the similarity between records. Records identified as similar will be stored
within the same nodes of the system. This method of data distribution will prevent the need for later redistribution when the data is put into use.
The goal is to find a function that will assign each row of a table to a particular node in such way that the rows that will be used altogether in later queries should be stored
within the same nodes. This implies that rows are similar with respect to particular features should be located in the same node. To achieve this, we propose the use of machine-learning techniques, in particular unsupervised techniques such as: K-Means or hierarchical clustering. Two distinct scenarios are possible:
* on-line distribution: row distribution is calculated on the fly using incoming data (by performing e.g. on-line K-Means clustering),
* off-line distribution: row distribution is calculated based on rows already present in the system, then the redistribution step is performed.
Typical usage scenario (for the off-line distribution) of our idea is the following:
1. The user chooses a subset of columns with respect to which the redistribution will take place. This may be the set of all columns in the table.
2. The user chooses the number of nodes between which the data will be redistributed.
3. An unsupervised clusterin...