Optimal distributed linear regression for parallel shared nothing architecture.
Publication Date: 2014-May-22
The IP.com Prior Art Database
We present scalable implementation of the Linear Regression algorithm on parallel shared nothing architecture. Parallel shared nothing architecture consists of number of nodes coordinated by a host. Each node consists of processing unit, operating memory and hard disk. We assume that observation matrix is uniformly distributed between the nodes.
Page 01 of 2
Optimal distributed linear regression for parallel shared nothing architecture .
We invent an optimal and scalable implementation of the classical Linear Regression algorithm in parallel shared nothing architecture. Linear Regression is one of the most popular data mining algorithms. So, the fast and scalable implementation is critical for our customer. Classical Linear Regression problem is stated in the following way. We are given an input m x n-matrix X of observations and target m-dimensional vector Y,
where m is number of rows and n is a number of columns in matrix X. Each observation
is a row in the input matrix X. Our goal is to compute n-dimensional vector B of coefficients that minimizes Euclidean distance between X*B and our target vector Y. Once, vector B is computed we can use it to predict value of the target y for a new observation. More precisely, y = x * B where x is 1 x n-vector of a new observation.
In parallel shared nothing architecture input data are distributed among the nodes. Each node consists of storage (hard disk), processing unit and operating memory. Work of all nodes is coordinated by host. Host can collect and process output of the nodes.
We show how to solve Linear Regression problem when rows of the input matrix X and target vector Y are distributed between the nodes. Our implementation is scalable with the number of nodes. That is twice larger number of nodes results in nearly twice faster execution assuming uniform data distribution between nodes. Additionally, memory utilization in our implementation depends only on the number of input columns and does not depend on the number of rows in the input matrix.
In particular, our implementation can be efficiently applied to the architecture of Pure Data System for Analytics machines. Our preliminary tests show that the new implementation of the distributed Linear Regression idea is about 25 times faster then the classical implementation. A prototype has been tested on a TwinFin-12 Netezza appliance. We show how classical Linear Regression algorithm can be efficiently distributed in parallel shared nothing architecture. Assuming that rows of input matrix X and target vector Y are uniformly distributed among the nodes we show that the heaviest matrix operations of the classical Linear Regression algorithm can be distr...