Browse Prior Art Database

Algorithm for parallel estimation of coefficients in batch of regression models in the share nothing parallel database

IP.com Disclosure Number: IPCOM000229324D
Publication Date: 2013-Jul-22
Document File: 3 page(s) / 106K

Publishing Venue

The IP.com Prior Art Database

Abstract

We are proposing an algorithm for a scheduler for a parallel share nothing database. Traditionally, subtasks relating to a first task have been all performed before starting to perform subtasks relating to a second task. Our scheduler uses a queue with a list of subtasks to be performed and information about the type of the subtasks. It interleaves subtasks from multiple tasks to achieve performance gain. The algorithm is suitable for parallel estimation of coefficients in batch of regression models.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 3

Algorithm for parallel estimation of coefficients in batch of regression models in the share nothing parallel database

The problem that has been solved is how to speed up the computation of regression modeling by dividing a batch of tasks into a set of sub tasks that may be distributed in some specific fashion among different computational nodes.

In the regression-like modeling problems (like generalized regression modeling, mixed models modeling, Gaussian regression modeling) the computational task is to estimate a set of coefficients by iterating the following formula some number of times.

(1)

--

Where is a vector of coefficients that are estimated, c is some constant, g stands for gradient while H stands for hessian matrix.

The calculations defined by formula (1) are repeated some number of times. Each iteration is composed of two sub-steps:


A. calculation of gradient and hessian

B. calculation of new .

Assume that data is distributed onto k independent nodes.

The step A may be done in parallel. Since both gradient and hessian are additive, thus on each node the gradient and hessian for data-chunk may be computed. Then these partial sums are sent to a single node and the final hessian and gradient is calculated.

The step B has to be done on a single node (with multiple threads but with single shared memory).

We have proposed an algorithm for scheduler for parallel share nothing database. Such scheduler optimizes plan in which steps B's for different modeling tasks are processed in

parallel. When there are large number of models to be estimated the proper scheduling of steps A and B leads to faster calculations.

We did not found description of such algorithm anywhere in literature.

We have proposed an algorithm for scheduler for parallel share nothing database. We did not found description of such algorithm anywhere in literature.

Having p models to estimate, for each model the formula (1) has to be repeated a number of times (say n_i). In each iteration, steps A and B have to be calculated.

The scheduler uses a queue with a list of subtasks to be performed. The number of subtasks in the queue is not larger than the number of computational nodes. The subtasks are labeled A or B (see the definition in the ,,background'' section). When all tasks A are finished then tasks B are performed in parallel. After finishing a task of type A, a new task of B...