Browse Prior Art Database

Optimal Linear Regression with nominal attributes for Data Warehouses and data base systems

IP.com Disclosure Number: IPCOM000240451D
Publication Date: 2015-Jan-30
Document File: 5 page(s) / 53K

Publishing Venue

The IP.com Prior Art Database

Abstract

We show how to implement Linear Regression in data base systems in case observation table contains nominal columns. Our method requires only one data scan which is optimal and makes our implementation about 3x faster than the classical one using intermediate numerical table. Our method can be also used to implement optimal model update operation where the complexity of the update depends only on the number of new observations. Our method guarantees perfect scalability on data warehouses relying on Massively Parallel Processing.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 34% of the total text.

Page 01 of 5

Optimal Linear Regression with nominal attributes for Data Warehouses and data base systems

Linear Regression is one of the most popular data mining algorithms used in database systems and data warehouses. The input for the Linear Regression problem is matrix of observations X, n x m, and target vector Y, n x 1, where m is number of columns and n is number of observations. The goal is to find vector of coefficients B , m x 1, that minimizes L2 distance between target vector Y and X*B. More precisely, we are seeking coefficient vector B that minimize residual sum of squares (RSS)

The classical Linear Regression algorithm works in the following way:
1) Compute covariance matrix XTX = X^{T} * X, where X^{T} is X transposed matrix.
2) Compute B = XTX^{-1} * (X^{T} * Y)

The most expensive part of the algorithm is computing covariance matrix XTX. Computing inverse matrix of XTX is fast since the size of matrix XTX is m x m where m is number of predictors that is never larger than 1000 in practical applications. In data base systems the input matrix X of observations is represented as table where columns are predictors and rows are observations. In case the input table contains nominal columns the input table requires conversion so that all columns in the new table are numerical. The conversion of the original input table to the numerical table works in the following way. Each nominal column is decomposed to the number of new columns. The number of new columns representing original nominal column is equal to the number of unique items in the original nominal column. Next, the binary representation is used to mark presence of an item in a row. Consider the following example to understand the conversion. This is the original input table T1

| id | col1 | -------------
| 1 | a | | 2 | b | | 3 | c |

The original table T1 has to be converted to the intermediate numerical representation T2.

| id | col1_a | col1_b | col1_c | --------------------------------------------
| 1 | 1 | 0 | 0 | | 2 | 0 | 1 | 0 | | 3 | 0 | 0 | 1 |

1


Page 02 of 5

We can see that the number of new columns in a table after conversion may grow significantly (it is equal to the number of unique values in the original column). Having converted input table to its numerical representation we are ready to solve Linear Regression problem. The main drawbacks of this classical approach is that we create a new intermediate table. The new table has to be written and scanned to solve the Linear Regression. Since disk operations are the bottleneck of the whole algorithm the execution with the nominal attributes using intermediate table is about 3x solver than execution that requires only one data scan.

We present a new optimal method of solving Linear Regression in case the input matrix contains nominal columns. The advantages of our approach are:
1) Our method requires exactly one data scan which is optimal and avoids creating intermediate numerical table. Therefore, our method is 3x faste...