Browse Prior Art Database

A method of efficiently computing the multivariate statistics among massive variables in big data

IP.com Disclosure Number: IPCOM000237359D
Publication Date: 2014-Jun-16
Document File: 8 page(s) / 155K

Publishing Venue

The IP.com Prior Art Database

Abstract

In big data analysis, finding the correlation relationship among the individual observed variables is much more valuable than the causal relationship. In the aspect of correlation relationship, multivariate statistics is a kind of important and common statistics to help people understand the relationship between variables. Hadoop MapReduce framework can be used to handle the data depth problem easily. However, for data width problem, it is a big challenge. This article introduces an efficient method to compute the multivariate statistics among massive observed variables in big data scenario.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 38% of the total text.

Page 01 of 8

A method of efficiently computing the multivariate statistics among massive variables in big data

A method of efficiently computing the multivariate statistics among massive variables in big data

In big data analysis, finding the correlation relationship among the individual observed variables is a very important topic, which is even more valuable than the causal relationship. In the aspect of correlation relationship, multivariate statistics is a kind of important and common statistics to help people understand the relationship between variables and their relevance to the actual problem being studied.

To compute the multivariate statistics among massive observed variables in big data scenario, both the data depth (massive records) and data width (massive individual observed variables) are big challenges. People can take advantage of Hadoop MapReduce framework to handle the data depth problem easily because Hadoop framework is scalable for massive records naturally. But to the data width problem, people need to consider seriously when applying MapReduce framework. Because once the number of observed variables reaches a very large extend, the overload of the mapper node or the reduce node would becomes very heavy and the memory cost will have a geometry numerical expression swift growth. Taking the bivariate statistics (indicates the relationship between two individual variables) for example, if the number of individual variables is and people want to get bivariate statistics of each two variables, they have to compute pairs. Assuming each pair takes the same memory size , and then totally memory cost will be . Usually, people can take one MapReduce job to get the bivariate statistics:


- In mapper, compute all bivariate pairs against each data block. Each mapper handles pairs and the memory cost are .
- In reducer, the bivariate statistics for each pair are merged separately.


- The server gets the final results.

1


Page 02 of 8

The memory cost during the computation of multivariate statistics can be always calculated based on the metadata information of the involved variables, such as the number of categories of a categorical variable. Thus, the is a known value which can be calculated before the computation.

To make the problem more specific, say all the observed individual variables are categorical format with 20 categories. The bivariate statistics of a categorical-by-categorical variable pair might have:


- Crosstabulation table


- Pearson Chi-Square


- Cramer's V


- Spearman Correlation


- Kappa

The key statistics among the above is the Crosstabulation table. Crosstabulation table is a matirx, in which and are the number of categories of the variables in the pair. All other statistics are conducted from the Crosstabulation table. Thus, if each individual variable in the data has 20 categories, then the memory usage of each pair is (Each value in the matrix is a double format, taking 8 Byte):

Thus, if there are massive individual var...