Browse Prior Art Database

Method for Generating Secure and Highly Available Data for Hadoop Processing

IP.com Disclosure Number: IPCOM000238663D
Publication Date: 2014-Sep-10
Document File: 5 page(s) / 190K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a scheme to quantify security and availability of an input split when the secret shares generated from the input split are placed in a Hadoop cluster.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 49% of the total text.

Page 01 of 5

Method for Generating Secure and Highly Available Data for Hadoop Processing

Hadoop* is a framework developed to process high volumes of data by running a large number of machines in a cluster, which run in parallel and do not share any memory or disk. The Hadoop Distributed File System (HDFS) runs on a cluster of nodes that spreads across many racks. When a client wants to load a data file into a Hadoop cluster, HDFS breaks the data into smaller blocks called input splits , spreads the input splits across a number of machines (nodes) throughout the cluster, and tracks where the data resides. Because multiple copies are stored in a cluster, data can be automatically replicated in case of a node failure . The standard setting for Hadoop is to produce and store three copies of each input split in multiple nodes in the cluster (one copy on a local node and two copies on the nodes in a different rack ), and the number of copies can be configured through a parameter . Although Hadoop's redundancy scheme can make it possible to retrieve data blocks from other locations when a node fails and achieves fault tolerance, Hadoop does not have a solid scheme to maintain secure and high available data in the cluster.

The novel contribution is a scheme to quantify security and availability of an input split

when the secret shares generated from the input split are placed in a Hadoop cluster . This scheme determines parameters such as redundancy factor and the number of secret shares that can be used to generate secret shares , using a secret sharing scheme and information dispersal algorithm (IDA), while taking account the availability and security tradeoff. It also proposes a scheme to decide how many shares can be stored within a single rack to reduce cross-rack communication but still maintain manageable rack-level input split availability. The proposed scheme achieves a redundancy factor lower than HDFS's default redundancy factor and places the shares amongst the nodes in the cluster in a more secure way, while maintaining high availability of the input split when compared with HDFS's placement scheme.

HDFS, by default, splits the input file into 64 MB input to quickly and in parallel process huge input data, and the input split is the amount of data that goes into one map task . By default, the input splits are copied and randomly placed at each data block .

Figure 1: Method for placing secret shares on a Hadoop cluster

1


Page 02 of 5

A secret sharing scheme is a method to distribute a secret amongst a group of parties , and the secret can be reconstructed only when the secrets are combined . The scheme is called an (n,m)-threshold scheme since the secret can be recovered with greater than or equal to 'm' shares (availability condition), and less than 'm' shares have no information about the secret (security condition). The size of each share is (S/m) where 'S' is the size of an original secret, and the total size blows up from 'S' to S*(...