Browse Prior Art Database

Improve Data update performance among multi-replicas in HDFS

IP.com Disclosure Number: IPCOM000238218D
Publication Date: 2014-Aug-11
Document File: 4 page(s) / 79K

Publishing Venue

The IP.com Prior Art Database

Abstract

When import data into HDFS with multi replicas, the client will wirte data into datanode one by one like a pipeline. Copy as the pipeline will lead to great write-delays, and degrade performance. To enhance the data import performance in HDFS, this disclosure propose a system based improved data synchronization method between replicas to increase the update performance containing three modules:1)data synchronization module which will synchronize data between machines; 2)Replica observation module which will monitor replica configuration information and get its location;3) Parallel update module which will update the data in parallel among replicas. Meanwhile this disclosure provides a method to synchronize data between machines ,process as follows: 1)monitor the machines which are performing the same task;2)select a machine as synchronizing machine according to the monitoring information;3)send updated data of all machines to the synchronizing machine. A method to update data in parallel among replicas is provided also:1)get the number of replicas in the system;2)monitor replica configuration information and get its locations;3)The synchronizing machine initiates connections with replicas;4)Write the updated data into the disk from synchronizing machine. Achieve the data synchronization and update in parallel on the application side to reduce the cost to keep data consistency among replicas.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 99% of the total text.

Page 01 of 4

Improve Data update performance among multi

Improve Data update performance among multi- --replicas in HDFS

replicas in HDFS

In HDFS, there are multi-replicas to guarantee the reliability of data. And when writing or update data, the data will be

core idea: copied as the pipeline between multi-replicas which will lead to great write-delays, and degrade performance. -which consists

•data synchronization module: synchronize data between machines;

•Replica observation module: monitor replica configuration information and get its location;

•Parallel update module: update the data in parallel among replicas;

-A method to synchronize data between machines

•Get the number of replicas in the system

•Monitor the machines which are performing the same task, and select same number of machines with the number of replicas as synchronizing machines.

•Send updated data of all machines to a synchronizing machine, and swap data in synchronizing machines.

-A method to update data in parallel among replicas

•monitor replica configuration information and get its location

•Determine the correspondence between replicas and synchronizing machines.

•Write the updated data into the disk from synchronizing machines.

1


Page 02 of 4

data synchronization module: synchronize data between machines;

Replica observation module: monitor replica configuration information and get its location;

Parallel update module: update the data in parallel among replicas;

2


Page...