Browse Prior Art Database

Improve data query performance through distributed cache in hadoop based big data platform

IP.com Disclosure Number: IPCOM000234139D
Publication Date: 2014-Jan-14
Document File: 7 page(s) / 116K

Publishing Venue

The IP.com Prior Art Database

Abstract

Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Hadoop is an open source software framework for storage and large scale processing of data sents on clusters of commodity hardware. This article proposed a distributed cache system in hadoop based big data platform so as to improve data query performance.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 7

Improve data query performance through distributed cache in hadoop based big data platform

1. Motivation


In Hadoop based big data platform, under iterative computing environment, for each computation iteration, the inputs will be read from

HDFS once, this brings in long latency. For example, when data scattered in multiple HDFS blocks, it takes time to accumulate them for each iteration and reading data from HDFS involves disc visiting and even network transmission. Also some computation processes are often invoked even for the same data set, this increases system workload and application response time.

2. Core idea


This article proposed a distributed cache system to 1) store hot raw data and often used computation results in memory; 2) sync cached

data with data base. The distributed cache system consists: Master cache which stores and manages often used computation results; Slave cache which stores and manages hot raw data and Update record module which maintains data base update records. This article also proposed two methods to efficiently sync data among HDFS clusters and timely and efficiently sync data between cache and data base respectively. The proposed cache system and methods can utilize limited memory size to store hot data so as to improve query performance and timely and efficiently sync data between cache and data base so as to ensure data freshness.


3. System diagram

1


Page 02 of 7

Master cache: cache in memory to store and manage history computation result

Slave cache: cache in memory to store and manage raw data

Update record module: maintain data base update record

Figure 1 System diagram

2


Page 03 of 7

Figure 2 System workflow

Currently, there are multiple different kinds of cache systems. Table 1 specified the difference among these different cache systems from the perspective of cache content and update mechanism.

Table 1 Comparison among multiple different cache systems

3


Page 04 of 7

4. Data synchronization method between cache and database


Since cache only maintains hot data, when these data is updated in data base, there should be one mechanism to notify cache and update

its content. However, the size of c...