Browse Prior Art Database

A Method to recommend spark RDD caching Disclosure Number: IPCOM000244639D
Publication Date: 2016-Jan-04
Document File: 7 page(s) / 135K

Publishing Venue

The Prior Art Database


Spark is an open source in-memory application framework for distributed framework for distributed data processing and iterative analyasis on massive data volumes. Resilient Distributed Dataset (RDD) is the basic in-memory data abstraction in spark, when data is loaded into spark, it is represented by RDD. Since spark is an in-memory framework, data caching is an outstanding Spark capability, but Spark relies on developer to identify which RDD should be cached, so Spark application developer needs to pay attention on caching at the correct places according to code logic, this is critical to performance, but this distract spark developer from real business logic and also this is difficult in complex algorithm(MLlib) The disclosure introduces a method which can give recommendation on which RDD should be cached for spark application

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 01 of 7

A Method to recommend spark RDD caching

In general, when one RDD is reused, this RDD should be cached to prevent from scanning original data many times. But caching RDD sometimes is a tradeoff, the more RDD caching in memory, the less memory left for computation. So we need to cache the key RDD instead of any one reused.

In this disclosure,all stages in DAG of a spark appliation are analyzed, output is the RDDs which are good to be cached in the order of priority .

This disclosure follows steps:

Step 1: Initialize analyzing forest with first job DAG stage tree

Step 2:Take next job DAG, merge this graph into overall graph in a way that if there are duplicate edges, add 1 to edge in overall graph, else add new tree or branch to overall graph

Step 3:Merge all DAG following step2

Step 4:For final overall DAG, pick up all subgraphs with edge value>1, identify the bottom node of each subgraph, sort all these bottom nodes by edge value

Take DAG of KMeans application as an example to elaborate the method:

If user runs KMeans application, he would get the following DAGs(there are many more stages, only list first 5 here)


Page 02 of 7


Page 03 of 7

Applying above method, the following graphs for each step is generated:


Page 04 of 7


Page 05 of 7

For the merged forest(a tree in this example), after analyzing, RDD3, RDD6 and RDD10 are reused, user can choose to cache these RDDs base on the consideration of memory size.


Page 06 of 7


Page 07 of 7

Performance validation:

Take a spark application(KMean MLlib application) as example, it takes 3.8 minutes when no caching in code, it takes 59s after applying

the recommendation generated following the method in this disclosure. An obvious performance improvement is shown.

No cache

after applying recommendation


On the other hand, this result cannot prove proper RDD are re...