Browse Prior Art Database

a new methodology on tracing the data source origin in distributed big data processing system

IP.com Disclosure Number: IPCOM000248838D
Publication Date: 2017-Jan-17
Document File: 2 page(s) / 15K

Publishing Venue

The IP.com Prior Art Database

Abstract

This disclosure describe a new methodogy of debugging the data processing in big data application, it is based on the spark, in order to instrument the tracing stub, it extends the existing RDD(resilient distributed dataset) implementation, and then grab the intermediate data in production environment for developer to process debug on real data, which can provide big convenient support for large amount data's debugging.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 63% of the total text.

1

a new methodology on tracing the data source origin in distributed big data processing system

current big data system development is facing a problem of huge data tracing, the developer always do the development based on limited sample data, but there are always exception when apply the processing on real huge data. it's a tough and time consuming effort to locate the problem data, developer can only add some verification code to check the issue via some experiential assumption or guessing on the potential data. this disclosure describe a new methodology on tracing the data provenance in distributed big data processing system this disclosure includes following main ideas: 1> customized the dataset unit extension in distributed data process platform, record the data processing path and the parent data in each steps 2> logical trace plan together with physical node tracing plan to located the exactly data piece.

advantages: 1> data origin path for developer to be aware of the processing detail 2> data piece locating for developer to easily and quickly check the issue data rows Details:

1> extends the dataset unit implementation to lightly travel through the data transformation process flow in stage boundries.

TrackingRDD extends RDD[T]{ def traceBack(): TrackingRDD def trackForward: TrackingRDD

def compute(...){ -- insert impl here -- } }

val resultdf = myDF.sortBy(_._1).collect resultDF.traceBack() .........

2> address the workflow instrumenting after the customization of t...