Browse Prior Art Database

Framework To Support Versioning Of A Run Of Big Data Applications Disclosure Number: IPCOM000234049D
Publication Date: 2014-Jan-09
Document File: 5 page(s) / 67K

Publishing Venue

The Prior Art Database


Disclosed is a framework, applicable to Big Data applications, to support versioning of runs by using the notion of "scenarios". This framework provides a new Application Protocol Interface (API) to Big Data applications.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 22% of the total text.

Page 01 of 5

Framework To Support Versioning Of A Run Of Big Data Applications

Big Data applications are used to perform complex analytics processing on very large datasets. The output data generated by such applications may be inputs to business-related decision making processes. Hence, the quality of output data that is generated by such applications is extremely important, and the users of the applications need to have a high confidence in the quality of the data generated before using the applications for ongoing operations.

Many factors affect the output data generated by Big Data applications. One factor is the metadata that was specified for running the applications. Metadata provides configuration information that controls processing logic. Application developers usually

provide a definition for the metadata as well as sample data to populate the metadata for various use cases/industries. However, data scientists have domain expertise, have an in-depth understanding of the data for analysis, and usually augment the metadata. Processing logic of Big Data applications also affects output. Finally, output is affected by the input datasets processed and the interplay between the application logic and the metadata used for processing particular datasets.

A run of the applications performs the analytics processing to transform input datasets to output data of value to the user. There is metadata associated with a run. A run

represents a logical version of the execution of the applications that may have associated metadata performed by a certain user at a certain time.

A run can be a "sample" run or a "production" run. In a "sample" run, the user is experimenting with a small subset of the input datasets and is using a particular variation of the metadata. A "production" run is typically run on larger input datasets of interest. The run may involve "standalone" or "chained" applications.

A "standalone" application can be construed as an application which, given certain input datasets and metadata, has all the logic to transform the datasets to output data of value to the user.

For a given use case, it is possible that one or more Big Data applications are involved, so that the applications are granular. In addition, since some steps in processing may need variants of applications, one or more Big Data applications may be involved in sample runs and subsequent production runs. Applications run in multiples are referred

to as "chained" applications. Chained applications are effectively using the same metadata and have flow dependencies such that the output of one application can be the input of another application. The solution described herein applies to both "standalone" and "chained" applications.

"Scenario" refers to the versioning of a run. A "scenario" is a unique run of a "standalone" or several "chained" Big Data applications that performs analytics to process specified input datasets using specified metadata and produces output data t...