Browse Prior Art Database

System and method to support specification of mapping of raw data (generated by multiple sources) to a normalized format and to perform the format conversion for the data such that the data can be processed by downstream big data analytics applications

IP.com Disclosure Number: IPCOM000234634D
Publication Date: 2014-Jan-23
Document File: 6 page(s) / 70K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a system and method to support specification of mapping of raw data (generated by multiple sources) to a normalized format and to perform the format conversion for the data such that the data can be processed by downstream big data analytics applications

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 18% of the total text.

Page 01 of 6

System and method to support specification of mapping of raw data (generated by multiple sources ) to a normalized format and to perform the format conversion for the data such that the data can be processed by downstream big data analytics applications

Big data analytics applications performing complex analytics need to consume raw data that can have a variety of formats . An example of such a application is one that determines organizational and product buzz and sentiment by analyzing the data generated by social media data. The raw data input to such applications can come from variety of sources such as social media sites , data from various systems, etc. The data can use different representational schemes such as JSON, XML, delimiter separated files, etc . In addition, subsets of data from the raw input needs to be extracted and transformed and used as input for the analytics applications . However, the analytics applications expect the data in a certain format (referred to as "normalized input data" in this invention. Raw data will be referred to as "raw input data"). The raw input data can be either "date at rest" or "data in motion" and also the volume of input data can be large. There can be a proliferation of types of data sources that such analytics applications need to consume .

The "raw input" and normalized input having different formats can mean the following items. Additional details will be provided in the publication.


1. Certain "top-level" entities in the "raw input" correspond to records in the "normalized input". The field values that end up in a record in the "raw normalized input" can come up from various nested elements within the "top-level" entity in the "raw input" .

For example, a certain XML element within the "root" tag in a XML format file or a JSON object in the top-level JSON array in a JSON format file can be a "top-level" entity.


2. The values of fields in a record in the "normalized input" can come from various nested entities within a "top-level" entity in the "raw input".


3. It may not be feasible for certain elements in a "top-level" entity to contribute to a field value in a record in "normalized input".


4. Several fields in "top-level" entity may have same name and so any conversion method will need to provide a mechanism for "name qualification".


5. It is conceivable that other semantics that may need to be provided when mapping from "raw input" to "normalized input" which can be utilized in downstream analytics application (such as whether a field is a pass thru field which needs to end up the downstream output without any additional processing, specific text mining logic that needs to be applied to the field value before it can be used as a raw input value).

This situation presents the following challenges.


1. Dedicated applications need to be implemented to perform the conversion from "raw input" to "normalized input" format that the analytics applications expect. The mapping logic need...