Browse Prior Art Database

Technique for Storing Data Sets of an ETL tool in Client side data store

IP.com Disclosure Number: IPCOM000235830D
Publication Date: 2014-Mar-26
Document File: 2 page(s) / 36K

Publishing Venue

The IP.com Prior Art Database

Abstract

A Data Set in a job is the unit of data that is a storing area in the process of moving the data from one source other source. A data set could be also be a target. While performing operations like transformation or a lookup, data is read from the source everytime it is referred to.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 2

Technique for Storing Data Sets of an ETL tool in Client side data store

A Data Set in a job is a unit of data that moves from one link to another link. In its persistent form, the Data Set is a operating system file which can be used by other DataStage jobs. Currently data is stored in a data set in the order in which it arrives fetching all data columns and these data sets are stored in the file systems, normally using proprietary formats. This limits the way data sets can be used efficiently as one has come up with additional tooling mechanism, like indexing, for faster access. This disclosure is about Storing DataSets in a engine side data store, say for example in Apache Derby which is an open source Database. The datasets are stored against a common database schema along with pre-created table indexes. This disclosure describes a technique to store data sets in a engine side data store.

An ETL tool basically performs data transformation and data movement from source systems to target systems in batch and in real time. The data sources might include indexed files, sequential files, relational databases, archives, external data sources, and etc. Some of the transformations as part of an ETL work process, otherwise called as a

job can involve the following:


 String and numeric formatting and data type conversions.


 Business derivations and calculations by applying business rules and algorithms onto the data.


 Conversion of reference data from various sources to a common reference set, creating consistency across these systems.

Fetching the data from the data sources constitutes a reasonable amount of the overall elapsed time of the job and this data has to be moved between various components with in the ETL job. These components are popularly called as Stages in products like IBM InfoSphere DataStage. In order to cut down time and easy of job execution the data that is fetched from various data sources are stored in temporary units called as Data Set which is used across stages and jobs.

A Data Set in a job is a unit of data that moves from one link to another link. In its persistent form, the Data Set is a operating system file which can be used by ot...