Browse Prior Art Database

Method to embed resume capability in an ETL environment

IP.com Disclosure Number: IPCOM000200909D
Publication Date: 2010-Oct-29
Document File: 5 page(s) / 165K

Publishing Venue

The IP.com Prior Art Database

Abstract

An Extract-Transform-Load reads inputs from varied sources (like databases, file systems, XML, etc.) and performs pre-designed transformation on the data. The end result is then loaded in a data warehouse or repository. A single iteration of ETL with specific design can be termed as JOB. An ETL JOB can perform various important transactions on data with high efficiency. An ETL job is fairly vulnerable to hang or abort mainly due to server failure or network failure. Due to this, after every abort, once the job is restarted a considerable amount of rework is performed. This article proposes an approach to resume an aborted job which was expected to finish successfully. The pros and cons of the proposed method are also discussed in the paper. Once developed and applied, a resume methodology would save considerable amount of time and improve the overall runtime efficiency of ETL .This paper also discusses about the future challenges that may be encountered due to the proposed method and possible solutions to them.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 35% of the total text.

Page 01 of 5

Method to embed resume capability in an ETL environment

Introduction :

A typical ETL job consists of a source stage feeding data to a transformation stage followed by an output stage where the transformed data can be loaded. The proposed method will work for any typical ETL scenario.

(This page contains 00 pictures or other non-text object)

Figure 1 : Block diagram depicting ETL

Figure 1 is representation of a typical ETL tool .Since the user is free to design and manipulate ETL jobs as per the need, real-time jobs are far more complex. In jobs with multiple stages, it is fairly complex to capture the state of the job. State of the job involves the metadata and the job conditions before the job aborts .Retrieving the state of the job specially in situations where job design is very complex and in case where huge data transformation is involved saves a considerable amount of time and improves the overall efficiency of the ETL tool .The approach proposed will work successfully for simple scenarios and some of the complex scenarios as well. A comprehend and unbiased analysis of the approach is also been added.

Problem Statement :

Resuming an Extract-Transform-Load job, though important, is not a simple process. This is mainly because of following complexities involved:
a.If a job has multiple inputs and outputs, it is fairly difficult to maintain the state of the job. By state of the job, we mean the percentage of the job which has completed successfully.

1


Page 02 of 5

(This page contains 00 pictures or other non-text object)

Figure 2: Multiple Inputs-Multiple Outputs

b.In a single run, different rows are read from different inputs i.e. it is not necessary that same number of rows are read from all inputs

(This page contains 01 pictures or other non-text object)

Figure 3 : Multiple inputs to same stage

c.Inputs are read at different stages.

(This page contains 02 pictures or other non-text object)

Figure 4 : Multiple input at different stages

Some of the scenarios where a resume functionality is essential can be :

System reboots.

Pausing a job and resuming.

Job failure.

Job pausing during migration. Configuring job parameters during runtime.
Debugging and analysis.

Benefits of Job Resume :

Saving the rework involved in restarting the job .Rework involves capturing the state of the job , generating intermediate data/metadata involved etc .

Improvising the overall efficiency of ETL job.

Also saving considerable amount of time as the rework involved is saved due to resume process .

2


Page 03 of 5

Benefits of Job Pause :

Pausing a Job provides us a better flexibility in monitoring a job at different phases (For eg : at different data phases) .

It will be easy to capture the state of the job .The state of the job can be recorded and a report can be generated to for end users reference .

By enabling a check point run at Job level , the possibility of going back to the previous states can be done .

Check point run at Job level will improve the...