Browse Prior Art Database

Method and system to allow intelligent insertion of stages into an ETL job design

IP.com Disclosure Number: IPCOM000240526D
Publication Date: 2015-Feb-05
Document File: 4 page(s) / 84K

Publishing Venue

The IP.com Prior Art Database

Abstract

This article describes how to enable an ETL developer to insert a new processing stage between two existing stages of a job design without having to reconstruct metadata that would otherwise be lost as a result of breaking the joining link into two pieces. It includes detailed algorithms for how to: (a) check that the inserted stage will not lead to invalid job topology; (b) preserve parallelism and partitioning rules where possible; (c) transfer properties that implement certain other rules to the new objects, where applicable.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 37% of the total text.

Page 01 of 4

Method and system to allow intelligent insertion of stages into an ETL job design

At its simplest level an ETL (Extract Transform Load) job is just a process that will read data

from one source (such as a database), transform it (for example, remove trailing spaces), and finally write it to a target (such as a file). An ETL job design consists of one or more stages, each stage performing a discrete function such as read from database, sort data, merge data etc. ETL job designs can vary from the simplistic, consisting of only a handful of stages to the complex where the number of stages can reach 100+.

    An ETL job design is constructed by the ETL developer dragging and dropping these stages onto the ETL job design canvas and linking them together to meet a particular requirement. In many ETL applications, if the ETL developer needs to insert a new stage in-between two existing stages they need to manually disconnect the existing stages from each other, add the new stage then reconnect the stages. In the process of doing this they will have lost the metadata and settings that were associated with the existing link between those two stages.

    For the purposes of this disclosure, the metadata and settings associated with a link are defined as:
· Source schema
· Target schema
· Mapping rules between the source and target schema (for example input field A is renamed to B on the output)

· Sort rules (key, direction, case sensitivity)

· Partitioning rules if the target stage runs in parallel mode (partitioning algorithm and associated settings)

· Collection rules if the target stage runs in sequential mode (collection algorithm and associated settings)

· Runtime optimisation parameters (buffering algorithm, buffer size)

· It would be more productive if a new stage could be intelligently inserted into an existing link preserving the metadata, hence a new solution is required.

    Our invention allows the ETL developer to drag and drop a new stage onto a link connecting two existing stages together. The system will automatically break the link into two pieces, preserving the metadata and settings, insert the new stage and connect it to the two existing stages.

    The novelty in this invention arises from the non-trivial algorithm it uses to preserve as much of the original link metadata as possible: · The degree of parallelism and associated partitioning/collection rules for the link are preserved, where the stage being inserted allows · Any sorting being performed on the link is preserved · Mapping rules between the original source and target stages of the link are preserved, where the stage being inserted allows · Non-optimal insertions are flagged to the user · Runtime optimisations are maintained
The system incorporates a classification of stage types into the following categories, based on particular characterisations that are important to this disclosure:

Data set reduction

    A 'data set reduction' stage is one whereby the number of out...