Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Automatic (Dynamic) Generation of Node Configuration File for ETL Tools (DataStage)

IP.com Disclosure Number: IPCOM000200906D
Publication Date: 2010-Oct-29
Document File: 5 page(s) / 133K

Publishing Venue

The IP.com Prior Art Database

Abstract

Dynamic generation of optimal Node Configuration File for DataStage on all supported platforms. The proposed solution could be used in mutiple scenarios including for new datastage installations, resolving bottlenecks due to improper node file configuration on existing systems as well achieving better performance on already running systems.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 34% of the total text.

Page 01 of 5

Automatic (Dynamic) Generation of Node Configuration File for ETL Tools (DataStage)

With-in any ETL process lots of time, the jobs developed on a specific machine needs to be ported/shipped to a different machine having unequal node values. This makes it unusable or a lot of time is wasted figuring out the nodes in data-sets and reconfiguring apt file on the target machine to match the values. Also, after the DataStage or any other ETL tool install is done, the proper setting up of node configuration file, scratch disk, temp location etc are vital to attain an optimum performance, which requires skillful decision based on hardware configuration.

A dynamic file generation mechanism which should generate best suited configuration file for optimum performance over the given platform/hardware for fresh install. For any mismatch found between the nodes defined on the system and the data-sets within the

job (which is the case of a job import from other source), the proposed mechanism (tool)

should detect this situation and should automatically adopt the node configuration from the concerned job by generating a temporary node file on the fly to be used with the job in question.

NOTE: Proposed mechanism is valid for any ETL tool available but detailed provided below are based on DataStage as a reference point.

DataStage parallel extender jobs use datasets to store data being operated on in a persistent form. Datasets are operating system files, each referred to by a descriptor file. The following properties of datasets make it strictly follow the node configuration on which it was created:

The descriptor file for a data set contains the following information:

• Data set header information.

• Creation time and data of the data set.

• The schema of the data set.

• A copy of the configuration file used when the dataset was created

As datasets are partitioned datatype, they are stored in multiple disks on the system. A dataset is organized in terms of partitions and segments. Each partition of a data set is stored on a single processing node. Each data segment contains all the records written by a single DataStage job. So a segment can contain files from many partitions, and a partition has files from many segments. Following diagram illustrates the scenario:

1


Page 02 of 5

(This page contains 00 pictures or other non-text object)

(This page contains 01 pictures or other non-text object)

As shown above, each partition has data from multiple segments and a single segment's data is stored on multiple partitions. We can conclude here that because the datasets are stored on multiple partitions, which is defined by the node configuration file, the datasets strictly follow the node constraints.

A big advantage of using this approach is to have enormous I/O performance due to parallel processing. It may not be feasible to change the architecture itself as it will bring lot of other considerations and complexities.

A better approach and simpl...