Browse Prior Art Database

Optimal way of using score generation and usage by processes in ETL – Data stage SMP

IP.com Disclosure Number: IPCOM000199932D
Publication Date: 2010-Sep-21

Publishing Venue

The IP.com Prior Art Database

Abstract

Score generation and score usage by all processes generate score and usage by all processes in sequential the drawbacks of generating scores sequentially are 1.Performance overhead on generation and usage of score by all processes (sometimes it take 1 hour) 2.operation happens in sequential 3.score module is not modularized so that it can be used other modules other than Datastage

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 17% of the total text.

Page 1 of 22

Optimal way of using score generation and usage by processes in ETL - Data stage SMP

Abstract

The score in conductor and in section leaders serializes onto the disk. Thereafter the all the

  layers will de-serializes it and perform the action on it. By doing this, there is a lot of overhead on serializing and de-serializing the score and not performing this in parallel. It proposes the generation and usage of score by all processes on homogeneous SMP based systems to provide a high degree of concurrency on ETL world. This provides parallelism that multi-

p

applications can use to scale linearly, and thus deliver high throughput with the shared memory by all processes. The generation and usage of score is optimized that delivers high throughput by minimizing wait latencies in processing in the proposed architecture.

INTRODUCTION

ETL would be used to process large volumes of data and the architecture on one of the ETL tool called data stage of shown below.

Data stage/Process Based Architecture

DSX XML Requests

OSH

Figure 1. Data stage architecture

Score - it is a combination of job and configuration file with data connection with many

processes

OSH - orchestrate shell - used to run a job

Orchestrate framework - this is the framework that is being used by all components of Information Server product.

Conversion of this figure (data stage architecture) to more realistic model (with GUI (graphical user interface) and jobmon (

job monitor used to monitor jobs)) is given in next Figure.

process

DataStage GUI

OSH Generato

r

Job Monito

r

Jobmon XML

Orchestrate Framework

1

Page 2 of 22

Control Channel/TCP Stdout Channel/Pipe Stderr Channel/Pipe APT_Communicator Jobmon Connection

GUI

Jobmon

Conductor

Section Leader,0

Section Leader,1

$ osh "generator -schema record(a:int32) [par] | same | copy"

Section Leader,2

generator,0

generator,1

generator,2

copy,0

copy,1

copy,2

Figure 2. Architecture with peripherals - GUI and Jobmon

Where

Control channel - used to transmit the information from conductor to section leader and vice-versa.

Stdout channel/

pipe - use to sent logging information that was generated by all lower parts

  namely players through the pipe to conductor Stderr channel/

pipe - - use to sent error logging information that was generated by all lower

parts namely players through the pipe to conductor

APT

_Communicator - used to data transfer between players

Jobmon communication - used to send monitoring information to GUI.

Conversion of above diagram to parallel engine (PX) architecture terms is shown in next section.

PX Process Based Architecture

Conversion of above architecture to pure PX model has been isolated and same is shown in below figure.

Figure 3. Architecture of PX - Process relationship

Architecture Flow

2

[This page contains 2 pictures or other non-text objects]

Page 3 of 22

In the above diagram (Figure 3), conductor is used to take the input data from the user (user is data stage GUI for PX); spawns respective processes (s...