Browse Prior Art Database

Method and system for Providing Data Connections between Processes to Perform ETL

IP.com Disclosure Number: IPCOM000199884D
Publication Date: 2010-Sep-20
Document File: 3 page(s) / 102K

Publishing Venue

The IP.com Prior Art Database

Abstract

A method and system is disclosed for providing data connections between parallel processes to perform Extract, Transform and Load (ETL) operations.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 3

Method and system for Providing Data Connections between Processes to Perform ETL

Disclosed is a method and system for providing data connections between parallel processes to perform Extract, Transform and Load (ETL) operations. The parallel processes may be running on same or different machines. ETL operations use data parallelism and task parallelism. The data parallelism and the task parallelism are achieved by providing data connections between parallel processes. The data connections are used to process the data by the parallel processes.

In order to provide the data connections, all the parallel process are grouped into two sets. The grouping is done based on Least Significant Bit (LSB) value of operator numbers of the processes. For example, an operation 1 with operator number 0001 and an operation 3 with operator number 0011 are grouped together as the LSB (0th bit) values of operator numbers for operations 1 and 3 is 1.

Processes grouped in one set of the two sets are made to send requests for data connections to the processes in another set. Meanwhile, processes grouped in another set are made to wait for data connection requests.

Thereafter, a check is performed to identify pending data connection requests. If there is no pending data connection request, the process is done. Alternatively, if pending data connection requests are identified, all the parallel processes are re-grouped into two sets based on first bit value of the operator numbers of the processes. In response to re-grouping all the parallel processes, the check is performed again to identify pending data connection requests. The above process is repeated until all the data connection requests are made. Alternatively, the above process may be repeated until Most Significant Bit (MSB) value of the operator number is reached.

In a scenario, consider a job having four operators i.e. an operator Op0, an operator Op1, an operator Op2, and an operator Op3. The four operators are running in a three node configuration. Each operator includes parallel processes. A process of operator m, running on node n may be represented as P(m,n). For example, processes P(0,0), P(0,1), and P(0,2) belongs to operator 0 (i.e. operator Op0), r...