Browse Prior Art Database

A method to quickly configure the parallel setting for ETL data from/to a database

IP.com Disclosure Number: IPCOM000245266D
Publication Date: 2016-Feb-24
Document File: 4 page(s) / 82K

Publishing Venue

The IP.com Prior Art Database

Abstract

Parallel data processing is widely used for dividing the same work among many parallel processes or threads. Multiple instances of a data flow are created (called partitions), each to process some fraction of the overall data set in parallel. By using this technic, ETL tools are able to scale the data flow by creating more and more partitions to extract from source and load the data into target, as long as machine resources allow. This helps get scalable performance in the Extract or Load processes. However, from various performance test and experience, the performance is not always linear when increasing the parallelism. To get the optimized configuration (number of parallelism) is very important for a performance tuning. While, the problem is that how to get the optimized configuration in a easy way.

To deal with the problem, this invention proposes a new method which only needs to run the ETL workload once before determining the best configuration, instead of long and tedious tests. The fundamental principle of the new technique is as following:

1. Utilizes existing Database performance curve (curve f1(x), x is the # of parallism) of extracting or loading data which generated by native database CLI. 2. Run the ETL workload with a small number of parallelism,capture the performance (IO,CPU,memory,network) and speculate the curve(curve f2(x), x is the # of parallism) of ETL product performance 3. By using the curve f1(x) and curve f2(x), calculate the best configuration for parallelism for using ETL tool to extract or load data to database.

This technology is able to ease the process of get optimized parallelism configuration when ETL massive data from or to a database.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 4

A method to quickly configure the parallel setting for ETL data from

A method to quickly configure the parallel setting for ETL data from/

Parallel data processing is widely used for dividing the same work among many parallel processes or threads. Multiple instances of a data flow are created (called partitions), each to process some fraction of the overall data set in parallel. In ETL field, by using this technics, it introduces the ability to scale the data flow by creating more and more partitions to extract from source and load the data into target, as long as machine resources allow. This help get scalable performance in the extract or load process.

However, from various performance test and experience, it's not determined that the best performance will be achieved by using more # of parallelism or less # of parallelism. So the problem is that how we get the right configuration (# of parallelism) conveniently.

This disclosure introduces a new technics to suggest the best configuration method for parallelism when processing parallel ETL data from or to a database. It only needs to run the ETL workload once before determining the best result. This will sharply reduce the work efforts for performance optimization.


1. Utilizes existing Database performance curve (curve f1(x), x is the # of parallism) of extracting or loading data which generated by native database CLI.

2. Run the ETL workload with a small number of parallelism,capture the performance (IO,CPU,memory,network) and speculate the curve(curve f2(x), x is the # of parallism) of ETL product performance

3. By using the curve f1(x) and curve f2(x), calculate the best configuration for parallelism for using ETL tool to extract or load data to database.

[

[Load performance study Load performance study

Load performance study]

]

In one of connector performance tests of ETL product ( in both the ETL performance study last year and the ETL performance regression this month) , the max throughput for Load is about 2.35TB/Hour which is about 650MB/s, and we never reach the max throughput capacity for database server. In fact, this is >20% smaller than the max database server capacity which is about 850MB/s.

We did following analysis, which maybe indicate the connector performance capacity issue with smaller # of parallel configuration in ETL product.

Schematic Diagram

//to a database

to a database

1



Page 02 of 4

For the connector, it's process based, and one process has only one thread. This means one process of the connector can utilize one CPU core thread resource. When thinking about the parallelism configuration of ETL product, it will have same # of connector processes by # of parallel configuration, which means it could utilize same # of CPU core resources by # of parallel configuration. This leads the

2



Page 03 of 4

connector has the ve...