Near Real-Time ETL pipelines
Publication Date: 2014-May-09
The IP.com Prior Art Database
A method is disclosed to optimize data flow architecture of micro batch ETL by defining data change scenario, utilizing a data change profiler to capture the data change summary and a rule engine to accordingly trigger the dedicated ETL pipeline.
Page 01 of 4
--Time ETL pipelines Time ETL pipelines
The traditional ETL adopts a permanent ETL, which doesn't customize the particular execution path for specific change in source.
The extraction strategy, the transformation logic and loading approach are fixed without adaptive capability as per the particular data change in source system.So it results the redundant execution and ineffective process.
The reality is source system changes could be different in each interval among executions of the micro batch ETL. Therefore it doesn't make sense to unify and solidify one permanent ETL process logic to handle various source changes.
Traditional ETL logic in micro batch is not effective, and could probably result the delay of micro batch completion since it has redundant process hard coded.Furthermore, it could downsize the real time capacity.
The solid steps of extract, transformation, loading defined once by developers, so it's impossible to figure out optimum approach to each execution of micro batch ETL. Lacking of flexible ETL strategies in real time according to data source changes, and there are unnecessary and repeating processes occur in extract, transformation and loading.
Data Change Profiler will snapshot and summarize the data changes occurred in source from last interval before the newest micro batch runs.(Scenario: the summary of data changes occurred in source from last interval, which could be defined and captured by Data Change Profiler).
The data change summary from profiler will be sent to rule engine, and will be matched with a certain ETL execution path (namely the pipeline).
Page 02 of 4
The rule engine is defined to match captured scenario and router it to a dedicated pipeline.
Detailed Description of The Invention The Source Change Profiler
Page 03 of 4
Definition of Scenario:
Scenario is the summary information at table level or data file level to describe the source changes in the int...