A real-time method using data mining to improve root cause analysis in parallel ETL streams buffering suspicious data
Publication Date: 2011-Feb-23
The IP.com Prior Art Database
In the domain of Extract-Transform-Load (ETL) a typical use case is to move data from multiple sources into a single target (Data Warehousing, Master Data Management, etc.) or from one source into multiple targets (e.g. download from enterprise data warehouse into multiple marts). A common pattern in these type of processing is that a portion of the same cleansing and transformation logic is executed in all parallel processing streams on a common data model. Unfortunately, in a scenario where during nightly batch processing from multiple sources into a single data warehouse and of the sources fails for a broad range of reasons (e.g. software upgrade, invalid data entered which is not handled by the ETL logic, etc. either only a partial load to the data warehouse happens or none at all with negative impacts because the required data for analytical processing is not there in the morning. What we thus proposes is a method and systems which uses data mining techniques in the ETL stream to detect outliers in the data potentially causing problems across the parallel running ETL streams by searching for outliers across them. Furthermore our method includes the buffering of data in case outliers are detected with alerts to ETL developers and data stewards who can then review the buffered data and can then for example abort the stream gracefully or let it continue to completion.