Browse Prior Art Database

A real-time method using data mining to improve root cause analysis in parallel ETL streams buffering suspicious data Disclosure Number: IPCOM000204404D
Publication Date: 2011-Feb-23
Document File: 5 page(s) / 157K

Publishing Venue

The Prior Art Database


In the domain of Extract-Transform-Load (ETL) a typical use case is to move data from multiple sources into a single target (Data Warehousing, Master Data Management, etc.) or from one source into multiple targets (e.g. download from enterprise data warehouse into multiple marts). A common pattern in these type of processing is that a portion of the same cleansing and transformation logic is executed in all parallel processing streams on a common data model. Unfortunately, in a scenario where during nightly batch processing from multiple sources into a single data warehouse and of the sources fails for a broad range of reasons (e.g. software upgrade, invalid data entered which is not handled by the ETL logic, etc. either only a partial load to the data warehouse happens or none at all with negative impacts because the required data for analytical processing is not there in the morning. What we thus proposes is a method and systems which uses data mining techniques in the ETL stream to detect outliers in the data potentially causing problems across the parallel running ETL streams by searching for outliers across them. Furthermore our method includes the buffering of data in case outliers are detected with alerts to ETL developers and data stewards who can then review the buffered data and can then for example abort the stream gracefully or let it continue to completion.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 22% of the total text.

Page 01 of 5

A real-time method using data mining to improve root cause analysis in parallel ETL streams buffering suspicious data

A very common use case today for ETL (Extract-Transform-Load) is to extract data from multiple sources, profile, cleanse, harmonize standardize, match, de-duplicate and transform it to a the data model of a target system and then load it into the target system. Today, the ETL processing is done frequently on a comprehensive software stack typically called the enterprise information integration platform. Now lets look at the following example shown in Figure 1 (simplified view on a real customer situation) to show the problem: S1 to S4 are order entry systems for a line of business. T1 is line of business data warehouse for sales reporting. Only during the night there is a window to extract and process approximately 10.000 orders from S1 to S4 and load them into the data warehouse. In the morning, the sales managers
for the various regions are running their reports to see how much business has been done in their region, by customer segment, etc. to see if the actual business is on track with the business projections done in the business plans. Unfortunately, not every morning the reports are matching the sales managers expectations because:

For example an erroneous entry in the data coming from S1 caused a data processing step in job J1 to convert the data in a way which leads to a subsequent error in job J3 causing all subsequent jobs to fail so that ultimately no data is loaded from S1 into T1.

For example due to erroneous data coming from S3 the job J2 processes the data in a way that the subsequent jobs process only a subset of the data in a fashion that its loaded into the target T1. Unfortunately, the administrator on duty saw the

job failure of job J2 and due to the limited time window for the batch process he

decides to re-run the job with tracing enabled since the log in production systems contains no useful information regarding the nature of the failure. He might not be even aware that a portion of the data made it into T1 due to the minimal log information. This is due to the fact that in production environments logging is typical set to a minimal level for performance reasons. Furthermore, even if tracing is fully enabled - the operators reading/writing to databases are unable to trace how many inserts/updates/deletes occurred during a job run nor are they able to record which record caused a processing failure of a record for a database operation. This means, if a partial load occurred the administrator has to manually compare all source records with all records which made it into the system to first find out which records did not make it. In the set of records which did not make it the administrator then needs to search for suspicious values which might have caused the issue. This is so time consuming that typical batch windows of 1-2 hours are not sufficient to find the guilty record. Unfortunately t...