Make ETL jobs fully transactional using commiter back pipeline
Publication Date: 2019-Mar-14
The IP.com Prior Art Database
It is not easy to make ETL job fully transactional. Although some of the stages allow user to start single atomic transaction, and commit it accordingly, still it may be not possible to ensure the data is safe through the full pipeline across all stages. Early stages (e.g. reading from DB2 database of kafka topic) might not be aware of the data processed further and - if any error occurs - the data can be lost. The current solution - existing EndOfWave marker - allows user to mark start and end of transaction, however it might be implemented only in one direction and not all of the stages use that, which can lead to missing EOW marker if the stage does not support it.
The back (reverse) pipeline can be used to ensure data transactional across all of the job stages. One of the solutions for that is to use back (reverse) pipeline, which comes back through all of the stages with commit message. That allows user to decide at which job stage the data is safe (e.g. it is stored on disk) and then send commit back through the reverse pipeline. The final stage (storing the data) would send the data unique ID (or primary key) back through this pipeline to inform previous stages that this data is safe, and the transaction can be finished. Every job stage may check this pipeline and commit its transaction if needed. This solution improves transaction atomicity by providing full transaction committed only if the data is safe. Backward pipeline would return information that the particular...