Browse Prior Art Database

Method and System for Behavioral Testing of Big Data Pipelines using Statistical Distributions

IP.com Disclosure Number: IPCOM000240626D
Publication Date: 2015-Feb-13
Document File: 2 page(s) / 19K

Publishing Venue

The IP.com Prior Art Database

Related People

Joshua Walters: INVENTOR [+3]

Abstract

A method and system is disclosed for behavioral testing of big data pipelines using statistical distributions. The method and system performs behavioral testing on big data pipelines by utilizing record column statistical distributions that are used to assert on business rules via named aggregates.

This text was extracted from a Microsoft Word document.
This is the abbreviated version, containing approximately 52% of the total text.

Method and System for Behavioral Testing of Big Data Pipelines using Statistical Distributions

Abstract

A method and system is disclosed for behavioral testing of big data pipelines using statistical distributions.  The method and system performs behavioral testing on big data pipelines by utilizing record column statistical distributions that are used to assert on business rules via named aggregates.  

Description

Behavioral tests are required to implement continuous delivery (CD) of data pipeline’s and to certify core business logic for enforcing data output sets.  Data sets (e.g. Strings, Integers, Booleans) are split into columns, which adhere to certain predefined properties.  These columns can also develop relationships with other columns based on certain predefined conditions.  Due to the large volume of data in the pipeline, many columns follow some statistical distributions (e.g. for all records in a given hour, a column value is 46 with a standard deviation of 5, or a column follows a Poisson distribution with certain parameters).  There is no standard way to perform these checks on data sets to ensure the correctness of the data pipeline.  Most projects have used ‘gold’ data sets, where they pre-compute all the expected values for a special saved data set, which are then, compared to the data pipelines output on the same data set.  There is a need for a platform/tool that performs behavioral testing of big data pipelines.

Disclosed is a method and system for behavioral testing of big data pipelines using statistical distributions.  The method and system allows enforcement of business rules on live data sets which enable a user to monitor parameters such as, but not limited to, input data schema changes and column value changes.  Enforcement of business rules, while leveraging statistical distribut...