Browse Prior Art Database

Method and System for Ad-Hoc Debugging of Extraction and Integration Pipelines

IP.com Disclosure Number: IPCOM000197952D
Publication Date: 2010-Jul-23
Document File: 4 page(s) / 94K

Publishing Venue

The IP.com Prior Art Database

Related People

Alpa Jain: INVENTOR [+3]

Abstract

A method and system for ad-hoc debugging of extraction and integration pipelines is disclosed. More specifically, a generic framework for provenance based ad-hoc debugging of extraction and integration pipelines is disclosed.

This text was extracted from a Microsoft Word document.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 29% of the total text.

Method and System for Ad-Hoc Debugging of Extraction and Integration Pipelines

Abstract

A method and system for ad-hoc debugging of extraction and integration pipelines is disclosed.  More specifically, a generic framework for provenance based ad-hoc debugging of extraction and integration pipelines is disclosed.

Description

Disclosed is a method and system for provenance based ad-hoc debugging of extraction and integration pipelines.

Generally, tracing and linking output records from each operator and understanding transformations of the output records across different operators in an Information Extraction (IE) pipeline while building debugging applications or debuggers is a critical task.

The system i.e., a Provenance Based Debugger (PROBER) includes a provenance model for IE pipelines to trace lineage of arbitrary output records (also called, “black-box”) from each operator.  The PROBER runs a post-execution analysis of the IE pipelines consisting of arbitrary operators.  The arbitrary operators may include a variety of operator types.  For example, the arbitrary operator (hereinafter, “operator”) may include the operators for which full or no specifications are available.

The provenance model utilizes various properties of the operators that can be learned by sampling, for example, monotonic operators, one-to-one operators, one-to-many operators or arbitrary operators.  The provenance model then identifies connections between data items of the diverse set of operators that are found in real-world IE pipelines.  Thereafter, the system rigorously examines methods to build provenance information for combinations of the diverse set of operators.  The system utilizes a suite of algorithms to build provenance for each instance of the post-execution analysis of the IE pipeline.  The suite of algorithms explores a tradeoff between efficiency of building the provenance and the amount of information captured by the suite of algorithms.

The provenance model is based on minimal subsets of operator inputs that capture necessary information.  This basic model is then extended to operators where multiple minimal subsets may exist.  To define the provenance of an IE pipeline P, the system first defines the provenance for each operator in P.  The system defines the provenance of an extraction operator O based on the provenance for each output record  for O.  The provenance of output record r represents precisely the set of records that contributed to r.  The provenance model utilizes the following definition of a Minimal Subset (MISet) of records.  For a given operator O, its input I and output R,  is considered as MISet of , if and only if, , and . Intuitively, a MISet gives the fewest input records required for a particular output record r to be present.  Therefore, an MISet provides users with one possible reason for the occurrence of r.  This, in turn, reduces the burden of manual annotation on the users.  The MISets focus...