Browse Prior Art Database

Test Coverage of a natural language corpus Disclosure Number: IPCOM000236478D
Publication Date: 2014-Apr-29
Document File: 2 page(s) / 25K

Publishing Venue

The Prior Art Database


Disclosed is a method for selecting an effective subset of a corpus for natural language processing (NLP) pipeline testing. Each passage in the corpus is scored against a vector of interesting attributes such that Combinatorial Test Design (CTD) style reduction can then be applied to identify the optimum subset of documents to include in the test.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 2

Test Coverage of a natural language corpus

Performing natural language processing (NLP) on a large text corpus is time consuming. Developing and testing an NLP pipeline of annotators is even more time consuming, especially given the number of possible document and sentence constructs. In order to have confidence that the NLP processing is correct, a development team tests an NLP ingestion process against the entire text corpus, which is very expensive. The alternative is to include a random subset of documents but this approach does not deliver the broad test coverage required.

In a hypothetical NLP corpus ingestion, 600,000 documents are ingested and can take a week to run. A method is needed to select a small subset of the documents (300-1000) that provide 99% or higher coverage but can run quickly, in less than an hour.

For each document in the corpus, the method generates a vector of attributes or expected attributes based on those annotators found in the document in a previous ingestion or given the presence of certain keywords. CTD style reduction is then applied to identify the smallest possible list of documents that covers all the features. Alternately, this scoring and reduction can be performed at the paragraph or sentence level.

The benefit of this approach is that it provides maximum test coverage using the smallest possible set of documents yielding an overall faster processing time.

Summary of solution

Perform an analysis of the NLP pipeline to determine the inputs and outputs of each segment of the pipeline. Scan the input passages to determine what input annotations are present or are likely to be present per passage. Next, determine the output annotations these passages will have or are likely to generate. For every passage, build a list of output annotations it has or will have and perform CTD reduction, including weighting, to identify the smallest set of documents that covers all the expected annotations.

Solution details
1) For all of the annotators in the pipeline, determine their input and output annotation types.

2) Scan the input passages for each of these annotation...