System and method for generating question type distribution of a training data set in a Question/Answering system
Publication Date: 2014-Dec-02
The IP.com Prior Art Database
A system and method for generating question type distribution of a training data set in a question answering system is disclosed.
Page 01 of 4
System and method for generating question type distribution of a training data set in a Question/
Disclosed is a system and method for generating question type distribution of a training data set in a question answering system.
Modern Question Answer systems rely heavily on machine learning analytic techniques to be able to answer questions correctly. To be effective, high quality training data (also known as Ground Truth) must be provided. The training data is used for "training" the system and computing a predictive model with appropriate weights for the different features being used. The training data may have been manually created by subject matter experts (SME) or automatically created using approaches in Figure 1, which depicts some algorithms for matching questions and answers to the corpus that are applied in phases:
Text search: In this phase, the goal is to screen the questions using simple queries
created to perform full text search on the corpus. Options can be tweaked to "match perfectly" or "fuzzy matched." One possible scenario would be to start with perfect match and if the number of hits is too low, gradually move to fuzzy match until the targeted training data size is met. With this algorithm, there may be higher likelihood for false positive, but this would get higher performances which is good for running against the entire "Universal Ground Truth" which potentially contains 100s of millions of Question/Answer pairs.
Deeper natural language processing (NLP) analysis: Invoke the supporting evidence
phase of the pipeline to score the answers. This step is invoked on the set that results from step 1 ("Text search"). Only the Question/Answer pairs with highest scores are retained in the final Question set.
(Optional) Further refinement can then be considered where generated questions
can go through a workflow to be vetted by SMEs.
Page 02 of 4
However, whether the training data has been created manually by SMEs or automatically, it is possible that the distribution of questions doesn't provide good coverage for the corpus and may contain large number of questions that are similar, therefore not providing good predictive informativeness. Poor distribution of the training data can cause the overall accuracy of Question Answering system to drop.
A system for annotating the training set with metadata that identifies characteristics of the question and/or answer is disclosed in the article: http://priorart.ip.com/IPCOM/000232302D
An algorithmic approach to identifying an initial optimal distribution of questions for training a QA system on a new domain builds on that system. A profile of the most prominent features of each document in the new domain's corpora is aggregated across each corpus into a corresponding corpus profile. The training sets for those most similar systems may be annotated and an interpolation of the statistics from those existing training sets a...