Method to identify new source documents and enlarge corpora in question-answer systems by automated stylistic evaluation of uncurated text
Publication Date: 2016-Mar-03
The IP.com Prior Art Database
Disclosed is a system that uses stylistic analysis to determine the quality of human-generated textual content. Uncurated content can then be utilized with confidence, increasing the available corpora for answer generation. Not only can this increase corpora size, but it may allow for answers to more esoteric questions, such as trend reporting in certain demographics.
Page 01 of 2
Method to identify new source documents and enlarge corpora in question - systems by automated stylistic evaluation of uncurated text
Question-Answer systems are only as good as their source material. They rely on correct data to obtain answers - including free-text sources such as encyclopedias, news media, which are considered authoritative. However, this leaves the vast majority of human-generated textual content unavailable - blogs, amateur articles, and even business communications are left out, because their content may not be reliable. Furthermore, news sources are increasingly turning to semi-factual opinion pieces or sponsor-paid native advertising, undermining the credibility of the source. Eliminating these articles will ensure higher-quality answer generation.
There is false information contained in some documents. This false information often comes from classes of documents like blogs, opinion articles, or press releases from highly biased sources (like think tanks or lobbyists). However, this false information typically looks and feels like accurate information to most question-answering systems. One needs a method to down-weight answer candidates that were discovered in uncurated sources, so that one can make use of credible information from reliable but unknown sources. This is a way to answer questions by leveraging documents from a broad, uncurated collection such as the world wide web, in addition to deeply analyzed documents from a stored corpus.
This invention solves this problem by using stylistic analysis to determine the quality of human-generated textual content. Uncurated content can then be utilized with confidence, increasing the available corpora for answer generation. Not only can this increase corpora size, but it may allow for answers to more esoteric questions, such as trend reporting in certain demographics. A curated source may not immediately reflect on the current boy band's latest activities, but many devoted followers will have blogs devoted to the topic. Evaluating the purpose and authority of the author can lend confidence to uncurated sources; further checking these sources for signs of duplicity or native advertising can further increase confidence even in curated news sources. While this technique can never eliminate all error, the increase in available corpora will more than compensate for occasionally treating a source as more reliable than it is.
1. Given a question, collect a selection of blogs or other uncurated textual sources relating to the question - this can be from topic modeling or other selection criteria.
2. For each article, evaluate the reliability of the information present based on the style of its presentation.
3. Heuristically determine a cutoff point, below which the article will not be utilized.
4. For each remaining source, generate answers - utilizing the reliability score from (2) to help weight the answer candidates.
5. For articles deemed highly reliable, flag...