Browse Prior Art Database

Method to identify new source documents and enlarge corpora in question-answer systems by automated stylistic evaluation of uncurated text

IP.com Disclosure Number: IPCOM000245360D
Publication Date: 2016-Mar-03
Document File: 2 page(s) / 29K

Publishing Venue

The IP.com Prior Art Database

Abstract

Disclosed is a system that uses stylistic analysis to determine the quality of human-generated textual content. Uncurated content can then be utilized with confidence, increasing the available corpora for answer generation. Not only can this increase corpora size, but it may allow for answers to more esoteric questions, such as trend reporting in certain demographics.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 47% of the total text.

Page 01 of 2

Method to identify new source documents and enlarge corpora in question - systems by automated stylistic evaluation of uncurated text

Question-Answer systems are only as good as their source material. They rely on correct data to obtain answers - including free-text sources such as encyclopedias, news media, which are considered authoritative. However, this leaves the vast majority of human-generated textual content unavailable - blogs, amateur articles, and even business communications are left out, because their content may not be reliable. Furthermore, news sources are increasingly turning to semi-factual opinion pieces or sponsor-paid native advertising, undermining the credibility of the source. Eliminating these articles will ensure higher-quality answer generation.

    There is false information contained in some documents. This false information often comes from classes of documents like blogs, opinion articles, or press releases from highly biased sources (like think tanks or lobbyists). However, this false information typically looks and feels like accurate information to most question-answering systems. One needs a method to down-weight answer candidates that were discovered in uncurated sources, so that one can make use of credible information from reliable but unknown sources. This is a way to answer questions by leveraging documents from a broad, uncurated collection such as the world wide web, in addition to deeply analyzed documents from a stored corpus.

    This invention solves this problem by using stylistic analysis to determine the quality of human-generated textual content. Uncurated content can then be utilized with confidence, increasing the available corpora for answer generation. Not only can this increase corpora size, but it may allow for answers to more esoteric questions, such as trend reporting in certain demographics. A curated source may not immediately reflect on the current boy band's latest activities, but many devoted followers will have blogs devoted to the topic. Evaluating the purpose and authority of the author can lend confidence to uncurated sources; further checking these sources for signs of duplicity or native advertising can further increase confidence even in curated news sources. While this technique can never eliminate all error, the increase in available corpora will more than compensate for occasionally treating a source as more reliable than it is.

1. Given a question, collect a selection of blogs or other uncurated textual sources relating to the question - this can be from topic modeling or other selection criteria.

2. For each article, evaluate the reliability of the information present based on the style of its presentation.

3. Heuristically determine a cutoff point, below which the article will not be utilized.

4. For each remaining source, generate answers - utilizing the reliability score from (2) to help weight the answer candidates.

5. For articles deemed highly reliable, flag...