Browse Prior Art Database

Dependency parsing for question normalization

IP.com Disclosure Number: IPCOM000244155D
Publication Date: 2015-Nov-16
Document File: 3 page(s) / 81K

Publishing Venue

The IP.com Prior Art Database

Abstract

Described is a method for normalizing questions to a canonical form that is optimized for question-answering purposes so one can achieve a stronger mapping between questions and relevant primary-source passages. Equivalent natural language questions can be expressed in many different ways, either through synonyms or syntactic alternations.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 3

Dependency parsing for question normalization

Human language is extremely flexible; individuals asking the same question in the same language often word things quite differently. Speakers of a language can easily identify three statements such as "Who took the sofa?", "Who took the couch?", and "Who is the sofa thief?" to be identical, but to a computer, the sentences are judged only based on string equivalency and are, therefore, not found to be the same. Many current QA systems only find correct answers if a question is written in a specific way by the user, though this form is rarely explicitly given. Because of this, a method is needed to allow users to write queries in the way they naturally speak, and this query should be changed into the form the system needs to generate the correct answer (see the figure below).

    To succinctly normalize questions to a canonical form, the sentences must be pulled apart and all their key pieces extracted. Identified here are six key pieces of a query: the tense (past, present, etc.), the wh- or question word, the subject, wanted word (not necessary), object (not necessary), and hypothesis (not necessary). The normalizer takes a text file of sentences and pulls these pieces into arrays. After all the sentences have been seen, each array is searched to make sure the words brought in are equivalent (either the same word or a synonym) and that the number of elements in each array is either equal to the number of sentences within the input file or that there are no elements (only acceptable in the case of non-necessary pieces). After this, the pieces are put into the template for the canonical form and this final form is submitted to the QA system. The canonical form was chosen by running experiments on many forms of the same question and choosing the form that resulted in the correct answer most often.

    The normalizer will be given a text file with one or more line-separated sentences. For each sentence, the six query pieces will be extracted. For example, given the sentence "What is the capital of France?": subject=France, wanted=capital, object=NONE, hypothesis=NONE, tense=present, wh=what. Another example is the sentence "Is Paris the city that is the capital of France?": subject=France, wanted=capital, object=city, hypothesis=Paris, tense=present, wh=what. Although the word "what" is not present, the normalizer detects the hypothesis to be talking about a place, thus generating the wh- word when none are actually present. The syntactic parser used for this detection was based on the XSG slot grammar, which can pull expressions for each word in a sentence, giving information on what the word could be and its part of s...