Scoring terms in a question
Original Publication Date: 2000-Mar-01
Included in the Prior Art Database: 2003-Jun-19
In many cases, a question posed in English contains a significant work which implies what type of answer is desired. For instance, the question "Where is the restaurant Lutece located?" implies that it is desired to find an address or location of the restaurant. However, for some questions the leading word can lead to an open ended list of possible answer types. An example of such a question is "What store carries the blue dress?". The problem is to determine which terms in the question have more significance than others. As a consequence, a weighted (by the significance measure of the terms) bag of words query against a corpus would return as its highest ranking docu- ments which contain the important terms which improves the likelihood that they will contain the correct answer. In this disclosure we suggest a significance weighting of a question. In particular we assert that the first term (not including stop words) after the query word (i.e. WHAT) has more significance than the others. The traditional approach to weighting terms in a search in IR is to use a tf*idf function or a variant thereof. The tf term measures the number of occurrences of a query term in a docu- ment, and the idf measures how few documents contain a mention of the term. These factors influence how relevant the document is to the query, but don't address at all how intrinsically important the various query terms are to the meaning of (and hence answer to) the query. Our invention helps to single out the most important work in the query, so that if documents are found that don't contain all query terms, those that contain the more important ones can be given more significance than those that don't. The knowledge of significant words is important. In general query expansion has to be per- formed to close the gap between a user asking a question and the corpus paraphrasing the same information. Synonym expansion can be helpful to bridge the gap, however it can lead to a combinatorial explosion of possibilities of potential answers. We suggest to do synonym expansion only on the most significant terms. We propose to match a question against a set of patterns as described in the disclosure with docket number YO999-503. Then a linguistic ana- lyzer will be applied which removes stop words and annotates the other terms. Terms which are identified as corresponding to one of the QA-Tokens (like NAME$, ORG$, to name a few) are deemed to be more significant and should be assigned a higher weight then other terms.