Soft Boolean Search in Text and Speech
Original Publication Date: 2004-Dec-20
Included in the Prior Art Database: 2004-Dec-20
Searching for text queries in documents is a core functionality of any Internet search engine, document retrieval, mail retrieval and other systems handling unstructured data. A search engine accepts words, keywords or natural language query and retrieves a ranked list of documents that correspond to the query, with highest ranks to the most relevant documents. Here we propose a new method called Soft-Boolean search which provides a powerful way for search in unstructured data. This is an extension and fusion of traditional statistically-based text search and of Boolean search. Statistical methods are the most common one in use by Internet search engines. The tf*idf formula is most often being used along with the OKAPI formula for statistical document retrieval. In such schemes, each document gets a score which sums the contribution of the individual words/phrases. In general, a word contributes more if it appears a larger number of times in the document, if it is a less frequent word in the corpus, and if it appears more then one time in the query. This sum might be normalized with respect to the document length and other factors. There are many flavors to the exact formulation of this concept. The present invention is described using one common formulation but is very general and can be applied to other definitions and computation forms as well. Boolean methods are also widely used in search engines, and in particular in those which are used by professional users. For example, see the patents database by Delphion, or the medical database by Medline etc. The advantage of Boolean search over statistical search is that the user can narrow down the set of retrieved documents by adding Boolean constraints such as AND terms, or providing alternative words using OR terms. However, the typical query would be longer than ones used with statistical methods, and need to be formed as a Boolean expression using an appropriate syntax. This is a limitation for novice users who may not know how to use it. Further, there is a great limitation to Boolean search in that of potentially missing many documents which "nearly match" the Boolean query expression.
Soft Boolean Search in Text and Speech
The proposed new formula for document retrieval is a combination of Boolean expression analysis with statistical based ranking.
The new method allows for partial Boolean matching - hence the name soft Boolean.
With standard statistical based ranking, (e.g., OKAPI), a word's weight, in the score of the document containing it, is inversely related to the number of documents containing it (i.e., a rare word scores higher), but is uniform for all occurrences of that word. With the proposed method, each occurrence of the word is re-scored based on its contribution to the Boolean-like query term.
For example, consider the query term "President-Bill-Clinton". The - operator is defined in our scheme as an AND with close proximity and words order significance (one of several different flavors of the AND operator). Hence the word "Bill", when it occurs and is preceded within close proximity (3 Seconds in our current system) of the word "Clinton", would gain a higher score than the word "Bill" without "Clinton" following it. Thus the mutual appearance of those words increases each of their scores. While an AND operator increases the score of a work, an OR operator makes words alternatives to each other without increasing their mutual appearance score. For example, searching for "(system | method | program) and (query | retrieval)" will give a lower score to the appearance of "system ... method ..." in the text than to "system ... query ... " or "query ... method ... ". This is bec...