Soft Boolean Search in Text and Speech
Original Publication Date: 2004-Dec-20
Included in the Prior Art Database: 2004-Dec-20
Searching for text queries in documents is a core functionality of any Internet search engine, document retrieval, mail retrieval and other systems handling unstructured data. A search engine accepts words, keywords or natural language query and retrieves a ranked list of documents that correspond to the query, with highest ranks to the most relevant documents. Here we propose a new method called Soft-Boolean search which provides a powerful way for search in unstructured data. This is an extension and fusion of traditional statistically-based text search and of Boolean search. Statistical methods are the most common one in use by Internet search engines. The tf*idf formula is most often being used along with the OKAPI formula for statistical document retrieval. In such schemes, each document gets a score which sums the contribution of the individual words/phrases. In general, a word contributes more if it appears a larger number of times in the document, if it is a less frequent word in the corpus, and if it appears more then one time in the query. This sum might be normalized with respect to the document length and other factors. There are many flavors to the exact formulation of this concept. The present invention is described using one common formulation but is very general and can be applied to other definitions and computation forms as well. Boolean methods are also widely used in search engines, and in particular in those which are used by professional users. For example, see the patents database by Delphion, or the medical database by Medline etc. The advantage of Boolean search over statistical search is that the user can narrow down the set of retrieved documents by adding Boolean constraints such as AND terms, or providing alternative words using OR terms. However, the typical query would be longer than ones used with statistical methods, and need to be formed as a Boolean expression using an appropriate syntax. This is a limitation for novice users who may not know how to use it. Further, there is a great limitation to Boolean search in that of potentially missing many documents which "nearly match" the Boolean query expression.