Browse Prior Art Database

Soft Boolean Search in Text and Speech

IP.com Disclosure Number: IPCOM000033637D
Original Publication Date: 2004-Dec-20
Included in the Prior Art Database: 2004-Dec-20
Document File: 2 page(s) / 44K

Publishing Venue

IBM

Abstract

Searching for text queries in documents is a core functionality of any Internet search engine, document retrieval, mail retrieval and other systems handling unstructured data. A search engine accepts words, keywords or natural language query and retrieves a ranked list of documents that correspond to the query, with highest ranks to the most relevant documents. Here we propose a new method called Soft-Boolean search which provides a powerful way for search in unstructured data. This is an extension and fusion of traditional statistically-based text search and of Boolean search.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Soft Boolean Search in Text and Speech

The proposed new formula for document retrieval is a combination of Boolean expression analysis with statistical based ranking.

The new method allows for partial Boolean matching - hence the name soft Boolean.

With standard statistical based ranking, (e.g., OKAPI), a word's weight, in the score of the document containing it, is inversely related to the number of documents containing it (i.e., a rare word scores higher), but is uniform for all occurrences of that word. With the proposed method, each occurrence of the word is re-scored based on its contribution to the Boolean-like query term.

For example, consider the query term "President-Bill-Clinton". The - operator is defined in our scheme as an AND with close proximity and words order significance (one of several different flavors of the AND operator). Hence the word "Bill", when it occurs and is preceded within close proximity (3 Seconds in our current system) of the word "Clinton", would gain a higher score than the word "Bill" without "Clinton" following it. Thus the mutual appearance of those words increases each of their scores. While an AND operator increases the score of a work, an OR operator makes words alternatives to each other without increasing their mutual appearance score. For example, searching for "(system | method | program) and (query | retrieval)" will give a lower score to the appearance of "system ... method ..." in the text than to "system ... query ... " or "query ... method ... ". This is bec...