Efficient Processing of Fulltext Queries Containing Stopwords
Original Publication Date: 2002-May-17
Included in the Prior Art Database: 2003-Jun-20
The treatment of so-called stopwords in fulltext search is a well-known problem. Typically, stopwords are "fillwords" in languages such as articles, prepositions, etc. that do not contribute significantly to the meaning of a text. Since stopwords occur very frequently in documents often in every document it is very expensive to process them in queries and the contribution to the result is limited. Example: a query "Intelligent Miner for Text" contains the stopword "for". Since "for" occurs in almost every English text, the result is almost identical to a query "Intelligent Miner Text". Therefore, most search engines simply eliminate stopwords from both the text being indexed and the query being asked. But there are exceptions: consider the query "to be or not to be" which consists exclusively of stopwords. It can be argued that stopwords are useful to discriminate results, but only if taken in combination and with proximity information, e.g., the query terms must occur in this order within a single sentence. It can also be argued that this usually reflects closely the user intention in such a query. A more business critical example is the query for "A Series" which is one of IBM's ThinkPad models unfortunately the product name contains the stopword "A". The problem of efficient processing remains, however. An index in a search engine is constructed in a way that provides efficient access from a term to all of the documents and positions it occurs in. The search engine has to compute very large intersections of document/position lists to evaluate a query containing stopwords. Our solution is based on the following main idea: if the intended meaning of a stopword in a user query is (most) always based on the context of the stopword used, then when processing input documents we should not index stopwords independently, but also in the context they occur in.