Browse Prior Art Database

Efficient Processing of Fulltext Queries Containing Stopwords

IP.com Disclosure Number: IPCOM000015657D
Original Publication Date: 2002-May-17
Included in the Prior Art Database: 2003-Jun-20
Document File: 3 page(s) / 46K

Publishing Venue

IBM

Abstract

The treatment of so-called stopwords in fulltext search is a well-known problem. Typically, stopwords are "fillwords" in languages such as articles, prepositions, etc. that do not contribute significantly to the meaning of a text. Since stopwords occur very frequently in documents often in every document it is very expensive to process them in queries and the contribution to the result is limited. Example: a query "Intelligent Miner for Text" contains the stopword "for". Since "for" occurs in almost every English text, the result is almost identical to a query "Intelligent Miner Text". Therefore, most search engines simply eliminate stopwords from both the text being indexed and the query being asked.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 3

Efficient Processing of Fulltext Queries Containing Stopwords

    The treatment of so-called stopwords in fulltext search is a well-known problem. Typically, stopwords are "fillwords" in languages such as articles, prepositions, etc. that do not contribute significantly to the meaning of a text. Since stopwords occur very frequently in documents - often in every document - it is very expensive to process them in queries and the contribution to the result is limited. Example: a query "Intelligent Miner for Text" contains the stopword "for". Since "for" occurs in almost every English text, the result is almost identical to a query "Intelligent Miner Text". Therefore, most search engines simply eliminate stopwords from both the text being indexed and the query being asked.

    But there are exceptions: consider the query "to be or not to be" which consists exclusively of stopwords. It can be argued that stopwords are useful to discriminate results, but only if taken in combination and with proximity information, e.g., the query terms must occur in this order within a single sentence. It can also be argued that this usually reflects closely the user intention in such a query. A more business critical example is the query for "A Series" which is one of IBM's ThinkPad models - unfortunately the product name contains the stopword "A".

    The problem of efficient processing remains, however. An index in a search engine is constructed in a way that provides efficient access from a term to all of the documents and positions it occurs in. The search engine has to compute very large intersections of document/position lists to evaluate a query containing stopwords.

    Our solution is based on the following main idea: if the intended meaning of a stopword in a user query is (most) always based on the context of the stopword used, then - when processing input documents - we should not index stopwords independently, but also in the context they occur in.

Example: Given the text with document ID 4711 "Use[1] IBM[2] Intelligent[3] Miner[4] for[5] Text[6] to[7] discover[8] new[9] knowledge[10] in[11] huge[12] amounts[13] of[14] textual[15] data[16]" (numbers in [ ] are the term positions in the text)

    A standard indexer will calculate the following index entries: use 4711#1
IBM 4711#2
Intelligent 4711#3
Miner 4711#4
for 4711#5
Text 4711#6
to 4711#7
discover 4711#8
new 4711#9
knowledge 4711#10
in 4711#11

1

Page 2 of 3

huge 4711#12 amounts 4711#13 of 4711#14 textual 4711#15 data 4711#16

    In combination with the information from other documents, the document/position lists for "for", "to", ... will be huge.

    Our approach will create the following index entries: use 4711#1
IBM 4711#2
Intelligent 4711#3
Miner 4711#4
for_Text 4711#5
Text 4711#6
to_discover 4711#7
discover 4711#8
new 4711#9
knowledge 4711#10
in_huge 4711#11
huge 4711#12
amo...