Browse Prior Art Database

A Method and Apparatus of Automatic Stopwords Detection for a Specific Topic

IP.com Disclosure Number: IPCOM000190168D
Original Publication Date: 2009-Nov-19
Included in the Prior Art Database: 2009-Nov-19
Document File: 2 page(s) / 155K

Publishing Venue

IBM

Abstract

Nowadays, people are facing information explosion, and many applications and technologies are designed to extract valuable information to users, like search technologies, text clustering technologies, information extraction technologies. But all these technologies encounter the same problem: how to filter the noises of the information. Here we narrow down this problem: how to detect stopwords for a specific topic\corpus? A topic is specified by keywords or a corpus collected manually. As of now, manually generating a stopwords dict is a common approach to most of the applications. In addition to the dict, some of the computational approaches are applied for further filtering like ti-idf, and language model. But the problem is not solved well. Here we use an example to explain this point: At first, we specify a topic, “ The air bus plane of Air France is missing” . Assume that the corpus are collected from , and terms “ Sina News、Air France、Air Bus、Missing” occur in every document. Apparently, “ Sina News” is a stopword while “ Air France、Air Bus、Missing” are not. This cannot be distinguished by other approach such as tf-idf or language model. So the key problems are clear: a) Manually listed stopwords is not enough for most of the text applications, especially in a specific topic. How to automatically detect topic specific stopwords like the example above? b) How to use global information (distribution) to help judging a words being a stopwords? Main Idea Our invention proposes a system and method to automatically detect global (topic\corpus specific) stopwords by building a words graph in a specific corpus. The main idea is to use the global distribution, which is different from previous local distribution computation like tf-idf and represented by a graph, to detect stopwords, especially for topic specific stopwords.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 63% of the total text.

Page 1 of 2

A Method and Apparatus of Automatic Stopwords Detection for a Specific Topic

Architecture (see Fig 1):

Followings are the key components and their functions in the system:

Fig.1. The whole process of the invention.

Tokenizer

A component to allow an application to break a document into tokens\words.

Word Association Analyzer

     A component to analyze the distribution of all words in each document list. Implementation details:

Given a keyword w

j

, there are

K

documents doc 1

j

,

doc 2

j… doc

K

j

with it. For each word w i where i≠

j

, we

can compute a link weight Cij from w i to w

j

                                               can be built which describes the pair-wise link weight between words in the vocabulary. Assume there are

M

. Thus a transition matrix

A

words in the vocabulary, the

dimensions of the matrix would be

M

x

M

. In this matrix, if one word w i has a link to another word w

j

we can

change the element at (i,

j

)

to a link weight Cij otherwise it remains a 0.

Now the point is how to compute

Cij ? For word w i in the

K

documents doc 1

j

,

doc 2

j… doc

K

j

,

the joint

                                                     can be calculated, thus the link weight Cij can be a combination of all these parameters. Gappy bigram which allows gaps between two words is such a words pair with link weight:

2

distribution, distance, frequency and some other parameters for these two words w i and w

j

C

=

= ,

c

P

(

C

(

w

|

w

)

e

1 )

|

×

w

i

w

j

+

c

ij

i

j

1

[This page contains 1 picture or other non-text object]

Page 2 of 2

P

(

w

,

w

)

P =

(

w

|

w

)

i

j

i w

P

j

(

)

j

,

f

(

...