Browse Prior Art Database

A Method and System of Topic Words Detection Based on Image Evidence

IP.com Disclosure Number: IPCOM000191333D
Original Publication Date: 2009-Dec-30
Included in the Prior Art Database: 2009-Dec-30
Document File: 4 page(s) / 151K

Publishing Venue

IBM

Abstract

Clustering is a normal methodology to mine deeper subtopics on a big set of documents. Lots of algorithms were developed to achieve a good clustering result. In addition, one needs to get a small set of representative keywords to represent the cluster, and help to catch the meaning fast. The top keywords can not be much, because of the narrow space of the screen, and if we have lots of top keywords (e.g. 10), that will confuse us that what exactly the meaning is, and we would rather read the original documents first. Then the clustering will lose its meaning in practice. So selecting top keywords is very important, but it is not easy. We can use tfidf, name entities, language models to generate the top keywords, but the challenges still exist. Here we use an example to explain: We have one cluster that talks about: "Suntech (STP) tonight said that seven of its employees were hurt on November 28 in an explosion accident in its module production facilities in Wuxi, China. The solar company said the injured have been hospitalized. Cause of the accident is under investigation by the responsible government department...". About 20 news are related and included in this cluster. And after doing tf-idf, language model and some other ranking methods, we find "accident" and "explosion" having close score according to the previous methods. So the problem is clear: which one should be the better representative?

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 62% of the total text.

Page 1 of 4

A Method and System of Topic Words Detection Based on Image Evidence

The whole process of selecting top words by simply analyzing images is as the following diagram:

Figure

1. The whole process of our invention

Followings are the key components and their functions in the system:

Image evidence collection:

1

[This page contains 1 picture or other non-text object]

Page 2 of 4

Figure

2

Image collection

(

part

)

on "Explosion"

Step 1. Search in the image search engine by presenting the keywords as query words

2

[This page contains 1 picture or other non-text object]

Page 3 of 4

Step 2. Extract salient features from each of the images. The objective of this step is to form the feature space for following calculation. ¡§ A bunch of image processing and feature detection algorithms can be applied, for example,
local feature descriptor (such as texture feature detector, SIFT descriptor, DoG feature, wavelet filtering, etc.)
color feature descriptor (such as hue-saturation, color histogram, etc.) .
¡§ Form a comprehensive image descriptor for each of the images.

Step 3. Construct the pair-wise distances matrix n

j

dist ,...,

1

,

(, )]

=[ = (eg. KL divergence, cosine distance) of the images.

D I

dist

I

i

j

i

D

is the image cluster and n is

the total number of images in

D.

Step 4. Calculate the distance of each image i

I to the cluster

D

.

     dist D

(

I...