Browse Prior Art Database

System and Method of Semantic Similarity Based Fuzzy Vector Space Model Ranking Disclosure Number: IPCOM000240872D
Publication Date: 2015-Mar-09
Document File: 3 page(s) / 64K

Publishing Venue

The Prior Art Database


Vector Space Model (SVM) based similarity scorer is a widely used tool in many application systems(e.g. information retrieval, word sense disambiguation, etc). Traditional search framework uses vector space model by comparing word sequences of query and document. The system will return the document with highest ranking score. This method is based on one hot representation and only consider exact matched words. One hot representation sometimes cannot give a full understanding of a document. By using distributed word representation, our new system provide a method to overcome the weakness of one-hot representation and improve the scoring performance in some applications.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 01 of 3

System and Method of Semantic Similarity Based Fuzzy Vector Space Model Ranking

To give a relation score between two documents for ranking, traditional method would add up the scores of all the matched words between them. Since only considering the word that is exactly matched , these approaches do not make full use of all information in query. We propose a new approach that try to consider more words in query, by using similarity scores between words. Distributed word representation is one of the state-of-the-art word-word similarity tools. We use this measurement to represent the similarity between words and improve the performance of traditional models

In traditional approaches, when we want to calculate the similarity between a query and a document. We first find out all the words in the query that occur in document. Than add up all the scores of these words in query. TF*IDF score is one of the most widely used score and normalization process is followed. Algorithm is illustrated below:

SCORE(query, document):
total_score = 0;
normq = 0;
normd = sqrt(document.length); // number of words in document for each word in query
tmp_score = TF(word,document)*IDF(word)^2;
normq += IDF(word)^2;
total_score += tmp_score;

normq = sqrt(normq); total_score /= normd * normq; return total_score;

If we cannot find the same word in document with words in query, the TF(word, document) will be zero. That is to say, we will not take this word into account. Although some documents do not have the exact same words, some words with very close meaning may help.

For example, 'car' is one of the words in query and 'vehicle' is in document. For traditional method, the contribution of car is zero, although 'car' and 'vehicle' are synonyms.

Key idea of our approach is trying to make full use of information provided by query, by using semantic similarity to find as much "matched" words as possible. We will consider the similarity of 'car' and 'vehicle' in the ex...