Browse Prior Art Database

A method of focusing relevancy ranking by using context dependent terms weights

IP.com Disclosure Number: IPCOM000010284D
Original Publication Date: 2002-Nov-15
Included in the Prior Art Database: 2002-Nov-15
Document File: 2 page(s) / 337K

Publishing Venue

IBM

Abstract

Disclosed is a novel method of calculating document relevancy scores in information retrieval systems. The novel relevancy score calculation is based on context-dependent weights of query terms. This method allows to tune the document relevancy scores calculation to the identified query context, as well as to the user profile.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 50% of the total text.

Page 1 of 2

A method of focusing relevancy ranking by using context dependent terms weights

  Traditional full text search systems use so called TF-IDF formula for calculating document relevancy scores/ranking (see [1]). According to this method the document relevancy score S, related to the search term T, is calculated as S(T) = F(T) * log2[ N / n(T) ], where F(T) is frequency of the term T in the document, N is number of documents in the collection, n(T) is number of documents where term T appears at least once. This method works fine for public web-based search systems, but it may not work for corporate search systems, which use strong classification of documents based on proprietary corporate taxonomies. The following example illustrates some deficiencies of this method in a corporate environment.

Example: consider search system that contains 500,000 documents, where 1000 docs contain term T2 - "WAS" (WebSphere Application Server); 10 docs among them contain term T1 - "NoResourceException". Assume that there is 1 document (A) that contains 2 occurrences of term T1 and 10 occurrences of term T2, and there is 1 document (B) that contains 1 occurrence of term T1 and 1 occurrence of term T2. A user submitted the following query: "NoResourceException thrown while running WAS". The system calculates scores of documents A and B for terms T1 and T2 as follows:
Score_A(T1) = 2*log2(500000/10) = 30.2; Score_A(T2) = 10*log2(500000/1000) = 89.7; Score_B(T1) = 1*log2(500000/10) = 15.6; Score_B(T2) = 1*log2(500000/1000) = 9. According to these scores, the search system puts document A at the top of the hit-list, while document B goes to the bottom. Now, let's assume that document A contains customer problem report, and document B contains a link to the patch that should be applied to resolve the problem reported by a customer. In a traditional search system the user, most likely, will never open document B, because it does not appear at the top of the hit-list.

The proposed method changes the way the search system calculates document relevancy scores by introducing context-dependent weights of search terms. The method proposes calculating document scores for given search terms based on weights assigned to these terms in accordance with their salience in the given context. If in the above mentioned example the document scores were calculated based on weights assigned to terms T1, T2 in the context of 'WAS run-time exceptions', then both documents A, B would have similar scores, because WC(T1) >> WC(T2), where WC(T) is the weight of term T in the given context C. To assign...