Density function for query terms
Original Publication Date: 2000-Mar-01
Included in the Prior Art Database: 2003-Jun-19
There are many different search engines which determine a set of documents which satisfy a specific query. In general, the words in the query are submitted to the search engine as a bag of words and the hit list is returned. Such a hit list needs to be rank ordered, with the most relevant documents at the top. Different scoring algorithms were suggested in the past. In this paper, we propose a new scoring algorithm which in general will rank the most relevant documents or fragments of documents at the top. First of all, the elements in the hit list are not necessarily whole documents, but could be passages from documents. A passage consists of a predetermined (but user settable) number of sentences in the documents. To score a passage or a document, any given scoring algorithm can be used as deemed rele- vant to a particular application. The individual words could have weights associated with them which are preset by the user (or another part of the system) or the frequency of occurrences of the terms could be used. However, with any scoring algorithm, several documents or passages may end up with the same score and a mechanism for resolving the ranking of equal scoring documents or passages has to be addressed. We propose to use a density function to break the tie in scoring. For each element in the hit list its density function is computed which in its simplest form is the reciprocal of the number of terms between the first and the last term which is part of the "bag of words" submitted to the query. The larger the density, the higher a document should be ranked. The reason behind this is that terms which are close are most likely to address the same topic, vs terms which are widely distributed over a document may not relate to the same topic and hence not really satisfy the query.