Browse Prior Art Database

Measuring Topical Coverage and Quality of Document Collections via Relevance Testing

IP.com Disclosure Number: IPCOM000235649D
Publication Date: 2014-Mar-17
Document File: 4 page(s) / 649K

Publishing Venue

The IP.com Prior Art Database

Related People

Bruce T. Smith: AUTHOR [+4]

Abstract

We propose extending standard search relevance testing to measure the relative topical coverage and quality of document collections. Search relevance testing has become an industry best practice for measuring the quality of search engines’ ranking functions. Our method reuses the human judgments on individual query-result pairs—the most expensive part of relevance testing—to derive measurements on document sets.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 01 of 4

Title: Measuring Topical Coverage and Quality of Document Collections via Relevance Testing

Abstract: We propose extending standard search relevance testing to measure the relative topical coverage and quality of document collections. Search relevance testing has become an industry best practice for measuring the quality of search engines' ranking functions. Our method reuses the human judgments on individual query-result pairs-the most expensive part of relevance testing-to derive measurements on document sets.

Detailed Description:

Relevance tests based on human judgments of query-result pairs have become a standard tool for tuning the ranking functions of search engines. Metrics for measuring search engine quality include Cumulative Gain (CG) or Discounted Cumulative Gain (DCG). DCG uses the position of a result in the result set to "discount" the value of a result. When order does not matter, the simpler CG, which omits this discounting, might be preferred.

A typical search relevance test might follow these steps:


1. Obtain a sample of queries, Q, from search logs.

2. Obtain the top N results for each query, via a search engine. The result is individual query-result pairs.

3. Obtain relevance judgments (from people) for each query-result pair. These judgments are typically on a scale such as Poor = 1, Good = 2, Excellent = 3.

4. Combine the relevance judgments and result ordering to compute the average DCG of the queries to some result depth. Typically, the result depth depends on how many results the search engine's users will see.

DCG scores are only meaningful for making comparisons, so this is commonly extended to a comparative test between two search engines by modifying step 2 to obtain results for the queries on both search engines. Then, an average DCG for the queries in Q is computed for each search engine, and these values are compared. The search engine with the higher DCG is judged to be better at ranking results. This is shown in Figure 1.

We propose modifying this process to compare two document sets, as follows:


1. Obtain a sample of queries, Q, from search logs.

2. Obta...