Browse Prior Art Database

A Search Engine Evaluation Tool using Document Titles Extraction

IP.com Disclosure Number: IPCOM000011413D
Original Publication Date: 2003-Feb-19
Included in the Prior Art Database: 2003-Feb-19
Document File: 4 page(s) / 22K

Publishing Venue

IBM

Abstract

Disclosed is a tool for automating the performance evaluation of text search engines by extracting titles of documents.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 50% of the total text.

Page 1 of 4

THIS COPY WAS MADE FROM AN INTERNAL IBM DOCUMENT AND NOT FROM THE PUBLISHED BOOK

JP820020834 Koichiro Kato/Japan/IBM Katsuhiko Masuda

A Search Engine Evaluation Tool using Document Titles Extraction

Key point of the disclosure: The tool offers a means of automatically evaluating the performance of text search engines. The tool extracts titles which accompany the documents, and likens the titles to keywords and/or sentences for search. The tool makes it possible to repeatedly perform the evaluation work on a large scale.

Issues resolved by the tool: There are various indices that represent quality performance of a search engine (for example, Precision&Recall is one of the common indices). However, since all of them depend on human work (judgement of human about what are considered as a correct answer and about what are proper reference sentences for search). Therefore, a lot of human work is required in order to get the sufficient accuracy of the evaluation by removing subjective aspects. Since the tool automates the evaluation work, the search engine developers and users can perform large-scale and frequent evaluation. The developers can get the quick feedbacks to the implementation, and the users can adjust the search engine applied to a specific application domain.

How does the tool solve the problems: It is common for a document to be accompanied by a title which expresses the contents directly. In the document processed electronically, the title is usually structured data separated from the document content. And, it is easy for a tool, a software program to extract the title and to perform a search by making search keyword(s) and/or sentence(s) from the title. Then, the tool can calculate recall (Recall) and accuracy (Precision) of the search by ranking the documents in the obtained search result list.

One example of the definition of Recall and Precision is as follows. The definition assumes that the document extracted the title is the correct answer of the search.

Recall = Na/Nq Precision = Ra/Nq

Where, Nq: is the number of times of search. Na: is the number of times of which the most conformed document is found for Nq. Ra: Conformance ratio value of the document which the search engine returns.

The user may use another definition. For example, the following definition of Precision uses the conformance ranking order.

Precision = (Sum(Ri-Ai+1)/Ri)Nq

Where,

Sum() is the summation of all search. Ri is the number of documents of the result returned y the i-th search. Ai is the rank of the correct document at i-th search. 1 <= Ai <= (Ri+1), and Ai = Ri+1 when the correct document is not returned.

Conventional way

(Quoted from "Information Retrieval and Language Processing" Author T. Tokunaga,

1

Page 2 of 4

Published by University of Tokyo Publication) Ther...