Browse Prior Art Database

Extracting Salient Keywords in a Document that belong to a Specific Context

IP.com Disclosure Number: IPCOM000010453D
Original Publication Date: 2002-Dec-03
Included in the Prior Art Database: 2002-Dec-03
Document File: 2 page(s) / 78K

Publishing Venue

IBM

Abstract

Disclosed is a novel method for extracting keywords that are salient and relevant to the documents for a specific context. The method creates a glossary of terms from a large collection of documents and re-calculates the relevancy scores of terms in the glossary for each context. The biased glossary for each context is used to extract salient keywords in the documents.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 50% of the total text.

Page 1 of 2

Extracting Salient Keywords in a Document that belong to a Specific Context

  Keywords generated from documents are generally used by the full text search systems to find documents and are used to represent the context of each document in the search result list. The list of keywords that are relevant to the documents changes according to the perspective or context of the view or search. Different context will be associated with different list of keywords.

The tools used to extract the keywords from documents are based on glossary of terms generated from a corpus of documents. The terms in the glossary can consist of canonical form, variant form (inflection, abbreviation, misspelling, etc.), synonym along with statistical information including 'confidence' levels (see [1]). When the keywords are extracted from the documents, the statistics of the terms in the glossary are solely used to identify and to rank the keywords. Terms having high frequency of occurrences will be at the top of the list of keywords. This method works well in the general literature domain (for example, news articles) to characterize and find the documents for public web-search systems. But, it may not work for the corporate search systems since the documents in the corporate environment are well classified by basing on proprietary corporate taxonomies.

When the purpose of the search of the documents is to solve a particular problem (for example, in the context of the product support domain) or to find documents for a specific topic, either most of the keywords extracted by using the existing method are too general to be used to identify the documents or the relevant terms do not have enough score to be at the top of the keyword list. Some terms that are very relevant and specific to the documents may appear very few times in the documents. For the specific topic or context, these terms are very important for characterizing the documents and for the search engine to find the documents.

The disclosed method describes a glossary biasing process to create glossary for a specific context and a process to extract salient keywords in the documents for that spe...