Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Computing Related Terms in a Corpus of Documents

IP.com Disclosure Number: IPCOM000019626D
Original Publication Date: 2003-Sep-23
Included in the Prior Art Database: 2003-Sep-23
Document File: 1 page(s) / 50K

Publishing Venue

IBM

Abstract

A program is disclosed that provides a scalable method of computing relations between terms in a large collection of documents.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 1

Computing Related Terms in a Corpus of Documents

    We disclose a program that is used to compute the strenght of relations between terms across a large collection of documents. In order to compute these relations, we must first use a shallow parser to produce a parse tree for each document, and identify the major noun phrases in each sentence. We define two types of relations: named and unnamed. Named relations are discovered on a per-document basis, and consist of abbreviations and theri expansions, names and appositives describing them, such as

"George X, CEO of Y, said..." Here, we detect the relation name "CEO of" and the two terms "George X" and "Company Y." As each pair are discovered, they are written to a databse load file, along with the relation name and the document index. In addition, the abbreviation module of the text mining software detects abbreviations and their expansion. For example, it deduces that sRNP stands for "soluble ribonucleoprotein." Then it stores a "same-as" named relation in the database load table. In order to compute only relations which are between terms of some importance, we define a salience measure that depends on the total number of times a term appears in the collection
the total number of documents in the collection
the number of doucments a term appears in
the number of documents the term appears in more than once. We use this salience measure to eliminate terms that appear so frequently or infrequently as to be unimportant. Unnamed relations are more complex. We store all of the multi-word terms from noun phrases in a database load file, along with the document, paragraph, and sentence numbers and offset within the sentence. The program this invention describes computes unnamed relatio...