Browse Prior Art Database

Method for Automatic Extraction Topics in Expository Prose

IP.com Disclosure Number: IPCOM000099291D
Original Publication Date: 1990-Jan-01
Included in the Prior Art Database: 2005-Mar-14
Document File: 3 page(s) / 118K

Publishing Venue

IBM

Related People

Jensen, K: AUTHOR [+2]

Abstract

Problem Solved: propose a computational method for finding topics of of texts. This method should improve current retrieval systems (which typically use simple keyword word-stem matching) by making it possible to index of text by their topics.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Method for Automatic Extraction Topics in Expository Prose

       Problem Solved: propose a computational method for
finding topics of of texts.  This method should improve current
retrieval systems (which typically use simple keyword word-stem
matching) by making it possible to index of text by their topics.

      Description of the Invention: automated office environments,
mechanisms are needed manage incoming textual data and provide
contention-based to stored databases of texts.  Current text
retrieval typically use simple keyword or word-stem matching.
matching is far from ideal,  because often relevant is not found and
a lot of text that is not of interest retrieved.

      We propose a computational method for finding topics of of
texts.  The method we propose should improve text retrieval systems
by making it possible to paragraphs of text by their topics.

      Notation: a "lexical unit" we understand a word of a (a phrase
like "rational numbers," "typed "make up smbd's mind," etc.)

      For a list of lexical units X, let BK(X) denote a of background
knowledge about X: for instance, definitions of elements of X.  X
should not auxiliary words like "be," "have," "may" ...

      Words(X) -- list of lexical units appearing in X auxiliaries)

      Words*(X) -- list of lexical units appearing in X auxiliaries)
of a text P, denoted by Gp(P), is defined as

      The nodes (vertices) of the graph are Words*(P) -- the units of
P; these units are two classes: relations to verbs, verb phrases or
ISA links) and (corresponding to or noun phrases).  There is an edge
between a node and a constant node if in the text P the is a
grammatical argument of the relation.  For in the sentence The
program library contains the Graph-f, the node representing "contain"
will be to the nodes representing "program library" and Graph-f";
similarly, the relationship between and its dictionary definition "a
repository for media" will be depicted as an ISA link between
representing the two phrases.  Each connection between lexical unit
and its background knowledge is represented a separate relation. --
denoted by Tp(P) -- a set of phrases such that each of P refers to
Tp either directly or indirectly. -- denotes a minimum spanning tree
connecting all X in graph G.
            Algorithm "p-topic"
   Input:   A paragraph of (expository) text, denoted by
            P.
   Output:  A list of lexical units that constitute the
            topic of P
            and a topic sentence.
   1.   The sentences of P are parsed; records containing
        information about lexical structure (lexical
        units) are computed.  If anaphoric references
        (next step) are to be resolved, also the
        grammatical structure of each sentence (including
        phrase structure, tense, etc.) should be...