Browse Prior Art Database

Method and System for Constructing a Hierarchical Document Structure from Text Streams

IP.com Disclosure Number: IPCOM000198335D
Publication Date: 2010-Aug-05
Document File: 3 page(s) / 114K

Publishing Venue

The IP.com Prior Art Database

Abstract

A method and system for constructing a hierarchical document structure from text streams is disclosed. The text stream is segmented and inherent hierarchical relationships between segments in the text streams are extracted. Thereafter, a hierarchical document structure is constructed based on the extracted inherent hierarchical relationships.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 55% of the total text.

Page 1 of 3

Method and System for Constructing a Hierarchical Document Structure from Text Streams

Disclosed is a method and system for constructing a hierarchical document structure from text streams. Construction of hierarchical document structure involves segmentation of a text stream followed by an identification of hierarchical relationships between segments of the text stream. The identification of hierarchical relationships between segments is useful for retrieving segments at different hierarchical levels. The hierarchical relationships between segments are also useful for reconstructing original documents from an Optical Character Recognition (OCR) output of scanned documents. The original documents are reconstructed using text in these documents and without using any information associated with font size of the text. Moreover, the method and system may facilitate conversion of speech signals into organized documents. For example, in tasks involving key in agent-customer conversation in a call center, a lecture, a public speech, a television or a radio broadcast the speech signals may be converted into organized documents.

To construct a hierarchical document structure from a text stream, initially a contiguous segment of homogeneous topics are identified from the text stream as shown in Fig. 1.

Figure 1

The hierarchical document structure is constructed using a word co-occurrence

1

[This page contains 1 picture or other non-text object]

Page 2 of 3

statistics model. A word co-occurrence statistics obtained from training data is used to build the word co-occurrence statistics model. The word co-occurrence statistics model is shown in Fig. 2.

Figure 2

For example, for a given vocabulary, a word co-occurrence statistics model is inferred as follows:

Vocabulary:

N

i

w

W i

= 1

},

{

Word co-occurrence statistics model:...