Browse Prior Art Database

Extracting Keys for Document Structure from the Table of Contents

IP.com Disclosure Number: IPCOM000118093D
Original Publication Date: 1996-Sep-01
Included in the Prior Art Database: 2005-Mar-31
Document File: 4 page(s) / 129K

Publishing Venue

IBM

Related People

Tateishi, Y: AUTHOR

Abstract

Disclosed is a program that extracts a pattern of the key strings indicating the structure of a document (a book, a white paper, etc.) written in Japanese from its table of contents. The extracted pattern can be used by programs that automatically mark up (in HTML, SGML and other tag languages) the structure of a plain text for marking up the headings of chapters and sections. In this bulletin, the term 'key' denotes the numeric and other expressions that indicate the numbers of the chapters, the sections, etc., of a document. In many documents, those keys precede the headings of the chapters etc., and can be a crucial information for marking-up programs to detect the headings.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Extracting Keys for Document Structure from the Table of Contents

      Disclosed is a program that extracts a pattern of the key
strings indicating the structure of a document (a book, a white
paper, etc.)  written in Japanese from its table of contents.  The
extracted pattern can be used by programs that automatically mark up
(in HTML, SGML and other tag languages) the structure of a plain text
for marking up the headings of chapters and sections.  In this
bulletin, the term 'key' denotes the numeric and other expressions
that indicate the numbers of the chapters, the sections, etc., of a
document.  In many documents, those keys precede the headings of the
chapters etc., and can be a crucial information for marking-up
programs to detect the headings.

      The program reads the plaintext file representing the table of
contents of a document, assuming the file consists of the text lines
called 'entries' like the ones shown in Figs. 1 and 2.  In both Figs.
1 and 2, part 1 is the key for 'chapter 2'.  Part 2 is the chapter
heading.  Part 3 is a separator that separates the heading and the
page number.  Part 4 is the page number.  Note that the meanings of
the lines  in both figures are the same.  The program consists of two
phases: the  first phase separates the keys and headings in the
entries of a table of  contents and the second phase determines the
level of the headings from  the extracted keys.

Phase 1:
  The program separates the keys and headings by the following
   steps.

Step 1:

      Reduce each entry to a 'entry pattern' by converting a run of
numerals to single character '1', a single uppercase letter to 'A', a
single lowercase letter to 'a', a single katakana letter to 'k'.  For
example, the entries in Fig. 3 are reduced to the entry patterns in
Fig. 4.

Step 2:

      Read each entry pattern from left to right, character by
character, and construct a trie whose nodes are pairs <c,n> where c
is a character and n is the number of times that the node is visited
while constructing the trie.  The number n corresponds to the count
of a (reduced) character at a particular position in an entry
pattern.  Fig. 5 shows the trie constructed from entry patterns shown
in Fig. 4.  In Fig. 5, boxes (like the one numbered 1) show the
node.  The characters in the upper part of the boxes show the
characters in the entry patterns, and the numbers in the lower part
of the boxes are the counts of the corresponding characters in that
position. Thus,  the box numbered 1 in Fig. 5 corresponds to the pair
<'1', 5>.  This means that numerals appear five times at the top of
entry patterns in the  table of contents.  Rightward lines (like t...