Browse Prior Art Database

Method for Formatting Tabular Data

IP.com Disclosure Number: IPCOM000117377D
Original Publication Date: 1996-Feb-01
Included in the Prior Art Database: 2005-Mar-31
Document File: 4 page(s) / 82K

Publishing Venue

IBM

Related People

Hirayama, Y: AUTHOR

Abstract

Disclosed is a method for formatting tabular data processed by an Optical Character Recognition (OCR) system. This method analyzes the arrangement of character strings in a table area on an image and formats texts recognized by the OCR system according to the arrangement information.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 55% of the total text.

Method for Formatting Tabular Data

      Disclosed is a method for formatting tabular data processed by
an Optical Character Recognition (OCR) system.  This method analyzes
the arrangement of character strings in a table area on an image and
formats texts recognized by the OCR system according to the
arrangement information.

      Fig. 1 shows a sample document image.  First, character
strings, vertical and horizontal lines, and other groups of black
pixel components are extracted from a page image by an algorithm for
detecting character strings (Fig. 2).  This algorithm is described in
(*).  In the next step, all vertical and horizontal lines are
extended to a rectangle surrounding the table.  The table is
segmented into a lattice structure by the lines, and in this way
several rows and columns are detected.  In Fig. 3, C1, C2,..., C5 are
columns and R1, R2, and R3 are rows.  In the third step, the
character strings in each table column are formatted by analyzing
their horizontal arrangement.  For this purpose, the Dynamic
Programming (DP) matching method is used.  In Fig. 3, C1 has 12
character strings and C2 has 13.  Their label sequences are shown in
Fig. 4(a).  The recurrence formula for DP matching is shown in Fig.
4(b).  Fig. 5 shows four lattice points on a 13 x 14 plane used in
the DP matching method.  Path 1 indicates that a character string in
C2 has no corresponding string in C1.  Path 2 indicates that a
character string n C1 corresponds t...