Browse Prior Art Database

Line Segmentation Method for Documents in European Languages

IP.com Disclosure Number: IPCOM000101018D
Original Publication Date: 1990-Jun-01
Included in the Prior Art Database: 2005-Mar-16
Document File: 4 page(s) / 131K

Publishing Venue

IBM

Related People

Yamashita, A: AUTHOR

Abstract

This article describes an efficient method for segmenting character lines from a skewed document image. The method estimates base lines, font size, and degree of skew for each page, and all character lines are segmented on the basis of this information. This method can be used to segment characters with underlines and also characters in tables.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Line Segmentation Method for Documents in European Languages

       This article describes an efficient method for segmenting
character lines from a skewed document image. The method estimates
base lines, font size, and degree of skew for each page, and all
character lines are segmented on the basis of this information.  This
method can be used to segment characters with underlines and also
characters in tables.

      A page image is vertically divided into several partitions like
those shown in Fig. 1.  In each partition, a horizontally projected
histogram is calculated.  Image data are horizontally processed byte
by byte, and the number of black pixels in each 8-bit byte of data is
counted and summed up in each partition. If all pixels in an 8-bit
byte of data are black, it may be a part of an underline or a
scaled line.  Therefore, the number of these black patters is also
counted.  From the distribution of the projected histogram,
rectangular data (line-components) that represent parts of character
strings are detected in each partition. An example of a
line-component and a local base-line is shown in Fig. 1.

      A local base-line is then estimated in each line-component.
Since the projected histogram has a maximum value around a base-line
in a line-component, it is investigated from the bottom of the
component, and if its value exceeds the threshold level for a
base-line, the position is recorded as a local base-line.  However,
if parts of character strings connect an underline, the position of
the underline can be detected as a local base-line.  In this case,
many 8-bit black patterns must be counted around the underline, and
therefore the investigation of the projected histogram continues
until the next maximum value is detected.  If no appropriate
candidate is found, the bottom line of the component is recorded as a
local base-line.

      In order to eliminate scaled lines connecting the tops of
characters, the projected histogram is investigated in the same way
from the top of each line-component.  If a scaled line is detected,
the boundary between the characters and the scaled line is estimated
by using the distributions of the projected histogram and black
patterns. Line-components that contain only underlines or scaled
lines are eliminated.

      To complement the skew effect, the degree of skew in a page is
estimated on the basis of local base-lines....