Browse Prior Art Database

Two Step System for Character Recognition Without Pre-stored Fonts

IP.com Disclosure Number: IPCOM000122708D
Original Publication Date: 1991-Dec-01
Included in the Prior Art Database: 2005-Apr-04
Document File: 3 page(s) / 138K

Publishing Venue

IBM

Related People

Campigli, P: AUTHOR [+3]

Abstract

Disclosed is an approach for recognizing printed texts without pre- stored fonts or models with which unknown characters are compared. The approach is performed in two steps. In the first step, characters are tentatively recognized on the basis of their attributed graph and class information. Numerical features for each kind of character are computed according to those used in SISTEMA L (1), an IBM computer program for text recognition. In the second step, recognition is made by SISTEMA L using the font built in the first step. First Step

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 46% of the total text.

Two Step System for Character Recognition Without Pre-stored Fonts

      Disclosed is an approach for recognizing printed texts
without pre- stored fonts or models with which unknown characters are
compared. The approach is performed in two steps. In the first step,
characters are tentatively recognized on the basis of their
attributed graph and class information. Numerical features for each
kind of character are computed according to those used in SISTEMA L
(1), an IBM computer program for text recognition. In the second
step, recognition is made by SISTEMA L using the font built in the
first step.
First Step

      The basic starting point of this step is a graph representation
of the characters, obtained from a run representation of the
character images. By defining the runs having exactly one father and
one son as regular and all the other runs as singular, the image of
each character can be represented by a graph where singular runs are
the nodes, while regular runs are the arcs (2,3). This representation
allows easy removal of character noise: a graph-based filter prunes
the graph, killing short connections which are not significant for
the overall structure.

      After the identification of all the connected groups of pixels
belonging to a row of text, near elements are merged together, so
that characters made up by different pieces, (e.g., ; " i : = ) can
be identified as a single unit. This is achieved by analyzing the
position of the baricenters of contiguous pieces along the direction
parallel to the written row.

      To improve recognition, arcs and nodes are characterized by
attributes. The attributes for the arcs are:
- number of arcs between nodes i-j ;
- length of the arc, normalized with respect to the width of the
character; - average thickness of the arc, normalized with respect to
the height of the character;
- number of the segments approximating the arc;
- geometrical configuration of the arc. For example, the segment
connecting the initial and final nodes is considered together with
the position of the baricenters of the runs with respect to it. The
successions of these positions are classified as 1 = flat (almost
rectilinear arc), 2 = hill (arc upon the segment), and 3 = valley
(arc below the segment);
- enclosure between the previous mentioned segment and the line
connecting the baricenters of runs, normalized with the area of the
character;
- variance of the thickness of the arc, normalized with the heigth of
the character.
Such attributes are computed for all the arcs existing between two
nodes.

      The attributes for the nodes are:
- run type, i.e., its thickness over the character height;
- position of the run (0-1 for low-height position with respect to
the character).

      Further, similar information can be obtained from vertical
runs.

      To recognize a character, it is compared with attributed graph
structures by means of a simple table, built on the basis of a
criteri...