Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Fast Method of Correcting Substitution Errors in Optical Character Recognition

IP.com Disclosure Number: IPCOM000120069D
Original Publication Date: 1991-Mar-01
Included in the Prior Art Database: 2005-Apr-02
Document File: 4 page(s) / 103K

Publishing Venue

IBM

Related People

Itoh, N: AUTHOR

Abstract

This article describes an efficient method of reducing the number of candidate words in dictionary-aided recovery from errors in optical character recognition (OCR).

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Fast Method of Correcting Substitution Errors in Optical Character
Recognition

      This article describes an efficient method of reducing
the number of candidate words in dictionary-aided recovery from
errors in optical character recognition (OCR).

      The method transforms each recognized candidate into numbers
designating the group to which the candidate character belongs, and
selects the candidate words according to the similarities between the
number lattice and words in a dictionary. This method can be easily
applied to large character sets, such as Kanji.

      All characters are classified into a suitable number of groups
based on the feature vectors of their templates in recognition
dictionary by clustering. Table 1 shows an example in which 2805
templates of Kanji characters are clustered into 123 groups. Similar
(easily confusable) characters have a tendency to be classified into
the same group.  If an identifier (name) is given to each group,
words can be expressed by the cluster names:

                            (Image Omitted)

      Another important point is that it is possible to define
similarities between the 123 groups by computing the distances
between the centers of their elements. All the words in the word
dictionary are expressed by these cluster names, and classified
accordingly.

                            (Image Omitted)

      In these procedures, a dictionary is made in which all the
words are indexed by their cluster-name strings, and a similarity
table of cluster names is obtained. This dictionary and table, are
used to select candidate words as follows (Fig. 1). Let the
recognition result of OCR be C(i,j) (i and j denote the column
position and the order of the candidate, respectively).
Step 1

      Replace each candidate (C(i,j)) with the clust...