Browse Prior Art Database

Optical Character Recognition Accuracy Estimation

IP.com Disclosure Number: IPCOM000116008D
Original Publication Date: 1995-Jul-01
Included in the Prior Art Database: 2005-Mar-30
Document File: 8 page(s) / 321K

Publishing Venue

IBM

Related People

Bella, IN: AUTHOR

Abstract

A method for evaluating the accuracy of an Optical Character Recognition (OCR) engine is disclosed. This method enables one to compare the output of an OCR engine with a truth model without any modifications to the OCR engine. The novelty of this solution is that no constraints are placed on how many extra characters or lines are in either of the documents being compared, and the algorithm runs in a reasonable amount of time. This flexibility allows an OCR accuracy estimate to be made even if the segmentation portion of the OCR engine is extremely poor.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 17% of the total text.

Optical Character Recognition Accuracy Estimation

      A method for evaluating the accuracy of an Optical Character
Recognition (OCR) engine is disclosed.  This method enables one to
compare the output of an OCR engine with a truth model without any
modifications to the OCR engine.  The novelty of this solution is
that no constraints are placed on how many extra characters or lines
are in either of the documents being compared, and the algorithm runs
in a reasonable amount of time.  This flexibility allows an OCR
accuracy estimate to be made even if the segmentation portion of the
OCR engine is extremely poor.

      In order to evaluate the accuracy of an OCR engine, one needs
to be able to compare the output of the engine with a truth model.
This can be done by creating a database of single character images,
passing them one by one through the OCR engine, and comparing each
result to the truth model.  However, this can be time consuming, and
is not always possible unless one has the source code for the engine
in order to bypass the image segmentation code.  A more reasonable
approach is to feed the OCR engine an entire document image, and then
compare the output to a truth model of the entire document.  The
later method does not require any major preparation, and hence is
quicker and easier.  The problem then becomes comparing the documents
when the OCR engine could have left out or added extra characters or
lines in the output.  This paper describes an algorithm that can be
used to solve this problem.

      High Level Description - The accuracy of an OCR engine can be
expressed in terms of how many characters the OCR engine interpreted
correctly relative to the number of characters.  Since we already
know the number of characters given the truth model, all that remains
is determining the number of correctly interpreted characters.  This
can be determined by calculating a maximum set of matching characters
between the truth model and the OCR output.  The example in -- Fig
'FGSOLTN' unknown -- demonstrates this solution.

      An edge represents a potential matching of two characters
between the two documents.  Hence a maximum set of matching
characters is a maximum set of edges such that no two edges cross
each other as depicted in the example.  For simplicity sake, if two
edges start or end at the same character, then they are considered
crossing.

      The solution output consists of the set of matching characters
between the two documents.  The accuracy can then be calculated as
the percentage of matching characters to the number of characters in
the truth model.  In Fig. 1, the accuracy is 12 edges (matching
characters) over 17 truth model characters which is an accuracy of
70.6 percent.  If one is not concerned whether an OCR engine
determines spaces correctly, then the accuracy would be 9 edges over
14 truth model characters which is an accuracy of 64.3 percent.

      The novelty of this s...