Browse Prior Art Database

# Method for balancing the effects segmentation errors in dictionary assisted OCR error correction

IP.com Disclosure Number: IPCOM000019158D
Original Publication Date: 2003-Sep-02
Included in the Prior Art Database: 2003-Sep-02
Document File: 3 page(s) / 48K

IBM

## Abstract

The purpose of the invention is to indicate how to modify existing approximate matching algorithms so that these cases will be correctly handled, without affecting the overall matching process of other characters.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 41% of the total text.

Page 1 of 3

Method for balancing the effects segmentation errors in dictionary assisted OCR error correction

Problem

Pattern Recognition processes are usually implemented in several stages. For example, in OCR, there is a preliminary layout analysis, followed by a line separation stage. The lines are then segmented into characters, and finally, each character is classified individually.

This process is error-prone, and OCR is generally assisted by an error-correction method, which strives to perform a dictionary lookup of the recognized words. The common correction process uses an approximate string-matching algorithm, based in general on the editing distance proposed by Levenshtein. A dynamic-programming approach, which allows an effective computation of this distance, was developed by Wagner and Fisher.

The editing distance between two strings is defined as follows: First, define the distance between each two characters of the character set (this includes the void character, in which case there is the cost of a deletion or insertion operation).

The cost of a sequence of operations needed to transform one string into another is the sum of costs of the elementary operations.

The distance between two strings is defined as the minimum cost of all the possible sequences that transform one string into another.

This notion of distance is appropriate for correcting typing and spelling errors. When used to assist OCR (and other processes of Pattern Recognition that involve previous segmentation) as a correction process, the weakest aspect of this process is the case when some single characters are incorrectly split into two or more characters. For example, in OCR for printed characters, the most frequent examples are the cases when the character "m" is split into "r" and "n", or the character "B" is split into "I" and "3", or "1" and "3", and so on. In this case, the distance between "m" and "r"+"n" is 2, while in principle it should be 0.

Solution

The purpose of the invention is to indicate how to modify existing approximate matching algorithms so that these cases will be correctly handled, without affecting the overall matching process of other characters.

The algorithm is based on the extension of the character set used to express the strings, a modification of the directories, and an adaptation of the character distance matrix.

While OCR and other recognition processes may split a character into more than two other characters, here only the case of two character splits are handled. The method, however, is not limited to this situation, and is extendible also to the more complex case when a character is incorrectly split into three or more characters.

The proposed invention is based on the fact that each approximate matching algorithm has a distance matrix C.

This matrix indicates the cost of replacing one character by another (in particular, when a character is replaced by the void character, there is a deletion, and vice-versa, an insertion).

A...