Browse Prior Art Database

Characters Segmentation in Digital Images

IP.com Disclosure Number: IPCOM000031025D
Original Publication Date: 2004-Sep-07
Included in the Prior Art Database: 2004-Sep-07
Document File: 5 page(s) / 102K

Publishing Venue

IBM

Abstract

A new method for segmentation of characters in digital text images is introduced. This method can be used as a first stage in the OCR (Optical Character Recognition) process. In addition, it can be instrumental in layout analysis systems, where text image has to be split into uniform blocks (e.g. abstracts, footnotes, tables etc.).

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 28% of the total text.

Page 1 of 5

Characters Segmentation in Digital Images

Background

OCR (Optical Character Recognition) is the main tool for extracting textual information from digital images. Hence, the performance of document processing systems, to a large extent, depends on OCR quality. Indeed, even slight improvements in OCR reading rates translate into significant savings in the cost of document handling.

    The OCR process, in turn, can be viewed as a combination of two main sub-processes: segmentation and recognition. The first sub-process locates and "isolates" the characters. The second one classifies the characters in question and assigns to each character a corresponding alpha-numerical symbol. For high-quality images, characters are well separated and the segmentation process becomes relatively straightforward. However, typical (scanned) images suffer from low contrast and a high degree of noise. Moreover, frequently, characters are connected (due to the low quality of printers and typing machines). All these factors complicate the segmentation process. Wrong segmentation leads to failure in recognition, and, in the worst case, to substitution errors.

    Consider, for example, a character "n" which is badly segmented and "truncated" on its right side. Such a character can easily be misinterpreted as an "r". In such a case, the word "counting" may be recognized as "courting". The verification of the word by an English dictionary will not help, since both words are legal.

    In this invention we introduce a new method for segmentation of characters in digital text images. This method can be used as a first stage in the OCR process. In addition, it can be instrumental in layout analysis systems, where text image has to be split into uniform blocks (e.g. abstracts, footnotes, tables etc.).

Description of the invention

Paper documents are scanned and converted into digital images. (It is assumed here that in order to preserve image quality the images are acquired either in grey-scale or in color.)

    Optionally the images go through a pre-processing stage, including image enhancement, de-skewing, and binarization.

    Next, segmentation is applied. This is the core of the present invention. Below are described the main steps of the proposed algorithm. Some sample images are provided as an illustration of the process as applied to the image in Figure 1.

* Determine if the type of the machine print is dot matrix. Dot matrix print is characterized by many small dots rather than solid strokes. The case of dot matrix can easily be set by simple connected components analysis, where the dominant size of the components will be close to the size of the dot matrix. If this case is encountered, the image can be pre-processed prior to segmentation, by applying morphology operators such as dilation and erosion, by which the dots will be converted to strokes. This path can be limited if the distance between the dots is of the size to the distance between the characters, and that shou...