Browse Prior Art Database

Blank Space Detection for Optical Character Recognition

IP.com Disclosure Number: IPCOM000078806D
Original Publication Date: 1973-Mar-01
Included in the Prior Art Database: 2005-Feb-26
Document File: 1 page(s) / 12K

Publishing Venue

IBM

Related People

Baumgartner, RJ: AUTHOR

Abstract

In some applications of optical character recognition (OCR) systems, it is crucial that the recognized characters be properly separated into words. Words must not, however, be broken up by inserting extra blank spaces. Blank spaces may range in size from 0.036 inch to many inches, while intercharacter spaces may occupy an overlapping range of zero to more than 0.100 inch, in different machine-printed or typewritten fonts. Thus, the size of the blank space must be individually determined for each document or even each line being scanned. The method described below reliably detects blank spaces in both fixed-pitch and variable-pitch fonts.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 71% of the total text.

Page 1 of 1

Blank Space Detection for Optical Character Recognition

In some applications of optical character recognition (OCR) systems, it is crucial that the recognized characters be properly separated into words. Words must not, however, be broken up by inserting extra blank spaces. Blank spaces may range in size from 0.036 inch to many inches, while intercharacter spaces may occupy an overlapping range of zero to more than 0.100 inch, in different machine-printed or typewritten fonts. Thus, the size of the blank space must be individually determined for each document or even each line being scanned. The method described below reliably detects blank spaces in both fixed-pitch and variable-pitch fonts.

The number of scans between each character pair in a line is first accumulated in a table. As an example, the width of each scan is assumed to be
0.036 inch, and the number of scans between successive characters may be positioned in the table as follows: Number 0, 1, 2 3, 4 5, 6 .... 39, 40 Position 1 5 3 .... 20.

The goal is to select a table position which has intercharacter spaces to the left and interword spaces to the right. This table position then becomes a threshold for the generation of a "blank space" signal by the OCR system.

The threshold is determined by searching the table positions in ascending order, until the sum of the entries exceeds three. The search continues until two sequential positions have zero entries. If the next position also contains a zero entr...