Browse Prior Art Database

HROP - A fast method for distinguishing between hand-print and machine-print text

IP.com Disclosure Number: IPCOM000012961D
Original Publication Date: 2002-Jul-01
Included in the Prior Art Database: 2003-Jun-11
Document File: 4 page(s) / 50K

Publishing Venue

IBM

Abstract

HROP A fast method for distinguishing between hand-print and machine-print text Many computerized tasks deal with analyzing written material for different purposes such as text scanning, postal sorting systems and more. All systems used for analyzing written material use some kind of OCR (Optical Character Recognition) engine. For improving OCR performance different OCR engines are used for machine print (MP) and hand print (HP) material. However this causes some difficulty in applications where the input to the OCR engine may include both MP and HP text. In order to decide which of the OCR engines (MP or HP oriented) to activate for each input text, one must know, before activating the OCR engine, if hand or machine print text is involved. Thus a preliminary step for distinguishing between hand and machine print, prior to the OCR stage, is a necessity for obtaining good results. Furthermore when grouping segments into blocks, prior to activating the OCR engine information concerning text type (HP or MP) may be of vital importance.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 28% of the total text.

Page 1 of 4

  HROP - A fast method for distinguishing between hand-print and machine-print text

Many computerized tasks deal with analyzing written material for different purposes such as text scanning, postal sorting systems and more. All systems used for analyzing written material use some kind of OCR (Optical Character Recognition) engine. For improving OCR performance different OCR engines are used for machine print (MP) and hand print (HP) material. However this causes some difficulty in applications where the input to the OCR engine may include both MP and HP text. In order to decide which of the OCR engines (MP or HP oriented) to activate for each input text, one must know, before activating the OCR engine, if hand or machine print text is involved. Thus a preliminary step for distinguishing between hand and machine print, prior to the OCR stage, is a necessity for obtaining good results. Furthermore when grouping segments into blocks, prior to activating the OCR engine information concerning text type (HP or MP) may be of vital importance.

     The presented method offers an efficient and fast way for characterizing input text as MP or HP without using the OCR engines.Many image understanding tasks, which are performed with great ease by the human eye, become a great problem when automatic computers are used for the same tasks. An example of such a task is distinguishing between HP and MP text. This task is performed with great ease by any untrained person, but includes a lot of problems when an automatic algorithm attempts to do the same.

     As described previously there is a need to distinguish between MP and HP text in tasks which include activating OCR engines. To automatically do so we must characterize global differences between hand and machine print. We must use these differences to form robust features which will be invariant to font size, type, and writer personality. As a first step let us observe MP and HP text and try to characterize them in terms of line and space widths. HP text is usually written with a pen, that generates a typical line width which is modulated by the pressure induced on the pen. Letter and word spacing have a significant standard- deviation, which is a result of the "human factor" as well as the fact that the hand written letters are usually rounded and do not include many parallel lines. The widths of both hand written lines and spaces are correlated to font size.

     MP text is usually characterized by a single dominant line stroke (font width) which is correlated to font size, and additional line widths which are a results of the font styling and specific shapes of the different letters. Letter and word spacing are not constant within the text, however their standard deviations is not as significant as spacing in HP text. The described method will enhance and quantify these differences in an effort to obtain a classifier which will be able to distinguish MP and HP text for a variety of text types (different fo...