Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Field Detection in Forms by White Space Analysis

IP.com Disclosure Number: IPCOM000114005D
Original Publication Date: 1994-Oct-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 4 page(s) / 152K

Publishing Venue

IBM

Related People

Billings, DW: AUTHOR

Abstract

A method to facilitate the location of data entry fields on a bitmap image of a form is disclosed. This method can be used to in preparation for data extraction using Intelligent Character Recognition (ICR) or Optical Character Recognition (OCR),

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 35% of the total text.

Field Detection in Forms by White Space Analysis

      A method to facilitate the location of data entry fields on a
bitmap image of a form is disclosed.  This method can be used to in
preparation for data extraction using Intelligent Character
Recognition (ICR) or Optical Character Recognition (OCR),

      Forms processing involves data extraction from predefined
fields on a form.  Before reading the data, which may include an
address block, account information, or amounts, the locations on the
form must be defined so that the data extraction program can locate
the data.  This is done manually, using either a ruler or computer
program.  Whichever procedure is used, the output is a list of bitmap
coordinates which identify the locations of the fields containing
data to be extracted.  The invention described in this bulletin
automatically locates likely data entry fields on the bitmap image,
and returns their coordinates.  In addition to data entry fields
bounded by solid lines (the most common type of data entry field),
this method also locates two other types of data entry fields:  those
bounded by dotted lines and those bounded by irregular shapes, such
as circles.

      One method to automatically locate fields is to search for
horizontal and vertical lines, and then compute the bounding boxes
formed by their intersections.  This method is deficient in that a)
the lines of some scanned images are broken and/or jagged, making
line detection difficult, and b) some data entry areas are not
bounded by lines but irregular shapes or dotted lines instead.  This
method also more closely matches the concept of field detection,
which is the location of large white areas ("fields"), not the
location of line intersections.

      This invention assumes that a bitmap image of a blank form
exists, typically created by a scanner at 200 or 300 dots per inch.
The invention first converts the image to run length encoding, in
which white runs are represented by pairs of horizontal coordinates.
A white run of 100 pixels beginning at column 50, for example, would
be represented by the pair (50, 149).  The first value is called the
"left" and the second value is called the "right" of the run.

      The invention then begins to search for white run lengths
greater than some threshold, say 50 pixels, beginning at the upper
left corner of the image.  This threshold is the minimum width for a
data entry field that the invention will find.  A large threshold
will only find wide fields, while a small threshold will identify too
many blank areas as possible data entry fields.

      Once a sufficiently long white run is located, the invention
searches downwards to determine the vertical extent of this white
area.  The search continues until a white run with left and right
coordinates at least as large as those of the first run is not found.
There are eight possible white run cases:
 1.  No overlapping white run found (end of...