Browse Prior Art Database

Converting Paper Documents to Electronic Images with Automatic Document Recognition, Index Generation and Template Removal

IP.com Disclosure Number: IPCOM000107869D
Original Publication Date: 1992-Mar-01
Included in the Prior Art Database: 2005-Mar-22
Document File: 5 page(s) / 191K

Publishing Venue

IBM

Related People

Aghili, H: AUTHOR [+2]

Abstract

Disclosed is a method for converting paper documents into digital images incorporating techniques 1) for automatic index generation, 2) for removing the irrelevant background information like form templates, and 3) for reconstructing the documents into their original form for later presentation. These features will result in considerable savings in conversion, storage, and communications costs.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 36% of the total text.

Converting Paper Documents to Electronic Images with Automatic Document Recognition, Index Generation and Template Removal

       Disclosed is a method for converting paper documents into
digital images incorporating techniques 1) for automatic index
generation, 2) for removing the irrelevant background information
like form templates, and 3) for reconstructing the documents into
their original form for later presentation. These features will
result in considerable savings in conversion, storage, and
communications costs.

      Eliminating paper as a medium for storing information and a
means for its distribution has long been a goal of the information
processing industry. Achieving this objective has recently become
more feasible than ever by the introduction of the image systems
which convert paper documents to digital images, and store the result
in electronic libraries.

      This is, however, only the initial step toward full automation;
the stored information must be properly indexed to be of any
practical value. The standard solution offered by the existing
products is to have an operator review the documents during the
conversion process, and manually key in the desirable index
information. Obviously this approach rapidly becomes impractical as
the workload grows.

      Another drawback of the existing products is that they
typically convert paper documents to digital images with no regard as
to the content of such documents. One may observe that many business
documents are produced by filling out standard forms or templates.
Merely digitizing the paper documents, and storing the images will
result in a significant storage cost for a large amount of redundant
and irrelevant information, that is, that portion of the images which
corresponds to templates.  The net is considerable storage and
communications cost in any large-scale operation.

      This article describes a new method for converting paper
documents to digitize images. It is different from the existing
methods in two ways: 1) it attempts to generate the indices
automatically; and 2) it removes the irrelevant template information,
storing only the data portion of each document together with the
necessary information to reconstruct it into its original form. The
result is the elimination of the tedious and time-consuming operation
of manual indexing, and significant savings in storage and
communications costs:
1. Templates are stored only once in the library, rather than being
stored for each instance of a document type.
2. Templates may be transferred very infrequently, hopefully once,
into workstation cache memories.
3. For each instance of a document type only the data portion needs
to be transferred to workstations.

      Assume that enterprise documents are typically produced by
filling out data on templates (forms).  Denote the list of all such
valid templates by:
          EnterpriseTemplateSet X <t1, t2, ... , tN>,
where N, N...