Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Method of Differentiating Image from Text within Documents

IP.com Disclosure Number: IPCOM000107951D
Original Publication Date: 1992-Apr-01
Included in the Prior Art Database: 2005-Mar-22
Document File: 2 page(s) / 87K

Publishing Venue

IBM

Related People

Fitzpatrick, GP: AUTHOR [+4]

Abstract

A document (one potentially containing both text and graphics components) may enter a computer system via unarchitected methods like a scanner or a facsimile machine. In this case, the boundaries between the text and image components are often unclear. For instance, text may be contained within what may appear to be the image region of the document. A method is needed to identify those regions of a document which are "image" even though those portions may contain text, and even in the absence of borderline boundaries, such as a box or line.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Method of Differentiating Image from Text within Documents

       A document (one potentially containing both text and
graphics components) may enter a computer system via unarchitected
methods like a scanner or a facsimile machine. In this case, the
boundaries between the text and image components are often unclear.
For instance, text may be contained within what may appear to be the
image region of the document. A method is needed to identify those
regions of a document which are "image" even though those portions
may contain text, and even in the absence of borderline boundaries,
such as a box or line.

      Often, it is necessary to exclude the image component(s) of a
given document from various processes. For instance, a spell-checking
program would be wasting resources trying to analyze a textual
statement within an image portion of a document. Furthermore, certain
printers are not well-adapted for printing images. A user may wish to
reduce storage costs by storing only the textual portions of a
document.

      The solution to this problem lies in extending the capabilities
of existing Optical Character Recognition (OCR) software to:
      -    isolate text from image components, in the absence of
borderline boundaries
      -    treat "text" which is associated to image (e.g., captions)
as image

      This invention utilizes probability techniques to determine
whether a particular region on a page is an image. Such an area can
be defined by pixel boundaries. The existence of pixel boundaries can
be defined to approximate a rectangle, square, circle or similar
geometric representation. These geometric shapes are used to measure
confidence factors that an identified area is image only.

      Traditional OCR techniques are used to identify a character or
symbol; however, this invention places each recognizable and
unrecognizable symbol...