Browse Prior Art Database

Generating Words from Characters using an Adaptive "Learning" Algorithm

IP.com Disclosure Number: IPCOM000112406D
Original Publication Date: 1994-May-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 4 page(s) / 180K

Publishing Venue

IBM

Related People

Viswanathan, M: AUTHOR

Abstract

Humans use syntactic and semantic knowledge to read proportionally-spaced text. Computers can "read" such text by using an adaptive learning algorithm which generates readable text from an otherwise unreadable stream of characters. An adaptive algorithm to generate readable text from characters being rasterized for use in search engines in image libraries is described here. The algorithm uses a pattern classification technique to shift the inter-character and inter-word space parameter values towards what is observed reducing uncertainty as the sample size increases.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 27% of the total text.

Generating Words from Characters using an Adaptive "Learning" Algorithm

      Humans use syntactic and semantic knowledge to read
proportionally-spaced text.  Computers can "read" such text by using
an adaptive learning algorithm which generates readable text from an
otherwise unreadable stream of characters.  An adaptive algorithm to
generate readable text from characters being rasterized for use in
search engines in image libraries is described here.  The algorithm
uses a pattern classification technique to shift the inter-character
and inter-word space parameter values towards what is observed
reducing uncertainty as the sample size increases.

      In proportionally-spaced text, characters and words are spaced
in accordance with the amount of "space" required to satisfy basic
readability and aesthetic considerations.  This means that when
proportionally-spaced text is typeset or formatted, the horizontal
space occupied by an 'i' is less than that occupied by 'n,' which in
turn is less than that of 'w.' Therefore, the space between two
characters is a function of the widths of the characters that flank
it.  When such text samples are rasterized, the rasterizing engine
relies solely on the formatting instructions to form words out of
characters.  In some cases, the formatter provides the spaces
themselves as characters, but in most cases humans use their
knowledge of printed text to differentiate characters within a word
from another word using the simple rule of thumb that inter-word
spaces are generally larger than inter-character spaces.  This rule
works remarkably well even when text is badly formatted.  However,
this rule is not true for proportionally-spaced text.  Our brains use
semantic as well as syntactic knowledge to permit us to read
proportionally-spaced text.  However, it is a particularly difficult
task to build this level of knowledge into a computer to make this
distinction.  Therefore, if rasterization were to be interrupted to
abstract the characters out of the system, computers need "vision" to
combine characters into words.  Though the positional information of
these characters provide the x- and y-coordinates of the origins of
the characters, the general rule of thumb about inter-character and
inter-word spacing fails to produce readable text.  Further, humans
divine an estimate of inter-word and inter-character spaces after
scanning the whole string of text something that is impossible given
the real-time, binary decision that we expect from the computer as
each character is observed.  The adaptive algorithm presented here
provides a means of generating readable text from this otherwise
unreadable stream of characters.

      Let us assume that we are at that stage of the rasterization
process wherein the characters being rasterized are known along with
their heights and widths and x and y locations in the page image.
Hence, a phrase of the form "Fred runs," would include the
coordinates fo...