Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Japanese Handprint Segmentation

IP.com Disclosure Number: IPCOM000117551D
Original Publication Date: 1996-Mar-01
Included in the Prior Art Database: 2005-Mar-31

Publishing Venue

IBM

Related People

Bella, IN: AUTHOR

Abstract

Algorithms developed for the segmentation of Japanese hand printed text images as input into Optical Character Recognition (OCR). The algorithms presented here are mostly language independent.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 13% of the total text.

Japanese Handprint Segmentation

      Algorithms developed for the segmentation of Japanese hand
printed text images as input into Optical Character Recognition
(OCR). The algorithms presented here are mostly language independent.

      Disclosed is a Japanese hand printed text image segmenter.  The
algorithm descriptions are presented in chronological order such that
the evolution of each algorithm can be seen.  The intended audience
is technical personnel who have an understanding of basic computer
science and statistical methods.

The segmenter has the following requirements:
  o  The input consists of an image of Japanese hand-printed text.
      This includes Hiragana, Katakana, Kanji, some upper-case Roman,
      and Numeric characters.  The segmenter also handles machine
      printed Japanese text, but it is tuned for hand print
      characteristics.
  o  Page segmentation has already been done.  Hence, there are no
      intermingled diagrams or severely disjointed sections of text.
      All lines of test must be oriented in the same general
direction.
  o  The text has been rotated so that the lines are mostly
horizontal
      or vertical.  However, lines may be at slightly different
angles
      or slightly curved relative to each other.
  o  The output consists of extracted character images in the order
      that they are read from the image.  The character images are
      normalized to the specifications of the recognition module.

Segmentation

      General Algorithm History - To point out the difficulties in
segmenting Japanese text, a comparison with non-cursive English can
be made.  To start out, Japanese and non-cursive English are similar
in that most characters are separated by white space.  However, there
are occasional touching characters.  The main difference between
Japanese and English is in the number of connected components per
character.  English text tends to have one component per character
where as Japanese tends to have multiple components per character.  A
second notable difference is that Japanese text may be written
vertically or horizontally.  This requires some additional processing
to determine the line orientation.

      Given this overview of the problem, two general approaches
may be taken: from the top down or the bottom up.  To take a top-down
approach, one would first divide the text into lines, and then divide
the lines into characters.  To take a bottom-up approach, one would
cluster the entire text into characters, and then determine the
lines.  It seemed logical to take the first approach because then we
can use one dimensional clusterings as opposed to a more complex two
dimensional clustering.  In addition, a bottom-up approach would lose
the information one can gain from a pattern of lines.

      Given the relatively short time given to complete the task, we
did not have the time to try less c...