Browse Prior Art Database

Semi-Automatic Corpus Annotation for On-Line Handwriting Recognition

IP.com Disclosure Number: IPCOM000106761D
Original Publication Date: 1993-Dec-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 2 page(s) / 92K

Publishing Venue

IBM

Related People

Bellegarda, EJ: AUTHOR [+3]

Abstract

An efficient algorithm for the semi-automatic annotation of handwritten corpora, based on a coarse partition of the chirographic space into distinctive ways of writing a given character is presented. The annotated corpora can be used for automatic handwriting recognition in either template-matching based algorithms (to generate a starter prototype set) or in HMM based algorithms (to generate Markov character models).

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Semi-Automatic Corpus Annotation for On-Line Handwriting Recognition

      An efficient algorithm for the semi-automatic annotation of
handwritten corpora, based on a coarse partition of the chirographic
space into distinctive ways of writing a given character is
presented.  The annotated corpora can be used for automatic
handwriting recognition in either template-matching based algorithms
(to generate a starter prototype set) or in HMM based algorithms (to
generate Markov character models).

      A prototype starter set by collecting and analyzing
wide-coverage writing styles has been previously established.  This
is useful to selectively generate a vocabulary of lexemes, where a
lexeme is defined as a variation around a given character.  In
essence, lexemes are topologically distinct versions of the same
letter.  For example, there are many ways to write the lower-case
letter y it can be written in one, two, or even three strokes; the
one-stroke y category can itself be subdivided into different classes
according to the starting point and formation of the letter.

      Basically, the vocabulary of the lexemes represents a coarse
partition of the chirographic space into regions which represent
different ways of writing a character.  This partitioning of the
chirographic space is very crude compared to a clustering defined at
the frame level.  However, it is more appropriate for use in
conjunction with a template-matching approach to handwriting
recognition.  In essence, this is the closest we can come to the
equivalent of an alphabet of phones (i.e., morphologically motivated
building blocks) for handwriting recognition.

      A systematic method to annotate handwriting data based on such
an alphabet of lexemes is presented in this disclosure.  Accurate
annotation of handwritten corpora is desirable in template-matching
based systems to generate a good quality starter prototype set for
writer-independent processing.  It is also required in HMM-based
systems to grow representative, writer-independent Markov character
models (baseforms).

The procedure adopted is as follows:

1.  Display, edit, and clean up the previously obtained starter set
    removing far-fetched lexemes and trying to keep only the most
    representative ones.  The idea is to discard accidental instances
    of a character while isolating true variations around the
    character.

2.  Add in "artificially generated" lexemes to account for further
    variations in strok...