Browse Prior Art Database

Fully Automatic Corpus Annotation Handwriting Recognition

IP.com Disclosure Number: IPCOM000106749D
Original Publication Date: 1993-Dec-01
Included in the Prior Art Database: 2005-Mar-21
Document File: 4 page(s) / 159K

Publishing Venue

IBM

Related People

Bellegarda, E: AUTHOR [+4]

Abstract

A fully automatic procedure based on a K-means clustering algorithm is proposed for the annotation of handwritten corpora. The algorithm first determines a suitable alphabet of lexemes, each of which representative of one distinctive way of writing a given character. Each character is then annotated by its closest lexeme as measured by a diagonal Gaussian probability density.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 31% of the total text.

Fully Automatic Corpus Annotation Handwriting Recognition

      A fully automatic procedure based on a K-means clustering
algorithm is proposed for the annotation of handwritten corpora.  The
algorithm first determines a suitable alphabet of lexemes, each of
which representative of one distinctive way of writing a given
character.  Each character is then annotated by its closest lexeme as
measured by a diagonal Gaussian probability density.

      The ability to accurately annotate handwritten corpora is
crucial to any automatic handwriting recognition system where
parameters used during a given decoding phase are learned during the
associated training phase.  This includes handwriting recognition
systems based on template-matching such as [1], neural networks e.g.,
[2], and hidden Markov models [3].  In [1], for example, annotated
corpora are required to generate a good quality starter prototype
set; in [2], they are instrumental in the specification of the neural
network architecture; finally, in [3]  they are necessary to train
the statistics of each Markov character model.

      To be able to annotate handwritten corpora, one must first
derive a suitable alphabet of lexemes, where a lexeme (also sometimes
referred to as allograph) is defined as a variation around a given
character.  For example, there are many ways to write the lower-case
letter y: it can be written in one, two, or even three strokes; the
one-stroke y category can itself be subdivided into different classes
according to the starting point and formation of the letter.  So far
the standard practice for establishing such a lexeme alphabet has
been to collect and analyze, mostly by eye, wide-coverage writing
styles [4].  This, however, tends to introduce subjectivity in the
resulting allographs.

      The present article proposes a new strategy, more amenable to
implement a fully automated procedure.  We follow a two-step process
consisting of (i) producing an objective, data-driven,
morphologically motivated  alphabet of lexemes, and (ii) consistently
annotating hand- writing data with this alphabet, i.e., using the
same measure as was employed to derive the alphabet itself.

      To achieve (i), we first map each observation of each character
seen in the training data into a point in a suitable high-dimensional
space called the lexographic space.  Note that this space is only
marginally related to the chirographic space mentioned in [5].
Specifically, the chirographic space is populated by frame-level
feature vectors while the lexographic space contains only
character-level feature vectors.  As a result, the lexographic space
is more appropriate for finding the high-level variations
characterizing lexemes.

      To achieve (ii), one performs the annotation in the same
lexographic space as defined in (i).  Since, for a given character,
lexemes represents a partition of the associated lexographic space
into regions which correspond to...