Hidden Markov model structure determination for on-line handwritten characters
Original Publication Date: 2002-Aug-24
Included in the Prior Art Database: 2003-Jun-19
Disclosed is a method for on-line handwritten character recognition. In the field, left-to-right hidden Markov model (HMM) is used. The HMM cannot provide good accuracy because handwritten characters include stroke order permutations. There is more than one kind of stroke order for a character, especially for a Chinese character. Writing style variation is called allograph. The examples are shown in Figure 2. Allograph can be properly handled when a stroke order is provided with a HMM whose structure corresponds to the stroke order. This disclosure proposes a HMM structure determination method for handling stroke order permutations. 1. Outline of the procedure Figure 1 shows the procedure. The first step (allograph categorization by stroke clustering) is for generating prototype stokes. The second step (stroke matching) is for finding prototype stroke sequences. The third step (HMM training without stroke number variations) is for training HMM using character examples. Before training HMMs, their structures are determined based on the prototype stroke sequences. The forth step (allograph categorization by HMMs) is for finding correspondence between a HMM and a character example. The fifth step (HMM training with stroke number variations) is for training HMM using character examples. Techniques employed in the steps are already known [1-4], but the effectiveness of the combination of the techniques was not known. Another uniqueness of the proposed method is HMM structure is determined by using clean character examples without style variations. Clustering provides these clean examples. After the structure is determined, HMM parameters are estimated based on the all examples. 2. Allograph categorization by stroke clustering Prototype strokes are obtained by clustering stroke examples. A set of stroke examples S(C,N,n) is made by the n-th stroke of category C that has N strokes, where N is the most frequent stroke number. Note that the value N depends on category C. The set S(C,N,n) is divided into clusters. A centroid of a cluster is a prototype stroke. A prototype stroke sequence is obtained for each character example. A sequence is an allograph. For each category C that has N strokes, the most frequent prototype stroke sequence is chosen to obtain the N prototype strokes. Other prototype strokes obtained in this section are not used hereafter. Character examples with N strokes are examples without stroke number variations. They also are used in the following steps. 3. Stroke matching For each category C that has N strokes, where N is the most frequent stroke number, character examples are compared with the most frequent prototype stroke sequence. By comparing stroke examples with prototype strokes, a prototype stroke sequence is obtained for each character example .