Browse Prior Art Database

Character Recognition Techniques for Forms Processing

IP.com Disclosure Number: IPCOM000127323D
Original Publication Date: 2005-Aug-23
Included in the Prior Art Database: 2005-Aug-23
Document File: 4 page(s) / 28K

Publishing Venue

IBM

Abstract

Disclosed is a program that allows unique way of recognising characters in a scanned document. Unlike traditional methods for recognising characters in a scanned document, the program demonstrates an innovative method of recognising international characters. This technique allows documents with multiple languages to be recognised simultaneously without having to have separate recognising engine for each language.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 25% of the total text.

Page 1 of 4

Character Recognition Techniques for Forms Processing

A program is disclosed for recognising characters on a scanned document. There are a number of methods available for recognising characters in a scanned document. Technique described in the disclosure is unique and though the process is available for usage in different context, it has never been adopted for Character Recognition Process.

OCR (Optical Character Recognition) Process is described below: First the document that needs to be digitised is scanned. Scanned documents are generally stored as image file either as TIFF (Tagged Image File Format) or as BMP (in short for Bitmap file). Scanned images are not accurate and they have noises and rough edges. Also if the document is not scanned properly, then the scanned image will be skewed. These are to be corrected, before continuing the process. Following diagram provides a simple representation of character recognition process:

Document to Scan

Pre-Process

Convert Grey scales, Filter Noise, Skew Correction etc..

Identify characters, words and lines

Apply Rules and Extract data

Compare it stored data

No? Go to next stage

Match found?

yes? Store the charac

 Decrement Stage. End of

Stage?

No? yes? Sto unknow

There are many architecture defined for recognition of words. Feature Extraction and Comparison is the basis for recognition of characters that is described here. This is a simple process of extracting the features of the bitmap pattern. These features are then compared with the stored result of a complete alphabetical set. If there is a match of at least 95% or more of the features with any of the stored alphabet features, then it is assumed that the character is recognised.

There are basically three types of knowledge in character recognition Morphological, pragmatic and linguistic. Morphological knowledge refers to the shape of an ideal representation of a character like number of vertical and horizontal lines, closed & open loops, curves & contours that define the character and the segments that join these. Pragmatic represents spatial arrangement of a character within a word boundary. This is more language dependent. For instance in English 'u' will always follow 'q' and 'b' and 'f' will seldom appear in pairs etc. There are many such language dependent pragmatic rules that can be applied to augment the recognition process. The third knowledge of course is lexicon. Use of dictionary of a predefined words or even language dictionary, will definitely provide better accuracy on a read word. There are also other rules like grammar checking to ensure the correctness of a sentence.

Feature extraction will fall under Morphological Knowledge. This is done by applying a set of rules to the scanned image. This is derived from Hidden Markov Model (HMM). Each pattern is identified with number of closed loops, lines, curves, ascenders and descenders. The model uses the space between the ascenders or descenders to isolate the connecte...