Browse Prior Art Database

Automatic Character Image Sanitizing Method for Optical Character Recognition Data Setup

IP.com Disclosure Number: IPCOM000114123D
Original Publication Date: 1994-Nov-01
Included in the Prior Art Database: 2005-Mar-27
Document File: 2 page(s) / 36K

Publishing Venue

IBM

Related People

Mano, T: AUTHOR [+2]

Abstract

Disclosed is a method how inappropriate character images for an Optical Character Recognition (OCR) recognition library, such as dirty and deformed images in the Figure, are removed automatically from a set of a huge number of character images.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 100% of the total text.

Automatic Character Image Sanitizing Method for Optical Character
Recognition Data Setup

      Disclosed is a method how inappropriate character images for an
Optical Character Recognition (OCR) recognition library, such as
dirty and deformed images in the Figure, are removed automatically
from a set of a huge number of character images.

      The method consists of two processes.  The first process
removes abnormally segmented character images such as the first and
the second characters in the Figure.  If the number of black pixels
at the one side is smaller than a threshold and if the number of
black pixels at the other side is bigger than another threshold, this
sample is assumed as a mis-segmented one and it is removed from the
set.  The comparison is done between the top and the bottom edge
areas, and between the left and the right edge areas.

      After mis-segmented character images are removed, all character
images are grouped into several groups using a conventional
clustering method.  Mis-written and noisy character images are
isolated into groups with very small number of members.  Then members
of very small groups created at the clustering are removed from the
data set.