Browse Prior Art Database

Method for Text Recognition of Printed Documents

IP.com Disclosure Number: IPCOM000049222D
Original Publication Date: 1982-May-01
Included in the Prior Art Database: 2005-Feb-09
Document File: 2 page(s) / 34K

Publishing Venue

IBM

Related People

Wong, KY: AUTHOR

Abstract

Recognition of text information in printed documents has been a difficult problem because printed characters have many font styles and sizes. A particular set of documents may contain 3 to 5 fonts of characters out of a commonly used set of, say, 100 fonts, which may include 15 styles and 6-7 sizes. Each font may contain about 80 characters. Traditional OCR methods try to determine features of each input unknown character and to match the unknown with all the characters in the font library. Feature extraction on each input unknown character as well as comparison with all possible characters in the font library uses significant computational resources.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 66% of the total text.

Page 1 of 2

Method for Text Recognition of Printed Documents

Recognition of text information in printed documents has been a difficult problem because printed characters have many font styles and sizes. A particular set of documents may contain 3 to 5 fonts of characters out of a commonly used set of, say, 100 fonts, which may include 15 styles and 6-7 sizes. Each font may contain about 80 characters. Traditional OCR methods try to determine features of each input unknown character and to match the unknown with all the characters in the font library. Feature extraction on each input unknown character as well as comparison with all possible characters in the font library uses significant computational resources. This invention relates to a method for the recognition of text information in printed documents involving an intermediate step of pattern matching in order to reduce the computation load.

Referring to the figure, a text line in a digitized document initially will be segmented into individual characters. The bit-pattern of the unknown character is then matched with a reference or prototype pattern. If there is a match, the new pattern will be given an identification number of the prototype pattern; otherwise, it is stored as a prototype. Only a small number of prototypes need be compared with an unknown character in contrast to the prior art, where an unknown character must be compared with the entire font library for a match. The pattern matching process continues unt...