Browse Prior Art Database

Arabic Reading Machine

IP.com Disclosure Number: IPCOM000101550D
Original Publication Date: 1990-Aug-01
Included in the Prior Art Database: 2005-Mar-16
Document File: 6 page(s) / 199K

Publishing Venue

IBM

Related People

Abdelazim, HY: AUTHOR

Abstract

This article describes a system of automatic recognition of typewritten Arabic text. It incorporates existing techniques of Optical Character Recognition (OCR), Text Recognition and Arabic Cursive Typewritten Text to achieve automatic practical processing of large volumes of Arabic/ Latin data.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 36% of the total text.

Arabic Reading Machine

       This article describes a system of automatic recognition
of typewritten Arabic text.  It incorporates existing techniques of
Optical Character Recognition (OCR), Text Recognition and Arabic
Cursive Typewritten Text to achieve automatic practical processing of
large volumes of Arabic/ Latin data.

      The present state of the art in Latin (OCR) has advanced from
the recognition of specified machine printed fonts to the application
of sophisticated techniques for the recognition of handprinted
characters.  Yet the technology of Arabic Text Recognition is still
at the early stage of research and experimentation (1, 2).  The main
difference between Arabic and Latin writing from an OCR view is the
cursiveness of Arabic writing.  The most common form of Latin writing
is made up by simple concatenation of isolated characters which form
words and sentences.  Arabic typewriting, however, is essentially
cursive in nature, and concatenation of isolated characters is an
unacceptable way of Arabic writing.  Fig. 1. demonstrates the form of
an Arabic versus a Latin word.

      The cursive nature of Arabic writing represents the main
technical challenge in Arabic OCR.  Segmentation is thus a necessary
and crucial step for Arabic Text Recognition, and represents the main
issue in this disclosure.

      Several books (e.g.), (3, 4), special issues and reports
(e.g.), (5), and extensive bibliographies (e.g.), (6, 7), published
and compiled on this subject are listed in the references.

      An Arabic Text Recognition System was developed based on a
combined statistical/structural approach.  Fig. 2 shows a block
diagram representation of the system.  Dark blocks in the diagram
represent the new contribution in this field, whereas other blocks
represent selected techniques and methodologies from the literature,
suited for Arabic OCR.

      A "PAW" (8, 9) or Piece of Arabic Word is a result of applying
vertical windowing on the Arabic word.  A PAW could be a character,
more than one character, or a whole word. Fig. 3 shows a typical
Arabic word consisting of four PAWs. In Latin typewritten text,
however, the following relation is true:
           Character <=================> PAW

      Segmentation of a PAW into characters is the key to this
disclosure.  The algorithm adopted is derived from the nature of
character connection in Arabic words.  The welding between Arabic
characters in a PAW is characterized by the presence of a neck (N-N),
which is a lump of black pixels (Fig.4).  The size of this neck (N-N)
is taken as a threshold value for segmentation.  An energy-like curve
based on the black-to-black pixel height is drawn.  This curve is
traversed by the threshold value (N-N), yielding significant (above
the threshold) and insignificant (below the threshold) primitives.
Generally, the primitive is not necessarily a character, due to the
presence of (internal-silence) zo...