Browse Prior Art Database

Method to get text information from images in a web page

IP.com Disclosure Number: IPCOM000015713D
Original Publication Date: 2002-Jul-01
Included in the Prior Art Database: 2003-Jun-21
Document File: 2 page(s) / 66K

Publishing Venue

IBM

Abstract

Disclosed here is a new method to get text information from images in a web page.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 2

Method to get text information from images in a web page

Disclosed here is a new method to get text information from images in a web page.

Machine translation software and machine reading software can handle text information in a web page after obtaining it the information an directly from HTML file or through system API. But they cannot handle text information included as a part of image in a web page, and can just use ALT attribute of each tags, or an hyper link references.

This method text information from image data in a web page. It can be used as a pre-process of machine translation software and machine enables a user to obtain reading software. To get the text information, this method uses Optical Character Reader (OCR) and OCR Dictionary.

This method updates OCR dictionaries dynamically with the information in the web page. There are two types of OCR dictionary. One is "OCR Page Dictionary" and another is "OCR Image Dictionary". OCR Page Dictionary consists of page information and is made for each page. OCR Image Dictionary consists of image information and related tag information, and. is made for each image.

In the dictionaries, every registered word has each "weight". Followings are the sample of sorted list by weight of data type and dictionary name.

1. ALT attribute of an image in this web page: OCR Image Dictionary
2. TITLE attribute and KEYWORD attribute in web page that is referred to by hyper link of the image: OCR Image Dictionary
3. Text in a web page that is referred to by hyper link of an image: OCR Image Dictionary
4. TITLE attribute and KEYWORD attribute...