Browse Prior Art Database

Automatic Text Extraction from Images and Video for Content-Based Annotation, Search, and Retrieval Disclosure Number: IPCOM000010256D
Original Publication Date: 2002-Nov-13
Included in the Prior Art Database: 2002-Nov-13
Document File: 4 page(s) / 94K

Publishing Venue



Text either embedded or superimposed within images and video frames is very useful for describing the semantic content of the frames, as it enables both keyword and free-text based search, automatic video logging, and video cataloging. Extracting text directly from video data becomes especially important when closed captioning or speech recognition is not available to generate textual transcripts of audio, or when video footage that completely lacks audio needs to be automatically annotated and searched based on frame content. Towards building a video query system, we have developed a scheme for automatically extracting text from digital images and videos for content annotation and retrieval. In this paper, we present our approach to robust text extraction which can handle complex backgrounds in video frames, deal with different font sizes, font styles, and font appearances such as normal and inverse video. Our algorithm results in segmented characters from video frames that can be directly processed by an OCR (optical character recognition) system to produce ASCII text. Results from our experiments with over 5,000 video frames demonstrate the good performance of our system in terms of text identification accuracy and computational efficiency.

This text was extracted from a PDF file.
At least one non-text object (such as an image or picture) has been suppressed.
This is the abbreviated version, containing approximately 22% of the total text.

Page 1 of 4

  Automatic Text Extraction from Images and Video for Content -Based Annotation, Search, and Retrieval

"A program is disclosed that will automatically extract text embedded on digital images and video frames. ``Image'' and ``frame'' are used interchangeably in this paper.

Text Extraction from Video

Text extraction and recognition comprises of obtaining an image (scanned from a document or decoding a MPEG video clip), segmenting the image and extracting regions containing text only (sometimes referred to as text location), analyzing the text regions into blocks, lines, words, and characters, and finally recognizing the characters using OCR systems to output the text strings contained in the image. Text can appear in video anywhere in the frame and in different contexts. It appears as either scene text or as superimposed text. Text that appears as part of the scene and is recorded with the scene is referred to as scene text and its presence can be in the scene as part of street and shop name boards, or on a person's clothing. It is difficult to extract scene text reliably due to unconstrained nature of its appearance. On the other hand, superimposed text is intended to carry and stress important information in video. It is typically generated by video title machines or graphical font generators in studios. Our system is designed to extract superimposed text and scene text that possesses typical text attributes. We do not assume any prior knowledge about frame resolution, text location, and font styles. Some common characteristics of text are exploited in our algorithm including monochromaticity of individual characters, size restrictions (characters cannot be too small to be read by humans or too big to occupy a large portion of the frame), and horizontal alignment of text (preferred for ease of human reading).

Locating Text and Extracting Characters : Our Approach

The input to our system is a sequence of gray level images obtained by decompressing MPEG-encoded video sequences. The primary goals of our system are (i) isolating regions that may contain text characters in an image from other image content, (ii) separating each character region from its surroundings, and (iii) verifying the presence of text by consistency analysis.

A New Generalized Region Labeling (GRL) Algorithm

A basic process that is used repeatedly in our system for text extraction is that of labeling the pixels in an image based on a given criterion (e.g. gray scale homogeneity) using contour traversal, thus partitioning the image into multiple regions, then grouping pixels belonging to a region by determining its interior and boundaries, and extracting region features such as its MBR (minimum bounding rectangle), area, mean gray level, etc. We have developed a fast and efficient algorithm that uses chain-code to perform these tasks collectively. This new generalized region labeling (GRL) algorithm works on all types of images. It is fast, avoids recursion and exte...