Multiple source digitization system
Original Publication Date: 2009-Sep-22
Included in the Prior Art Database: 2009-Sep-22
This article describes a solution to a problem in OCR and digitization of texts. The purpose of this innovation presented is to create an effective solution for digitization of such cases so that there would be no repetition of the digitization process for the same texts without loosing any information contained in the similar (albeit distinct) books.
Large digitization efforts are done today on libraries and archives around the world. These efforts scan books, newspapers, etc., OCR them, and create electronic representations of the content. Hence, the importance of OCR quality is growing. There are numerous techniques (and patents) aimed at solving digitization processes in general and digitization of books in particular. However, state of the art systems focus on the problems on a case by case basis. In other words, each book/document is digitized at the discretion of the individual library/book holder.
The idea behind this invention is to create a generic approach to the digitization problem. In this scenario a variety of libraries cooperate with each other in a combines digitization effort.
As a result,
for many documents, there are multiple sources of the same (or similar) document. The purpose of this invention is to create an effective solution for digitization of such cases so that there would be no repetition of the digitization process for the same texts without loosing any information contained in the similar (albeit distinct) books.
In the following detailed description,
we will assume that the system deals
although, of course, a similar approach can
be extended to cases with 3 or more representations.
It should be noted that one or both of the book sources may be digitized (recognized)
with or without manual data correction.
Hence, our invention calls for utilization of both book image information and all the relevant meta data.
From the business point of view, both book owners may agree to share in the costs and benefits of such a process. However, such arrangements go beyond the technical scope of this disclosure.
As a first step,
both book sources would be processed to bring them to some level of a common denominator. One option would be to perform layout analysis,
and Optical Character Recognition (OCR) for both sources. (Note that in some cases one of the sources may be digitized already, in which case only the second source would be preprocessed.) This first step envisions automatic processing only (avoiding, to begin with, costly manual post-processing). Then the system would identify the degree of similarity between the two book representations and then perform appropriate fusion of both information sources.
we need to handle two different copies of the same book or when we need to handle
photographic reissue of the same book).
The system would verify photographic correspondence of both sources. The strongest (and best) approach would be to perform two dimensional matching of all the corresponding word images. For the purposes of speed, it may be sufficient to compare some subset of the words on the page or even a chosen subset of characters (recognized by OCR with high confidence) making sure that both recognition values and relative locations correspond to each other).