Browse Prior Art Database

A System For Detecting, & Extracting Data From, Tables In Text Based Documents

IP.com Disclosure Number: IPCOM000219454D
Publication Date: 2012-Jul-02
Document File: 2 page(s) / 57K

Publishing Venue

The IP.com Prior Art Database

Abstract

This article describes a method for extracting textual information from tabular data held in a range of proprietary formats. Extracting tabular data from proprietary document formats (e.g. MS Word, PDF, etc) is challenging as there are no common standards. As such, extraction filters often discard valuable meta-data relating to the columns, rows and general layout of the table. This article describes the use of scanning technology to capture this layout data and enable the accurate extraction of tabular data.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 01 of 2

A System For Detecting, & Extracting Data From, Tables In Text Based Documents

Identifying tabular data in free text is a major challenge for a text analytics system.

    The real problem is that text analytics systems extract free text from a whole range of different document formats (plain text, Word, PDF) and then create a sequence of bytes that are analysed by the text analytics components. The extraction process often removes critical data such as tabulation or any proprietary tags held in the source document type (e.g. PDF). As a result, the text analytics system needs to identify columns, rows, column headings and row headings without this critical tabulation data.

    The number of different document formats in use inhibits the adoption of standards to rectify this problem. Even when standards eventually emerge, there

will be many years worth of legacy documents that were not produced in accordance with the standards.

    At present, attempts have been made to develop individual rule systems capable of processing individual document types. This may be possible on a case by case basis, however this solution is not practical for all document types and all sets of content; quite simply, too many different sets of rules would be required.

    There is therefore a requirement for a generic system capable of analysing any document type to identify tables and extract column, row, column headings and row headings for processing by a text analytics system.

    A generic solution to...