Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Automatic Transformation of Unstructured and Semi-structured Document Tables to Structured Relational Database Format

IP.com Disclosure Number: IPCOM000245463D
Publication Date: 2016-Mar-11
Document File: 7 page(s) / 135K

Publishing Venue

The IP.com Prior Art Database

Abstract

An integrated and automated system which can de-normalize unstructured and semi-structured document tables based on the section title, caption, header, footer, cell values and other information and metadata from the identified and extracted tables into single dimension relational database format; join or append the transformed de-normalized tables from multiple similar documents, based on metadata and contexts; store and index transformed tables along with metadata of the original document or set of documents into database for further use; without human intervention to reduce effort, time and cost

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 40% of the total text.

Page 01 of 7

Automatic Transformation of Unstructured and Semi -structured Document Tables to Structured Relational Database Format
Automatic Transformation of Unstructured and Semi-structured Document Tables to Structured Relational Database Format
Introduction
There is a requirement for a system which can transform all kind of document tables (e.g. in pdf, doc, html etc.) into some structured format to use for any analytic process, information retrieval, question answering (e.g. IBM Watson), comparison, calculation etc. The document tables are usually with different patterns based on the position of the caption, header, footer, cells, and other contexts embedded in the text of the document. It is a challenge to extract, relate and represent all such information in a systematic and automated way. The proposed system is capable of table extraction, pattern based classification and transformation of document tables into relational tables (RDBMS).

Key Features

Identifying Data Tables in Documents or Web Pages Extracting and Analyzing Tables Classifying Tables with respect to Structural Representation Patterns
De-normalizing Complex Tables into Single Dimension (Top Row As Field Names)

  • Storing and Indexing Transformed Tables with Documents or Contexts Joining or Appending Tables from Different but Similar Documents Problem Definition

• There is no existing solution which finds out appropriate patterns and categories of data tables in documents (e.g. htm, pdf, doc, ppt, odf, xml etc.)

• The proposal is to extract data tables from documents, extract tabular information, and representing the information in relational database structure, index with right contexts, to compare, compute, merge or append, and represent for analytics.


• Improving cognitive research and practices

1


Page 02 of 7

Applications

• Natural to structured query composition project • Question-answering system (e.g. IBM Watson)

• Effectively in the field of finance, retail or media (news article)

• Helpful in education domain (data interpretation)

  • Can be implemented efficiently for large scale artificial intelligence project, decision support system with automatic logic building and cognition Method

• The tabular data in document tables is generally in semi-structured (or unstructured) format and is hard to be queried for simple questions.

• The proposed idea is a unique and flexible method of converting the semi-structured tabular data from within the documents or web pages into structured format so that they can be accessed, queried and manipulated by plain vanilla programs.

• The idea here is to de-normalize complex tables into single dimensional table so that it can be queried for factual as well as inferred or computed queries.

• It also includes semantically combining (joining or appending) two different tables that are similar in nature in terms of heading and type of data.

• The important aspect of semantically understanding t...