Browse Prior Art Database

HIGH EFFICIENT TOOL FOR EXTRACTING TABULAR DATA FROM XML FILE

IP.com Disclosure Number: IPCOM000235919D
Publication Date: 2014-Mar-29

Publishing Venue

The IP.com Prior Art Database

Abstract

System, methods, computer program product embodiment are provided to extract elements and attributes content of a markup language document and to convert them to a tabular format. The embodiment includes receiving a group of column path definitions, and an optional row path definitions, normalizing these path definitions, generating parser plan, extracting element content and attribute value during parsing the markup language document, merging all these contents and values to form the tabular output data. An embodiment further includes extensible markup language (XML) as the markup language, and XPath expression as the path definitions.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 39% of the total text.

Page 01 of 10

HIGH EFFICIENT TOOL FOR EXTRACTING TABULAR DATA FROM XML FILE

1


Page 02 of 10

2


Page 03 of 10

3


Page 04 of 10

4


Page 05 of 10

5


Page 06 of 10

6


Page 07 of 10

Background

The existing method for converting XML file to tabular format data provides a facility to extract nodes from XML file using XQuery, and also provides the facility to combine all those nodes together to form a new XML document, final tabular format data then is extracted from the new XML document. This approach is burdensome to achieve the final goal, because the results of XQuery are represent in DOM format or StAX format.

To flatten the XML to a tabular format, all data will be accessed at least twice. The first pass is to extract nodes from original XML document to form the intermediate XML document, and the second pass will extract content from the intermediate XML document to form the tabular data. If the data set is quite large, this two phase process will consume quite a long time and resources.

Since the tabular format data requires very limited information from the document, such as element data content, attribute values, etc. It is meaningless to extract and store so much irrelevant information as intermediate results.

Brief Summary

Present disclosure will try to overcome the performance and resource usage problems of previous method by providing a new approach and system which will get rid of the intermediate DOM or StAX format document, and will extract all required data in one pass.

Description

The XML document shown at FIG. 3 represents a simple structure of a school class. The tree structure starts at the root element 300 and each element or attribute in document has been converted to a tree node. The ellipse represents an XML document element, rectangular represents text content, and a quoted string beside the shape represents the text content value. As shown, the element class 301 has four children student 303, student 315, teacher 332 and teacher 337. The student 303 and 315 are siblings. Student 303 has an attribute id 304 which has the text content "001" and other three children. The name 305 is children of student 303, it has a text content 306 "Tim".

In FIG.3 diagramD denotes a data node. A data node is either the text content of the element or attribute of the element. Its value is requested by user in input XPath. R denotes a record node. A record node is defined as common element of all input XPath. During the parsing process, if the end of an R node

7


Page 08 of 10

is encountered, it means a group of records have accomplished and will be outputted.M denotes a merge node. A merge node is defined as all nodes

between the record node and data node. During the parsing process, data from all its children will be merged and the merge results will be appended to its parent node.

FIG. 1 illustrate an environment implements the present disclosure, which receives a list of XPath, extracts element text content or attributes value to form a tabular o...