Browse Prior Art Database

Indexing XML Documents to increase parsing performance Disclosure Number: IPCOM000132350D
Original Publication Date: 2005-Dec-09
Included in the Prior Art Database: 2005-Dec-09
Document File: 2 page(s) / 68K

Publishing Venue



This article disclosure describes a method for improving the parsing of an XML document. Many XML documents are semi-structured, containing major sections which may be processed by different parts of a system. The performance improvement described here comes from indexing the start of those major sections so that the subsystems need parse only the sections of the document that are of interest reducing the amount of redundant parsing occurring.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 2

Indexing XML Documents to increase parsing performance

The most commonly used method for parsing an XML document is to use a SAX parser. The SAX parser starts at the top of the document and fires an event for every tag it encounters. Usually this is what the application program requires as it is interested in the whole document. Increasingly, however, different sections of a single XML document may be required by several components of a process. One example of this is in processing a SOAP document.

    A SOAP document contains distinct defined sections; the main envelope, the header sections and the body section. Generally when processing a SOAP message these different sections will be processed by different parts of the web services processing pipeline. If each of the steps to process the SOAP document were distinct this would result in the whole document being parsed completely several times. This is obviously very inefficient and has major performance implications as each step in the process is parsing a large number of tags which it has no interest in.

    This article describes a method to allow sections of an XML document to be processed directly by using a custom parser and an indexing mechanism within the document.

    To avoid the multiple parsing problem described above current solutions require the various components with an interest in the document to be closely coupled This is to allow information from a single parse to be shared between the processes. The solution disclosed here avoids the requirement for this close coupling.

    If the structure of the document is known in advance, as in the SOAP document example above, an index could be added as a comment to the beginning of the XML document as it is written. The index has the physical character offset into the xml document of key elements starting the logical sections within the document. An option could then be added to a SAX style parser to allow an application to selectively parse the document. It could skip the beginning of the document up to the first occurrence of a particular required tag listed in the index and begin a parse from there.

    This would allow a significant improvement in the performance of parsing the document. This is due to information about elements prior to the section of the document that is of interest, that aren't required, not being parsed.

    As the document is written out a simple index in xml is added as a comment to the beginning of...