Browse Prior Art Database

High-performance XML parser using LALR parser generator

IP.com Disclosure Number: IPCOM000031980D
Original Publication Date: 2004-Oct-18
Included in the Prior Art Database: 2004-Oct-18
Document File: 3 page(s) / 39K

Publishing Venue

IBM

Abstract

Techniques of using a LALR parser generator to generate generic XML parsers are described. Used in conjunction with a lexical scanner generator, this approach has several advantages over hand-written recursive descent parsing. The key advantages are as follows: (i) user input is specific to rules and high level semantic actions as opposed to hand-coding the details of lexical analysis and parsing, (ii) the generated parser has superior performance as compared to the hand-written parser, and (iii) the generated parser can support a customized interface for high performance or a general interface such as DOM or SAX.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 32% of the total text.

Page 1 of 3

High-performance XML parser using LALR parser generator

I. Introduction

All the general-purpose XML parsers we have seen are hand-written, using recusive-descent parsing, for the reason of easy implementation. However, this approach does not provide best performance, which is a well-known fact, due to heavy recursive calls among small routines. On the other hand, table-driven SLR (or LALR) parsers can provide better performance, because it uses iteration and low overhead stack. They are not easily implemented by hand, but can be generated easily using a parser generator, such as Flex/Bison (GNU version of Lex/Yacc), especially when some restrictions on XML input are imposed.

Database engines require very high performance. Taking any excessive instructions out would amount to performance improvement for the system. A relatively closed environment within an engine gives opportunities for more optimization, by avoiding overhead of supporting standard interfaces, such as DOM or SAX. Calls to semantic action routines can be inlined to further reduce the overhead. This is the first choice for an XML parser inside a database engine. For other application environments, or even within database environment, we present detailed design to also support DOM or SAX interface from a parser-generated high-performance XML parser.

Generating an XML parser using a LALR parser generator is simpler than generating parsers for most other languages, since XML is relatively simple to parse, especially when there are no complex parameter entities and general entities involved. However, one should not take the BNF from the XML recommendation as direct input. Care must be taken to separate the lexical rules from the grammatical rules. For lexical rules, one can use lexical scanner generator (such as Lex or Flex) to generate the scanner, or manually write the lexical scanner, since it is not complex and hand-written ones might have better performance.

In the following, we will discuss the process of generating an XML parser using parser generator approach. First, identify lexical rules and grammatical rules in BNF from the XML recommendation. Then map the BNF grammatical rules to the format a parser generator (such as Yacc or Bison, or any other similar parser generator) requires, add the semantic actions into the parser generator input file, and run the parser generator. Some parser generators have totally separate semantic actions from production rule input, in that case, semantic action routines can be supplied after running parser generator. Following the standard compilation and linking procedure, compile the generated parser source code, link with the library if any, you will get the XML parser executable, or library. You may want to customize it with the following alternatives:
1) Use inline functions or routines for semantic actions, such as building tree nodes for XML documents. This is suitable for extremely high performance with special-purpose node...