Browse Prior Art Database

Unstructured HTML file conversion to well-formed XHTML using external rule based parser.

IP.com Disclosure Number: IPCOM000016177D
Original Publication Date: 2002-Aug-16
Included in the Prior Art Database: 2003-Jun-21
Document File: 3 page(s) / 51K

Publishing Venue

IBM

Abstract

Disclosed is a rule based parser that can convert unstructured HTML data (or any SGML data) to welformed XML.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 53% of the total text.

Page 1 of 3

  Unstructured HTML file conversion to well-formed XHTML using external rule based parser.

Disclosed is a rule based parser that can convert unstructured HTML data (or any SGML data) to welformed XML.

    HTML is a standard document format used on the World Wide Web. XHTML is a reformulation of HTML conforming to the rules of XML.

    XHTML documents are XML conforming. As such, they are readily viewed, edited, and validated with standard XML tools.

    Well formed XML files can be manipulated in a much easier fashion that unstructured files, using standard XML parser technologies such as SAX and Xerces the process of converting these files to other formats (such as Word Processor specific formats) is very much simplified.

    The parser uses externalized rules describing the structure of the input file format and thus allows the flexibility to specify rules for converting any unstructured SGML type formats to welformed XML.

    HTML files can be tokenized into tags and text, tags are surrounded by angle brackets :

<HTML> <HEAD>Here is my heading</HEAD> <BODY> <h1>Here is a heading<br> This is the body of the document</BODY></HTML>

    HTML is loosely structured in so far as rules are not applied to ensure the correct nesting and termination of tags. Web browsers are very unforgiving for badly structured HTML documents or for documents containing syntax errors and will usually display the text in some or other format. The parser described here takes account of this loose structure and using it's rules about the tags constructs a highly structured error free document.

    Taking the above example after passing through the parser and having the rules applied the results would be :

<HTML> <HEAD>Here is my heading</HEAD> <BODY> <h1>Here is a heading</h1><br/> This is the body of the document </BODY>
</HTML>

    The external rules describes the characteristics of the tags that may occur in the input a sample is as follows :

<Rule>

<Name>HTML</Name>

1

Page 2 of 3

    <TagRule>Structure</TagRule> </Rule>
<Rule>

    <Name>HEAD</Name> <TagRule>Structure</TagRule> </Rule>
<Rule>

         <Name>BODY</Name> <TagRule>Structure</TagRule> <Comments>The content of the Body is what we want to convert.</Comments>...