Parsing from flat RTF to a complex XML structure.
Original Publication Date: 2005-May-31
Included in the Prior Art Database: 2005-May-31
The present publication discloses an XSLT technology application in parsing operation of RTF plain text with the goal to obtain a structured and qualified XML text. This text may be used into text and context information retrieval with an improved efficiency due to the explicit presence of XML tags that made evident text structure. Using XSLT + XML we obtain an open, customizable and flexible method to extract structural information from RTF plain text.
Parsing from flat RTF to a complex XML structure .
Generally paper documents like medical bulletins, law collection, government publications or encyclopedias can be transformed in text file using RTF format from, for example, OCR programs. More generally, a large amount of documents is actually available in electronic format "RTF like". This document format is oriented (like HTML) not to "content" but to presentation; in general, yet, printing presentation involves (may be in conjunction with same special text) same structural information like "this part of text is the abstract", "this part of text is the main content", "this part of text in bold format" and so on.
A search engine may use that information to refine and extend search capabilities; if the text is stored in the same XML format (i.e. using any specific XML syntax) in which the structure information is defined, the search engine can use these explicit information to permit a search like: "find a law in which the word revolver is used in abstract zone and in crime zone only if this word is in bold style ". Obviously, this XML may be built with manual or semi manual operations but this approach is very expensive and the error rate is generally high. In this document we disclose an application of the XSLT technology to build a "filter" that faces the following aspect involved into "structurization" process:
1. Recognize, isolate and (eventually) rebuild single Readable Basic Unit (RBU). With
RBU we will indicate the most elementary unit of text that is significant in the context in
use (i.e. a law, an article, an item)
2. Recognize the inter-dependence between RBU to obtain hierarchical organization (like chapter that contain paragraphs)
3. Categorize RBU to let the application of the correct pattern analysis (i.e. the pattern analysis for chapter or for paragraph)
4. Apply pattern analysis to made evident text structure
5. Convert various font into a single UNICODE font using custom translitteration tables to uniform font and to make evident font style and effect (bold, superscript and so on)
At the end of this process the initial RTF document became a series of XML RBU with document structure and font properties enforced into a specific XML syntax.
The following figure illustrates the main filter structure:
A particular attention is directed to exception management. The filter explicitly contains a series of object oriented to the isolate, categorize and store filtering exceptions to permit user a quick and efficient...