Browse Prior Art Database

Technique to Generate very Efficient Compression/Decompression Engines for XML Data Streams Based on the Stream's DTD

IP.com Disclosure Number: IPCOM000013426D
Original Publication Date: 2000-Feb-01
Included in the Prior Art Database: 2003-Jun-18
Document File: 2 page(s) / 44K

Publishing Venue

IBM

Abstract

XML, the eXtensible Markup Language, is a popular markup language which defines a common way for people to specify a grammar for encoding a self-defining data stream for a particular industry or group (e.g., banking, insurance policies). XML encourages the use of fairly long character strings (more than a character or two) as data identifiers or "tags". These tags are part of the data stream along with the data values. This means that an XML-compliant data stream is often significantly longer than a data stream containing only data values. The extra length of the data streams can have a severe impact on speed of transmission of such a data stream, transient storage of the data stream in memory or more permanent storage of the data stream in non-volatile media such as a file on disk. Traditional compression algorithms are designed for generic data streams. They look for patterns of repeated characters or bits, or other patterns in the data stream itself; they do not assume any prior knowledge of the content or structure of the data stream. Thus they are less efficient than they could be when used on a data stream with a know structure and some known content, such as an XML-compliant data stream. The invention, in brief, is to use the known structure and partially-known content inherent in an XML-compliant data stream to generate compression/decompression engines tailored to the grammar used by a particular industry or group. These engines are by nature faster and compress more tightly than traditional compression algorithms.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 51% of the total text.

Page 1 of 2

  Technique to Generate very Efficient Compression/Decompression Engines for XML Data Streams Based on the Stream's DTD

XML, the eXtensible Markup Language, is a popular markup language which defines a common way for people to specify a grammar for encoding a self-defining data stream for a particular industry or group
(e.g., banking, insurance policies). XML encourages the use of fairly long character strings (more than a character or two) as data identifiers or "tags". These tags are part of the data stream along with the data values. This means that an XML-compliant data stream is often significantly longer than a data stream containing only data values. The extra length of the data streams can have a severe impact on speed of transmission of such a data stream, transient storage of the data stream in memory or more permanent storage of the data stream in non-volatile media such as a file on disk.

Traditional compression algorithms are designed for generic data streams. They look for patterns of repeated characters or bits, or other patterns in the data stream itself; they do not assume any prior knowledge of the content or structure of the data stream. Thus they are less efficient than they could be when used on a data stream with a know structure and some known content, such as an XML-compliant data stream.

The invention, in brief, is to use the known structure and partially-known content inherent in an XML-compliant data stream to generate compression/decompression engines tailored to the grammar used by a particular industry or group. These engines are by nature faster and compress more tightly than traditional compression algorithms.

The invention recognizes that an XML-compliant data stream (henceforth called an "XML document", per standard XML terminology) is always associated with a Document Type Declaration, or DTD. The DTD defines the particular grammar for the particular group exchanging the data. For example, a DTD for exchanging insurance policy information might define such XML tags as "clientname" and "premium_amount". Other character strings, besides tags, are also defined in the DTD. This same DTD is typically used for all the documents of this type exchanged by members of this group. It is only worth designing a DTD if you are going to exchange a significant number of documents encoded in this way. Different insurance policy documents have a different value for "clientname" but they all use the "clientname" tag. The "clientname" tag tells you that, in this document, "John Doe" is the client's name and not the agent's name, for example.

The invention also recognizes that there is a special family of compression algorithms called "Shannon encoding", in which small binary values (of small but varying bit lengths) are assigned to frequently-used strings of data in order to reduce the size of the data stream. This is usually done by first reading the data stream to look for repeated data strings such as the 32-bit va...