Browse Prior Art Database

Technique to Generate very Efficient Compression/Decompression Engines for XML Data Streams Based on the Stream's DTD

IP.com Disclosure Number: IPCOM000013426D
Original Publication Date: 2000-Feb-01
Included in the Prior Art Database: 2003-Jun-18

Publishing Venue

IBM

Abstract

XML, the eXtensible Markup Language, is a popular markup language which defines a common way for people to specify a grammar for encoding a self-defining data stream for a particular industry or group (e.g., banking, insurance policies). XML encourages the use of fairly long character strings (more than a character or two) as data identifiers or "tags". These tags are part of the data stream along with the data values. This means that an XML-compliant data stream is often significantly longer than a data stream containing only data values. The extra length of the data streams can have a severe impact on speed of transmission of such a data stream, transient storage of the data stream in memory or more permanent storage of the data stream in non-volatile media such as a file on disk. Traditional compression algorithms are designed for generic data streams. They look for patterns of repeated characters or bits, or other patterns in the data stream itself; they do not assume any prior knowledge of the content or structure of the data stream. Thus they are less efficient than they could be when used on a data stream with a know structure and some known content, such as an XML-compliant data stream. The invention, in brief, is to use the known structure and partially-known content inherent in an XML-compliant data stream to generate compression/decompression engines tailored to the grammar used by a particular industry or group. These engines are by nature faster and compress more tightly than traditional compression algorithms.