Improving Initial Compression Ratio for Lempel-Ziv (LZ) Compressed Files
Original Publication Date: 2004-May-19
Included in the Prior Art Database: 2004-May-19
In many domains, a subset of the symbols (characters, words, phrases, ...) used in a compressed document is known to the sending and receiving application. Applications include XML documents, where the DTD (Document Type Definition) has to be known by both processing ends. This knowledge can be used to improve the compression ratio and reduce the amount of data having to be transferred.
Improving Initial Compression Ratio for Lempel -Ziv (LZ) Compressed Files
Instead of initializing the buffer with empty symbols (as done in most compression schemes), the buffer is initialized with a predefined dictionary, e.g., the set of all character pairs or keywords.
One application includes XML parsing, where the information automatically derived from the XML Document Type Definition (DTD) or schema can be used to seed the buffer. In case multiple versions of the DTD/schema are present (up-/downward compatibility), care needs to be taken to chose a DTD/schema for seeding which the receiver is known to have. This would result in a scheme with the advantages of XML compaction and XML compression.
To ensure compatibility in case of version skews, a reference to the DTD actually used could be included in a preface to the compressed document; this would allow the recipient to correctly seed the compression engine even if it does not know the DTD.