Browse Prior Art Database

Compression Technique for Text Character Streams

IP.com Disclosure Number: IPCOM000042466D
Original Publication Date: 1984-May-01
Included in the Prior Art Database: 2005-Feb-03
Document File: 2 page(s) / 15K

Publishing Venue

IBM

Related People

Herrold, PR: AUTHOR [+2]

Abstract

This compression technique takes advantage of the correlation between characters in a data stream using small amounts of storage and CPU time. A classic problem in the data processing and word processing industries is how to most efficiently store textual data with a minimum of overhead for compression and decompression. The storage and performance constraints are especially critical in small machine applications, such as stand-alone word processors and intelligent terminals. Two interrelated characteristics of the text stream suggest compression possibilities. 1. There are far more occurrences of some characters than others. This implies that there is a non-uniform probability distribution of characters. 2. The characters are not randomly distributed but form patterns of words.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 52% of the total text.

Page 1 of 2

Compression Technique for Text Character Streams

This compression technique takes advantage of the correlation between characters in a data stream using small amounts of storage and CPU time. A classic problem in the data processing and word processing industries is how to most efficiently store textual data with a minimum of overhead for compression and decompression. The storage and performance constraints are especially critical in small machine applications, such as stand-alone word processors and intelligent terminals. Two interrelated characteristics of the text stream suggest compression possibilities. 1. There are far more occurrences of some characters than others. This implies that there is a non-uniform probability distribution of characters. 2. The characters are not randomly distributed but form patterns of words. The probability of the occurrence of a given character is a function of the previous character. In English text, for example, after 'Q', 'U' is highly probable. The following describes a compression technique that takes advantage of these two characteristics without requiring unreasonable amounts of storage or processor power for execution. This technique has demonstrated 10% greater compression than standard variable length coding in dictionary applications for the Japanese language. 1. A statistical analysis is done to determine the probability distribution for each of the unique character values 'C(i)' as a function of the previous character. P (C(i)) = f (C(i-1)) is determined for all i. The result is an array of counts of the number of occurrences of specific character values in the entire set of text data to be compressed. The array of counts has 26 count values (for the simple English example) by 26 values (for each of the C(i-1) values). A total of 646 counters is retained. (Obviously, practical applications will have larger alphabets, even in English.) This data could also be interpreted as a list of 26 probability distributions. 2. Each of the 26 probability distributions for a given C(i-1) is sorted by probability (counter value). When this sorting is done, the original character value for each count is saved. 3. The 26 sorted probability distributions are summed. T(k) = P(1,k) + ... + P(i,k) + ... + P(n,k) where 'k' is the index for the sorted probabilities (the old 'i'). 4. At this point there are two elements of data that are used. a.

The mapping of sorted probabilities to the original character values can be used as a table for translating characters. b. The summed probability distribution is used for the basis of variable length encoding technique. The number of bits assigned to each of the 26 "codes" is given by Number of Bits = -log(2) P where 'P' is the probability of the given code. This technique is well known for data compression. Actual compression is done by translating each character C(i) in turn and then encoding the translated pa...