Compression Algorithm for information containing "well known" and "often repeated" content.
Original Publication Date: 2002-Jul-14
Included in the Prior Art Database: 2003-Jun-21
Current compression algorithms fall into a number of categories: File based compression, such as PKZIP*, where white space and other redundant or repeated information is encoded using a dictionary table to contain the repeated strings of content, the resulting smaller file being the dictionary and the encoded content comprising dictionary elements and uncompressed content. At the receiving end the data is reverted to uncompressed form by using the dictionary elements to replace the dictionary references in the encoded content. Such mechanisms are often used to reduce the size of files for storing on disks or transmitting as file attachments, downloading over the Internet from web sites etc. State based compression. Here the compression algorithm is being applied to content being streamed through it in the hope that the amount of data squeezed through the communications link between the compressing station and the receiving station which performs the uncompression or expansion. An example is V.42bis. Again a dictionary is used to encode strings in the content. The dictionary table is finite and gradually fills with the various strings being compressed. Usually an algorithm is applied to throw away the least used dictionary elements in favour of new ones which offer more compression for current content. This permits the compression algorithm to adapt to the content at any period in time and remain efficient but causes the updates of the dictionary to be sent between encoding and decoding stations. Any loss of information on the link questions the validity of the dictionary as without simple communications link recovery of data loss the dictionary must be considered corrupted and therefore needs rebuilding which can cause significant time and throughput/performance impairments on slow or unreliable links. Stateless block compression where a block of content, perhaps chunks of streamed data in a TCP/IP session or a large UDP packet, is encoded using a dictionary and the resulting transmitted block comprises the dictionary for that block and the resulting dictionary elements and uncompressed content. Such schemes are used over links with loss that otherwise could use the state based compression. Because the dictionary is sent with each block the efficiency should be lower than the other two. Rigid compression encoding. In this algorithm, "well known" data is being communicated and it is possible to produce a compression scheme which uses a fixed dictionary to encode and decode the information. It may also be possible to have more than one fixed dictionary where several well known content types are supported but a single dictionary would not offer sufficient entries for efficient use. Simple switching between a small number of such dictionary pages may be a way to accommodate this. An example is WAP Forum's WBXML. Typically, today's modems use V.42bis or MNP5 state based compression as the telephone links are reliable; however V.42bis has been shown to have problems in wireless communication due to the error rates. The benefits of using V.42bis apply to both the compression of content structure, e.g. HTML tags and elements, and the