Dismiss
InnovationQ will be updated on Sunday, Oct. 22, from 10am ET - noon. You may experience brief service interruptions during that time.
Browse Prior Art Database

Maximum Entropy Method of Data Encoding for Dictionaries and Texts

IP.com Disclosure Number: IPCOM000102381D
Original Publication Date: 1990-Nov-01
Included in the Prior Art Database: 2005-Mar-17
Document File: 2 page(s) / 79K

Publishing Venue

IBM

Related People

Sharman, R: AUTHOR

Abstract

Conventional methods of data encoding, such as EBCDIC and ASCII, are inefficient to use when storing and transmitting natural language data such as dictionaries and texts. Conventional compression techniques often use block encoding, which maps large and larger blocks of data into codewords from which the original data can be recovered by decoding. While performing optimal compression of the data, other is learned about the linguistic nature of the data.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 52% of the total text.

Maximum Entropy Method of Data Encoding for Dictionaries and Texts

       Conventional methods of data encoding, such as EBCDIC and
ASCII, are inefficient to use when storing and transmitting natural
language data such as dictionaries and texts. Conventional
compression techniques often use block encoding, which maps large and
larger blocks of data into codewords from which the original data can
be recovered by decoding.  While performing optimal compression of
the data, other is learned about the linguistic nature of the data.

      The method described here creates an efficient compression code
by enlarging the alphabet of symbols used to encode the data, and by
iteratively using a maximum entropy criterion for selecting new
symbols.  It is claimed that linguistically interesting facts about
the data can also be discovered.

      The principal features of the method are first to encode a set
of compound symbols, where each compound symbol comprises a word or
other group of at least several letters, as a bit stream of the same
length used to encode simple symbols (e.g, a bit string of length 8
as used to encode EBCDIC and ASCII letters).  The second feature is a
method of finding the optimum symbol set for a body of training data
by maximizing the average information per symbol in a set containing
both compound symbols and simple symbols.

      Let Sn= {s1, s2, ..., sn} be a symbol set used for encoding a
large body of natural language data, such as a dictionary, or corpus
of running text.

      An example is   S26 = {a, b,...,z} where ¯S26¯ = 26 This is the
initial starting set from which to encode the data.  The problem is:
for any 26 & n & 256, what is the Sn which results in the
data being encoded in the minimum number of symbols?

      The ENTROPY of the symbol Set Sn is known from standard
Information Theory.  An expression which can be used to compute the
average amount of self-information in a set of symbols, is

                            (Image Omitted)

      The following algorithm can be used to create an improved
Symbol Set Sm from an existing Set Sn .  Usually, m / n, although it
...