Browse Prior Art Database

Two-Phase 4-Huffmann Data Compression Algorithm

IP.com Disclosure Number: IPCOM000118419D
Original Publication Date: 1997-Feb-01
Included in the Prior Art Database: 2005-Apr-01
Document File: 6 page(s) / 182K

Publishing Venue

IBM

Related People

Trius-Chassaigne, F: AUTHOR

Abstract

This is an algorithm which, using regular compression techniques in the explained order, multiplies both compression effects to obtain extremely compact compressed blocks which contain enough information in them so that loseless decompression is feasible. The algorithm uses two phases: 1. In the first phase, character oriented redundancy is suppressed by 'tokenizing' the input buffer, thus obtaining a result buffer which is already smaller than the original one. All but the first repeated found strings is translated into an 'ESC' character code, plus a relative distance code, and a 'run length' code. This is a 'normal technique' used by most data compressors, compilers with tokenized forms, and transaction systems.

This text was extracted from an ASCII text file.
This is the abbreviated version, containing approximately 22% of the total text.

Two-Phase 4-Huffmann Data Compression Algorithm

      This is an algorithm which, using regular compression
techniques in the explained order, multiplies both compression
effects to obtain extremely compact compressed blocks which contain
enough information in them so that loseless decompression is
feasible.  The algorithm uses two phases:
  1.  In the first phase, character oriented redundancy is
       suppressed by 'tokenizing' the input buffer, thus obtaining
       a result buffer which is already smaller than the original
       one.  All but the first repeated found strings is translated
       into an 'ESC' character code, plus a relative distance code,
       and a 'run length' code.  This is a 'normal technique' used
       by most data compressors, compilers with tokenized forms,
       and transaction systems.  To achieve its objective, any
       regular programming technique can be used by scanning the
       source buffer and using:
      o  An indexed data dictionary or
      o  A hash table or
      o  An inverted tree or
      o  A set of 'already appeared' bitmaps or
      o  A set of linked lists with offsets to repeated-header
          data or
      o  Full scanning of the previous data.

    Some of these techniques allow for dramatically reducing the
required time to look for the appearance of a given set of bytes in
the already-processed stream, but even the 'old-known' sequential
scanning can serve this purpose.  Basically, this phase can be
summarized as:
          i=0
          while i < buffer length
     (* 1 *) if (buffer(i)..buffer(i+N) appeared before in the
   source buffer at position j) and (N>3)
             OUTPUT ESC,N,i-j
             i=i+N
           else
             OUTPUT buffer(i)
           endif
           i = i + 1
         end

    It is the checking at (* 1 *) that can be done using any of the
above mentioned methods.

          In the current implementation, a 'linked lists' list has
been used in order to improve performance.  There are several linked
lists, each one holding all the offsets of previously analyzed data
which begins with the same values.  This way, we avoid searching all
previous data in the stream but just scan from all positions with the
same 'headings'.  Besides, and if data does not fit into one complete
buffer, previous data can be used as a 'seed' for finding matches,
even if it is  not directly encoded.
  2.  This phase is where the originality is found.  Since in the
       previous phase we did some 'encoding' of data, now we can
       distinguish between:
       - Raw data not encoded in the first phase (Char data)
       - Repeated 'tokens' encoded in the first phase (Beginning
         with the 'ESC' code) (Re...