Browse Prior Art Database

Data compression and analysis by macro substitution

IP.com Disclosure Number: IPCOM000244199D
Publication Date: 2015-Nov-23
Document File: 3 page(s) / 60K

Publishing Venue

The IP.com Prior Art Database

Abstract

A data compression scheme showing the relative complexity and structure of the data in a human understandable way using repeated macro substitutions.

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 48% of the total text.

Page 01 of 3

Data compression and analysis by macro substitution

Disclosed is an algorithm for compressing data, which also shows the structure by identifying repeated elements in a human understandable way.

    The problem is showing and comparing the complexity of different sequences of symbols. On an Internet mailing list there was a public discussion [1] of what constituted the most complex path through a particular graph which covered all the nodes (a Hamiltonian cycle). There was no exact solution, but one idea was to take the path and compress it using a data compression utility or Lempel-Ziv[2] encoding and take the least compressible as the most complex.

E.g. a path might be


BPPPBBPPPPBPPPBBPPPPBPPPBBPPPP
where P or B represent the choice of edge when leaving a vertex and by manual inspection this can be expressed as
(BPPPBBPPPP)*3
Lempel-Ziv LZ77 compression might encode that as follows:
(offset backwards, length, character)
0,0,B
0,0,P
1,2,B
5,4,P
10,19,B
so for a sequence of 30 characters, we could reserve 5 bits per number and use a total of (5+5+1)*5 = 55 bits.

    This doesn't necessarily show the structure well though. The simplest path is also of interest to people in the field, and in general what people consider as simple paths have some structure.

    The idea of this invention is to use macro definitions and replacement of text by a macro name to condense the input into a more compact form. The macro replacement process starts at the top level, so emphasises global structure. Also the way macro replacement is done in this invention sometimes allows macro replacement with the same macro name in two different contexts, which although might not aid understanding would improve the compression.

Compression:
Identify an available macro name


1.

Find area of the current input over which it can be used


1.

Identify a section of the current input which is repeated in this area


1.

Calculate how much space a macro replacement saves


2.

section length m, instances n, replacement with macro length 1 saves n*(m-1) but macro definition costs m+2, so saving = n*(m-1)-(m+2) = (n-1)(m-1)-3
Check that using the macro does decompress to the original input


3.

If it saves more space then record it as the best so far


4.

Repeat for the next section of the current input which is repeated in this area


5.

Find next area available for the macro name


2.

Repeat for the next available area for the macro


3.

Find next available macro name and repeat from 1


2.

If the best saves space (or takes the same amount of room),


3.

Replace uses in this area with the macro variable


1.


Page 02 of 3


2.

Insert macro definition to the input before first use of the macro

Go back to beginning


3.

Output final result


4.

Decompression:

Find a macro definition


1.

Expand the contents of the definition itself


2.

Expand uses of the macro (though not inside other unexpected macro definitions if that


3.

macro has redefined the variable)

Use a stack to identify inside an unexpanded macro definition which mac...