Higher order compressor details

Skip to first unread message

Paul E. Campbell

Jan 23, 1995, 6:28:05 PM1/23/95
Judging by my mail box, there was a *lot* of interest in higher order
compressors (no surprise). The comp.compression FAQ is also rather vague
in that area. So I'm going to extend it a little.

First, arithmetic compression was very well described by Witten, Neal, and
Cleary in "Arithmetic Coding for Data Compression", _Communications of the
ACM_, June 1987, Vol. 30, #6, p/ 520-540. I can't possibly do that justice to
arithmetic compression compared to that article. The article includes source
in C as well. Source is available for ftp from ftp.cpsc.ucalgary.ca in
directory /pub/projects/arithmetic.coding. For others, see the FAQ.

The general idea is this: We have an input string of symbols. We want to
generate an encoded string as a result. We have a model which tells how to
encode the input string.

The basic order 0 model is described and source is given in that article.
Compression is achieved by assigning fewer bits to more probable symbols.
We will assume an adaptive model here, so all the symbol probabilities are
based on the string as far as it has been processed so far. This is nothing

What arithmetic compression will give you is that the actual encoding process
is a purely mechanical function. Within some arbitrarily small bounds,
arithmetic encoding will automatically encode the symbol at the lowest entropy
it can based on the information from your model. The tough part is designing
a good model of the input string.

Okay, now here's the change. Suppose we are dealing with English text and
you are doing the encoding. If you see the string "tio", you will almost
always assume the next letter is "n" rather than "e" or "s" or a space or
some other more common symbol.

Why? Because when you look at the letter following the sequence "tio" you
can rule out most of the other combinations as improbable. The sequence
"tio" is a context. The probability of a particular symbol change depending
on the context you are assuming.

The order 0 model assumes no context at all. Every symbol is encoded strictly
based on the probabilities of the preceding symbols. Higher order models are
based on a past history of a few symbols (usually 1-5). The higher order
models will increase their predictive power up to a certain point, then they
will drop off. For instance, after seeing "tion ", what is the next letter?
It could be almost anything. There is a finite limit to the order of the
model before you start reaching the point where the gain in coding cuts
into your memory requirements (high order models take up lots of memory) and
speed (they also take a while to search).

DMC and PPM and various other "higher order compressors" are all based on the
simple extension of arithmetic encoding where you look at the previous
context of the source string to come up with your symbol table probabilities.

The original article on PPM is "Data Compression using adaptive coding and
partial string matching," by J. Cleary and I. Witten, IEEE Transactions on
Communications, Vol. COM-32, p. 396-402, Apr. 1984.

Cleary and Witten proposed this general idea:

Store a table. Context 0 contains the basic frequencies. Context 1 contains
frequency tables for the symbol following each symbol from context 0. Each
higher level branches out further. This is the generic model so if just
raw arrays were used, the storage for the frequency tables for an order 3
model would be 256+256^2+256^3+256^4 frequencies, which is why the "perfect"
model is never actually used.

So, just store the symbol combinations that have occurred so far (some sort
of garbage collection will be necessary just like in regular arithmetic
compression). When you run into a symbol that has not been seen before,
encode an "escape" and try encoding it with the context one order down.
Keep escaping until at least one context manages to make a match (the
context 0 model should always match).

One of the basic problems is what probability should be assigned to the
escape sequence? 3 suggestions have been made. They are all detailed
in "Implementing the PPM Data Compression Scheme", by Alistair Moffat,
_IEEE Transactions on Communications_, Vol. 38, #11, November 1990,
p. 1917-1921. I will use his notation for describing them.

In "Algorithm A", each symbol occurrance has a count. The escape probability
is "1" always. So the probability of "escape" is the inverse of the number
of times the context has been used except that the number of times the
context has been used is inflated by 1 to make room for the extra probability.

In "Algorithm B", the number of times the context has been used remains
normal but the probability for each symbol is decremented one. The escape
count is equal to the number of symbols.

The problem is that a symbol has to be "double counted" for algorithm B before
the symbol is recognized as in the context. The count of "1" for algorithm A
tended to mean the escape was given an artificially low probability when the
model was first being built up. Thus, the escape mechanism of Algorithm C.
The escape probability is added into the total count. Escape is equal to the
total number of symbols in the context.

These methods are still not really satisfactory. "HA", "Ashford", and others
have also tuned their escape probabilities but they haven't published the
exact details.

The usual problems with flushing the string buffer that Lempel-Ziv
derived algorithms have also occur with Markov modelling systems. Similar
to LZW, either throwing away the table and building it back on a buffer
of data or LRU seems to be the most common technique.

So far, the compression technique has mostly only considered the case of
bytes. Cleary and Witten do some work with regard to arbitrary alphabets,
but some specific examples exist for encoding on a bit level.

DMC is one version. It is described in "Data Compresion using Dynamic
Markov Modelling," by Gordon Cormack and Nigel Horspool, _Computer Journal_,
v. 30, #6 , p. 541-550, December 1987. Cormack and Horspool will tell you
that DMC is totally different from other Markov modellers because they used
Finite Automata for their model, but Bell and Moffat ("A note on the DMC data
compression scheme", _The Computer Journal_, V. 32, p. 16-20) went back and
rederived the original work and showed that the models are identical based
on the implementation Cormack and Horspool used (although leaving the door
open for some innovative work).

Urban is another version. It was one of the entries to the Dr. Dobbs data
compression contest. See the comp.compression FAQ for details because I
don't have any.

Both of these programs show better performance when you rank them against
object code. The prevailing theory is that the programs can capitalize on
the fact that certain bit patterns in machine instructions share common
formats with the bit patterns being broken up into various "fields" that
the byte-level compressors miss and also because of addresses in particular
code fragments showing similarity to others in the same fragment. Obviously,
these algorithms should do even better against byte-level systems if they
are used on some types of image data (such as bitmaps).

So, if someone is planning on extending the zip engine using one of these
monsters, I suggest that you implement both a bit-level and a byte-level
compressor. Also make absolutely sure that they can be turned OFF! These
things are notoriously slow, but the memory requirements can be lowered
to a tolerable level with a little work (though nothing you're going to see
in a Doublespace or other related compressing file system any time soon).

Reply all
Reply to author
0 new messages