http://www.michael-maniscalco.com/m03.pdf
M03 is an extension of my old M99 algorithm but is, to
my knowledge, the first ever context based compression
scheme for blocksort transformed data. The paper shows
how encoding and decoding can be done with respect to
all context switches of the original source with no
extra overhead other than the BWT itself.
- Michael A Maniscalco
Amazing...
Now get to coding.. Nothing beats the background work except a demo.
:)
Ernst
.
Ah, where do I find the time? You might have noticed that the
naming suggests that I have had the idea for well over a year! :^)
First I focused on MSufSort and maybe now that that is out of the
way, this will be next.
- Michael
I understand!
I have become involved with Encoders and it would seem little else :)
Strange sort of life I have to admit.
Good luck as this looks very important.
Ernst
There is no compression here. You can't construct the BWT transformed data
from the character counts without also transmitting the context tree. Or am
I missing something?
-- Matt Mahoney
You don't need to send the context tree. Only enough information to
uniquely re-construct the lower level in the context tree. I did an
experiment with the example worked in the paper, carrying the number of
bits along with the letters as the merges occurred, and relying on the
method to tell me the partition count in the next level (this may or may
not be true, but the paper seems to indicate that this can be done).
For example, if I am merging the order 5 contexts, I create the symbol
mapping [s,m] -> [m,s,1:1], which means "the sequence of letters s
followed by m can be represented as a multi-set containing letter m count
1, letter s, count 1, and use the letter at position 1 (0 based) as the
first position when reversing. For the sequence [i,i], the merged
symbol becomes [i(2)]; no bits spent to determine order; since none
needed. When two sets are merged, say [m,s,1:1] followed by [i,s,0:1], the
new set becomes [i,m,s(2),[1:1]|[0:1]|[1:2,2:2]]; meaning symbols m and s
go to left set; along with its count (1:1), and the remainder goes to the
right set; along with its count (0:1). The numbers at the end of the
funny-looking notation contain the required bit counts. They probably are
unnecessary (which I have assumed anyway when counting my bits). I know
the ordering and notation is pretty crappy, but I just want to accumulate
the information and count bits. :-)
Then I worked through the letters upwards, always using as little
bits as possible per letter along. Finally I ended up with a set of four
elements, and a string of numbers with weird separators (I didn't assume
unique decodability until this point, in case I decided that you need
extra information to work through the list).
The end-result looked something like this (leave some slack for errors -
it is Friday evening, and I am tired):
[i(4:4),m(1:3),p(2:3),s(4:3),
[[1:1]|[1:2,2:2]]|[0:1]|[1:1,1:1]|
[[1:2,2:2,3:2,3:2]|[0:2]|[0:2,2:2]]]]
At this stage, before answering whether I needed special symbols to group
and order the decoding sequence, I decided to count the bits. As you can
see (well ... maybe), I assigned 4/3 bits per symbol count in the master
set (there are 11 letters, so a simplistic approach would require 4 bits
per symbol). When added up, I have used 35 bits for the whole mess,
optimistic assumptions and all.
In order to represent the BWT string itself, i.e. 11 letters from an
alphabet of four characters, I need only 2*11 = 22 bits.
In either representation, the original BWT transformed string can be
reconstructed, but not the alphabet itself.
To conclude, the approach doesn't seem to help use less number of bits
with this encoding method. The author's undisclosed encoding method might
end-up using less than 9 bits, where I need at least 22; my whole approach
may be fairly stupid; the example may not be the best one out there; etc.
But now there are some numbers, and a measurement method.
Enjoy,
vedaT
Hi Matt,
The rough paper does not go into the encoding itself. It just
demonstrates that if the encoding is done in a particular fashion,
that is order by order, then the decoder reveals the context
boundaries which implies the original context tree. And as decoding
is done, the exact statistics of a parent context
are known when predicting the statistics of the next higher order
contexts.
Remember that we are encoding the symbol which *proceeds* each branch
of the
context tree. Not the context tree itself. This is context based
encoding
of the BWT. We exploit the idea that similar contexts at higher
orders
are preceded by increasingly similar symbols.
The compression happens because the encoder merges higher order
contexts using only the statistics present in those contexts. And if
all contexts to merge have the same preceding character (which happens
very often at higher orders) then the merge is actually free.
Here's a quick example of a single merge/encode where two higher order
contexts are being merged together ....
[a(2), b(2), c(1)] + [a(1), b(2)] -> [a(3), b(4), c(1)]
encoding is done as ...
A: encode 2 of 3 (2 'a' came from left context)
this leave the left context with a weight of 3 and the right with
a weight of 2.
B: encode 2 of 4 (2 'b' came from left context)
In this case, actually the encode is 0 of 1 since the right context
only has space left for two symbols and we know that there are 4 'b'
symbols, thus at least two came from the left context.
C: encoding merge finished. enough information is encoded to decode
correctly (if we know that the original context split is (5,3) which
the paper shows we do.
decoding is done from ...
[a(3), b(4), c(1)] -> [a(2), b(2), c(1)] + [a(1), b(2)]
D. decode a value between 0 and 3 (number of 'a' which came from left
context)
Decode is 2 of 3.
E. decode a value between 0 and 4 (or really 0 and 2 as detailed in
B).
Decode is 2 of 4.
F. remaining symbol 'c' has original context known since it is all
that
remains and the left context is missing one symbol.
decoding completed...
This is exactly now M99 works except M99 merges lists as powers of 2
rather than intelligently along the actual context boundaries. In
this
case, simple adaptive models need to be use which predict values from
0 to N based on how many of a given symbol there is in total, the
ratio
of the size of the left context vs. the right context. etc. When this
is
done and arithmetic coding is used, my early results so that the
compression
is very similar to PPMII.
I know that the paper is a bit difficult to follow. The whole idea
is,
actually. But if you stare at it long enough, you can see how the
merging
of higher contexts into lower ones does allow the decoder to decode
along
the original context tree boundaries without extra information.
In essence, this makes BWT much like PPM except we predict the
preceding
character based on a given context when we know the statistics of the
parent context which is one order up the tree. I'm going to re-write
the
paper over the weekend to hopefully make the whole thing more obvious.
- Michael
You are on the right path. But encoding can be done far better.
For some rough numbers, this method is an advanced version of my
previous work M99. With M99 you just start by merging every two
symbols into lists of 4, then 4 into 8 etc. until you get one list.
This method uses simple bit packing and gets about 850,000 on 18
files of Calgary.
The new method is similar but can exploit merging along the context
boundaries. So, adding better probability tables and an arithmetic
coder produced preliminary results around that of PPMII.
My approach was an adaptive set of statistics for all merges where
total after merge is less than a given number. say 32.
Then we build probability tables for ...
probability[T][R][N]
where T is the total of symbol after merge.
R is some normalized ratio of size of left context vs. right. say 0 .. 10
and N is the potential quantity of the symbol in left context.
You will find that, at higher orders, the probability of any
given symbol belonging to one or the order context is far higher.
This skews the probabilities alot and gives much better compression.
for the given example using just bit packing. ...
order 5 merges [s1] + [m1] -> 1 bit. result ->[s1, m1]
order 4 merges [i1] + [i1] -> no bits. result->[i2]
order 3 merges [s1] + [s1] -> no bits. result->[s2]
order 2 merges ...
[p1] + [s1] + [m1,s1] -> 3 bits. result -> [m1, p1, s2]
[p1] + [i1] -> 1 bit. result [i1,p1]
[s2] + [i2] -> 2 bits. result [i2, s2]
order 1 merges ...
[m1,p1,s2] + [i1] + [i1,p1] + [i2,s2] -> 9 bits ->[i4, m1, p2, s4]
then the overhead for encoding the symbols and total counts.
This, for any significant sized stream is small.
total bits with just bitpacking is 16 + symbolcounts. :^)
- Michael