Today I tried to replace the MTF in my BWT coder with distance codes, like
Edgar Binder described earlier in his message (aaabccca -> 11501100). But it
was very confusing to see after compressing the distance code -output that
it increased the size of calgary corpus by 1 bpb compared to the very basic
MTF scheme (note that I did not use any special modeling, just plain simple
order-0). I've clearly misunderstod something here. Can anybody give me a
hint?
Regards,
Sami
Sami wrote:
Sorry, this is my fault. I thought I made it clear that I didn't make
any exclusions in this example... Well, there are at least three DC
properties that can be used for that:
#1. Obviously, there are always some known symbols when encoding a
distance, and their locations can be excluded. This means "aaabccca" -->
11301100 (I assume the start configuration "a..bc..." to be known to the
decoder)
#2. Probably not so obvious, one doesn't have to encode distances when
the next symbol is unknown to the decoder, because it has to be equal to
the current symbol (if that wouldn't be true, the symbol would remain
unknown after the encoding of the distance --> impossible). On the
example, with known start configuration asssumed this means:
"aaabccca" --> 300 (this property is equivalent with a "free" RLE)
#3. Not all distances are needed for successful decoding of the input
(encoding should stop when all symbols
are known):
"aaabccca" --> 30 (this should reduce the cost of encoding the initial
configuration)
So, let's try to decode 30, with "a..bc..." as starting configuration.
Pos #1, "a..bc...". The next symbol is unknown, so it must be an "a"
Pos #2, "aa.bc...". Next symbol unknown, must be "a"
Pos #3, "aaabc...". Decoded distance = 3, "b", "c" positions skipped
Pos #4, "aaabc..a". Decoded distance = 0, "b" doesn't appear anymore
Pos #5, "aaabc..a". Next symbol unknown, must be "c"
Pos #6, "aaabcc.a". Next symbol unknown, must be "c"
Pos #7, "aaabccca". Decoding finished (successfully :-)
Well, I realize this isn't the best explanation but I hope it makes this
clear.
The scheme described here achieves with simple order-0 modelling
compression equivalent with BZip. It depends though on your particular
way of handling big distances how close it'll get. My implementation of
straight distance coding (same as used in the public version of DC) gets
with order-0 modelling about 229k on book1, wich I think it's fairly
close to the optimum...
--
Edgar
Thanks for the reply and description Edgar. It's really not your fault that
I got it all wrong, this happens when I don't think and just try things out.
I'll try again and do it as you described and see if I got it right this
time. :-)
Regards,
Sami
The distance coding is efficient on short distances but longer ones
doesn't seem to be a correct measurement of a symbol's probability.
I think distance coding or "inversion frequencies" are interesting
enough to deserve more r&d though.
Have anyone tried any other variations of this?
Maybe some kind of hybrid?
Mikael
In article <390B6254...@T-Online.de>,
Binder Edgar <Binde...@T-Online.de> wrote:
> Hi,
>
> Sami wrote:
>
> > Today I tried to replace the MTF in my BWT coder with distance
> > codes, like Edgar Binder described earlier in his message (aaabccca
> > -> 11501100). But it was very confusing to see after compressing the
> > distance code -output that it increased the size of calgary corpus
> > by 1 bpb compared to the very basic MTF scheme (note that I did not
> > use any special modeling, just plain simple order-0). I've clearly
> > misunderstod something here. Can anybody give me a hint?
>
> Sorry, this is my fault. I thought I made it clear that I didn't make
> any exclusions in this example... Well, there are at least three DC
> properties that can be used for that:
>
> #1. Obviously, there are always some known symbols when encoding a
> distance, and their locations can be excluded. This means "aaabccca" -->
> 11301100 (I assume the start configuration "a..bc..." to be known to the
> decoder)
>
Sent via Deja.com http://www.deja.com/
Before you buy.
mikael.l...@spray.se wrote:
> The distance coding is efficient on short distances but longer ones
> doesn't seem to be a correct measurement of a symbol's probability.
You presume that because distance can be bigger than 255, right ?
Well, this seems to be compensated by exclusions... If you want, you can
try doing a small experiment with DC - namely compress with everything
disabled but the distance coder some text file (or any other file that
doesn't compress with RLE). Well, if you compare the result with an
order-0 arithmetic coder you'll find out that DC compresses as well / or
slightly better. The public distance coder uses only order-0 statistics
to do that, therefore it must be optimal in this sense... (you can make
the same experiment with MTF if you think it's necessary :-)
> I think distance coding or "inversion frequencies" are interesting
> enough to deserve more r&d though.
Yes, they're quite interresting....
> Have anyone tried any other variations of this?
> Maybe some kind of hybrid?
Well, I tryied a few and some are slightly better for some class of
files, but there isn't any constant improvement. Straight distance
coding seems to have the best speed/compression tradeoff.
--
Edgar
P.S. I might be unable to reply to the next messages as I'm leaving
tomorrow to Italy for a week...