Re: BWC by Willem Monsuwe - compression question

Message has been deleted

Willem

unread,

Oct 24, 2005, 4:49:18 AM10/24/05

to

Lyle wrote:
) Hi,
) I am going through the BWC algorithm and I am wondering why in the
) arithmetic coder stage, the compressor outputs the model each time
) before outputting each byte of the MTF
)
) For example, the code does:
) model_symbol(AVB_04_07, AM_BASE, as);
) model_symbol(i - 0x04, AM_04_07, as);
) Before outputting a number in the 4 - 7 range...

It's not outputting the model, it's outputting the symbol range,
and then the specific symbol within that range. In other words, one
number is encoded as two symbols. That's because the model is so skewed
that you'd run into accuracy problems otherwise.
Isn't there a link to some paper describing the algo and model ?

)>From my understanding, in an adaptive arithmetic coder, the model need
) not be explicitely stated in the stream correct? Couldnt the
) decompressor keep track of it?
)
) PS: I am asking because I am getting poorer results compared to just
) bzip2 which uses a huffman coder in the last stage.

That's probably because the model is poorer than the one in bzip2.
I say probably because I can't make head or tail from the bzip2 code,
and I can't really remember the bwc code all that well.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Message has been deleted

Willem

unread,

Oct 24, 2005, 1:41:38 PM10/24/05

to

Lyle wrote:
) Thanks for the reply. I'm not sure I follow when you say model is
) skewed and you may run into accuracy problems... is this different than
) a standard adaptive arith coder?

Well, the probability of a '0x01' is a lot of orders of magnitude
greater than the probability of a '0xf0'.

A normal arith coder with 32 bits of accuracy would run into problems
with such a large difference. I think.

) I couldnt find a link to the paper in the tar ball I have... I would
) appreciate it if you could direct me at it.

I was mistaken; the link is in the bzip2 tarball, the one by Peter Fenwick.

ftp://ftp.cs.auckland.ac.nz/pub/staff/peter-f/ACSC96paper.ps

) PS: bzip2 uses an adaptive huffman, starting with all symbols equi
) probobal and updates the model on the fly in both the encoder and
) decoder

Isn't there some model switching code in there ?

Anyway, the only real advantage bwc still has over bzip2 is the ability
to use much larger blocksizes. I haven't touched it in years.

Message has been deleted

Willem

unread,

Oct 24, 2005, 2:34:18 PM10/24/05

to

Lyle wrote:
) In my data sets, I have a large number of 0s in the input stream (about
) 60% of the file is 0s) with scattered bytes that are in the high ranges
) (0x80 to 0xff).
)
) So I would like to modify the BWC code to not code the symbol range
) into the stream itself but rather use a more traditional adaptive
) arithmetic coder instead... I bet I can beat bzip2 on my data sets with
) this change.
)
) Any pointers on how I can go about doing that?

Not that easy, because the mtf code is hacked to match the model.

) I am assuming I need to start the modification in mtf_send_block and
) mtf_get_block. But I cant figure out the implications on the
) model_symbol and subsequent functions if I make this change.

I think you should be able to rewrite the main loops of those two
functions, (and aditionally rewrite the init code for the models).

(I hope that) I wrote the modeling code in a generic way, so that
shouldn't be too hard.

Message has been deleted

Willem

unread,

Oct 26, 2005, 3:16:41 AM10/26/05

to

Lyle wrote:
) BTW Willem, do you (or anyone) still maintain the BWC code? I have a
) few problem test files that cause some problems for bwc...
)
) Attached is one that causes bwc to hang (loop) with the cpu pegged at
) 100%

Nope, sorry.. Haven't maintained it in years, it was mostly a pet project
and I've dropped it around when bzip2 came along. To me, it was mostly
a test to see how much improvement one could get at larger blocksizes
(over a megabyte) and an exercise in optimisation for the bwt step.

I could take a look if I have some spare time, which happens reasonably
often in my current job.