Data transforms for losless compression

Ryan

unread,

Feb 4, 2004, 1:12:49 AM2/4/04

to

The Burrows-Wheeler transform has fascinated me ever since I read Mark
Nelson's DDJ article in '96, but I haven't seen any other significant
transforms intended for lossless compression. Yes, RLE and MTF were
included in that article, but they don't give anywhere near the benefit of
BWT in transforming the data to be less entropic.

Is there another algorithm that, per the FAQ §9:78, "skews the first
order symbol distributions to make the data more compressible with simple
methods," or is BWT the only one that can be used practically for losless
compression of arbitrary data? I've seen info on lossless DCTs applied to
images/sound, can those be gainfuly applied to text compression?

Guenther von Knakspott

unread,

Feb 4, 2004, 2:25:07 PM2/4/04

to

There is a recent thread "Correlation between random data..." where
some moron describes such a transform. It goes something like:

You put your right foot in
You put your right foot out
You put your right foot in
And you shake it all about.
You do the Hokey-Pokey
And you turn yourself around
That's what it's all about!

Then he describes how he can save 1 bit for some files but absolutely
fails to explain why you need to make a complete ass of yourself
beforehand, instead of just applying the bit saving part right away.

Ryan

unread,

Feb 6, 2004, 9:38:49 PM2/6/04

to

"Guenther von Knakspott" <guenther.v...@gmx.de> wrote:
<snip>

> Then he describes how he can save 1 bit for some files but absolutely
> fails to explain why you need to make a complete ass of yourself
> beforehand, instead of just applying the bit saving part right away.

So I shouldn't BWT before I run my arithmetic or huffman coder?

Rainer Nausedat

unread,

Feb 6, 2004, 10:31:29 PM2/6/04

to

In article <2cedncDSLsd...@comcast.com>, "Ryan" <newsgroup
et dullsville dotty com> says...

> ...

> So I shouldn't BWT before I run my arithmetic or huffman coder?

It depends on the data you are going to compress. If memory serves
me right Mark Nelson were explaining in his article why he were
doing RLE and MTF before BWT (at least in a german article on his
article).

If you are looking for an "universal" archiver/compressor BWT or
LZ77 before arithmetic or huffman coding should be a good choice.
If you are compressing audio data (wav etc) or images other coding
shemes might be a better choice.

R.Nausedat

Mark Nelson

unread,

Feb 8, 2004, 3:12:50 PM2/8/04

to

> It depends on the data you are going to compress. If memory serves
> me right Mark Nelson were explaining in his article why he were
> doing RLE and MTF before BWT (at least in a german article on his
> article).

That's not quite right. You want to do RLE before applying BWT as it helps
pathological sort behavior on strings with long repeated sequences. This can
improve performance quite a bit.

MTF is performed after BWT. The BWT output has a lot of order, but it takes
MTF to help expose that order in a way that is very compressible.

--
|
| Mark Nelson - ma...@ieee.org - http://MarkNelson.us
| The Ultimate Compression Resource - http://www.datacompression.info
| Sponsored by Xceedsoft.com - Serious About Components
|

"Rainer Nausedat" <RNau...@speedproject.de> wrote in message
news:c01m5l$5iq$04$1...@news.t-online.com...

Willem

unread,

Feb 8, 2004, 4:59:56 PM2/8/04

to

Mark wrote:
)> It depends on the data you are going to compress. If memory serves
)> me right Mark Nelson were explaining in his article why he were
)> doing RLE and MTF before BWT (at least in a german article on his
)> article).
)
) That's not quite right. You want to do RLE before applying BWT as it helps
) pathological sort behavior on strings with long repeated sequences. This can
) improve performance quite a bit.

Not if you have a decent sorting algorithm that doesn't barf on long
repeated sequences.

) MTF is performed after BWT. The BWT output has a lot of order, but it takes
) MTF to help expose that order in a way that is very compressible.

You should also do a RLE of the zeroes after the MTF, because it tends to
produces so many zeroes irt. other values that you take a big compression
hit because of rounding off in an arith coder.

SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Phil Carmody

unread,

Feb 8, 2004, 7:30:14 PM2/8/04

to

Willem <wil...@stack.nl> writes:

> Mark wrote:
> )> It depends on the data you are going to compress. If memory serves
> )> me right Mark Nelson were explaining in his article why he were
> )> doing RLE and MTF before BWT (at least in a german article on his
> )> article).
> )
> ) That's not quite right. You want to do RLE before applying BWT as it helps
> ) pathological sort behavior on strings with long repeated sequences. This can
> ) improve performance quite a bit.
>
> Not if you have a decent sorting algorithm that doesn't barf on long
> repeated sequences.

It also doesn't help the ababababababababababababababababababa... case.
(Which is not uncommon in situations such as 16-bit audio or graphics formats)

> ) MTF is performed after BWT. The BWT output has a lot of order, but it takes
> ) MTF to help expose that order in a way that is very compressible.
>
> You should also do a RLE of the zeroes after the MTF, because it tends to
> produces so many zeroes irt. other values that you take a big compression
> hit because of rounding off in an arith coder.

This sounds surprising, but entirely believable. Do you have any figures
for the amount of waste due to rounding in many-zero scenarios?

Phil
--
Unpatched IE vulnerability: XSS in Unparsable XML Files
Description: Cross-Site Scripting on any site hosting files that can
be misrendered in MSXML
Reference: http://sec.greymagic.com/adv/gm013-ie/
Exploit: http://sec.greymagic.com/adv/gm013-ie/

Willem

unread,

Feb 9, 2004, 4:43:59 AM2/9/04

to

)> You should also do a RLE of the zeroes after the MTF, because it tends to
)> produces so many zeroes irt. other values that you take a big compression
)> hit because of rounding off in an arith coder.

Phil wrote:

) This sounds surprising, but entirely believable. Do you have any figures
) for the amount of waste due to rounding in many-zero scenarios?

Well, no.
Just that one of the original papers mentioned something like this, I
vaguely recall trying it with and without this step, and there was a
difference, I wouldn't know how large it was.

Something to test/find out perhaps ?