BWT and different alphabet orders...

Michael A. Rouse

unread,

Apr 19, 2000, 3:00:00 AM4/19/00

to

I was wondering if anyone has played around with the
Burrows-Wheeler Transform using ordering methods different from
standard alphabetical (abcdefg...) (Actually, I'm pretty sure someone
out there has done so -- it's too cool of an idea to pass up -- but I
haven't seen any results posted.)
For example, the blocks could be sorted by how often each
character occurs in the text (which would require transmitting the
order with the compression scheme) or by a built-in "standard"
frequency order (etaonisrh...). Or you could treat each block like a
huge binary number, and sort them from smallest to largest (which
might not do anything but take eight times as long and scramble the
text effectively).
It might even be possible to make a listing of digram
frequencies (digrams are two letter combinations that appear normally
in text, like th, he, an, in, er, re, es, etc.) and use it to create a
sorting order that compresses a bit better than normal. Since the
(last block)-(sorted first block) set makes digrams as well, perhaps
some relation between the normal digram frequency list and the BWT
generated list could be used to squeeze a little bit more order out of
the set..
Anyway, if anyone has done research on this they'd like to
share, I'd be interested in hearing about it. Thanks!

Michael Rouse
mro...@cdsnet.net

kaptke...@my-deja.com

unread,

Apr 20, 2000, 3:00:00 AM4/20/00

to

Hi,

Brenton Chapin did some work on the idea. The site is
http://www.cs.unt.edu/~chapin.
On english text, the sorting order aeiou... usually yields some (very
small) improvement over 0-255 ordering.

I was thinking of experimenting with order one stats gathered before
BWT and perhaps even changing the symbols after the BWT based on these
stats. That is, identify the symbols which have the most predictable
preceeding symbols and grouping them together by modify the sorting
order. Then, after the transform re-name these preceeding symbols by
overall probability (or perhaps only the more predictable symbols).
Thus, if say prior to the BWT the symbol A was preceeded by B,C and D
with probabilities of say 35%,10% and 7% (ignore the rest) and W was
preceeded by say X,Y and Z with probabilites of 70%,3%,2% (ignore the
rest) then we could sort with the order AZ.... and, after the sort,
rename all the B,C and D preceeded by A as 0,1 and 2. Also, rename all
the X,Y and Z the preceed W with 0,1 and 2. The two techniques
combined should put the greatest concentration of symbols together
after the BWT with just a little overhead.

-Michael A. Maniscalco

Sent via Deja.com http://www.deja.com/
Before you buy.

mikael.l...@spray.se

unread,

Apr 20, 2000, 3:00:00 AM4/20/00

to

Hi!
Well, my experimentations have indicated that it's not worth the effort
to *compute* the order each time. A simple static reordering is good
and close enough to the optimal.
The order that is suggested in Chapin's paper is not the best possible
though, it's close but with a little experimentation you should easily
find something better.

Mikael

In article <8dlk07$t37$1...@nnrp1.deja.com>,

kaptke...@my-deja.com

unread,

Apr 20, 2000, 3:00:00 AM4/20/00

to

In article <8dn2e2$f83$1...@nnrp1.deja.com>,

mikael.l...@spray.se wrote:
> Hi!
> Well, my experimentations have indicated that it's not worth the
effort
> to *compute* the order each time. A simple static reordering is good
> and close enough to the optimal.
> The order that is suggested in Chapin's paper is not the best possible
> though, it's close but with a little experimentation you should easily
> find something better.
>
> Mikael

This is probably true for any specific type of data like English Text,
however, i was thinking of a more general technique to apply for each
unique data stream. And all things considered, I can't believe that
the technique is all that time consuming. especially if you commit
yourself to only considering those symbols which show a higher
predictability for previos symbols and ignore the rest. My guess is
that only a few symbols for any data might qualify and so it might not
take that much time to model just those few.

-michael

s...@nospam.unt.edu

unread,

Apr 20, 2000, 3:00:00 AM4/20/00

to

kaptke...@my-deja.com wrote:
> In article <8dn2e2$f83$1...@nnrp1.deja.com>,
> mikael.l...@spray.se wrote:

>> Well, my experimentations have indicated that it's not worth the
>> effort to *compute* the order each time. A simple static reordering
>> is good and close enough to the optimal. The order that is
>> suggested in Chapin's paper is not the best possible though, it's
>> close but with a little experimentation you should easily find
>> something better.

> This is probably true for any specific type of data like English Text,

> however, i was thinking of a more general technique to apply for each
> unique data stream. And all things considered, I can't believe that
> the technique is all that time consuming. especially if you commit
> yourself to only considering those symbols which show a higher
> predictability for previos symbols and ignore the rest. My guess is
> that only a few symbols for any data might qualify and so it might not
> take that much time to model just those few.

Since you're talking about the paper I wrote with Chapin, I thought
I'd respond!

We did indeed look at ways of computing a reordering -- basically, you
want to group symbols together that when used as contexts have
"similar" distributions. This way you're not paying as big a cost
every time you switch contexts. The technique we used is to try
different metrics for measuring the difference between distributions,
and then tried to solve a TSP problem on the resulting graph. That
will give (if the metric is good) a way of minimizing the cost of
switching between distributions.

The problem is that if you compute the ordering you also have to
encode it. For a short file, the overhead of encoding the alphabet
ordering usually cancels out most of the improvement. Only by
exploiting the consistency of English, and the poor original ordering
(for BWT) of the English alphabet, can you get reasonable improvements
(so a fixed reordering for all English text, and then need 1 bit to
flag the reordering rather than actually encoding the permutation).

Actually, there was one other way we got improvement: if the data is
a bit unusual, and you're not sure how to manually reorder, and you
will be consistently compressing data from the same source, computing
an order from a sample of that source will be good. Our best example
was the "geo" data in the Calgary corpus. The computed reordering
took that rather funky data file and worked wonders -- it was by far
our biggest improvement (better than the English text results).

--
Steve Tate --- srt[At]cs.unt.edu | Gratuitously stolen quote:
Dept. of Computer Sciences | "The box said 'Requires Windows 95, NT,
University of North Texas | or better,' so I installed Linux."
Denton, TX 76201 |