Michael Rouse
mro...@cdsnet.net
Brenton Chapin did some work on the idea. The site is
http://www.cs.unt.edu/~chapin.
On english text, the sorting order aeiou... usually yields some (very
small) improvement over 0-255 ordering.
I was thinking of experimenting with order one stats gathered before
BWT and perhaps even changing the symbols after the BWT based on these
stats. That is, identify the symbols which have the most predictable
preceeding symbols and grouping them together by modify the sorting
order. Then, after the transform re-name these preceeding symbols by
overall probability (or perhaps only the more predictable symbols).
Thus, if say prior to the BWT the symbol A was preceeded by B,C and D
with probabilities of say 35%,10% and 7% (ignore the rest) and W was
preceeded by say X,Y and Z with probabilites of 70%,3%,2% (ignore the
rest) then we could sort with the order AZ.... and, after the sort,
rename all the B,C and D preceeded by A as 0,1 and 2. Also, rename all
the X,Y and Z the preceed W with 0,1 and 2. The two techniques
combined should put the greatest concentration of symbols together
after the BWT with just a little overhead.
-Michael A. Maniscalco
Sent via Deja.com http://www.deja.com/
Before you buy.
Mikael
In article <8dlk07$t37$1...@nnrp1.deja.com>,
This is probably true for any specific type of data like English Text,
however, i was thinking of a more general technique to apply for each
unique data stream. And all things considered, I can't believe that
the technique is all that time consuming. especially if you commit
yourself to only considering those symbols which show a higher
predictability for previos symbols and ignore the rest. My guess is
that only a few symbols for any data might qualify and so it might not
take that much time to model just those few.
-michael
>> Well, my experimentations have indicated that it's not worth the
>> effort to *compute* the order each time. A simple static reordering
>> is good and close enough to the optimal. The order that is
>> suggested in Chapin's paper is not the best possible though, it's
>> close but with a little experimentation you should easily find
>> something better.
> This is probably true for any specific type of data like English Text,
> however, i was thinking of a more general technique to apply for each
> unique data stream. And all things considered, I can't believe that
> the technique is all that time consuming. especially if you commit
> yourself to only considering those symbols which show a higher
> predictability for previos symbols and ignore the rest. My guess is
> that only a few symbols for any data might qualify and so it might not
> take that much time to model just those few.
Since you're talking about the paper I wrote with Chapin, I thought
I'd respond!
We did indeed look at ways of computing a reordering -- basically, you
want to group symbols together that when used as contexts have
"similar" distributions. This way you're not paying as big a cost
every time you switch contexts. The technique we used is to try
different metrics for measuring the difference between distributions,
and then tried to solve a TSP problem on the resulting graph. That
will give (if the metric is good) a way of minimizing the cost of
switching between distributions.
The problem is that if you compute the ordering you also have to
encode it. For a short file, the overhead of encoding the alphabet
ordering usually cancels out most of the improvement. Only by
exploiting the consistency of English, and the poor original ordering
(for BWT) of the English alphabet, can you get reasonable improvements
(so a fixed reordering for all English text, and then need 1 bit to
flag the reordering rather than actually encoding the permutation).
Actually, there was one other way we got improvement: if the data is
a bit unusual, and you're not sure how to manually reorder, and you
will be consistently compressing data from the same source, computing
an order from a sample of that source will be good. Our best example
was the "geo" data in the Calgary corpus. The computed reordering
took that rather funky data file and worked wonders -- it was by far
our biggest improvement (better than the English text results).
--
Steve Tate --- srt[At]cs.unt.edu | Gratuitously stolen quote:
Dept. of Computer Sciences | "The box said 'Requires Windows 95, NT,
University of North Texas | or better,' so I installed Linux."
Denton, TX 76201 |