# paper on BWT for source estimation

4 views

### Dror Baron

Mar 24, 2000, 3:00:00 AM3/24/00
to
Hi all,

Anybody interested is welcome to take a look at a paper
I recently showed at the CISS conference.
The paper uses the Burrows Wheeler transform (BWT) to
preprocess a sequence of symbols. Afterwards a block-merging
algorithm estimates states, parameters, etc.

The main contributions are:
1. the bvlock mergins estimates the source well asymptotically
2. application to universal coding with lower redundancies than
previous BWT schemes
3. iid assumption not used (weaker assumption used)

Dror

--------------------------
Dror Baron
ECE Department
University of Illinois at Urbana-Champaign(UIUC)
CSRL room 120
Phone: (217)-265-8234 (office)
Website: http://www.uiuc.edu/~dbaron

### Dmitry Shkarin

Mar 25, 2000, 3:00:00 AM3/25/00
to
Hi, Dror!

Note: Really, BWT has the same asimptotical complexity as PPM - O(d*N),
where d is average context length.

### Dror Baron

Mar 25, 2000, 3:00:00 AM3/25/00
to
Hi!

However, your claim is inaccurate. (I'll use LATEX for math:)

For modelling sources of depth $d$ PPM is indeed $O(d\cdot{N})$
where $N$ is the string length. However, usually $d=O(\log{N})$.
Thus, PPM is typically $O(N\log{N})$.
(NOTE: I got this confirmed by the author of the original
papers on PPM)

The BWT has several variants:
1. Sadakane's method is $O(N\log{N})$ worst case and $O(N\cdot{AML})$
typical case where AML is average matching length,
which is $O(H\cdot\log{N})$ where $H$ is the entropy rate.
2. Suffix trees are $O(N|X|)$ worst case, where $X$ is the alphabet
and $|X|$ its cardinality.
3. Suffix trees with hash tables are better yet...

Therefore, asymptotically $O(\log{N})$ is worse than $|X|$ which
a constant (not to mention the hash table implementations).

Lastly, should you think that CTW (context tree weighting) is
better, I talked at length with one of the authors last week
and he became convinced that his method is $O(N\log{N})$.
By the way, my paper (www.uiuc.edu/~dbaron and look for it)
discusses these issues briefly with pointers to the literature.

Again, thanks for the interest!
Dror

### Dmitry Shkarin

Mar 27, 2000, 3:00:00 AM3/27/00
to
Hi, Dror!

>
>However, your claim is inaccurate. (I'll use LATEX for math:)
Yes, sorry, it was error: d must be replaced with |X|.
BWT and PPM uses the same data structure, so their asimptotical
complexity must be the same.

Mar 27, 2000, 3:00:00 AM3/27/00
to
Hello, Dror !

> Anybody interested is welcome to take a look at a paper
> I recently showed at the CISS conference.
> The paper uses the Burrows Wheeler transform (BWT) to
> preprocess a sequence of symbols. Afterwards a block-merging
> algorithm estimates states, parameters, etc.

Good article. The approach can be found very useful in practical BW-schemes
Here are some notes and suggestions:

> The main contributions are:
> 1. the bvlock mergins estimates the source well asymptotically

Can you produce the results for Calgary?

> 2. application to universal coding with lower redundancies than
> previous BWT schemes

Well, but what in practice? K-T estimating is not good for
BW-postprocessing, because the transformed sequence usually has a kind of
correlation (it's not memoryless). So, the usual approaches are preferred
and the merging algorithm here can help to adjust the coding parameters
(statistics reset, etc). Going this way we may successfully reject
non-optimal MTF-coding.

> 3. iid assumption not used (weaker assumption used)

Block-oriented character is the main disadvantage of the represented
technique. We assume that the information birth is nonrecurrent action. It
put some limitations on the source (from this point CTW has a prime
advantage). Therewith, I'd like to note that such method as well as original

Regards,

E-mail: seme...@unitel.spb.ru

### Dror Baron

Mar 27, 2000, 3:00:00 AM3/27/00
to
The data structures are definitely similar.
However, in PPM we (asymptotically) model the source as
depth-$\log{N}$ thus at best $\log{N}$ operations per symbol are
necessary. For BWT, nothing is being modelled explicitly.
We need (suffix arrays) $|X|$ (without hash tables)
operations per symbol.
For practical purposes $|X|$ is pretty big, but I'm sure you
know how that can be overcome.
So... asymptotically PPM is *not* linear complexity whereas
BWT *is* linear.

I suggest you go through references [2],[8],[13],[14], and [15]
in my paper.
Remember that the basic difference is that PPM is explicitly
modelling the source while BWT is only running a sort.

Thanks again,
Dror

### Dror Baron

Mar 27, 2000, 3:00:00 AM3/27/00
to

I'll try to respond to your suggestions.

> Can you produce the results for Calgary?

I don't have simulation results yet.
However, I believe that for practical purposes
the block lengths will be quite large, so I
would not expect amazing results.
I guess we'll need to be patient.

> Well, but what in practice? K-T estimating is not good for
> BW-postprocessing, because the transformed sequence usually has a kind of
> correlation (it's not memoryless). So, the usual approaches are preferred
> and the merging algorithm here can help to adjust the coding parameters
> (statistics reset, etc). Going this way we may successfully reject
> non-optimal MTF-coding.

Exactly. Instead of running MTF on the BWT output,
we begin by estimating state segments, afterwards we
code each one of those separately.

> Block-oriented character is the main disadvantage of the represented
> technique. We assume that the information birth is nonrecurrent action. It
> put some limitations on the source (from this point CTW has a prime
> advantage). Therewith, I'd like to note that such method as well as original
> Burrows-Wheeler algorithm is not adaptive.

Yes - the block-orientation is limiting.

Thanks again,

### Dror Baron

Mar 27, 2000, 3:00:00 AM3/27/00
to

We agree that good results on text are not expected.
The fundamental reason is that the algorithm is designed for modelling.
It isn't designed for text. I am sure you realize that I'm a
research-oriented PhD student and not a results-oriented software
developer - the flavor of my results will always be biased for
(what I consider to be) beauty and elegance and less for results.

By the way - I talked with Willems last week and he mentioned
they have some results on a variant of CTW which is about as
good as PPM. So I wouldn't rule anything out.
Last comment: from a theoretical viewpoint (universal coding
redundancies) CTW is extremely good.
BWT output will always be biased by $\log{N}$ per segment (see
Merhav's paper on MDL of piecewise stationary sources).
Therefore CTW is 'born' to be better than BWT.

> > Exactly. Instead of running MTF on the BWT output,
> > we begin by estimating state segments, afterwards we
> > code each one of those separately.
>

> Not quite right. Undoubtedly, there is a correlation between the segments.
> Therewith, separable coding means large statistics saving issues in a case

Theoretically you are correct - see reference [1] from my paper.
For practical purposes (justified by simulation) the correlation

> BTW Dmitry's PPMD (variant E) packer as a rule take the lead in coding speed
> over BW-packers : )

Again: my emphasis is on theoretical research and not programming.
I am sure Dmitry is 10 times better at programming ;-)

At the end of the day there is an inherent bound on
the complexity of compression algorithms: it is $O(N\log{|X|})$.
Anybody who claims to be better is cheating somewhere.
Nobody can do real $O(N)$. Current BW algorithms are $O(N|X|)$.

Dror

Mar 28, 2000, 3:00:00 AM3/28/00
to
Hello, Dror !

> > Can you produce the results for Calgary?
>
> I don't have simulation results yet.
> However, I believe that for practical purposes
> the block lengths will be quite large, so I
> would not expect amazing results.
> I guess we'll need to be patient.

I have a serious doubt about amazing results:

(1) for BW (here and later Burrows-Wheeler method): as far as I know,
Michael Shindler's SZIP makes a similar thing in some aspects, but the
results are not very impressive;
(2) for CTW: well, your technique may be good in speed (especially, if radix
sort is in use (I mean partial context sort)), but what about the ratio? As
I said, it's not adaptive, and the estimate is not so clever as in CTW. So,
the compression effect may be poor for this class of algorithms;
(3) big trouble: the need for creating byte-oriented version. It's not easy.
One way is doing something like Tjalkens, Volf and Willems described in "A
Context-Tree Weighting Method for Text Generating Sources". This approach
looks awkward. Unfortunately, for today I don't see the other way.

> > Well, but what in practice? K-T estimating is not good for
> > BW-postprocessing, because the transformed sequence usually has a kind
of
> > correlation (it's not memoryless). So, the usual approaches are
preferred
> > and the merging algorithm here can help to adjust the coding parameters
> > (statistics reset, etc). Going this way we may successfully reject
> > non-optimal MTF-coding.
>

> Exactly. Instead of running MTF on the BWT output,
> we begin by estimating state segments, afterwards we
> code each one of those separately.

Not quite right. Undoubtedly, there is a correlation between the segments.
Therewith, separable coding means large statistics saving issues in a case

the adaptation must depend on segments boundaries. BTW the separation can be
made during the sequential coding after the boundary is reached (an exact
boundary hit is not so important). I mean coder and decoder that
synchronously analyze the data behavior.

[from the other discussion]

>However, in PPM we (asymptotically) model the source as
>depth-$\log{N}$ thus at best $\log{N}$ operations per symbol are
>necessary. For BWT, nothing is being modelled explicitly.
>We need (suffix arrays) $|X|$ (without hash tables)
>operations per symbol.

Modern PPM algorithms don't require hashing. PATRICIA structures are often

> For practical purposes $|X|$ is pretty big, but I'm sure you
> know how that can be overcome.
> So... asymptotically PPM is *not* linear complexity whereas
> BWT *is* linear.

BTW Dmitry's PPMD (variant E) packer as a rule take the lead in coding speed
over BW-packers : )

Regards,

E-mail: seme...@unitel.spb.ru

### Malcolm Taylor

Mar 28, 2000, 3:00:00 AM3/28/00
to
>> BTW Dmitry's PPMD (variant E) packer as a rule take the lead in coding speed
>> over BW-packers : )
>
>Again: my emphasis is on theoretical research and not programming.
>I am sure Dmitry is 10 times better at programming ;-)

I think the point must be made that Dmitry's PPMD uses a suffix tree
approach which gives rise to an O(N) (the point being that the maximum
order and alphabet size are constant and so should be ignored :). This
is better than the O(N*log(N)) you describe for PPM algorithms in
general.

The O(N) that Dmitry achieves, makes use of the coherence between
adjacent contexts, and as such could also be applied to CTW models as
they use similar structures. This would mean that you could create a
CTW that has O(N) complexity as well.

ie. both PPM and CTW are able to achieve the lower bound in complexity
that you quote for 'any' compression algorithm.

Having said that, the current crop of CTW algorithms and PPM algorithm
implementations (not including Dmitry's PPMD that I know of) are
O(N*logN) (eg. RK uses an O(N*logN) implementation of PPM because the
extra complexity and memory saving techniques in my PPMZ destroys the
coherence needed for an O(N) implementation).

Also of note is that complexity analysis is only useful
asymptotically. The classic example is that bubble sort is faster than
quick sort for small sets since the constant per 'operation' is much
lower than quick sort. So a comparison of algorithms should take into
account the differences in constant time per 'operation' (eg. CTW uses
binary decomposition giving rise to at least 8 times the work of PPM).

Just my few points :)

Malcolm

Mar 28, 2000, 3:00:00 AM3/28/00
to
Hello, Dror !

> By the way - I talked with Willems last week and he mentioned
> they have some results on a variant of CTW which is about as
> good as PPM.

Note: I have nothing against CTW. Compression results are really great. But
truth to tell: because of the slow decoding method will never be widely used
in practice (as well as PPM, DMC, PSM). CTW remains nothing but admirable
theoretical issue.

> > > Exactly. Instead of running MTF on the BWT output,
> > > we begin by estimating state segments, afterwards we
> > > code each one of those separately.
> >
> > Not quite right. Undoubtedly, there is a correlation between the
segments.
> > Therewith, separable coding means large statistics saving issues in a
case
>

> Theoretically you are correct - see reference [1] from my paper.
> For practical purposes (justified by simulation) the correlation
> between adjacent segments is negligible.

Don't agree. Saying that you imply the particular class of sources. The
number of counter-examples I know allows to make an assumption that
correlation exists and we can't leave it out of account, especially if we
are in need of a clever statistical information storing mechanism in a case
of 256-symbol alphabet.
.
Regards,

E-mail: seme...@unitel.spb.ru

### Dror Baron

Mar 28, 2000, 3:00:00 AM3/28/00
to
> Don't agree. Saying that you imply the particular class of sources. The
> number of counter-examples I know allows to make an assumption that
> correlation exists and we can't leave it out of account, especially if we
> are in need of a clever statistical information storing mechanism in a case
> of 256-symbol alphabet.

Obviously if you showed me some of these examples I would be
extremely grateful.

Dror

### Dror Baron

Mar 28, 2000, 3:00:00 AM3/28/00
to
Hi Malcolm,

Maximum order grows with $N$ so I don't agree with that approach.
Alphabet size - that's fine.
Theoretically, as one of the guys in this correspondence pointed out,
there is no reason to attain the $O(N\log{|X|})$ bound once we can
do $O(N|X|)$, because we can 'go down' to smaller alphabets.

> ie. both PPM and CTW are able to achieve the lower bound in complexity
> that you quote for 'any' compression algorithm.

This lower bound is simply the length in bits of the original
(uncompressed) version. I'm sure you agree we can't do better than
O(length original).

> Also of note is that complexity analysis is only useful
> asymptotically. The classic example is that bubble sort is faster than
> quick sort for small sets since the constant per 'operation' is much
> lower than quick sort. So a comparison of algorithms should take into
> account the differences in constant time per 'operation' (eg. CTW uses
> binary decomposition giving rise to at least 8 times the work of PPM).

Of course ;-)
Although I usually use the 'sort' in MATLAB which is
'other people's code' ;-)

Dror

### Dror Baron

Mar 28, 2000, 3:00:00 AM3/28/00
to
You propose compressing block by block.
This is drastically different than the proposed scheme.

> I agree. As a general rule BWT can not be better than PPM compressionwise;
> this
> was discussed on Dcc98 (not in the program, but among a few people). The
> reason is
> mainly that the BWT uses wrong statistics for the beginning of new
> contexts/buckerts/whatever.
> This is also the reason for improved compression for bigger blocks - the
> loss due to
> adaption is (nearly) the same for a big and a small block as the number of
> different contexts
> is about the same. For the big block the percentage of such badly compressed
> data will
> simply be smaller.
>
> I did a few experiments and these are approximate data:
> if you take a big text (maybe out of the canterbury large file corpus) and
> run statistics individually
> for the first few bytes of a context/bucket/whatever and for the remainder
> you will notice that
> the compression in the beginning is very low (3 bytes/char or more for the
> first few chars)
> while it then becomes good.
> However for large blocks (above 1 or 2 MB) the fraction of these badly
> compressed bytes is low.
> So do not expect too much from larger blocksizes. I have szip versions
> capable of handling blocks up to
> 2^31-1, but i never really tested above 10MB or so.
> My rule of thumb: limit(size) = 2*(size(using 4MB blocks)-size(using 2MB
> blocks)).

>
> >(1) for BW (here and later Burrows-Wheeler method): as far as I know,
> >Michael Shindler's SZIP makes a similar thing in some aspects, but the
> >results are not very impressive;
>

> szip does a partial context sort, the context size is selectable by -o
> switch. Try -o4 for speed, which
> will do two base 2^16 LSB first radix sorts between writing "processing..."
> and "coding..." (-v1 to see it)
> The time before processing is reading the input, between coding and done
> doing the modelling
> and output. szip does no asyncroneous IO; experiments indicate that when
> in seperate threads speed for -o4 would double - is anyone interested in
> this?
>
> The szip version available does no context border estimation. I found that
> the 2% smaller size was
> not worth 20% more CPU time. I did such experiments, so these 2% smaller for
> files on my harddisk
> are tested. Szip has other places too where it prefers speed over
> compression.
>
> The main benefit of limiting sort order is that you could easily do BWT in
> hardware - MTF is not a
> real problem, and so are entropy coders in hardware.
>
> There are ways to get better compression out of BWT than currently
> available.
>
> michael
>
> [snipped]
>
>
>

### DI Michael Schindler

Mar 29, 2000, 3:00:00 AM3/29/00
to
hi!

i'd like to add my 0.02 euro to this discussion:

Vladimir Semenyuk wrote in message <8bof54$m56$1...@news.runnet.ru>...

>Hello, Dror !
>
>> > Can you produce the results for Calgary?
>>
>> I don't have simulation results yet.
>> However, I believe that for practical purposes
>> the block lengths will be quite large, so I
>> would not expect amazing results.
>> I guess we'll need to be patient.
>
>I have a serious doubt about amazing results:

I agree. As a general rule BWT can not be better than PPM compressionwise;

### Dmitry Shkarin

Mar 29, 2000, 3:00:00 AM3/29/00
to
Hi, Dror!

>The data structures are definitely similar.
>However, in PPM we (asymptotically) model the source as
>depth-$\log{N}$ thus at best $\log{N}$ operations per symbol are
>necessary. For BWT, nothing is being modelled explicitly.
>We need (suffix arrays) $|X|$ (without hash tables)
>operations per symbol.

Let's divide:

1. Finite order PPM.
It is easy: after statistics gathering(all possible states and all
possible transitions are created), it works similarly to ordinary order-0
arithmetic coder with cost proportional |X| per symbol, so PPM behaivor
is O(|X|*N).

2. Unbounded PPM*.
It is more complicated: number of operations per context is bounded by
|X| again and number of scanned contexts(number of escapes) is less or
equal |X| too, because We can visit only those nodes that contain different
number of transitions, their number is limited by |X|, tree nodes are added
in linear time, therefore PPM* behaivor is O(|X|*|X|*N).
Hmmm, it is slightly worse than my previous claim ;-).

3. CTW(bounded).
It is similar to PPM: finite model => finite time for each symbol.

4. CTW(unbounded, does it exists?).
It scans tree in depth for each symbol, tree depth is proportional
log(N), so it is really O(N*log(N)).

Mar 29, 2000, 3:00:00 AM3/29/00
to
Hi, Michael !

>>(1) for BW (here and later Burrows-Wheeler method): as far as I know,
>>Michael Shindler's SZIP makes a similar thing in some aspects, but the
>>results are not very impressive;

> szip does a partial context sort, the context size is selectable by -o
> switch. Try -o4 for speed, which
> will do two base 2^16 LSB first radix sorts between writing
>"processing..." and "coding..." (-v1 to see it)

I knew this ;-)

> The szip version available does no context border estimation. I found that
> the 2% smaller size was
> not worth 20% more CPU time. I did such experiments, so these 2% smaller
> for files on my harddisk
> are tested. Szip has other places too where it prefers speed over
> compression.

I am entirely with you in this. A few percents is not amazing result.
Saying "a similar thing in some aspects" I meant only the partial
renunciation of MTF. I must be mistaken, but it seems to me that SZIP
sometimes doesn't use MTF. Is it right?

Regards,

E-mail: seme...@unitel.spb.ru

### Dror Baron

Mar 29, 2000, 3:00:00 AM3/29/00
to
Hi Dmitry,

Excellent - it seems that our understanding is converging.
There are nice $O(N|X|)$ algorithms, and they lead to $O(N\log{|X|})$.
Anyway BWT is unbounded and $O(N|X|)$ so it is fine.

By the way - CTW unbounded does indeed exist.
Willems told me that he has some paper on this.
I'll send you the reference when I come across it...
(Busy month)

Dror

### Dmitry Shkarin

Mar 30, 2000, 3:00:00 AM3/30/00
to
Hi, Malcolm!

>Also of note is that complexity analysis is only useful
>asymptotically. The classic example is that bubble sort is faster than
>quick sort for small sets since the constant per 'operation' is much
>lower than quick sort. So a comparison of algorithms should take into
>account the differences in constant time per 'operation' (eg. CTW uses
>binary decomposition giving rise to at least 8 times the work of PPM).
I agree with You that asymptotical and real behaivors are two very
different things and here is more actual example: standard blocksorting
scheme(BWT + MTF + entropy coding) is asymptotically nonoptimal, but it is
successfully used by modern compressors ;-).

### mikael.l...@spray.se

Mar 30, 2000, 3:00:00 AM3/30/00
to
Hi all!
I'm pleased to see that there are people still working to improve bwt.
Bwt still has a lot of room for that.
I just wish that you could have included more like examples and results!

Ppm and ctw etc. is mainly interesting from an academic point of view.
It's simply too slow and memory demanding for any real practical use.
Someone said
"We don't need more powerful computers, we need better algorithms!"
Too see parts of what is possible with bwt, visit
http://www.mathematik.uni-bielefeld.de/~bernhard/

I'm working to improve bwt too and my current
results are very similar to these.
I think it's possible to do better than Balkenhol
but it's getting more difficult.
The argument that better compression equals slower
execution is simply not true! (most of the time, lazy programmers?)

You ppm experts, perhaps you should use your expert
knowledge and energy to develop bwt instead?

Mikael

,
> Hi all,

>
> Anybody interested is welcome to take a look at a paper
> I recently showed at the CISS conference.
> The paper uses the Burrows Wheeler transform (BWT) to
> preprocess a sequence of symbols. Afterwards a block-merging
> algorithm estimates states, parameters, etc.
>

> The main contributions are:
> 1. the bvlock mergins estimates the source well asymptotically

> 2. application to universal coding with lower redundancies than
> previous BWT schemes

> 3. iid assumption not used (weaker assumption used)
>

>

> Dror
>
> --------------------------
> Dror Baron
> ECE Department
> University of Illinois at Urbana-Champaign(UIUC)
> CSRL room 120
> Phone: (217)-265-8234 (office)
> Website: http://www.uiuc.edu/~dbaron
>
>

Sent via Deja.com http://www.deja.com/

### Binder Edgar

Mar 30, 2000, 3:00:00 AM3/30/00
to

Hi,

mikael.l...@spray.se wrote:

> Ppm and ctw etc. is mainly interesting from an academic point of view.
> It's simply too slow and memory demanding for any real practical use.

Hmm, I don't agree here. PPMD (the compressor) shows that PPM can be
quite fast. I suspect that it's the same with CTW, but probably most of the
people that work on PPM/CTW don't value speed/memory requirements as much as
compression ratios...

> I think it's possible to do better than Balkenhol
> but it's getting more difficult.

Well, my BWT results are a bit better than Balkenhols, but I use a
completly different modelling technique...

> The argument that better compression equals slower
> execution is simply not true! (most of the time, lazy programmers?)

:-)

--
Edgar

### Bernhard Balkenhol

Apr 6, 2000, 3:00:00 AM4/6/00
to
Dror Baron wrote:
You propose compressing block by block.
This is drastically different than the proposed scheme.

> I agree. As a general rule BWT can not be better than PPM compressionwise;

> this
> was discussed on Dcc98 (not in the program, but among a few people).  The
> reason is
> mainly that the BWT uses wrong statistics for the beginning of new
> contexts/buckerts/whatever.
> This is also the reason for improved compression for bigger blocks - the
> loss due to
> adaption is (nearly) the same for a big and a small block as the number of
> different contexts
> is about the same. For the big block the percentage of such badly compressed
> data will
> simply be smaller.

OK, I was also one of this people who has statet that. But now I am no longer shure
about that.  It can been shown that you can think about the BWT method as a generalisation
of PPM because
1) You have to consider Models where a context is suffix of an other context
2) In fact you can estimate the changings of the real context and then work as in PPM

My implementation gives
bib   111261    26469  1.903
book1   768771   218023  2.269
book2   610856   149734  1.961
geo   102400    53221  4.158
news   377109   114097  2.420
obj1    21504    10020  3.728
obj2   246814    74284  2.408
paper1    53161    16003  2.408
paper2    82199    24186  2.354
pic   513216    46131  0.719
progc    39611    12108  2.445
progl    71646    15028  1.678
progp    49379    10377  1.681
trans    93695    17106  1.461
AVERAGE  14 :           786787  2.257

on the files of the Calgary Corpus
and that is possible in complexity similar to gzip (gzip not bzip or something else)

Everybody who is interrested can find the describtion in a report
http://www.mathematik.uni-bielefeld.de/~bernhard

the reports are under the link "Preprints im SFB"
and the mentioned one is a work together with Yuri M. Shtarkov:
"One Attempt of a Compression Algorithm based on the BWT"

That is still not the best I can achieve, I estimate for the next version which I like
to describe something like 2.20 in the average (hopefully).
Therefore as a conclusion at least for myself it is no longer clear that we can not do better
than PPM or CTW.

> .....

> >(1) for BW (here and later Burrows-Wheeler method): as far as I know,
> >Michael Shindler's SZIP makes a similar thing in some aspects, but the
> >results are not very impressive;

What do you think about the results above ?

> ....

> There are ways to get better compression out of BWT than currently
> available.

As you can see above.

>
> michael
>
> [snipped]
>
With my best regards
    Bernhard

--
---------------------------------------------------------------------------
Dr. Bernhard Balkenhol        e-Mail bern...@mathematik.uni-bielefeld.de
Fakultaet fuer Mathematik     Tel. +0049-521-106-3872
Universitaet Bielefeld
Postfach 10 01 31             Fax  +0049-521-106-4743
D - 33501 Bielefeld
Germany

### Bernhard Balkenhol

Apr 6, 2000, 3:00:00 AM4/6/00
to

mikael.l...@spray.se wrote:

Hi,

Hi all!
I'm pleased to see that there are people still working to improve bwt.
Bwt still has a lot of room for that.
I just wish that you could have included more like examples and results!

>...

I'm working to improve bwt too and my current
results are very similar to these.
I think it's possible to do better than Balkenhol
but it's getting more difficult.

No, we can do better and faster simultaneously as myself.

The argument that better compression equals slower
execution is simply not true! (most of the time, lazy programmers?)

You ppm experts, perhaps you should use your expert

knowledge and energy to develop bwt instead?

The sorting can be done in between 1/4 and 1/2 of the time of Benson Sedwick with
similar memory requirement but not with that terible worst cases like pic. This
can be done with the same algorithm in time complexity smaller than with suffix trees.
As far as I have described it I let you know.

With regards
Bernhard
---------------------------------------------------------------------------
Dr. Bernhard Balkenhol        e-Mail bern...@mathematik.uni-bielefeld.de

Apr 6, 2000, 3:00:00 AM4/6/00
to
Hello, Bernhard !

> OK, I was also one of this people who has statet that. But now
> I am no longer shure about that. It can been shown that you can
> think about the BWT method as a generalisation of PPM because

> 1) You have to consider Models where a context is suffix of an other
> context
> 2) In fact you can estimate the changings of the real context and then
> work as in PPM

I have the objections. First of all, let's disjoint two cases:

(1) We use an original Burrow-Wheeler approach. It means: making a sorting
(i. e. simple x^n->y^n as in your papers), forgetting about contexts (!),
coding.

(a) Fast modeling => fast compression.
(b) Very fast decompression => this method will be used in practice in
contrast to any prediction technique (i.e. statistical methods like PPM).

(a) A kind of abstraction from real model due to context forgetting =>
unavoidable loss of efficiency (even with estimating techniques).
(b) Block nature => loss of adaptivity => possible loss of efficiency.

(2) We make context sorting and the result is used to facilitate the
modeling. It can be useful in dictionary and statistical methods.

(a) Fast and simple modeling => possibility of making good results with low
complexity.

(a) It speeds up the compression, but the complexity of decompression
stays the same.
(b) It is not a new method (not BW). Just a sorting.

So, if you mean the first case, in general it won't guarantee toughest
compression (see Disadvantages); second approach has no connection
to Burrows-Wheeler method.

And now I'll show the same from the point of practice:

> My implementation gives
> bib 111261 26469 1.903
> book1 768771 218023 2.269
> book2 610856 149734 1.961
> geo 102400 53221 4.158
> news 377109 114097 2.420
> obj1 21504 10020 3.728
> obj2 246814 74284 2.408
> paper1 53161 16003 2.408
> paper2 82199 24186 2.354
> pic 513216 46131 0.719
> progc 39611 12108 2.445
> progl 71646 15028 1.678
> progp 49379 10377 1.681
> trans 93695 17106 1.461

Results for PPMD(E) by Dmitry Shkarin ("-o5" option):

bib 1.77
book1 2.22
book2 1.90
geo 4.46
news 2.28
obj1 3.60
obj2 2.30
paper1 2.25
paper2 2.22
pic 0.79
progc 2.28
progl 1.60
progp 1.62
trans 1.37

And speed was really impressive here. (PPMD usually passes ahead of
BW-based programs (I mean coding+decoding time).)

Conclusion: BW is interesting only from the point of speed. So, further
sorting improvements are welcome. Greedy coding is not so important (slowing
down doesn't worth it). It's my standpoint.

Regards,

E-mail: seme...@unitel.spb.ru

### Binder Edgar

Apr 6, 2000, 3:00:00 AM4/6/00
to
Hi,

[Bernhard: BWT as PPM generalisation]

> I have the objections. First of all, let's disjoint two cases:
>
> (1) We use an original Burrow-Wheeler approach. It means: making a sorting
> (i. e. simple x^n->y^n as in your papers), forgetting about contexts (!),
> coding.
>
> (a) Fast modeling => fast compression.
> (b) Very fast decompression => this method will be used in practice in
> contrast to any prediction technique (i.e. statistical methods like PPM).

You forgot (c), Context blending and (d), BW doesn't need very much memory.

> (a) A kind of abstraction from real model due to context forgetting =>
> unavoidable loss of efficiency (even with estimating techniques).
> (b) Block nature => loss of adaptivity => possible loss of efficiency.

Well, this is correct...

> So, if you mean the first case, in general it won't guarantee toughest
> compression (see Disadvantages); second approach has no connection
> to Burrows-Wheeler method.
>
> And now I'll show the same from the point of practice:

[results comparision]

> And speed was really impressive here. (PPMD usually passes ahead of
> BW-based programs (I mean coding+decoding time).)
>
> Conclusion: BW is interesting only from the point of speed. So, further
> sorting improvements are welcome. Greedy coding is not so important (slowing
> down doesn't worth it). It's my standpoint.

Hmm, it isn't neccesary for BW to get slower with better modeling. It's
possible to improve both compression and speed...

--
Edgar

### mikael.l...@spray.se

Apr 7, 2000, 3:00:00 AM4/7/00
to
Hi!

What makes bwt worth the effort is that
1. It's relatively new and unexplored (1994) compared to ppm.
2. When Burrows and Wheeler released the idea to the world,
it could compress Calgary corpus to 2.40 bpc average.
Today we can compress the same using bwt to 2.26 bpc and it's not a
limit.
3. As Balkenhol is saying, bwt is close to ppm in what and how it's doing
it but faster and much more efficient with memory, hinting that it has
not reached its true potential.

If we say it's impossible from the beginning, we will never discover
or gain anything. (Man had never walked on the moon either!)

Mikael

In article <8ciotv$e53$1...@news.runnet.ru>,
> Hello, Bernhard !

>
> > OK, I was also one of this people who has statet that. But now
> > I am no longer shure about that. It can been shown that you can
> > think about the BWT method as a generalisation of PPM because
>
> > 1) You have to consider Models where a context is suffix of an other
> > context
> > 2) In fact you can estimate the changings of the real context and then
> > work as in PPM
>

> I have the objections. First of all, let's disjoint two cases:
>
> (1) We use an original Burrow-Wheeler approach. It means: making a sorting
> (i. e. simple x^n->y^n as in your papers), forgetting about contexts (!),
> coding.
>
> (a) Fast modeling => fast compression.
> (b) Very fast decompression => this method will be used in practice in
> contrast to any prediction technique (i.e. statistical methods like PPM).
>

> (a) A kind of abstraction from real model due to context forgetting =>
> unavoidable loss of efficiency (even with estimating techniques).
> (b) Block nature => loss of adaptivity => possible loss of efficiency.
>

> (2) We make context sorting and the result is used to facilitate the
> modeling. It can be useful in dictionary and statistical methods.
>
> (a) Fast and simple modeling => possibility of making good results with low
> complexity.
>
> (a) It speeds up the compression, but the complexity of decompression
> stays the same.
> (b) It is not a new method (not BW). Just a sorting.
>

> So, if you mean the first case, in general it won't guarantee toughest
> compression (see Disadvantages); second approach has no connection
> to Burrows-Wheeler method.
>
> And now I'll show the same from the point of practice:
>

> > My implementation gives
> > bib 111261 26469 1.903
> > book1 768771 218023 2.269
> > book2 610856 149734 1.961
> > geo 102400 53221 4.158
> > news 377109 114097 2.420
> > obj1 21504 10020 3.728
> > obj2 246814 74284 2.408
> > paper1 53161 16003 2.408
> > paper2 82199 24186 2.354
> > pic 513216 46131 0.719
> > progc 39611 12108 2.445
> > progl 71646 15028 1.678
> > progp 49379 10377 1.681
> > trans 93695 17106 1.461
>

> Results for PPMD(E) by Dmitry Shkarin ("-o5" option):
>
> bib 1.77
> book1 2.22
> book2 1.90
> geo 4.46
> news 2.28
> obj1 3.60
> obj2 2.30
> paper1 2.25
> paper2 2.22
> pic 0.79
> progc 2.28
> progl 1.60
> progp 1.62
> trans 1.37
>

> And speed was really impressive here. (PPMD usually passes ahead of
> BW-based programs (I mean coding+decoding time).)
>
> Conclusion: BW is interesting only from the point of speed. So, further
> sorting improvements are welcome. Greedy coding is not so important (slowing
> down doesn't worth it). It's my standpoint.
>

> Regards,
>
> E-mail: seme...@unitel.spb.ru

### kaptke...@my-deja.com

Apr 7, 2000, 3:00:00 AM4/7/00
to

> What makes bwt worth the effort is that
> 1. It's relatively new and unexplored (1994) compared to ppm.
> 2. When Burrows and Wheeler released the idea to the world,
> it could compress Calgary corpus to 2.40 bpc average.
> Today we can compress the same using bwt to 2.26 bpc and it's not a
> limit.
> 3. As Balkenhol is saying, bwt is close to ppm in what and how it's
doing
> it but faster and much more efficient with memory, hinting that it has
> not reached its true potential.
>
> If we say it's impossible from the beginning, we will never discover
> or gain anything. (Man had never walked on the moon either!)
>
> Mikael

This is exactly the attitude that is needed for any great breakthough
to occur. There are so many things that are new and novel concepts
such as BWT which will only come into existance when we push to
discover them. The focus here seems to be exclusively on the modeling
improvements to BWT however, my work was firstly to build a better
coder for use with the BWT. Here, I have made many such improvements
which will further speed up the overall BWT performance (although I
realize that the real bottlneck in speed is not the coder but the sort
transform. however, every cycle counts.) My coder achieves just over
2.4 b/b on average for the Calgary Corpus without any special modeling
other than the generic BWT. There is no MTF, no RLE and no other
alterations to the output from BWT. Moreover, there is no arithmetic
coding so the whole encoding process is far faster. And decoding is
even faster still. Now, I need to start working on the more advanced
improvements to the basic BWT as is the focus here. If only I have
more time....

-michael maniscalco

Apr 7, 2000, 3:00:00 AM4/7/00
to
Hi, Edgar !

> You forgot (c), Context blending and (d), BW doesn't need very much
memory.

Context blending - this is not an advantage here, because it (blending)
could be made in a better way. BTW BW method doesn't actually makes context
blending (we lose the real contexts). Just source transformation.
Memory: you mean 5N+alpha, but it doesn't provide determined speed
complexity.

> Hmm, it isn't neccesary for BW to get slower with better modeling. It's
> possible to improve both compression and speed...

Better and faster than comp.exe by Michael? If you can, show me how. For
example, your distance coding still has a lack of speed.

Regards,

Apr 7, 2000, 3:00:00 AM4/7/00
to
Hi, Mikael !

> If we say it's impossible from the beginning, we will never discover
> or gain anything. (Man had never walked on the moon either!)

I didn't mean that someone should stop working. BW is great and useful. I
only pointed to the direction (sorting improvement), which by my opinion may

Regards,

E-mail: seme...@unitel.spb.ru

### Bernhard Balkenhol

Apr 7, 2000, 3:00:00 AM4/7/00
to

Hallo all of you.
Hi, Edgar !
>....
Better and faster than comp.exe by Michael? If you can, show me how. For
example, your distance coding still has a lack of speed.

Yes, I can do it even faster then gzip but I have not described it yet.
You can do the hole encoding not much more expensiv then the decoding.
It is even faster then Michaels restricted sorting where I still produce a total order.

And I have compared the compression rates to PPMDE D=5 as mentioned by Jan Aberg
1) But there the results look slieghtly different
2) As I have allready mentioned that is not the best I can achieve.
3) It is much slower then gzip and therefore also slower than my one.

And what I meen with generalisation is the answer to the question why Moffat's idea
to count the occured symbols only ones and not realy the frequencies (going done in the tree)
not only improves speed but also the compression rate.

Regards,

Regards

### Binder Edgar

Apr 7, 2000, 3:00:00 AM4/7/00
to
Hi,

> > You forgot (c), Context blending and (d), BW doesn't need very much
> memory.
>
> Context blending - this is not an advantage here, because it (blending)
> could be made in a better way. BTW BW method doesn't actually makes context
> blending (we lose the real contexts). Just source transformation.

That's not necessary true. Contexts *can* be recovered after a BWT, during
the encoding phase (although not all of them and unlikely very large
contexts)... Still, to be fair, I have to admit that the speed drop and
increased memory usage makes recovering quite unpractical - and it only brought
so far relatively small compression improvements in my tests.

> Memory: you mean 5N+alpha, but it doesn't provide determined speed
> complexity.

With a LZP preprocessing step (that helps BTW compression too), you'll have
determined speed...

> > Hmm, it isn't neccesary for BW to get slower with better modeling. It's
> > possible to improve both compression and speed...
>

> Better and faster than comp.exe by Michael? If you can, show me how. For
> example, your distance coding still has a lack of speed.

Okay, it's difficult to get better & faster than comp.exe. Still, I think
the compression/speed tradeoff is reasonable with DC...

--
Edgar

Apr 7, 2000, 3:00:00 AM4/7/00
to
Hi, Bernhard !

>> Better and faster than comp.exe by Michael? If you can, show me how. For
>> example, your distance coding still has a lack of speed.

> Yes, I can do it even faster then gzip but I have not described it yet.

> You can do the hole encoding not much more expensiv then the decoding.
> It is even faster then Michaels restricted sorting where I still produce a
total order.

Note, comp.exe by Michael Shindler doesn't make sorting.

>And I have compared the compression rates to PPMDE D=5 as mentioned by Jan
Aberg
>1) But there the results look slieghtly different

You took a wrong program :-)
Read a part of my recent message again:

VS> Results for PPMD(E) by Dmitry Shkarin ("-o5" option).

www.act.by.net).

>2) As I have allready mentioned that is not the best I can achieve.
>3) It is much slower then gzip and therefore also slower than my one.

method).
Console version is freeware. GZIP is old and too slow.

Regards,

E-mail: seme...@unitel.spb.ru

### DI Michael Schindler

Apr 7, 2000, 3:00:00 AM4/7/00
to
hi!

I fully belive Bernhard when he says his programs are faster and
compress better than szip. If szip would have started not as demo
to show that limited sorting works at all I would have done several
things different.

szip modelling still is based on the modelling designed back in '96 -
maybe it's about time to exploit the free dictionary approach you get
during the sort and a few other things, as well as commenting the
#define EXTRAFAST line in the rangecod.c used inside szip,
as this costs some compression.
The IO of the compressed data should also change the to
something more reasonable than putchar/getchar. Something like
collecting it in mem and output the whole block once 64 kB are there --
I know putchar does this, but i could shave off some cycles there.
Program counter sampling shows that IO takes a large fraction of
the time. Also CPU time is just about 1/2 of real time in an idle
system, so I could win by parallel IO and sorting.

I should also set the tuning of the modelling more towards small
files and not to the default blocksize of 1.7MB to perform better on
calgary corpus, and maybe focus more on text too :)

And maybe I should use a huffman coder with 20MB/sec throughput
instead of a rangecoder. Or mixed coders.

But then: all of above would break compatibility, and would mean I'd
have to spend some time on it.

When comparing with my restricted sorting: compare with order 4
(-o4 switch), as the other orders are really slow :(

Beeing faster than gzip -9 or even -6 is no miracle; use -1 (or program
imp) for a test. Or if one likes a real challange: beat LZOP (by Markus
Oberhumer) speedwise.

As for counting occurred symbols (and not frequencies): szip does this
since 2 years, so it probably is an old idea. Most of the time it just
counts
the symbols of the symbols of the last 32 runs - which implies that the
probability is never more accurate than 1/32. Nevertheless it gives
acceptable compression ;)

regards
Michael
--
DI Michael Schindler Compression Consulting
http://www.compressconsult.com/ mic...@compressconsult.com

Bernhard Balkenhol wrote in message
<38EDF352...@mathematik.uni-bielefeld.de>...

>Yes, I can do it even faster then gzip but I have not described it yet.
>You can do the hole encoding not much more expensiv then the decoding.
>It is even faster then Michaels restricted sorting where I still produce a
total order.

>And I have compared the compression rates to PPMDE D=5 as mentioned by Jan
>Aberg
>1) But there the results look slieghtly different

>2) As I have allready mentioned that is not the best I can achieve.
>3) It is much slower then gzip and therefore also slower than my one.
>

Apr 8, 2000, 3:00:00 AM4/8/00
to
Hi, Edgar and All !

> > Context blending - this is not an advantage here, because it (blending)
> > could be made in a better way. BTW BW method doesn't actually makes
> > context
> > blending (we lose the real contexts). Just source transformation.

> That's not necessary true. Contexts *can* be recovered after a BWT,
> during the encoding phase (although not all of them and unlikely very
> large contexts)...

Look. We usually make BWT by sorting the real contexts. Then we as if forget
about the reality. Now you propose making the recovery. It looks really
strange. May be it would be better not to do BWT at all. Just try PPM, CTW
or Associative Coding. For speeding-up use sorting, but don't make BWT. Be
sure you will get much more.

*******************************************************
For everyone entrained into this discussion:

I try to show that we shouldn't improve BWT so as it become a variation
of some well-known prediction technique. Doing so we lose the charm of
original Burrows-Wheeler algorithm. Well, just don't name it BWT.

*******************************************************

> > Memory: you mean 5N+alpha, but it doesn't provide determined speed
> > complexity.
>
> With a LZP preprocessing step (that helps BTW compression too), you'll
> have determined speed...

It's a good and quite usual decision, but

(1) LZP step will take a portion of time;
(2) widely used technique is not optimal (it should (I guess it's possible)
be improved without LZP).

> Okay, it's difficult to get better & faster than comp.exe. Still, I
> think the compression/speed tradeoff is reasonable with DC...

Distance Coding is a fresh and elegant decision. You necessarily should
write a paper.

Regards,