paper on BWT for source estimation

Dror Baron

unread,

Mar 24, 2000, 3:00:00 AM3/24/00

to

Hi all,

Anybody interested is welcome to take a look at a paper
I recently showed at the CISS conference.
The paper uses the Burrows Wheeler transform (BWT) to
preprocess a sequence of symbols. Afterwards a block-merging
algorithm estimates states, parameters, etc.

The main contributions are:
1. the bvlock mergins estimates the source well asymptotically
2. application to universal coding with lower redundancies than
previous BWT schemes
3. iid assumption not used (weaker assumption used)

It is downloadable from my website...

Dror

--------------------------
Dror Baron
ECE Department
University of Illinois at Urbana-Champaign(UIUC)
CSRL room 120
Phone: (217)-265-8234 (office)
Website: http://www.uiuc.edu/~dbaron

Dmitry Shkarin

unread,

Mar 25, 2000, 3:00:00 AM3/25/00

to

Hi, Dror!

Note: Really, BWT has the same asimptotical complexity as PPM - O(d*N),
where d is average context length.

Dror Baron

unread,

Mar 25, 2000, 3:00:00 AM3/25/00

to

Hi!

Thanks for your interest!
However, your claim is inaccurate. (I'll use LATEX for math:)

For modelling sources of depth $d$ PPM is indeed $O(d\cdot{N})$
where $N$ is the string length. However, usually $d=O(\log{N})$.
Thus, PPM is typically $O(N\log{N})$.
(NOTE: I got this confirmed by the author of the original
papers on PPM)

The BWT has several variants:
1. Sadakane's method is $O(N\log{N})$ worst case and $O(N\cdot{AML})$
typical case where AML is average matching length,
which is $O(H\cdot\log{N})$ where $H$ is the entropy rate.
2. Suffix trees are $O(N|X|)$ worst case, where $X$ is the alphabet
and $|X|$ its cardinality.
3. Suffix trees with hash tables are better yet...

Therefore, asymptotically $O(\log{N})$ is worse than $|X|$ which
a constant (not to mention the hash table implementations).

Lastly, should you think that CTW (context tree weighting) is
better, I talked at length with one of the authors last week
and he became convinced that his method is $O(N\log{N})$.
By the way, my paper (www.uiuc.edu/~dbaron and look for it)
discusses these issues briefly with pointers to the literature.

Again, thanks for the interest!
Dror

Dmitry Shkarin

unread,

Mar 27, 2000, 3:00:00 AM3/27/00

to

Hi, Dror!

>
>Thanks for your interest!
>However, your claim is inaccurate. (I'll use LATEX for math:)

Yes, sorry, it was error: d must be replaced with |X|.
BWT and PPM uses the same data structure, so their asimptotical
complexity must be the same.

Vladimir Semenyuk

unread,

Mar 27, 2000, 3:00:00 AM3/27/00

to

Hello, Dror !

> Anybody interested is welcome to take a look at a paper
> I recently showed at the CISS conference.
> The paper uses the Burrows Wheeler transform (BWT) to
> preprocess a sequence of symbols. Afterwards a block-merging
> algorithm estimates states, parameters, etc.

Good article. The approach can be found very useful in practical BW-schemes
(I advise to read the paper everybody who works on BWT).
Here are some notes and suggestions:

> The main contributions are:
> 1. the bvlock mergins estimates the source well asymptotically

Can you produce the results for Calgary?

> 2. application to universal coding with lower redundancies than
> previous BWT schemes

Well, but what in practice? K-T estimating is not good for
BW-postprocessing, because the transformed sequence usually has a kind of
correlation (it's not memoryless). So, the usual approaches are preferred
and the merging algorithm here can help to adjust the coding parameters
(statistics reset, etc). Going this way we may successfully reject
non-optimal MTF-coding.

> 3. iid assumption not used (weaker assumption used)

Block-oriented character is the main disadvantage of the represented
technique. We assume that the information birth is nonrecurrent action. It
put some limitations on the source (from this point CTW has a prime
advantage). Therewith, I'd like to note that such method as well as original
Burrows-Wheeler algorithm is not adaptive.

Regards,
Vladimir.

E-mail: seme...@unitel.spb.ru

Dror Baron

unread,

Mar 27, 2000, 3:00:00 AM3/27/00

to

The data structures are definitely similar.
However, in PPM we (asymptotically) model the source as
depth-$\log{N}$ thus at best $\log{N}$ operations per symbol are
necessary. For BWT, nothing is being modelled explicitly.
We need (suffix arrays) $|X|$ (without hash tables)
operations per symbol.
For practical purposes $|X|$ is pretty big, but I'm sure you
know how that can be overcome.
So... asymptotically PPM is *not* linear complexity whereas
BWT *is* linear.

I suggest you go through references [2],[8],[13],[14], and [15]
in my paper.
Remember that the basic difference is that PPM is explicitly
modelling the source while BWT is only running a sort.

Thanks again,
Dror

Dror Baron

unread,

Mar 27, 2000, 3:00:00 AM3/27/00

to

Dear Vladimir,

Thanks for your comments!
I'll try to respond to your suggestions.

> Can you produce the results for Calgary?

I don't have simulation results yet.
However, I believe that for practical purposes
the block lengths will be quite large, so I
would not expect amazing results.
I guess we'll need to be patient.

> Well, but what in practice? K-T estimating is not good for
> BW-postprocessing, because the transformed sequence usually has a kind of
> correlation (it's not memoryless). So, the usual approaches are preferred
> and the merging algorithm here can help to adjust the coding parameters
> (statistics reset, etc). Going this way we may successfully reject
> non-optimal MTF-coding.

Exactly. Instead of running MTF on the BWT output,
we begin by estimating state segments, afterwards we
code each one of those separately.

> Block-oriented character is the main disadvantage of the represented
> technique. We assume that the information birth is nonrecurrent action. It
> put some limitations on the source (from this point CTW has a prime
> advantage). Therewith, I'd like to note that such method as well as original
> Burrows-Wheeler algorithm is not adaptive.

Yes - the block-orientation is limiting.

Thanks again,

Dror Baron

unread,

Mar 27, 2000, 3:00:00 AM3/27/00

to

Hi Vladimir,

We agree that good results on text are not expected.
The fundamental reason is that the algorithm is designed for modelling.
It isn't designed for text. I am sure you realize that I'm a
research-oriented PhD student and not a results-oriented software
developer - the flavor of my results will always be biased for
(what I consider to be) beauty and elegance and less for results.

By the way - I talked with Willems last week and he mentioned
they have some results on a variant of CTW which is about as
good as PPM. So I wouldn't rule anything out.
Last comment: from a theoretical viewpoint (universal coding
redundancies) CTW is extremely good.
BWT output will always be biased by $\log{N}$ per segment (see
Merhav's paper on MDL of piecewise stationary sources).
Therefore CTW is 'born' to be better than BWT.

> > Exactly. Instead of running MTF on the BWT output,
> > we begin by estimating state segments, afterwards we
> > code each one of those separately.
>

> Not quite right. Undoubtedly, there is a correlation between the segments.
> Therewith, separable coding means large statistics saving issues in a case

Theoretically you are correct - see reference [1] from my paper.
For practical purposes (justified by simulation) the correlation
between adjacent segments is negligible.

> BTW Dmitry's PPMD (variant E) packer as a rule take the lead in coding speed
> over BW-packers : )

Again: my emphasis is on theoretical research and not programming.
I am sure Dmitry is 10 times better at programming ;-)

At the end of the day there is an inherent bound on
the complexity of compression algorithms: it is $O(N\log{|X|})$.
Anybody who claims to be better is cheating somewhere.
Nobody can do real $O(N)$. Current BW algorithms are $O(N|X|)$.

Dror

Vladimir Semenyuk

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

Hello, Dror !

> > Can you produce the results for Calgary?
>
> I don't have simulation results yet.
> However, I believe that for practical purposes
> the block lengths will be quite large, so I
> would not expect amazing results.
> I guess we'll need to be patient.

I have a serious doubt about amazing results:

(1) for BW (here and later Burrows-Wheeler method): as far as I know,
Michael Shindler's SZIP makes a similar thing in some aspects, but the
results are not very impressive;
(2) for CTW: well, your technique may be good in speed (especially, if radix
sort is in use (I mean partial context sort)), but what about the ratio? As
I said, it's not adaptive, and the estimate is not so clever as in CTW. So,
the compression effect may be poor for this class of algorithms;
(3) big trouble: the need for creating byte-oriented version. It's not easy.
One way is doing something like Tjalkens, Volf and Willems described in "A
Context-Tree Weighting Method for Text Generating Sources". This approach
looks awkward. Unfortunately, for today I don't see the other way.

> > Well, but what in practice? K-T estimating is not good for
> > BW-postprocessing, because the transformed sequence usually has a kind
of
> > correlation (it's not memoryless). So, the usual approaches are
preferred
> > and the merging algorithm here can help to adjust the coding parameters
> > (statistics reset, etc). Going this way we may successfully reject
> > non-optimal MTF-coding.
>

> Exactly. Instead of running MTF on the BWT output,
> we begin by estimating state segments, afterwards we
> code each one of those separately.

Not quite right. Undoubtedly, there is a correlation between the segments.
Therewith, separable coding means large statistics saving issues in a case

of non-adaptive method. So, I guess postprocessing should be adaptive and
the adaptation must depend on segments boundaries. BTW the separation can be
made during the sequential coding after the boundary is reached (an exact
boundary hit is not so important). I mean coder and decoder that
synchronously analyze the data behavior.

[from the other discussion]

>However, in PPM we (asymptotically) model the source as
>depth-$\log{N}$ thus at best $\log{N}$ operations per symbol are
>necessary. For BWT, nothing is being modelled explicitly.
>We need (suffix arrays) $|X|$ (without hash tables)
>operations per symbol.

Modern PPM algorithms don't require hashing. PATRICIA structures are often
used instead.

> For practical purposes $|X|$ is pretty big, but I'm sure you
> know how that can be overcome.
> So... asymptotically PPM is *not* linear complexity whereas
> BWT *is* linear.

BTW Dmitry's PPMD (variant E) packer as a rule take the lead in coding speed
over BW-packers : )

Regards,
Vladimir.

E-mail: seme...@unitel.spb.ru

Malcolm Taylor

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

Dror Baron <dba...@grenada.csl.uiuc.edu> wrote:
>> BTW Dmitry's PPMD (variant E) packer as a rule take the lead in coding speed
>> over BW-packers : )
>

>Again: my emphasis is on theoretical research and not programming.
>I am sure Dmitry is 10 times better at programming ;-)

I think the point must be made that Dmitry's PPMD uses a suffix tree
approach which gives rise to an O(N) (the point being that the maximum
order and alphabet size are constant and so should be ignored :). This
is better than the O(N*log(N)) you describe for PPM algorithms in
general.

The O(N) that Dmitry achieves, makes use of the coherence between
adjacent contexts, and as such could also be applied to CTW models as
they use similar structures. This would mean that you could create a
CTW that has O(N) complexity as well.

ie. both PPM and CTW are able to achieve the lower bound in complexity
that you quote for 'any' compression algorithm.

Having said that, the current crop of CTW algorithms and PPM algorithm
implementations (not including Dmitry's PPMD that I know of) are
O(N*logN) (eg. RK uses an O(N*logN) implementation of PPM because the
extra complexity and memory saving techniques in my PPMZ destroys the
coherence needed for an O(N) implementation).

Also of note is that complexity analysis is only useful
asymptotically. The classic example is that bubble sort is faster than
quick sort for small sets since the constant per 'operation' is much
lower than quick sort. So a comparison of algorithms should take into
account the differences in constant time per 'operation' (eg. CTW uses
binary decomposition giving rise to at least 8 times the work of PPM).

Just my few points :)

Malcolm

Vladimir Semenyuk

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

Hello, Dror !

> By the way - I talked with Willems last week and he mentioned
> they have some results on a variant of CTW which is about as
> good as PPM.

Note: I have nothing against CTW. Compression results are really great. But
truth to tell: because of the slow decoding method will never be widely used
in practice (as well as PPM, DMC, PSM). CTW remains nothing but admirable
theoretical issue.

> > > Exactly. Instead of running MTF on the BWT output,
> > > we begin by estimating state segments, afterwards we
> > > code each one of those separately.
> >
> > Not quite right. Undoubtedly, there is a correlation between the
segments.
> > Therewith, separable coding means large statistics saving issues in a
case
>

> Theoretically you are correct - see reference [1] from my paper.
> For practical purposes (justified by simulation) the correlation
> between adjacent segments is negligible.

Don't agree. Saying that you imply the particular class of sources. The
number of counter-examples I know allows to make an assumption that
correlation exists and we can't leave it out of account, especially if we
are in need of a clever statistical information storing mechanism in a case
of 256-symbol alphabet.
.
Regards,
Vladimir.

E-mail: seme...@unitel.spb.ru

Dror Baron

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

> Don't agree. Saying that you imply the particular class of sources. The
> number of counter-examples I know allows to make an assumption that
> correlation exists and we can't leave it out of account, especially if we
> are in need of a clever statistical information storing mechanism in a case
> of 256-symbol alphabet.

Obviously if you showed me some of these examples I would be
extremely grateful.

Dror

Dror Baron

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

Hi Malcolm,

Maximum order grows with $N$ so I don't agree with that approach.
Alphabet size - that's fine.
Theoretically, as one of the guys in this correspondence pointed out,
there is no reason to attain the $O(N\log{|X|})$ bound once we can
do $O(N|X|)$, because we can 'go down' to smaller alphabets.

> ie. both PPM and CTW are able to achieve the lower bound in complexity
> that you quote for 'any' compression algorithm.

This lower bound is simply the length in bits of the original
(uncompressed) version. I'm sure you agree we can't do better than
O(length original).

> Also of note is that complexity analysis is only useful
> asymptotically. The classic example is that bubble sort is faster than
> quick sort for small sets since the constant per 'operation' is much
> lower than quick sort. So a comparison of algorithms should take into
> account the differences in constant time per 'operation' (eg. CTW uses
> binary decomposition giving rise to at least 8 times the work of PPM).

Of course ;-)
Although I usually use the 'sort' in MATLAB which is
'other people's code' ;-)

Dror

Dror Baron

unread,

Mar 28, 2000, 3:00:00 AM3/28/00

to

You propose compressing block by block.
This is drastically different than the proposed scheme.

> I agree. As a general rule BWT can not be better than PPM compressionwise;
> this
> was discussed on Dcc98 (not in the program, but among a few people). The
> reason is
> mainly that the BWT uses wrong statistics for the beginning of new
> contexts/buckerts/whatever.
> This is also the reason for improved compression for bigger blocks - the
> loss due to
> adaption is (nearly) the same for a big and a small block as the number of
> different contexts
> is about the same. For the big block the percentage of such badly compressed
> data will
> simply be smaller.
>
> I did a few experiments and these are approximate data:
> if you take a big text (maybe out of the canterbury large file corpus) and
> run statistics individually
> for the first few bytes of a context/bucket/whatever and for the remainder
> you will notice that
> the compression in the beginning is very low (3 bytes/char or more for the
> first few chars)
> while it then becomes good.
> However for large blocks (above 1 or 2 MB) the fraction of these badly
> compressed bytes is low.
> So do not expect too much from larger blocksizes. I have szip versions
> capable of handling blocks up to
> 2^31-1, but i never really tested above 10MB or so.
> My rule of thumb: limit(size) = 2*(size(using 4MB blocks)-size(using 2MB
> blocks)).

>
> >(1) for BW (here and later Burrows-Wheeler method): as far as I know,
> >Michael Shindler's SZIP makes a similar thing in some aspects, but the
> >results are not very impressive;
>

> szip does a partial context sort, the context size is selectable by -o
> switch. Try -o4 for speed, which
> will do two base 2^16 LSB first radix sorts between writing "processing..."
> and "coding..." (-v1 to see it)
> The time before processing is reading the input, between coding and done
> doing the modelling
> and output. szip does no asyncroneous IO; experiments indicate that when
> reading and sorting
> in seperate threads speed for -o4 would double - is anyone interested in
> this?
>
> The szip version available does no context border estimation. I found that
> the 2% smaller size was
> not worth 20% more CPU time. I did such experiments, so these 2% smaller for
> files on my harddisk
> are tested. Szip has other places too where it prefers speed over
> compression.
>
> The main benefit of limiting sort order is that you could easily do BWT in
> hardware - MTF is not a
> real problem, and so are entropy coders in hardware.
>
> There are ways to get better compression out of BWT than currently
> available.
>
> michael
>
> [snipped]
>
>
>

DI Michael Schindler

unread,

Mar 29, 2000, 3:00:00 AM3/29/00

to

hi!

i'd like to add my 0.02 euro to this discussion:

Vladimir Semenyuk wrote in message <8bof54$m56$1...@news.runnet.ru>...

>Hello, Dror !
>
>> > Can you produce the results for Calgary?
>>
>> I don't have simulation results yet.
>> However, I believe that for practical purposes
>> the block lengths will be quite large, so I
>> would not expect amazing results.
>> I guess we'll need to be patient.
>
>I have a serious doubt about amazing results:

I agree. As a general rule BWT can not be better than PPM compressionwise;

Dmitry Shkarin

unread,

Mar 29, 2000, 3:00:00 AM3/29/00

to

Hi, Dror!

>The data structures are definitely similar.

>However, in PPM we (asymptotically) model the source as
>depth-$\log{N}$ thus at best $\log{N}$ operations per symbol are
>necessary. For BWT, nothing is being modelled explicitly.
>We need (suffix arrays) $|X|$ (without hash tables)
>operations per symbol.

Let's divide:

1. Finite order PPM.
It is easy: after statistics gathering(all possible states and all
possible transitions are created), it works similarly to ordinary order-0
arithmetic coder with cost proportional |X| per symbol, so PPM behaivor
is O(|X|*N).

2. Unbounded PPM*.
It is more complicated: number of operations per context is bounded by
|X| again and number of scanned contexts(number of escapes) is less or
equal |X| too, because We can visit only those nodes that contain different
number of transitions, their number is limited by |X|, tree nodes are added
in linear time, therefore PPM* behaivor is O(|X|*|X|*N).
Hmmm, it is slightly worse than my previous claim ;-).

3. CTW(bounded).
It is similar to PPM: finite model => finite time for each symbol.

4. CTW(unbounded, does it exists?).
It scans tree in depth for each symbol, tree depth is proportional
log(N), so it is really O(N*log(N)).

Vladimir Semenyuk

unread,

Mar 29, 2000, 3:00:00 AM3/29/00

to

Hi, Michael !

>>(1) for BW (here and later Burrows-Wheeler method): as far as I know,
>>Michael Shindler's SZIP makes a similar thing in some aspects, but the
>>results are not very impressive;

> szip does a partial context sort, the context size is selectable by -o
> switch. Try -o4 for speed, which
> will do two base 2^16 LSB first radix sorts between writing
>"processing..." and "coding..." (-v1 to see it)

I knew this ;-)

> The szip version available does no context border estimation. I found that
> the 2% smaller size was
> not worth 20% more CPU time. I did such experiments, so these 2% smaller
> for files on my harddisk
> are tested. Szip has other places too where it prefers speed over
> compression.

I am entirely with you in this. A few percents is not amazing result.
Saying "a similar thing in some aspects" I meant only the partial
renunciation of MTF. I must be mistaken, but it seems to me that SZIP
sometimes doesn't use MTF. Is it right?

Regards,
Vladimir.

E-mail: seme...@unitel.spb.ru

Dror Baron

unread,

Mar 29, 2000, 3:00:00 AM3/29/00

to

Hi Dmitry,

Excellent - it seems that our understanding is converging.
There are nice $O(N|X|)$ algorithms, and they lead to $O(N\log{|X|})$.
Anyway BWT is unbounded and $O(N|X|)$ so it is fine.

By the way - CTW unbounded does indeed exist.
Willems told me that he has some paper on this.
I'll send you the reference when I come across it...
(Busy month)

Dror

Dmitry Shkarin

unread,

Mar 30, 2000, 3:00:00 AM3/30/00

to

Hi, Malcolm!

>Also of note is that complexity analysis is only useful
>asymptotically. The classic example is that bubble sort is faster than
>quick sort for small sets since the constant per 'operation' is much
>lower than quick sort. So a comparison of algorithms should take into
>account the differences in constant time per 'operation' (eg. CTW uses
>binary decomposition giving rise to at least 8 times the work of PPM).

I agree with You that asymptotical and real behaivors are two very
different things and here is more actual example: standard blocksorting
scheme(BWT + MTF + entropy coding) is asymptotically nonoptimal, but it is
successfully used by modern compressors ;-).

mikael.l...@spray.se

unread,

Mar 30, 2000, 3:00:00 AM3/30/00

to

Hi all!
I'm pleased to see that there are people still working to improve bwt.
Bwt still has a lot of room for that.
I just wish that you could have included more like examples and results!

Ppm and ctw etc. is mainly interesting from an academic point of view.
It's simply too slow and memory demanding for any real practical use.
Someone said
"We don't need more powerful computers, we need better algorithms!"
Too see parts of what is possible with bwt, visit
http://www.mathematik.uni-bielefeld.de/~bernhard/

I'm working to improve bwt too and my current
results are very similar to these.
I think it's possible to do better than Balkenhol
but it's getting more difficult.
The argument that better compression equals slower
execution is simply not true! (most of the time, lazy programmers?)

You ppm experts, perhaps you should use your expert
knowledge and energy to develop bwt instead?

Mikael

In article <Pine.GSO.4.21.00032...@grenada.csl.uiuc.edu>
,
Dror Baron <dba...@grenada.csl.uiuc.edu> wrote:
> Hi all,

>
> Anybody interested is welcome to take a look at a paper
> I recently showed at the CISS conference.
> The paper uses the Burrows Wheeler transform (BWT) to
> preprocess a sequence of symbols. Afterwards a block-merging
> algorithm estimates states, parameters, etc.
>

> The main contributions are:
> 1. the bvlock mergins estimates the source well asymptotically

> 2. application to universal coding with lower redundancies than
> previous BWT schemes

> 3. iid assumption not used (weaker assumption used)
>

> It is downloadable from my website...
>

> Dror
>
> --------------------------
> Dror Baron
> ECE Department
> University of Illinois at Urbana-Champaign(UIUC)
> CSRL room 120
> Phone: (217)-265-8234 (office)
> Website: http://www.uiuc.edu/~dbaron
>
>

Sent via Deja.com http://www.deja.com/
Before you buy.

Binder Edgar

unread,

Mar 30, 2000, 3:00:00 AM3/30/00

to

Hi,

mikael.l...@spray.se wrote:

> Ppm and ctw etc. is mainly interesting from an academic point of view.
> It's simply too slow and memory demanding for any real practical use.

Hmm, I don't agree here. PPMD (the compressor) shows that PPM can be
quite fast. I suspect that it's the same with CTW, but probably most of the
people that work on PPM/CTW don't value speed/memory requirements as much as
compression ratios...

> I think it's possible to do better than Balkenhol
> but it's getting more difficult.

Well, my BWT results are a bit better than Balkenhols, but I use a
completly different modelling technique...

> The argument that better compression equals slower
> execution is simply not true! (most of the time, lazy programmers?)

:-)

--
Edgar

Bernhard Balkenhol

unread,

Apr 6, 2000, 3:00:00 AM4/6/00

to

Dror Baron wrote:

You propose compressing block by block.
This is drastically different than the proposed scheme.

> I agree. As a general rule BWT can not be better than PPM compressionwise;

> this
> was discussed on Dcc98 (not in the program, but among a few people). The
> reason is
> mainly that the BWT uses wrong statistics for the beginning of new
> contexts/buckerts/whatever.
> This is also the reason for improved compression for bigger blocks - the
> loss due to
> adaption is (nearly) the same for a big and a small block as the number of
> different contexts
> is about the same. For the big block the percentage of such badly compressed
> data will
> simply be smaller.

OK, I was also one of this people who has statet that. But now I am no longer shure
about that. It can been shown that you can think about the BWT method as a generalisation
of PPM because
1) You have to consider Models where a context is suffix of an other context
2) In fact you can estimate the changings of the real context and then work as in PPM

My implementation gives
         bib   111261    26469 1.903
       book1   768771   218023 2.269
       book2   610856   149734 1.961
         geo   102400    53221 4.158
        news   377109   114097 2.420
        obj1    21504    10020 3.728
        obj2   246814    74284 2.408
      paper1    53161    16003 2.408
      paper2    82199    24186 2.354
         pic   513216    46131 0.719
       progc    39611    12108 2.445
       progl    71646    15028 1.678
       progp    49379    10377 1.681
       trans    93695    17106 1.461
AVERAGE 14 :           786787 2.257

on the files of the Calgary Corpus
and that is possible in complexity similar to gzip (gzip not bzip or something else)

Everybody who is interrested can find the describtion in a report
http://www.mathematik.uni-bielefeld.de/~bernhard

the reports are under the link "Preprints im SFB"
and the mentioned one is a work together with Yuri M. Shtarkov:
"One Attempt of a Compression Algorithm based on the BWT"

That is still not the best I can achieve, I estimate for the next version which I like
to describe something like 2.20 in the average (hopefully).
Therefore as a conclusion at least for myself it is no longer clear that we can not do better
than PPM or CTW.

> .....

> >(1) for BW (here and later Burrows-Wheeler method): as far as I know,
> >Michael Shindler's SZIP makes a similar thing in some aspects, but the
> >results are not very impressive;

What do you think about the results above ?

> ....

> There are ways to get better compression out of BWT than currently
> available.

As you can see above.

>
> michael
>
> [snipped]
>

With my best regards

    Bernhard

-- 
---------------------------------------------------------------------------
Dr. Bernhard Balkenhol        e-Mail bern...@mathematik.uni-bielefeld.de
Fakultaet fuer Mathematik     Tel. +0049-521-106-3872
Universitaet Bielefeld       
Postfach 10 01 31             Fax  +0049-521-106-4743
D - 33501 Bielefeld 
Germany

Bernhard Balkenhol

unread,

Apr 6, 2000, 3:00:00 AM4/6/00

to

mikael.l...@spray.se wrote:

Hi,

Hi all!
I'm pleased to see that there are people still working to improve bwt.
Bwt still has a lot of room for that.
I just wish that you could have included more like examples and results!

>...

I'm working to improve bwt too and my current
results are very similar to these.

I think it's possible to do better than Balkenhol
but it's getting more difficult.

No, we can do better and faster simultaneously as myself.

The argument that better compression equals slower
execution is simply not true! (most of the time, lazy programmers?)

You ppm experts, perhaps you should use your expert

knowledge and energy to develop bwt instead?

The sorting can be done in between 1/4 and 1/2 of the time of Benson Sedwick with
similar memory requirement but not with that terible worst cases like pic. This
can be done with the same algorithm in time complexity smaller than with suffix trees.
As far as I have described it I let you know.

With regards
Bernhard

Mikael
In article <Pine.GSO.4.21.00032...@grenada.csl.uiuc.edu>
,

---------------------------------------------------------------------------
Dr. Bernhard Balkenhol        e-Mail bern...@mathematik.uni-bielefeld.de

Vladimir Semenyuk

unread,

Apr 6, 2000, 3:00:00 AM4/6/00

to

Hello, Bernhard !

> OK, I was also one of this people who has statet that. But now
> I am no longer shure about that. It can been shown that you can
> think about the BWT method as a generalisation of PPM because

> 1) You have to consider Models where a context is suffix of an other
> context
> 2) In fact you can estimate the changings of the real context and then
> work as in PPM

I have the objections. First of all, let's disjoint two cases:

(1) We use an original Burrow-Wheeler approach. It means: making a sorting
(i. e. simple x^n->y^n as in your papers), forgetting about contexts (!),
coding.

Advantages:
(a) Fast modeling => fast compression.
(b) Very fast decompression => this method will be used in practice in
contrast to any prediction technique (i.e. statistical methods like PPM).

Disadvantages:
(a) A kind of abstraction from real model due to context forgetting =>
unavoidable loss of efficiency (even with estimating techniques).
(b) Block nature => loss of adaptivity => possible loss of efficiency.

(2) We make context sorting and the result is used to facilitate the
modeling. It can be useful in dictionary and statistical methods.

Advantages:
(a) Fast and simple modeling => possibility of making good results with low
complexity.

Disadvantages:
(a) It speeds up the compression, but the complexity of decompression
stays the same.
(b) It is not a new method (not BW). Just a sorting.

So, if you mean the first case, in general it won't guarantee toughest
compression (see Disadvantages); second approach has no connection
to Burrows-Wheeler method.

And now I'll show the same from the point of practice:

> My implementation gives
> bib 111261 26469 1.903
> book1 768771 218023 2.269
> book2 610856 149734 1.961
> geo 102400 53221 4.158
> news 377109 114097 2.420
> obj1 21504 10020 3.728
> obj2 246814 74284 2.408
> paper1 53161 16003 2.408
> paper2 82199 24186 2.354
> pic 513216 46131 0.719
> progc 39611 12108 2.445
> progl 71646 15028 1.678
> progp 49379 10377 1.681
> trans 93695 17106 1.461

Results for PPMD(E) by Dmitry Shkarin ("-o5" option):

bib 1.77
book1 2.22
book2 1.90
geo 4.46
news 2.28
obj1 3.60
obj2 2.30
paper1 2.25
paper2 2.22
pic 0.79
progc 2.28
progl 1.60
progp 1.62
trans 1.37

And speed was really impressive here. (PPMD usually passes ahead of
BW-based programs (I mean coding+decoding time).)

Conclusion: BW is interesting only from the point of speed. So, further
sorting improvements are welcome. Greedy coding is not so important (slowing
down doesn't worth it). It's my standpoint.

Regards,
Vladimir.

E-mail: seme...@unitel.spb.ru

Binder Edgar

unread,

Apr 6, 2000, 3:00:00 AM4/6/00

to

Hi,

Vladimir Semenyuk wrote:

[Bernhard: BWT as PPM generalisation]

> I have the objections. First of all, let's disjoint two cases:
>
> (1) We use an original Burrow-Wheeler approach. It means: making a sorting
> (i. e. simple x^n->y^n as in your papers), forgetting about contexts (!),
> coding.
>
> Advantages:
> (a) Fast modeling => fast compression.
> (b) Very fast decompression => this method will be used in practice in
> contrast to any prediction technique (i.e. statistical methods like PPM).

You forgot (c), Context blending and (d), BW doesn't need very much memory.

> Disadvantages:
> (a) A kind of abstraction from real model due to context forgetting =>
> unavoidable loss of efficiency (even with estimating techniques).
> (b) Block nature => loss of adaptivity => possible loss of efficiency.

Well, this is correct...

> So, if you mean the first case, in general it won't guarantee toughest
> compression (see Disadvantages); second approach has no connection
> to Burrows-Wheeler method.
>
> And now I'll show the same from the point of practice:

[results comparision]

> And speed was really impressive here. (PPMD usually passes ahead of
> BW-based programs (I mean coding+decoding time).)
>
> Conclusion: BW is interesting only from the point of speed. So, further
> sorting improvements are welcome. Greedy coding is not so important (slowing
> down doesn't worth it). It's my standpoint.

Hmm, it isn't neccesary for BW to get slower with better modeling. It's
possible to improve both compression and speed...

--
Edgar

mikael.l...@spray.se

unread,

Apr 7, 2000, 3:00:00 AM4/7/00

to

Hi!

What makes bwt worth the effort is that
1. It's relatively new and unexplored (1994) compared to ppm.
2. When Burrows and Wheeler released the idea to the world,
it could compress Calgary corpus to 2.40 bpc average.
Today we can compress the same using bwt to 2.26 bpc and it's not a
limit.
3. As Balkenhol is saying, bwt is close to ppm in what and how it's doing
it but faster and much more efficient with memory, hinting that it has
not reached its true potential.

If we say it's impossible from the beginning, we will never discover
or gain anything. (Man had never walked on the moon either!)

Mikael

In article <8ciotv$e53$1...@news.runnet.ru>,
"Vladimir Semenyuk" <seme...@unitel.spb.ru> wrote:
> Hello, Bernhard !

>
> > OK, I was also one of this people who has statet that. But now
> > I am no longer shure about that. It can been shown that you can
> > think about the BWT method as a generalisation of PPM because
>
> > 1) You have to consider Models where a context is suffix of an other
> > context
> > 2) In fact you can estimate the changings of the real context and then
> > work as in PPM
>

> I have the objections. First of all, let's disjoint two cases:
>
> (1) We use an original Burrow-Wheeler approach. It means: making a sorting
> (i. e. simple x^n->y^n as in your papers), forgetting about contexts (!),
> coding.
>
> Advantages:
> (a) Fast modeling => fast compression.
> (b) Very fast decompression => this method will be used in practice in
> contrast to any prediction technique (i.e. statistical methods like PPM).
>

> Disadvantages:
> (a) A kind of abstraction from real model due to context forgetting =>
> unavoidable loss of efficiency (even with estimating techniques).
> (b) Block nature => loss of adaptivity => possible loss of efficiency.
>

> (2) We make context sorting and the result is used to facilitate the
> modeling. It can be useful in dictionary and statistical methods.
>
> Advantages:
> (a) Fast and simple modeling => possibility of making good results with low
> complexity.
>
> Disadvantages:
> (a) It speeds up the compression, but the complexity of decompression
> stays the same.
> (b) It is not a new method (not BW). Just a sorting.
>

> So, if you mean the first case, in general it won't guarantee toughest
> compression (see Disadvantages); second approach has no connection
> to Burrows-Wheeler method.
>
> And now I'll show the same from the point of practice:
>

> > My implementation gives
> > bib 111261 26469 1.903
> > book1 768771 218023 2.269
> > book2 610856 149734 1.961
> > geo 102400 53221 4.158
> > news 377109 114097 2.420
> > obj1 21504 10020 3.728
> > obj2 246814 74284 2.408
> > paper1 53161 16003 2.408
> > paper2 82199 24186 2.354
> > pic 513216 46131 0.719
> > progc 39611 12108 2.445
> > progl 71646 15028 1.678
> > progp 49379 10377 1.681
> > trans 93695 17106 1.461
>

> Results for PPMD(E) by Dmitry Shkarin ("-o5" option):
>
> bib 1.77
> book1 2.22
> book2 1.90
> geo 4.46
> news 2.28
> obj1 3.60
> obj2 2.30
> paper1 2.25
> paper2 2.22
> pic 0.79
> progc 2.28
> progl 1.60
> progp 1.62
> trans 1.37
>

> And speed was really impressive here. (PPMD usually passes ahead of
> BW-based programs (I mean coding+decoding time).)
>
> Conclusion: BW is interesting only from the point of speed. So, further
> sorting improvements are welcome. Greedy coding is not so important (slowing
> down doesn't worth it). It's my standpoint.
>

> Regards,
> Vladimir.
>
> E-mail: seme...@unitel.spb.ru

kaptke...@my-deja.com

unread,

Apr 7, 2000, 3:00:00 AM4/7/00

to

> What makes bwt worth the effort is that
> 1. It's relatively new and unexplored (1994) compared to ppm.
> 2. When Burrows and Wheeler released the idea to the world,
> it could compress Calgary corpus to 2.40 bpc average.
> Today we can compress the same using bwt to 2.26 bpc and it's not a
> limit.
> 3. As Balkenhol is saying, bwt is close to ppm in what and how it's
doing
> it but faster and much more efficient with memory, hinting that it has
> not reached its true potential.
>
> If we say it's impossible from the beginning, we will never discover
> or gain anything. (Man had never walked on the moon either!)
>
> Mikael

This is exactly the attitude that is needed for any great breakthough
to occur. There are so many things that are new and novel concepts
such as BWT which will only come into existance when we push to
discover them. The focus here seems to be exclusively on the modeling
improvements to BWT however, my work was firstly to build a better
coder for use with the BWT. Here, I have made many such improvements
which will further speed up the overall BWT performance (although I
realize that the real bottlneck in speed is not the coder but the sort
transform. however, every cycle counts.) My coder achieves just over
2.4 b/b on average for the Calgary Corpus without any special modeling
other than the generic BWT. There is no MTF, no RLE and no other
alterations to the output from BWT. Moreover, there is no arithmetic
coding so the whole encoding process is far faster. And decoding is
even faster still. Now, I need to start working on the more advanced
improvements to the basic BWT as is the focus here. If only I have
more time....

-michael maniscalco

Vladimir Semenjuk

unread,

Apr 7, 2000, 3:00:00 AM4/7/00

to

Hi, Edgar !

> You forgot (c), Context blending and (d), BW doesn't need very much
memory.

Context blending - this is not an advantage here, because it (blending)
could be made in a better way. BTW BW method doesn't actually makes context
blending (we lose the real contexts). Just source transformation.
Memory: you mean 5N+alpha, but it doesn't provide determined speed
complexity.

> Hmm, it isn't neccesary for BW to get slower with better modeling. It's
> possible to improve both compression and speed...

Better and faster than comp.exe by Michael? If you can, show me how. For
example, your distance coding still has a lack of speed.

Regards,
Vladimir.

Vladimir Semenjuk

unread,

Apr 7, 2000, 3:00:00 AM4/7/00

to

Hi, Mikael !

> If we say it's impossible from the beginning, we will never discover
> or gain anything. (Man had never walked on the moon either!)

I didn't mean that someone should stop working. BW is great and useful. I
only pointed to the direction (sorting improvement), which by my opinion may
lead to success.

Regards,
Vladimir.

E-mail: seme...@unitel.spb.ru

Bernhard Balkenhol

unread,

Apr 7, 2000, 3:00:00 AM4/7/00

to

Vladimir Semenjuk wrote:

Hallo all of you.

Hi, Edgar !
>....

Better and faster than comp.exe by Michael? If you can, show me how. For
example, your distance coding still has a lack of speed.

Yes, I can do it even faster then gzip but I have not described it yet.
You can do the hole encoding not much more expensiv then the decoding.
It is even faster then Michaels restricted sorting where I still produce a total order.

And I have compared the compression rates to PPMDE D=5 as mentioned by Jan Aberg
1) But there the results look slieghtly different
2) As I have allready mentioned that is not the best I can achieve.
3) It is much slower then gzip and therefore also slower than my one.

And what I meen with generalisation is the answer to the question why Moffat's idea
to count the occured symbols only ones and not realy the frequencies (going done in the tree)
not only improves speed but also the compression rate.

Regards,
Vladimir.

Regards

Binder Edgar

unread,

Apr 7, 2000, 3:00:00 AM4/7/00

to

Hi,

Vladimir Semenjuk wrote:

> > You forgot (c), Context blending and (d), BW doesn't need very much
> memory.
>
> Context blending - this is not an advantage here, because it (blending)
> could be made in a better way. BTW BW method doesn't actually makes context
> blending (we lose the real contexts). Just source transformation.

That's not necessary true. Contexts *can* be recovered after a BWT, during
the encoding phase (although not all of them and unlikely very large
contexts)... Still, to be fair, I have to admit that the speed drop and
increased memory usage makes recovering quite unpractical - and it only brought
so far relatively small compression improvements in my tests.

> Memory: you mean 5N+alpha, but it doesn't provide determined speed
> complexity.

With a LZP preprocessing step (that helps BTW compression too), you'll have
determined speed...

> > Hmm, it isn't neccesary for BW to get slower with better modeling. It's
> > possible to improve both compression and speed...
>

> Better and faster than comp.exe by Michael? If you can, show me how. For
> example, your distance coding still has a lack of speed.

Okay, it's difficult to get better & faster than comp.exe. Still, I think
the compression/speed tradeoff is reasonable with DC...

--
Edgar

Vladimir Semenyuk

unread,

Apr 7, 2000, 3:00:00 AM4/7/00

to

Hi, Bernhard !

>> Better and faster than comp.exe by Michael? If you can, show me how. For
>> example, your distance coding still has a lack of speed.

> Yes, I can do it even faster then gzip but I have not described it yet.

> You can do the hole encoding not much more expensiv then the decoding.
> It is even faster then Michaels restricted sorting where I still produce a
total order.

Note, comp.exe by Michael Shindler doesn't make sorting.

>And I have compared the compression rates to PPMDE D=5 as mentioned by Jan
Aberg
>1) But there the results look slieghtly different

You took a wrong program :-)
Read a part of my recent message again:

VS> Results for PPMD(E) by Dmitry Shkarin ("-o5" option).

Test it. You can download it for free (pointer to PPMD(E) could be found at
www.act.by.net).

>2) As I have allready mentioned that is not the best I can achieve.
>3) It is much slower then gzip and therefore also slower than my one.

Please, compare your block-sorter with IMP using "-1" option (LZ-based
method).
Console version is freeware. GZIP is old and too slow.

Regards,
Vladimir.

E-mail: seme...@unitel.spb.ru

DI Michael Schindler

unread,

Apr 7, 2000, 3:00:00 AM4/7/00

to

hi!

I fully belive Bernhard when he says his programs are faster and
compress better than szip. If szip would have started not as demo
to show that limited sorting works at all I would have done several
things different.

szip modelling still is based on the modelling designed back in '96 -
maybe it's about time to exploit the free dictionary approach you get
during the sort and a few other things, as well as commenting the
#define EXTRAFAST line in the rangecod.c used inside szip,
as this costs some compression.
The IO of the compressed data should also change the to
something more reasonable than putchar/getchar. Something like
collecting it in mem and output the whole block once 64 kB are there --
I know putchar does this, but i could shave off some cycles there.
Program counter sampling shows that IO takes a large fraction of
the time. Also CPU time is just about 1/2 of real time in an idle
system, so I could win by parallel IO and sorting.

I should also set the tuning of the modelling more towards small
files and not to the default blocksize of 1.7MB to perform better on
calgary corpus, and maybe focus more on text too :)

And maybe I should use a huffman coder with 20MB/sec throughput
instead of a rangecoder. Or mixed coders.

But then: all of above would break compatibility, and would mean I'd
have to spend some time on it.

When comparing with my restricted sorting: compare with order 4
(-o4 switch), as the other orders are really slow :(

Beeing faster than gzip -9 or even -6 is no miracle; use -1 (or program
imp) for a test. Or if one likes a real challange: beat LZOP (by Markus
Oberhumer) speedwise.

As for counting occurred symbols (and not frequencies): szip does this
since 2 years, so it probably is an old idea. Most of the time it just
counts
the symbols of the symbols of the last 32 runs - which implies that the
probability is never more accurate than 1/32. Nevertheless it gives
acceptable compression ;)

regards
Michael
--
DI Michael Schindler Compression Consulting
http://www.compressconsult.com/ mic...@compressconsult.com

Bernhard Balkenhol wrote in message
<38EDF352...@mathematik.uni-bielefeld.de>...

>Yes, I can do it even faster then gzip but I have not described it yet.
>You can do the hole encoding not much more expensiv then the decoding.
>It is even faster then Michaels restricted sorting where I still produce a
total order.

>And I have compared the compression rates to PPMDE D=5 as mentioned by Jan
>Aberg
>1) But there the results look slieghtly different

>2) As I have allready mentioned that is not the best I can achieve.
>3) It is much slower then gzip and therefore also slower than my one.
>

Vladimir Semenyuk

unread,

Apr 8, 2000, 3:00:00 AM4/8/00

to

Hi, Edgar and All !

> > Context blending - this is not an advantage here, because it (blending)
> > could be made in a better way. BTW BW method doesn't actually makes
> > context
> > blending (we lose the real contexts). Just source transformation.

> That's not necessary true. Contexts *can* be recovered after a BWT,
> during the encoding phase (although not all of them and unlikely very
> large contexts)...

Look. We usually make BWT by sorting the real contexts. Then we as if forget
about the reality. Now you propose making the recovery. It looks really
strange. May be it would be better not to do BWT at all. Just try PPM, CTW
or Associative Coding. For speeding-up use sorting, but don't make BWT. Be
sure you will get much more.

*******************************************************
For everyone entrained into this discussion:

I try to show that we shouldn't improve BWT so as it become a variation
of some well-known prediction technique. Doing so we lose the charm of
original Burrows-Wheeler algorithm. Well, just don't name it BWT.

*******************************************************

> > Memory: you mean 5N+alpha, but it doesn't provide determined speed
> > complexity.
>
> With a LZP preprocessing step (that helps BTW compression too), you'll
> have determined speed...

It's a good and quite usual decision, but

(1) LZP step will take a portion of time;
(2) widely used technique is not optimal (it should (I guess it's possible)
be improved without LZP).

> Okay, it's difficult to get better & faster than comp.exe. Still, I
> think the compression/speed tradeoff is reasonable with DC...

Distance Coding is a fresh and elegant decision. You necessarily should
write a paper.

Regards,
Vladimir.

E-mail: seme...@unitel.spb.ru

mikael.l...@spray.se

unread,

Apr 8, 2000, 3:00:00 AM4/8/00

to

Bernhard,
ALL improvements on the bwt implementation is interesting
but making it faster than gzip or szip's limited sort is *easy*.
My implementation ba compress Calgary corpus about as fast as szip -o8
depending on machine and it's still making a full sort and is based on
the *old* CACM arithmetic coding routines.
So, from my point of view there are a few things a dedicated scientist
like you could and should do better than an engineer:
1. A bwt implementation should not really need more than 5n + C memory
where C is a constant (buckets, hashtables etc.).
A better datastructure and algorithm *combined* with Sedgewick's
faster multikey quicksort should be possible (for the repeats).
2. The most interesting part to work on is compression ratios of course.
Building better models, finding patterns and exploiting them as
efficiently as possible using probabilities or rules if you like.
The idea about ternary sequences, maybe it's possible to code it even
more efficiently?

This is what I think you should put your focus on.
Again, it would be nice if you could move the source code to your ftp
server!

Regards,
Mikael

In article <38EDF352...@mathematik.uni-bielefeld.de>,
Bernhard Balkenhol <bern...@mathematik.uni-bielefeld.de> wrote:
>
> --------------A04C4D17A8E67E3C66AADDFA
> Content-Type: text/plain; charset=us-ascii
> Content-Transfer-Encoding: 7bit

>
> Vladimir Semenjuk wrote:
>
> >
>
> Hallo all of you.
>
> > Hi, Edgar !
> > >....

> > Better and faster than comp.exe by Michael? If you can, show me how. For
> > example, your distance coding still has a lack of speed.
> >
>

> Yes, I can do it even faster then gzip but I have not described it yet.
> You can do the hole encoding not much more expensiv then the decoding.
> It is even faster then Michaels restricted sorting where I still produce a
> total order.
>
> And I have compared the compression rates to PPMDE D=5 as mentioned by Jan
> Aberg
> 1) But there the results look slieghtly different
> 2) As I have allready mentioned that is not the best I can achieve.
> 3) It is much slower then gzip and therefore also slower than my one.
>
> And what I meen with generalisation is the answer to the question why Moffat's
> idea
> to count the occured symbols only ones and not realy the frequencies (going
> done in the tree)
> not only improves speed but also the compression rate.
>
> >

> > Regards,
> > Vladimir.
>
> Regards

> Bernhard
>
> --
> ---------------------------------------------------------------------------
> Dr. Bernhard Balkenhol e-Mail bern...@mathematik.uni-bielefeld.de
> Fakultaet fuer Mathematik Tel. +0049-521-106-3872
> Universitaet Bielefeld
> Postfach 10 01 31 Fax +0049-521-106-4743
> D - 33501 Bielefeld
> Germany
>

> --------------A04C4D17A8E67E3C66AADDFA
> Content-Type: text/html; charset=us-ascii
> Content-Transfer-Encoding: 7bit
>
> <!doctype html public "-//w3c//dtd html 4.0 transitional//en">
> <html>
> Vladimir Semenjuk wrote:
> <blockquote TYPE=CITE> </blockquote>
> Hallo all of you.
> <blockquote TYPE=CITE>Hi, Edgar !
> >....
> Better and faster than comp.exe by Michael? If you can, show me how.
> For
> example, your distance coding still has a lack of speed.
>  </blockquote>

> Yes, I can do it even faster then gzip but I have not described it yet.

> You can do the hole encoding not much more expensiv then the decoding.
> It is even faster then Michaels restricted sorting where I still produce
> a total order.
> And I have compared the compression rates to PPMDE D=5 as mentioned
> by Jan Aberg
> 1) But there the results look slieghtly different
> 2) As I have allready mentioned that is not the best I can achieve.
> 3) It is much slower then gzip and therefore also slower than my one.
> And what I meen with generalisation is the answer to the question why
> Moffat's idea
> to count the occured symbols only ones and not realy the frequencies

> (going done in the tree)

> not only improves speed but also the compression rate.
> <blockquote TYPE=CITE> 
> Regards,
> Vladimir.</blockquote>
>
> Regards
>    Bernhard
>  
> <pre>-- 
> ---------------------------------------------------------------------------
> Dr. Bernhard Balkenhol        e-Mail bern...@mathematik.uni-bielefeld.de
> Fakultaet fuer Mathematik     Tel. +0049-521-106-3872
> Universitaet Bielefeld       
> Postfach 10 01 31             Fax  +0049-521-106-4743
> D - 33501 Bielefeld 
> Germany</pre>
>  </html>
>
> --------------A04C4D17A8E67E3C66AADDFA--

Binder Edgar

unread,

Apr 8, 2000, 3:00:00 AM4/8/00

to

Hi,

Vladimir Semenyuk wrote:

> > That's not necessary true. Contexts *can* be recovered after a BWT,
> > during the encoding phase (although not all of them and unlikely very
> > large contexts)...
>
> Look. We usually make BWT by sorting the real contexts. Then we as if forget
> about the reality. Now you propose making the recovery. It looks really
> strange. May be it would be better not to do BWT at all. Just try PPM, CTW

Okay, you have a point here...

> or Associative Coding. For speeding-up use sorting, but don't make BWT. Be
> sure you will get much more.

But it just happens that I like BWT :-) Well, maybe after I'm finished with
implementing all what I have in mind for DC I'll start with CTW...

> > With a LZP preprocessing step (that helps BTW compression too), you'll
> > have determined speed...
>
> It's a good and quite usual decision, but
>
> (1) LZP step will take a portion of time;

Wich, if good implemented will be compensated by faster sorting.

> (2) widely used technique is not optimal (it should (I guess it's possible)
> be improved without LZP).

Yes, it's not optimal - but not very bad either. If BWT compression on files
with long matches can get better & faster without LZP I would like to hear about
that method.

> > Okay, it's difficult to get better & faster than comp.exe. Still, I
> > think the compression/speed tradeoff is reasonable with DC...
>
> Distance Coding is a fresh and elegant decision. You necessarily should
> write a paper.

Well, I'm planning to write a paper for a while now, but I never found the
time to do it. Hopefully in the near future...

--
Edgar

Bernhard Balkenhol

unread,

Apr 10, 2000, 3:00:00 AM4/10/00

to

Vladimir Semenyuk wrote:

Hi, Bernhard !

Hi, Vladimir

>.....

VS> Results for PPMD(E) by Dmitry Shkarin ("-o5" option).

Test it. You can download it for free (pointer to PPMD(E) could be found at
www.act.by.net).

Thank you I will do it.

Please, compare your block-sorter with IMP using "-1" option (LZ-based
method).
Console version is freeware. GZIP is old and too slow.

I agree.

By the way on the side you have mentioned, notice that the calculation of the average
compression rates on that side is different of that one in the publications and in fact not fare.

For each file
rate = compressed size (in Bytes) *8 / orig size (in bytes)

for the average:
sum of rates / number of files.

The reason is that that otherwise you double the influence of the long files.

Regards,
Bernhard

Regards,
Vladimir.
E-mail: seme...@unitel.spb.ru

Vladimir Semenyuk

unread,

Apr 10, 2000, 3:00:00 AM4/10/00

to

Hello, Bernhard !

> By the way on the side you have mentioned, notice that the calculation
> of the average compression rates on that side is different of that one in
> the publications and in fact not fare.

Fully agree ! It's truth as a fact that Calgary and Canterbury
are too far from reality ;-)

Regards,
Vladimir.

E-mail: seme...@unitel.spb.ru

alex

unread,

Apr 11, 2000, 3:00:00 AM4/11/00

to

Lots of discussion on BWT. Actually, does the algorithm is asymptotic
optimality? If yes, how can prove that?

mikael.l...@spray.se

unread,

Apr 11, 2000, 3:00:00 AM4/11/00

to

Hi!
There are *lots* of information if you just bother to look.
http://dogma.net/DataCompression/
is one good source but the paper you are looking for is perhaps at
http://www.hn.is.uec.ac.jp/~arimura/ieice98.html

A lot of theory but if you're really interested ...

Mikael

In article <38F2E2...@uxmail.cityu.edu.hk>,

Dmitry Shkarin

unread,

Apr 11, 2000, 3:00:00 AM4/11/00

to

Hi, Mikael!

>> Lots of discussion on BWT. Actually, does the algorithm is asymptotic
>> optimality? If yes, how can prove that?

>Hi!
>There are *lots* of information if you just bother to look.
>http://dogma.net/DataCompression/
>is one good source but the paper you are looking for is perhaps at
>http://www.hn.is.uec.ac.jp/~arimura/ieice98.html

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

It is wrong proof. Standard blocksorting scheme(BWT + MTF + entropy
coding) is asymptotically nonoptimal.
Example: order-0 source with nonuniform symbol distribution.
1) BWT will not change source properties, because permutation does not
affects to symbol probabilities;
2) MTF will flatten symbol distribution(it is its property);
3) AriCoder will encode this corrupted distribution;

2Bernhard: I can not see your messages, perhaps, You use html-format. Plain
ASCII text is standard for Usenet forums, use it, please.

mikael.l...@spray.se

unread,

Apr 11, 2000, 3:00:00 AM4/11/00

to

In article <8cvidi$1r2s$1...@gavrilo.mtu.ru>,

"Dmitry Shkarin" <dmitry....@mtu-net.ru> wrote:
> Hi, Mikael!
> >> Lots of discussion on BWT. Actually, does the algorithm is asymptotic
> >> optimality? If yes, how can prove that?
> >Hi!
> >There are *lots* of information if you just bother to look.
> >http://dogma.net/DataCompression/
> >is one good source but the paper you are looking for is perhaps at
> >http://www.hn.is.uec.ac.jp/~arimura/ieice98.html
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> It is wrong proof. Standard blocksorting scheme(BWT + MTF + entropy
> coding) is asymptotically nonoptimal.
> Example: order-0 source with nonuniform symbol distribution.
> 1) BWT will not change source properties, because permutation does not
> affects to symbol probabilities;
> 2) MTF will flatten symbol distribution(it is its property);
> 3) AriCoder will encode this corrupted distribution;
>

Well, you're right in that bwt don't care about probabilities in the
source, it just sorts it. But if you care to make a closer study,
these probabilities plays in favour to bwt. Observe the often very
long runs of the same character in the bwt output.
In MY opinion the japanese researcher is right!
Note that bwt is really just the sorting phase, nothing else!
The problem for us developing bwt lies mainly in the other areas,
mostly in 2) but also in 3).
I'm working on that.

Regards,
Mikael

Dmitry Shkarin

unread,

Apr 12, 2000, 3:00:00 AM4/12/00

to

Hi, Mikael!

>Well, you're right in that bwt don't care about probabilities in the
>source, it just sorts it. But if you care to make a closer study,
>these probabilities plays in favour to bwt. Observe the often very
>long runs of the same character in the bwt output.
>In MY opinion the japanese researcher is right!

M.Arimura & H.Yamamoto considers standard blocksorting scheme, so their
proof is incorrect.

>Note that bwt is really just the sorting phase, nothing else!
>The problem for us developing bwt lies mainly in the other areas,
>mostly in 2) but also in 3).
>I'm working on that.

Something similar to D.Baron & Y.Bresler`s approach can provide
asymptotic optimality, but it will lead us to the loss of simplicity.
NB: asymptotic optimality has small relation to the real performance.

mikael.l...@spray.se

unread,

Apr 12, 2000, 3:00:00 AM4/12/00

to

Hi again!
If I remember my university math correctly, "asymptotic" means that
when a given in-value is closing in to a limit, usually infinity,
the function is getting closer to a limit.
When block-sizes is getting bigger and bigger, bwt implementations
tend to perform better and better (easily proven practically).
However, I don't think it's wise to stare ourselves blind on theory
when practical experiments often can lead to surprising results.
As for complexity, the ppm scheme is *very* complex (complicated).
But that doesn't stop you from trying to improve it, does it?

Mikael

In article <8d1u20$1c9h$2...@gavrilo.mtu.ru>,

Dmitry Shkarin

unread,

Apr 13, 2000, 3:00:00 AM4/13/00

to

Hi, Mikael!

>If I remember my university math correctly, "asymptotic" means that
>when a given in-value is closing in to a limit, usually infinity,
>the function is getting closer to a limit.
>When block-sizes is getting bigger and bigger, bwt implementations
>tend to perform better and better (easily proven practically).

Yes, strongly in this sense, standard blocksorting scheme is
asymptotically nonoptimal.

>However, I don't think it's wise to stare ourselves blind on theory
>when practical experiments often can lead to surprising results.
>As for complexity, the ppm scheme is *very* complex (complicated).
>But that doesn't stop you from trying to improve it, does it?

It is matter of preferences, for me, online PPM is more simple than
multipass BWT, but I`m partial person to this area ;-).

Some notes on compressors comparision:
1) Calgary(CC) and Canterbury corpora were designed for comparision of
*universal* lossless data compression algorithms, so various data type
dependent tricks, plays with program switches etc. are not allowed. In
opposite case, one can easily invent method that compresses each file of CC
to 4 bits and expands any other file on the same 4 bits;
2) Criterion for comparision is unweighted average bpb, but not total
archive size. See any paper on compression;

Binder Edgar

unread,

Apr 13, 2000, 3:00:00 AM4/13/00

to

Dmitry Shkarin wrote:

> >As for complexity, the ppm scheme is *very* complex (complicated).
> >But that doesn't stop you from trying to improve it, does it?
> It is matter of preferences, for me, online PPM is more simple than
> multipass BWT, but I`m partial person to this area ;-).

:-)

> Some notes on compressors comparision:
> 1) Calgary(CC) and Canterbury corpora were designed for comparision of
> *universal* lossless data compression algorithms, so various data type
> dependent tricks, plays with program switches etc. are not allowed. In
> opposite case, one can easily invent method that compresses each file of CC
> to 4 bits and expands any other file on the same 4 bits;

Hmm, there aren't any universal lossless data compression algorithms,
except maybe the "copy" algorithm. Every compression algorithm makes some
assumptions on its data (for example PPM assumes that the context of a symbol
is a good predictor), and a compressor gets better when it makes more
assumptions. In my opinion a compression algorithm should be allowed when:
Size(Algorithm(Corpus))+Size(Algorithm) < Size(Corpus)
BTW, one can do *theoreticaly* better than 4 bits. There are only 14 files
in the CC... :-)

> 2) Criterion for comparision is unweighted average bpb, but not total
> archive size. See any paper on compression;

I agree here. Yet, this comparision is a bit unfair for BWT-archivers
(they're usually worse on small files :-)

--
Edgar

mikael.l...@spray.se

unread,

Apr 13, 2000, 3:00:00 AM4/13/00

to

In article <8d4lq7$1at8$3...@gavrilo.mtu.ru>,
"Dmitry Shkarin" <dmitry....@mtu-net.ru> wrote:
> ...

>
> Some notes on compressors comparision:
> 1) Calgary(CC) and Canterbury corpora were designed for comparision of
> *universal* lossless data compression algorithms, so various data type
> dependent tricks, plays with program switches etc. are not allowed. In
> opposite case, one can easily invent method that compresses each file of CC
> to 4 bits and expands any other file on the same 4 bits;

> 2) Criterion for comparision is unweighted average bpb, but not total
> archive size. See any paper on compression;
>

Hi again!
1) Ba don't need any special switches or special tricks to beat ppmd
on the mentioned files. If no switches were allowed, ppmd would not
perform so well either.
My bwt implementation is *not* using any special tricks other than
it distinguishes text and non-texts. In the case of text it's using
a filter simply because it's so common with texts. It helps on russian
too. Everything else is truly universal, it works on *all* files.
2) I don't think they made any rules as how we should use the test
files. And really, the average measurement should only be interesting
for academic studies. In real life we want our archives as small as
ever possible.
I've used it mostly to compare my program to others, with a
desire to make a tool as efficient and powerful as ever possible.
Bwt has a great potential and I'm surprised that nobody has tried to
push it to the limit before now, with any greater success.

I wanted to compare with ppmd since it performs so well and we should
set our aims high if we ever want to achieve anything.
I don't really think it's one versus the other but, again, is a desire
to produce something really good. I think you have that desire too.
I think bwt has a fair chance to come close to ppmd and be much better
on some filetypes (including some texts), *without* any "tricks".
And you must admit, the advantages of bwt in speed and modest memory
needs is alluring isn't it?

Best regards,
Mikael

Dmitry Shkarin

unread,

Apr 14, 2000, 3:00:00 AM4/14/00

to

Hi, Mikael!

>1) Ba don't need any special switches or special tricks to beat ppmd
>on the mentioned files. If no switches were allowed, ppmd would not
>perform so well either.

Sorry, I meant extra information must not be trasmitted to compressor
with help of switches.

>2) I don't think they made any rules as how we should use the test
>files. And really, the average measurement should only be interesting
>for academic studies. In real life we want our archives as small as
>ever possible.

Average bpp does not depends on file sizes, so it is more objective
measure for mixed file types compression.

PS And last but not least: BA & DC are excellent compressors! 8-O

Dmitry Shkarin

unread,

Apr 14, 2000, 3:00:00 AM4/14/00

to

Hi, Edgar!

> Hmm, there aren't any universal lossless data compression algorithms,
>except maybe the "copy" algorithm. Every compression algorithm makes some
>assumptions on its data (for example PPM assumes that the context of a
symbol
>is a good predictor), and a compressor gets better when it makes more
>assumptions.

No, data properties can be (and must be) derived from data itself.
Trivial assumptions(2*2=4, data depends on previous data ;-)) don`t count.
Thus, universal lossless data compression algorithms exist, no doubt, in
opposite case, why do We talk about asymptotic optimality? ;-)

>In my opinion a compression algorithm should be allowed when:
>Size(Algorithm(Corpus))+Size(Algorithm) < Size(Corpus)

Such evaluation will not work, because it depends on file type and size.
1MB english dictionary + description of english syntax is too big addition
for BOOK1, it is insignificant for 1GB collection of english texts and it is
useless for russian text.

>> 2) Criterion for comparision is unweighted average bpb, but not total
>> archive size. See any paper on compression;
>

> I agree here. Yet, this comparision is a bit unfair for BWT-archivers
>(they're usually worse on small files :-)

Yes, it is defect of Canterbury corpus - some files are too small, CC
file sizes are more typical, but file types distribution is less typical.
Perfectness is not achievable in this world ;-).

Binder Edgar

unread,

Apr 17, 2000, 3:00:00 AM4/17/00

to

Hi,

Dmitry Shkarin wrote:

[universal lossless data compression algorithms]

> No, data properties can be (and must be) derived from data itself.
> Trivial assumptions(2*2=4, data depends on previous data ;-)) don`t count.

Well, that's not always the case, there are, for example uncorrelated
sources.. :-)

> Thus, universal lossless data compression algorithms exist, no doubt, in
> opposite case, why do We talk about asymptotic optimality? ;-)

Hmm... Maybe we have speak about different "universal lossless data
compression algorithms" :-) Let's take as example a binary source and xor it
with pi. It's easy to prove that if pi is random, the result will, in most of
the cases be random, too (at liest from the point of view of a statistical
compressor). So, although the true entropy of this result could be quite low,
CTW, PPM, BWT, LZ will think it's random... Therefore they aren't universal
compressors, because they restrict themselves to sources that aren't xored with
pi, sqrt 2 or any other nice number :-)
Okay, we distanced a bit from the subject, namely that tricks aren't
allowed. I just wanted to ask you how you define "tricks" ?

> >In my opinion a compression algorithm should be allowed when:
> >Size(Algorithm(Corpus))+Size(Algorithm) < Size(Corpus)
> Such evaluation will not work, because it depends on file type and size.
> 1MB english dictionary + description of english syntax is too big addition
> for BOOK1, it is insignificant for 1GB collection of english texts and it is
> useless for russian text.

Well, the corpus should be representative for the input files that are
common for a class of archivers. If the corpus is composed of just english
tests, and it's quite big, then it's probably suposed to test english text
compressors on big files so in this case 1 Mb dictionary would be useful. On the
other hand the Calgary/Canterbury corpora contain more than text, and that
dictionary is useless because it's to big, but some other low cost tricks like
EOL-encoding, phrase replacing, .... , are quite efficient and they'll probably
be efficient on most files where the compressor will be used so, it worths
implementing them...

--
Edgar

Malcolm Taylor

unread,

Apr 18, 2000, 3:00:00 AM4/18/00

to

Binder Edgar <Binde...@T-Online.de> wrote:
> Let's take as example a binary source and xor it
>with pi. It's easy to prove that if pi is random, the result will, in most of
>the cases be random, too (at liest from the point of view of a statistical
>compressor). So, although the true entropy of this result could be quite low,
>CTW, PPM, BWT, LZ will think it's random... Therefore they aren't universal
>compressors, because they restrict themselves to sources that aren't xored with
>pi, sqrt 2 or any other nice number :-)

Information content or entropy can only be defined relative to a
model. To warp a quote - one mans information is another mans random
data :)
Universal data compression algorithms are ones that are universal to a
model class. The way the term is used in literature is generally in
reference to markov sources (such as tree sources).
Universal does not imply it can compress anything.

Malcolm

Binder Edgar

unread,

Apr 18, 2000, 3:00:00 AM4/18/00

to

Malcolm Taylor wrote:

[universal data compression algorithms]

> Information content or entropy can only be defined relative to a
> model. To warp a quote - one mans information is another mans random
> data :)

:-)

> Universal data compression algorithms are ones that are universal to a
> model class. The way the term is used in literature is generally in
> reference to markov sources (such as tree sources).

Yes. Well, I was just arguing that markov sources aren't the only ones, so some
compression "tricks" can be justifyied in respect to other reference model
classes...

> Universal does not imply it can compress anything.

Well, on the average there is no compressor that actually compresses :-)

--
Edgar

Dror Baron

unread,

Apr 18, 2000, 3:00:00 AM4/18/00

to

> Lots of discussion on BWT. Actually, does the algorithm is asymptotic
> optimality? If yes, how can prove that?
>

The answer is obviously "yes".
For example, the papers by Arimura, Effros, and Verdu` et. al.
show this.

--------------------------
Dror Baron
ECE Department
University of Illinois at Urbana-Champaign(UIUC)
CSRL room 120
Phone: (217)-265-8234 (office)
Website: http://www.uiuc.edu/~dbaron

alex

unread,

Apr 19, 2000, 3:00:00 AM4/19/00

to

yes?????
it makes me very confused now. Someone (e.g.Dmitry Shkarin0 said not!
but someone said "yes"! So anyone can conclude?? Maybe giving the proof
here is a good way to discuss

Dmitry Shkarin

unread,

Apr 19, 2000, 3:00:00 AM4/19/00

to

Hi, Dror!

>> Lots of discussion on BWT. Actually, does the algorithm is asymptotic
>> optimality? If yes, how can prove that?
>>
>The answer is obviously "yes".
>For example, the papers by Arimura, Effros, and Verdu` et. al.
>show this.

You must disprove my example in such case.
Are Effros, and Verdu` et. al. papers available online?

Dmitry Shkarin

unread,

Apr 19, 2000, 3:00:00 AM4/19/00

to

Hi, Edgar!

> Okay, we distanced a bit from the subject, namely that tricks aren't
>allowed. I just wanted to ask you how you define "tricks" ?

Any empirically found dependence for some limited set of files is trick,
because We can say nothing about its appliance to other files.
Good performance on corpora is not ultimate purpose for compressor, it
is some indicator of its behaivor on real data. If compressor shows good
performance for BOOK1 then We can expect similar performance for any
text-like file, for example, for French text(oops, 'alphabet reordering' and
'phrase substitution' go out) or for program source(oops, 'punctuators
tricks' go out, because punctuators have special mean in programming
languages). If compressor shows good performance for SUM then We can expect
similar performance for IA32- or IA64-executables(oops, 'E8-trick' go out).
If compressor shows good performance for KENNEDY.XLS then We can expect
similar performance for any table-like file(oops, 'plays with switches' go
out: 'szip -o0 -r13 KENNEDY.XLS' gives excellent result, but it is not true
for 'szip -o0 -r13 AnyTable.XLS'). Hm, too long list, stop here...;)

> Well, the corpus should be representative for the input files that are
>common for a class of archivers. If the corpus is composed of just english
>tests, and it's quite big, then it's probably suposed to test english text
>compressors on big files so in this case 1 Mb dictionary would be useful.
On the
>other hand the Calgary/Canterbury corpora contain more than text, and that
>dictionary is useless because it's to big, but some other low cost tricks
like
>EOL-encoding, phrase replacing, .... , are quite efficient and they'll
probably
>be efficient on most files where the compressor will be used so, it worths
>implementing them...

By the way, do You compress pure plain English texts frequently?
I done some experiment: my system disk C contains 2789KB of .TXT
files, it is not so big value for 1GB disk. 2/3 of them are list- and
table-like files, tricks will not work for such files, 1/4 of the rest are
Russian README files. Disk contains a lot of text-like .INI, .CFG, .LOG,
.HTML files, tricks don`t work for them too.
After removing enormously big REDIST.TXT(it is list-like file) and files
with length < 1000 bytes(it is cookies from Internet), size of test file set
is 1207368 bytes. Number of files in the set: 59.

'dc.exe -v1 *.txt' gives 283342 bytes(3.3 sec.)
'dc.exe -v1 -cpre *.txt' gives 283539 bytes(3.1 sec.)

For separately compressed files(via batch file):
'dc.exe -v1 %1.txt' gives 285468 bytes(11.2 sec.)
'dc.exe -v1 -cpre %1.txt' gives 285685 bytes(11.0 sec.)

For comparision(I'm partial person to compressor selection again :)):
'ppmd e -f -o6 *.txt' gives 254241 bytes(2.5 sec.)

You can easily reproduce such experiment on your comp.

Binder Edgar

unread,

Apr 19, 2000, 3:00:00 AM4/19/00

to

Hi,

Dmitry Shkarin wrote:

> > Okay, we distanced a bit from the subject, namely that tricks aren't
> >allowed. I just wanted to ask you how you define "tricks" ?
> Any empirically found dependence for some limited set of files is trick,
> because We can say nothing about its appliance to other files.

Hmm, according to your definition isn't that limited PPM context a trick ?
(you should have an unbounded context :-)

> Good performance on corpora is not ultimate purpose for compressor, it
> is some indicator of its behaivor on real data. If compressor shows good
> performance for BOOK1 then We can expect similar performance for any
> text-like file, for example, for French text(oops, 'alphabet reordering' and
> 'phrase substitution' go out) or for program source(oops, 'punctuators

AFAIK alphabet reordering & phrase substition work on most languages...

> If compressor shows good performance for KENNEDY.XLS then We can expect
> similar performance for any table-like file(oops, 'plays with switches' go
> out: 'szip -o0 -r13 KENNEDY.XLS' gives excellent result, but it is not true
> for 'szip -o0 -r13 AnyTable.XLS'). Hm, too long list, stop here...;)

Well, DC has built-in record detection, similar to szip's -r... Without
playing with switches, it gets improved compression even on most image or sound
files...

> By the way, do You compress pure plain English texts frequently?

Okay, you got a point here... :-) But the text filter work on texts that
aren't "pure", too...

[compression experiment]

Hmm, your system disk may not be typical. And as we know any empirical found
dependence for some limited set of files is a trick.... :-) Now seriously, I
have about 25 Mb of text in 291 files over 1k on 1.4Gb total space. About 2.3Mb
is german text, 0.8Mb is romanian and the reminder is more or less english text.
Of course this could be non-typical, but it's good to know that DC works
efficiently at least on my computer :-)

--
Edgar

Dror Baron

unread,

Apr 19, 2000, 3:00:00 AM4/19/00

to

Hi Dimitry ;-)

First of all when I read one of your responses "approach such as
D.Baron and Y.Bresler" I was pleased. Thanks.
I hope our next salvo of papers (August I hope!) will be better.

Seriously, consider a memoryless source.
Let's go through the papers:

1. Arimura and Yamamoto:
You used MTF followed by arithmetic coder, but Arimura and
Yamamoto use MTF followed by a universal prior on the integers.
In theorem 2 (p.2119) the bound by Elias (IT transactions, 1987)
gives a redundancy of 2*log(H(X)+1)+1. This is still quite a lot!
(H(X) is in our case similar to log|X|!)
But go to p.2121 and see note 3 on symbol extension.
For a supersymbol of m symbols, asymptotically the redundancy will
be 1/m times the value from above. For practical purposes this is
not too god, but asymptotically with large m all will be fine.

2. Effros (DCC 99')
The BWT output is partitioned into blocks.
For each block the Krichevsky Trofimov estimator is run, followed
by arithmetic coding. Since the source is memoryless, by Verdu (ISIT 2000)
the BWT output will be within log(N) of memoryless (in divergence).
Therefore the blocks will be well compressed by the Krichevsky Trofimov.

Thanks again.
Dror

--------------------------
Dror Baron
ECE Department
University of Illinois at Urbana-Champaign(UIUC)
CSRL room 120
Phone: (217)-265-8234 (office)
Website: http://www.uiuc.edu/~dbaron

On Tue, 11 Apr 2000, Dmitry Shkarin wrote:

> Hi, Mikael!

> >> Lots of discussion on BWT. Actually, does the algorithm is asymptotic
> >> optimality? If yes, how can prove that?

> >Hi!
> >There are *lots* of information if you just bother to look.
> >http://dogma.net/DataCompression/
> >is one good source but the paper you are looking for is perhaps at
> >http://www.hn.is.uec.ac.jp/~arimura/ieice98.html
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> It is wrong proof. Standard blocksorting scheme(BWT + MTF + entropy
> coding) is asymptotically nonoptimal.
> Example: order-0 source with nonuniform symbol distribution.
> 1) BWT will not change source properties, because permutation does not
> affects to symbol probabilities;
> 2) MTF will flatten symbol distribution(it is its property);
> 3) AriCoder will encode this corrupted distribution;
>

Dmitry Shkarin

unread,

Apr 20, 2000, 3:00:00 AM4/20/00

to

Hi, Dror!

>1. Arimura and Yamamoto:
>You used MTF followed by arithmetic coder, but Arimura and
>Yamamoto use MTF followed by a universal prior on the integers.

Arimura and Yamamoto don`t restrict themselves to certain coder(see
Sect.2). Actually, problem is in MTF, but not in coder.

>In theorem 2 (p.2119) the bound by Elias (IT transactions, 1987)
>gives a redundancy of 2*log(H(X)+1)+1. This is still quite a lot!
>(H(X) is in our case similar to log|X|!)
>But go to p.2121 and see note 3 on symbol extension.
>For a supersymbol of m symbols, asymptotically the redundancy will
>be 1/m times the value from above. For practical purposes this is
>not too god, but asymptotically with large m all will be fine.

No method exists that can convert 'bad' flattened distribution to 'good'
peaked one without any extra information and 'symbol extension' try to do
it, so example remains nondisproved.

>2. Effros (DCC 99')
>The BWT output is partitioned into blocks.
>For each block the Krichevsky Trofimov estimator is run, followed
>by arithmetic coding. Since the source is memoryless, by Verdu (ISIT 2000)
>the BWT output will be within log(N) of memoryless (in divergence).
>Therefore the blocks will be well compressed by the Krichevsky Trofimov.

Yes, such scheme will be asymptotically optimal, but I meant standard
scheme only.

Dmitry Shkarin

unread,

Apr 20, 2000, 3:00:00 AM4/20/00

to

Hi, Edgar!

> Hmm, according to your definition isn't that limited PPM context a
trick ?
>(you should have an unbounded context :-)

You can consider it as a technical limitation, like to finitness of
MAX_TOTAL_FREQUENCY in AriCoder ;-)

> AFAIK alphabet reordering & phrase substition work on most languages...

Sorry, I meant compressor, that uses hard-coded alphabet reordering &
phrase substitution, is tricky compressor, but compressor, that selects
alphabet order and builds dictionary adaptively, is normal, straight
compressor(probably, it is LZ78 simply:)).

> Well, DC has built-in record detection, similar to szip's -r... Without
>playing with switches, it gets improved compression even on most image or
sound
>files...

Yes, DC is straight compressor for such files.

> Hmm, your system disk may not be typical. And as we know any empirical
found
>dependence for some limited set of files is a trick.... :-) Now seriously,
I
>have about 25 Mb of text in 291 files over 1k on 1.4Gb total space. About
2.3Mb
>is german text, 0.8Mb is romanian and the reminder is more or less english
text.

Did You excluded from count your test files? :)

mikael.l...@spray.se

unread,

Apr 20, 2000, 3:00:00 AM4/20/00

to

Hi!
There's no law against using "tricks" in our implementations.
However, I must agree with Dmitry in that true universal coding
is much more interesting.
Edgar, your implementation on "inversion frequencies" is very well done.
It would be very interesting to know at least the basic difference
between your and Arnavut and Magliveras' *implementation* on IF.

Mikael

In article <38FDC60C...@T-Online.de>,

Binder Edgar <Binde...@T-Online.de> wrote:
> Hi,
>
> Dmitry Shkarin wrote:
>
> > > Okay, we distanced a bit from the subject, namely that tricks aren't
> > >allowed. I just wanted to ask you how you define "tricks" ?
> > Any empirically found dependence for some limited set of files is trick,
> > because We can say nothing about its appliance to other files.
>

> Hmm, according to your definition isn't that limited PPM context a trick ?
> (you should have an unbounded context :-)
>

> > Good performance on corpora is not ultimate purpose for compressor, it
> > is some indicator of its behaivor on real data. If compressor shows good
> > performance for BOOK1 then We can expect similar performance for any
> > text-like file, for example, for French text(oops, 'alphabet reordering' and
> > 'phrase substitution' go out) or for program source(oops, 'punctuators
>

> AFAIK alphabet reordering & phrase substition work on most languages...
>

> > If compressor shows good performance for KENNEDY.XLS then We can expect
> > similar performance for any table-like file(oops, 'plays with switches' go
> > out: 'szip -o0 -r13 KENNEDY.XLS' gives excellent result, but it is not true
> > for 'szip -o0 -r13 AnyTable.XLS'). Hm, too long list, stop here...;)
>

> Well, DC has built-in record detection, similar to szip's -r... Without
> playing with switches, it gets improved compression even on most image or sound
> files...
>

> > By the way, do You compress pure plain English texts frequently?
>
> Okay, you got a point here... :-) But the text filter work on texts that
> aren't "pure", too...
>
> [compression experiment]
>

> Hmm, your system disk may not be typical. And as we know any empirical found
> dependence for some limited set of files is a trick.... :-) Now seriously, I
> have about 25 Mb of text in 291 files over 1k on 1.4Gb total space. About 2.3Mb
> is german text, 0.8Mb is romanian and the reminder is more or less english text.

> Of course this could be non-typical, but it's good to know that DC works
> efficiently at least on my computer :-)
>
> --
> Edgar
>
>

Binder Edgar

unread,

Apr 20, 2000, 3:00:00 AM4/20/00

to

Hi,

Dmitry Shkarin wrote:

> > Hmm, according to your definition isn't that limited PPM context a
> trick ?
> >(you should have an unbounded context :-)

> You can consider it as a technical limitation, like to finitness of
> MAX_TOTAL_FREQUENCY in AriCoder ;-)

:-)

> > AFAIK alphabet reordering & phrase substition work on most languages...

> Sorry, I meant compressor, that uses hard-coded alphabet reordering &
> phrase substitution, is tricky compressor, but compressor, that selects
> alphabet order and builds dictionary adaptively, is normal, straight
> compressor(probably, it is LZ78 simply:)).

Well, you can consider fixed alphabet reordering & phrase substitution a
technical limitation :-) Experience tells that the adaptive versions are much
slower and usually have lower performance...

[text files test]

> Did You excluded from count your test files? :)

Yes, although I must confess I was tempted to count those 15Mb, too :-)
Well, I guess that could be compensated by 13.8Mb of pascal/delphi source code
that compresses to 2426k without preprocessing and 2411k with the filters...
Assuming that there are 10,000 people out there with 33kbit/sec modems who
would want to download them and they pay about 2 cent/minute internet bills, I
just saved more than 10$ :-) Okay, it's probably better to use a solid mode
(1766k), or best a solid mode + PPMonstr (1720k)....

--
Edgar

Binder Edgar

unread,

Apr 20, 2000, 3:00:00 AM4/20/00

to

Hi,

mikael.l...@spray.se wrote:

> There's no law against using "tricks" in our implementations.

:-)

> However, I must agree with Dmitry in that true universal coding
> is much more interesting.

Yes, I agree too..

> Edgar, your implementation on "inversion frequencies" is very well done.

Thanks! Hmm, I don't like it called inversed frequencies :-) I discovered it
independently, and had an implementation before that DCC paper appeared, and anyway
beside the fact that it encodes distances between instances of the same character, it
doesn't have much in common with IF on the modelling part...

> It would be very interesting to know at least the basic difference
> between your and Arnavut and Magliveras' *implementation* on IF.

Well, I don't do it in more steps, I encode the distances on the fly... For
example aaabbbbcaaab ---> 116111501100 without any exclusion (a 0 distance means that
the char doesn't appear anymore in the input source)...

--
Edgar

Dror Baron

unread,

Apr 20, 2000, 3:00:00 AM4/20/00

to

Hi Dmitry!

> >In theorem 2 (p.2119) the bound by Elias (IT transactions, 1987)
> >gives a redundancy of 2*log(H(X)+1)+1. This is still quite a lot!
> >(H(X) is in our case similar to log|X|!)
> >But go to p.2121 and see note 3 on symbol extension.
> >For a supersymbol of m symbols, asymptotically the redundancy will
> >be 1/m times the value from above. For practical purposes this is
> >not too god, but asymptotically with large m all will be fine.
> No method exists that can convert 'bad' flattened distribution to 'good'
> peaked one without any extra information and 'symbol extension' try to do
> it, so example remains nondisproved.

I think you don't get it.
The 'flattened distributions' are a problem when MTF runs on
single symbols. But when supersymbols are used, it becomes less
of a problem. The reason is that basically the MTF redundancy is
2*log(H) per symbol where H is the entropy rate.
In your example the source is memoryless, so having length-m
supersymbols means H*m entropy. So on the one hand we're paying
H*m but the redundancy is 2*log(m*H) so the redundancy/entropy
ratio is 2*[log(H)+log(m)]/(m*H) which clearly goes to zero
pretty fast when we increase m.
I'm not saying this is practical! - I'm just saying that even
MTF can be 'coerced' into nice results ;-)

> >2. Effros (DCC 99')
> >The BWT output is partitioned into blocks.
> >For each block the Krichevsky Trofimov estimator is run, followed
> >by arithmetic coding. Since the source is memoryless, by Verdu (ISIT 2000)
> >the BWT output will be within log(N) of memoryless (in divergence).
> >Therefore the blocks will be well compressed by the Krichevsky Trofimov.
> Yes, such scheme will be asymptotically optimal, but I meant standard
> scheme only.

I think you are being unfair.
We all know you are a strong supporter of PPM, but when you are
looking at another method you can't say "standard only".
I could similarly look at PPM and say "d=4 only" (modeling depth).
Clearly your unbounded context work is much much better!

This debate isn't about whether BWT+MTF+arithmetic coding is
perfect and wonderful. The debate is whether the BWT statistical gathering
characteristics are good enough to provide with 'universal' coders
(I put it in quotes because universality is relative to a class of
sources, albeit a wide and useful class for practical purposes).

Last remark: as a convention, if we use BWT as a preprocessor,
then it seems fair to allow any trick with complexity up to the BWT
complexity.

Dror

kaptke...@my-deja.com

unread,

Apr 20, 2000, 3:00:00 AM4/20/00

to

In article <38FF41A6...@T-Online.de>,

Binder Edgar <Binde...@T-Online.de> wrote:
>
> Hi,
>
> mikael.l...@spray.se wrote:
>
> > There's no law against using "tricks" in our implementations.
>
> :-)
>
> > However, I must agree with Dmitry in that true universal coding
> > is much more interesting.
>
> Yes, I agree too..
>
> > Edgar, your implementation on "inversion frequencies" is very well
done.
>
> Thanks! Hmm, I don't like it called inversed frequencies :-) I
discovered it
> independently, and had an implementation before that DCC paper
appeared, and anyway

I too came up with the concept independantly but, i'm glad to see the
concept has a name. I'll spend some time reading up of what other have
done.

> beside the fact that it encodes distances between instances of the
same character, it
> doesn't have much in common with IF on the modelling part...
>
> > It would be very interesting to know at least the basic difference
> > between your and Arnavut and Magliveras' *implementation* on IF.
>
> Well, I don't do it in more steps, I encode the distances on the
fly... For
> example aaabbbbcaaab ---> 116111501100 without any exclusion (a 0
distance means that
> the char doesn't appear anymore in the input source)...
>

How do you handle distances greater than 255? This was one of the more
difficult points for me to overcome until I had developed my coder
(which doesn't care how big the fixed size tokens are. Thus I can use
16 bit numbers with no compresson penalty.)

Binder Edgar

unread,

Apr 21, 2000, 3:00:00 AM4/21/00

to

Hi,

kaptke...@my-deja.com wrote:

[inversion frequencies (distances coding)]

> How do you handle distances greater than 255? This was one of the more
> difficult points for me to overcome until I had developed my coder

Well, I use a logarithmical encoding scheme (first the log2 of the
distance, and then the rest of the needed bits)

> (which doesn't care how big the fixed size tokens are. Thus I can use
> 16 bit numbers with no compresson penalty.)

Don't you have high adaption costs ?

--
Edgar

kaptke...@my-deja.com

unread,

Apr 22, 2000, 3:00:00 AM4/22/00

to

In article <39001162...@T-Online.de>,

No, just the opposite. You see, M99 makes NO attempt to predict
anything. That's one of the reasons for its fantastic speed (and its
extremely low memory requirements (around 40kb)). The alogrithm has a
kind of natural adaptation to localized statistics. But, I'm sure
we're getting off topic here.

-Michael

Dmitry Shkarin

unread,

Apr 23, 2000, 3:00:00 AM4/23/00

to

Hi, Dror!

>I think you don't get it.
>The 'flattened distributions' are a problem when MTF runs on
>single symbols. But when supersymbols are used, it becomes less
>of a problem. The reason is that basically the MTF redundancy is
>2*log(H) per symbol where H is the entropy rate.
>In your example the source is memoryless, so having length-m
>supersymbols means H*m entropy. So on the one hand we're paying
>H*m but the redundancy is 2*log(m*H) so the redundancy/entropy
>ratio is 2*[log(H)+log(m)]/(m*H) which clearly goes to zero
>pretty fast when we increase m.
>I'm not saying this is practical! - I'm just saying that even
>MTF can be 'coerced' into nice results ;-)

Now, We can apply your considerations to plain ASCII-codes, with
help of trivial inequality: CodeLength(MTF(ASCII)) <= 8 = H + 2*lg(H+1) + 1,
where H=3, We can achieve 3bpb for any ASCII-sequence. Why do We need
compressors at all? We can get any compression rate for ASCII-codes simply.

>I think you are being unfair.
>We all know you are a strong supporter of PPM, but when you are
>looking at another method you can't say "standard only".
>I could similarly look at PPM and say "d=4 only" (modeling depth).
>Clearly your unbounded context work is much much better!

Well, I will name it 'original blocksorting scheme'. :)

Dmitry Shkarin

unread,

Apr 23, 2000, 3:00:00 AM4/23/00

to

Hi, Edgar!

>> > AFAIK alphabet reordering & phrase substition work on most
languages...
>> Sorry, I meant compressor, that uses hard-coded alphabet reordering &
>> phrase substitution, is tricky compressor, but compressor, that selects
>> alphabet order and builds dictionary adaptively, is normal, straight
>> compressor(probably, it is LZ78 simply:)).
>
> Well, you can consider fixed alphabet reordering & phrase substitution
a
>technical limitation :-) Experience tells that the adaptive versions are
much
>slower and usually have lower performance...

Hmm, We go on 2nd(or 3rd?) ring, so stop here...

Dror Baron

unread,

Apr 23, 2000, 3:00:00 AM4/23/00

to

> Now, We can apply your considerations to plain ASCII-codes, with
> help of trivial inequality: CodeLength(MTF(ASCII)) <= 8 = H + 2*lg(H+1) + 1,
> where H=3, We can achieve 3bpb for any ASCII-sequence. Why do We need
> compressors at all? We can get any compression rate for ASCII-codes simply.

Per-symbol.
Not on the entire file.
That's why supersymbols are needed. But MTF isn't good.

> Well, I will name it 'original blocksorting scheme'. :)

OK! I like your approach.
I agree that the 'original blocksorting' has shortcomings.

Dmitry Shkarin

unread,

Apr 25, 2000, 3:00:00 AM4/25/00

to

Hi, Dror!

>Per-symbol.
>Not on the entire file.
>That's why supersymbols are needed. But MTF isn't good.

Yes, it is wrong example.

Dror Baron

unread,

Apr 25, 2000, 3:00:00 AM4/25/00

to

Hi Dmitry!
This is cool. We've come to agreement that the
original BWT+MTF scheme is suboptimal.
But we've also come to agreement that BWT combined
with other methods can by asymptotically optimal
(lots of examples around...)

Dror