Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

New large text benchmark

79 views
Skip to first unread message

Matt Mahoney

unread,
May 12, 2006, 6:50:36 PM5/12/06
to
I am starting a new data compression benchmark here:
http://cs.fit.edu/~mmahoney/compression/text.html

The goal of the benchmark is solve the problem of natural language text
compression, which I believe is equivalent to passing the Turing test
for AI. Currently, no algorithm can compress text to 1 bpc, the
entropy estimated by Shannon in 1950 based on text prediction
experiments by humans. The benchmark is 1GB of text from Wikipedia,
about the amount it would take a lifetime to read. The entropy of the
resulting model is roughly the same as the long term memory capacity of
the human brain by some estimates. The web page explains the rules and
rationale for this benchmark.

The goal of most benchmarks is to try to find the "best overall"
algorithm. We know that the most general solution to encoding a string
x is the shortest program that outputs x. But Kolmogorov proved this
is an impossible goal. Instead, the top ranked programs are usually
those with lots of specialized models for different formats like exe,
pdf, bmp, wav, etc., whatever is included in the benchmark. Every time
a new data type is added to a benchmark (e.g. the digits of pi), then
compression developers have to add another model (a program that
computes pi). I think this distracts from the problem and leads to
overly complex systems. Compressing text in a single language is a
hard enough problem.

So far I have only tested a few compressors, but I expect to add many
more over the next few weeks.

-- Matt Mahoney

Matt Mahoney

unread,
May 16, 2006, 6:50:33 PM5/16/06
to
Matt Mahoney wrote:
> I am starting a new data compression benchmark here:
> http://cs.fit.edu/~mmahoney/compression/text.html

I have updated the benchmark with a few of the most popular compressors
like zip, rar, gzip, 7zip, compress, and a few high end compressors
like paq8h, sbc, slim, ppmonstr, ppmd. It also includes the xml-wrt
preprocessed ppmd and ppmonstr. xml-wrt improves compression on most
compressors that lack English models (so not paq8h or slim).

My next goal is to benchmark the remaining top compresors on text.
http://maximumcompression.com/data/text.php?historic=true
http://uclc.info/large_text_compression_test.htm

That would of course include WinRK, but unfortunately the "free trial"
download expires as soon as you install it. I only rank programs that
the public can download freely. (Free trials are OK). If this is not
fixed, it will have to wait for the next version.

The benchmark is publicly verifiable so it is open to anyone who wants
to submit results (especially if you wrote the program). I am not only
looking for results for new compressors, but also combinations of
options on already tested compressors that improve compression. I try
to find the best combination, but an exhaustive search is often not
practical. If you do send me results, I will acknowledge your
contribution. It is not possible to compare speeds on different
machines, but since the ranking is by size only, this will not matter
too much.

-- Matt Mahoney

wtx...@gmail.com

unread,
May 20, 2006, 10:14:39 PM5/20/06
to
Hi Matt,
I am a novie in the compression world, but I really found your works on
testing so many compressors are tramendous!!!

Thank you for your useful information.

Weng

Matt Mahoney

unread,
May 21, 2006, 6:13:19 PM5/21/06
to

Thanks. BTW the server with the new benchmark has been down for 2
days, so I am moving it to http://www.mattmahoney.net/text.html at
least for now.

Updates: added UHBC (6th place and top ranked BWT compressor), and
added more results for XML-WRT combined with other compressors. UHBC
is the only compressor not improved by XML-WRT, and possibly paq8h
which I haven't tested yet. Also cleaned up the results a bit.

-- Matt Mahoney

Matt Mahoney

unread,
May 21, 2006, 9:18:16 PM5/21/06
to

So right after I post this, the server is back up. It seems their air
conditioner broke over the weekend (this is in Florida). Anyway, I
guess I will mirror it on both sites.

Update: XML-WRT does not improve paq8h compression. However paq8h is
still #1. WinRK might be better but I can't test it until the free
trial expiration is fixed. Right now it expires as soon as you install
it.

-- Matt Mahoney

byb...@rocketmail.com

unread,
May 23, 2006, 11:28:42 AM5/23/06
to
Matt Mahoney wrote:

> overly complex systems. Compressing text in a single language is a
> hard enough problem.

Hmm, speaking of large text corpora, does anyone know if the
proceedings of the UN that google used to train their language
translator models are available? I'm sure those are way larger than
1GB.

The reason I'm asking is because I'm interested in English-Chinese
machine translation as well as improving the compressibility of Chinese
text so a huge parallel corpus would be handy for me.

-Tony

Matt Mahoney

unread,
May 26, 2006, 2:11:55 PM5/26/06
to

Don't know about that, but Wikipedia has articles in about 100
languages. If an article is written in more than one language, there
are links between them. One problem I noticed is sometimes a full
article in one language will link to a stub in another, so you need to
weed out these cases.

-- Matt Mahoney

Matt Mahoney

unread,
May 26, 2006, 3:04:37 PM5/26/06
to
Matt Mahoney wrote:
> I am starting a new data compression benchmark here:
> http://cs.fit.edu/~mmahoney/compression/text.html

Update: I have full benchmarks for 25 compressors, including 10 of the
top 11 in the maximumcompression text benchmark.
http://maximumcompression.com/data/text.php?historic=true
The other one, compressia, is no longer supported, so not eligible.

Some surprises: of the top 10, 2 have failed on the 1 GB file: hipp
does not have enough memory to even start (even with 800 MB and order
2), and durilca crashes. The others are severely constrained by having
only 1 GB of memory (effectively 800 MB available) on my PC. The PPM
based compressors must run with low orders (8 to 10) for best
compression, because higher orders use up memory faster, forcing them
to discard the model. slim crashes when run with a high PPM order.
The best BWT compressors have to split the file into 10 parts,
compressed independently. Context mixing algorithms using hash tables
should be less affected, but one big surprise (to me) was that WinRK
only took third place.

I relaxed the rules to remove limits on speed and memory. Previously,
programs had to run in 24 hours on a 2.2 GHz Athlon-64 with 1 GB memory
(effectively 800 MB available). However after these tests I realized
these arbitrary limits were hampering the goal of finding the best
natural language model. It is theoretically possible to use data
structures that take no more space than the compressed output, but
practical algorithms typically need 10 to 100 times as much space in
order to get reasonable speed. Given the 3 way tradeoff between speed,
memory, and compression, I don't want anything to get in the way of
compression.

Since this is an open benchmark (anyone can contribute), this
restriction also does not make sense. CPUs will get faster and memory
will get cheaper. I will continue testing with the original limits
(until I upgrade), but anyone can submit results run on any machine
they want.

-- Matt Mahoney

michael

unread,
May 28, 2006, 10:06:05 AM5/28/06
to

Matt Mahoney wrote:
<snip>

> The best BWT compressors have to split the file into 10 parts,
> compressed independently.
</snip>

Actually, with 800MB of space, the best BWTs should use just under 6
blocks as the best suffix array algorithms (both for speed and memory)
are currently 5N.

- Michael Maniscalco

zi...@hotmail.com

unread,
May 29, 2006, 6:26:40 AM5/29/06
to
Hi Matt

Thanks for the fantastic work! It is very useful.
Do you test an algorithm called CAB 1.3 and if not why?

thanks
ziv.

Matt Mahoney

unread,
May 29, 2006, 12:29:19 PM5/29/06
to

I have not tested CAB yet. I am going down the list of top compressors
for English text plus a few popular ones like zip, gzip, 7zip, bzip2,
rar.
http://maximumcompression.com/data/text.php
http://uclc.info/large_text_compression_test.htm

This is an open benchmark so if you send me results for any compressor
I will post them.

-- Matt Mahoney

Matt Mahoney

unread,
May 29, 2006, 12:46:20 PM5/29/06
to
michael wrote:
> Matt Mahoney wrote:
> <snip>
> > The best BWT compressors have to split the file into 10 parts,
> > compressed independently.
> </snip>
>
> Actually, with 800MB of space, the best BWTs should use just under 6
> blocks as the best suffix array algorithms (both for speed and memory)
> are currently 5N.
>
> - Michael Maniscalco

It should be possible to have blocks almost as large as memory. First
you quicksort pointers into smaller blocks, and then write the pointers
to temporary files taking 4N disk space. The pointers can be merged
efficiently on disk because access is sequential. The block itself is
accessed randomly (for suffix comparison) so it has to be kept in main
memory.

For the inverse transform, instead of building and traversing a linked
list, keep the transformed data in memory and build an index to give
you the position of every 16th or 32nd occurrence of each character,
which would have size N/4 or N/8. Then to invert the BWT you find the
approximate location of the search character using the index and then
the exact location by linear search. There is an obvious speed/memory
tradeoff, but linear search is pretty fast because it uses sequential
memory access.

I think this technique should yield a large improvement on the enwik9
benchmark. I might try it later this summer if nobody else does.

-- Matt Mahoney

Matt Mahoney

unread,
May 31, 2006, 10:28:29 PM5/31/06
to
Matt Mahoney wrote:
> I am starting a new data compression benchmark here:
> http://cs.fit.edu/~mmahoney/compression/text.html

I am now up to 40 ranked compressors including contributions by others.
The rankings include most of the high end, popular, fast, and unusual
compressors. Many are under active development and are only a few days
old. I welcome any new submissions. I will test a few more but mostly
I expect to update new versions of existing programs from now on.

One result that was unexpected (to me), but now seems obvious in
hindsight, is that a common feature of all top compressors is that they
use all available memory (800 MB in my tests). It doesn't matter if
they use CM (context mixing), PPM, BWT, LZ77 or DMC. I expect that
memory-efficient algorithms will do well on this benchmark. This
hasn't been a high priority, I think, because most data types other
than text can be modeled with not much memory. Text is rather unique
in this respect.

The best compressors include a lexical model, i.e. symbols represent
whole words instead of characters. XML-WRT improves compression of
almost every program by dynamically building a dictionary and mapping
words to 1 or 2 byte symbols. The only reason this doesn't work on
paq8h is that paq8h already has a static dictionary which performs the
same function.

Some high end compressors use text tricks like capital conversion
(replacing "The" with "*the" where a special symbol * means to
capitalize the next letter), or inserting spaces after newlines. This
works because different strings that have the same meaning are mapped
into the same context.

I think this idea could be extended to semantic and syntactic modeling.
For example, semantically related words like "star", "moon" and "sky",
or syntactically related words like "of", "to", and "from" could be
mapped into the same context. This language knowledge is learnable
from the input. Semantically related words are likely to appear near
each other, such as in an article about nighttime outdoors.
Syntactically related words are likely to appear in the same immediate
context, such as before the word "the".

Semantic and syntactic modeling would probably work best using a CM
algorithm, but could also be used in BWT. The idea is to reorder the
alphabet (of words) to bring related words together, as is now done for
characters in some compressors. BWT is promising because it can be
very memory efficient. It is possible to sort enwik9 after dictionary
preprocessing in a single block by using temporary files (although no
BWT program uses this technique yet). Once that is done, the
transformed data requires very little memory (order 0) to model.

-- Matt Mahoney

David A. Scott

unread,
May 31, 2006, 11:53:33 PM5/31/06
to
"Matt Mahoney" <matma...@yahoo.com> wrote in
news:1149128909.2...@h76g2000cwa.googlegroups.com:

>
> Matt Mahoney wrote:
>> I am starting a new data compression benchmark here:
>> http://cs.fit.edu/~mmahoney/compression/text.html
>
> I am now up to 40 ranked compressors including contributions by others.
> The rankings include most of the high end, popular, fast, and unusual
> compressors. Many are under active development and are only a few days
> old. I welcome any new submissions. I will test a few more but mostly
> I expect to update new versions of existing programs from now on.
>
> One result that was unexpected (to me), but now seems obvious in
> hindsight, is that a common feature of all top compressors is that they
> use all available memory (800 MB in my tests). It doesn't matter if
> they use CM (context mixing), PPM, BWT, LZ77 or DMC. I expect that
> memory-efficient algorithms will do well on this benchmark. This
> hasn't been a high priority, I think, because most data types other
>

Matt as you know arb255 uses 255 2 cell arithmetic coders. I could write
a version that uses many more cells maybe something that would consume 800
MB if your curious what a large tree like arithmetic coder would do with
this data. I could never run the code on my machine not enough memory and
not enough file space. Note arb255 would work on the files with out change
but its only a simple arithmetic compresor but could provide a baseline.
It should compress better than fpaq for true iid files but would be very
slow.

In fact since arb255 works on 8 bit words it would only be a few
lines of changes to get it to work on 16 bit words and maybe even 24 bit
words. Again the files don't have to be a full multiple of 16 or 24. And
there are several ways to connect the cells instead of the default tree
structure. I use to have several versions that run on 19 bit words and they
still compressed and ran on my machine.

Another bad thought is a bijective LZW may compress it but would be slow
but if enough memory and fast machine it might give interesting results on
a long file. My current LZW would have to be changed a lot to make full use
of the memory. It is fully bijectie starts with 2 roots 1 and 0 each time
it compress a string that string is gone and 2 strings 1 bit longer are put
in its place. So with massive text files the strings should get very long.


David A. Scott
--
My Crypto code
http://bijective.dogma.net/crypto/scott19u.zip
http://www.jim.com/jamesd/Kong/scott19u.zip old version
My Compression code http://bijective.dogma.net/
**TO EMAIL ME drop the roman "five" **
Disclaimer:I am in no way responsible for any of the statements
made in the above text. For all I know I might be drugged.
As a famous person once said "any cryptograhic
system is only as strong as its weakest link"

Matt Mahoney

unread,
Jun 1, 2006, 9:58:29 AM6/1/06
to

1 GB of text is a rather severe test. If you do want to try it, send
me the numbers and I will post them. I'm looking for "unusual"
compressors to add to the benchmark, whether or not they compress well.
There are not many bijective compressors.

> In fact since arb255 works on 8 bit words it would only be a few
> lines of changes to get it to work on 16 bit words and maybe even 24 bit
> words. Again the files don't have to be a full multiple of 16 or 24. And
> there are several ways to connect the cells instead of the default tree
> structure. I use to have several versions that run on 19 bit words and they
> still compressed and ran on my machine.

Try it. I would be surprised if order 19-bit or 2 byte full statistics
would compress as well as zip, though.

> Another bad thought is a bijective LZW may compress it but would be slow
> but if enough memory and fast machine it might give interesting results on
> a long file. My current LZW would have to be changed a lot to make full use
> of the memory. It is fully bijectie starts with 2 roots 1 and 0 each time
> it compress a string that string is gone and 2 strings 1 bit longer are put
> in its place. So with massive text files the strings should get very long.

What you described sounds like DMC. The top ranked DMC compressor
right now is ocamyd, but it is very slow (a little faster than paq8h).
It's compression is similar to PPMd.

The PAQ model is bijective, with the exception of the header and
encoding of the file length. You could solve that by writing a single
file compressor and using the arithmetic coding trick you used in
fpaq0s. However the preprocessing transforms like WRT and XML-WRT are
not bijective.

-- Matt Mahoney

David A. Scott

unread,
Jun 1, 2006, 10:19:53 AM6/1/06
to
"Matt Mahoney" <matma...@yahoo.com> wrote in
news:1149170309....@f6g2000cwb.googlegroups.com:

Well I could send send you a zip of the source and executable for
the 8, 16 and 19 bits. These I can test on small files. I have no
way of testing on large files. But they would work. If I remember
right the 2**19 tree worked better on Calgary then the 16 bit. You might
not think so but its essential looking at the last 18 bit strings for
a guess of the next bit which is further back than the 16. I could test
these all on the Calgary files and give results but not on your large
set. Of course they could never compete with paqh8. But could serve as
benchmarks of higher order arthmetic style coders. If they bet some
ppm model then I guess the ppm model needs to be tuned.

As for 24 bit models they would not even run on my machine. I could
send the source and executables. That's if they even compile on my
machine. I could do much of this saturday as I have to teach during
the week.

>> In fact since arb255 works on 8 bit words it would only be a few
>> lines of changes to get it to work on 16 bit words and maybe even 24
>> bit words. Again the files don't have to be a full multiple of 16 or
>> 24. And there are several ways to connect the cells instead of the
>> default tree structure. I use to have several versions that run on 19
>> bit words and they still compressed and ran on my machine.
>
> Try it. I would be surprised if order 19-bit or 2 byte full
> statistics would compress as well as zip, though.
>

> -- Matt Mahoney

Matt Mahoney

unread,
Jun 1, 2006, 2:12:52 PM6/1/06
to

Put the programs on your website. I only benchmark programs that are
available to the public. If they are really slow I might only test
them on enwik8. Of course it will help if you can benchmark it
yourself and send me the numbers. I will run a few more tests but I
plan to rely mostly on contributed results in the future.

-- Matt Mahoney

Claudio Grondi

unread,
Jun 1, 2006, 2:42:36 PM6/1/06
to
David,

I have a Pentium 4 with 3 GByte of RAM and Windows XP, a Microsoft
Visual Studio C++ .NET 2003 optimizing compiler and much space on hard
drives _and_ I am interested in language related subjects like
compression, so if triggered by appropriate email I am ready to use my
equipment for the testing purposes (if you like).

Claudio

Matt Mahoney

unread,
Jun 1, 2006, 4:37:25 PM6/1/06
to
Matt Mahoney wrote:
> I am starting a new data compression benchmark here:
> http://cs.fit.edu/~mmahoney/compression/text.html

I have added one more column to the table for memory usage. It does
not affect the ranking, but it does show a clear trend. Like the speed
measurements, the numbers are approximate.

-- Matt Mahoney

David A. Scott

unread,
Jun 1, 2006, 9:55:05 PM6/1/06
to
Claudio Grondi <claudio...@freenet.de> wrote in
news:e5ncem$m9j$1...@newsreader2.netcologne.de:

> David,
>
> I have a Pentium 4 with 3 GByte of RAM and Windows XP, a Microsoft
> Visual Studio C++ .NET 2003 optimizing compiler and much space on hard
> drives _and_ I am interested in language related subjects like
> compression, so if triggered by appropriate email I am ready to use my
> equipment for the testing purposes (if you like).
>
> Claudio
>

Well the code is GNU C and it use 64 bit integers so if your complier
can't handle these there would be no way to complie and run it. I am
not familar with the capabilites of other C's.

But arb255 is at
http://bijective.dogma.net/arb255.zip

I will try to mode it for larger size this weekend

Thanks

David A. Scott

unread,
Jun 1, 2006, 9:52:46 PM6/1/06
to
"Matt Mahoney" <matma...@yahoo.com> wrote in
news:1149185572....@u72g2000cwu.googlegroups.com:

>
> Put the programs on your website. I only benchmark programs that are
> available to the public. If they are really slow I might only test
> them on enwik8. Of course it will help if you can benchmark it
> yourself and send me the numbers. I will run a few more tests but I
> plan to rely mostly on contributed results in the future.
>
>

Thinks Matt arb255 is already there
the page is
http://bijective.dogma.net/compres11.htm
the file is
http://bijective.dogma.net/arb255.zip
is been available to the public for years.

Claudio Grondi

unread,
Jun 2, 2006, 9:41:35 AM6/2/06
to
David A. Scott wrote:
> Claudio Grondi <claudio...@freenet.de> wrote in
> news:e5ncem$m9j$1...@newsreader2.netcologne.de:
>
>
>>David,
>>
>>I have a Pentium 4, 3 GHz with 3 GByte of RAM and Windows XP, a Microsoft
>>Visual Studio C++ .NET 2003 optimizing compiler and much space on hard
>>drives _and_ I am interested in language related subjects like
>>compression, so if triggered by appropriate email I am ready to use my
>>equipment for the testing purposes (if you like).
>>
>>Claudio
>>
>
>
> Well the code is GNU C and it use 64 bit integers so if your complier
> can't handle these there would be no way to complie and run it. I am
> not familar with the capabilites of other C's.
>
> But arb255 is at
> http://bijective.dogma.net/arb255.zip

I have downloaded it and made a Visual C++ .NET 2003 console project out
of arb255.cpp, bit_byts.cpp and bit_byts.h .

The project seems to compile almost out of the box.
I had only to adjust
int main(int argc, char *argv[])
to
int _tmain(int argc, _TCHAR* argv[])

The 64 bit integers seem to be handled properly in C++ .NET 2003 as the
compiler run generates an executable giving some warnings about them
which probably are not critical as results of __int64 operations are
assigned to variables declared as integers (at least as I checked it for
bit_byts.h(45) warning).
Maybe it would be a good idea to use explicit conversion to avoid such
warnings.
Same probably with the operator precedence, but also here it would be a
good idea to insert parentheses to clarify possible misinterpretations.

Here the compiler output:

arb255\bit_byts.h(45) : warning C4244: '=' : conversion from '__int64'
to 'int', possible loss of data
arb255\bit_byts.h(82) : warning C4244: '=' : conversion from '__int64'
to 'int', possible loss of data
arb255\bit_byts.h(87) : warning C4244: '=' : conversion from '__int64'
to 'int', possible loss of data
arb255\bit_byts.h(94) : warning C4244: '=' : conversion from '__int64'
to 'int', possible loss of data
arb255\bit_byts.h(102) : warning C4244: '=' : conversion from '__int64'
to 'int', possible loss of data
arb255\bit_byts.h(108) : warning C4244: '=' : conversion from '__int64'
to 'int', possible loss of data
arb255\bit_byts.h(111) : warning C4244: '=' : conversion from '__int64'
to 'int', possible loss of data
arb255\bit_byts.h(114) : warning C4244: '=' : conversion from '__int64'
to 'int', possible loss of data
arb255.cpp(307) : warning C4554: '>>' : check operator precedence for
possible error; use parentheses to clarify precedence
arb255.cpp(307) : warning C4554: '>>' : check operator precedence for
possible error; use parentheses to clarify precedence
arb255.cpp(364) : warning C4554: '>>' : check operator precedence for
possible error; use parentheses to clarify precedence
arb255.cpp(365) : warning C4554: '>>' : check operator precedence for
possible error; use parentheses to clarify precedence
arb255.cpp(368) : warning C4554: '>>' : check operator precedence for
possible error; use parentheses to clarify precedence
arb255.cpp(369) : warning C4554: '>>' : check operator precedence for
possible error; use parentheses to clarify precedence
arb255.cpp(370) : warning C4554: '>>' : check operator precedence for
possible error; use parentheses to clarify precedence

>
> I will try to mode it for larger size this weekend

As I am not very familiar with the subject in case you expect me to run
some tests please provide with the source code also some detailed
instructions how to run the tests (which files to use, what to look for
etc.)

>
> Thanks
You are welcome.

Claudio
>
> David A. Scott

David A. Scott

unread,
Jun 2, 2006, 11:41:23 AM6/2/06
to
Claudio Grondi <claudio...@freenet.de> wrote in
news:e5pf6a$1ug$1...@newsreader2.netcologne.de:


The above are all from statements like

dw = (ax * dw) % bx;
dw was declared int and is 32 bits
ax and bx where long long and should be 64 bits.
the result will always be a positive number that
fits in a 32 bit number space.

This is code I added for what I called the stability
interface to bit I/O. It prevents a file of all zeroes
or all ones from blowing up on decompression. It gets
no errors or warnings on the GNU compile. But if
int is 16 bits I would need to change it. I will
try to fix it for your complier.

First what is the size of "int" in bits on your machine
if its 32 maybe the change like

dw = (int)(ax * dw) % bx

would not cause the complier warnings. I will try to mode
it this weekend so that it gets same answers for Cargary
and no nasty complier warnings or errors.

> arb255.cpp(307) : warning C4554: '>>' : check operator precedence for
> possible error; use parentheses to clarify precedence


Not sure whats going on here my line 307 out of 375 lines
The line at 307 is as follows:

low = high - a;

Where all three varibles are
unsigned long long
which are 64 bit unsigned numbers zero to 2**64 and at this point
in code high >= low and high >= a


Not sure of the other errors since count is off. But you complier
may not like the long compound 'IF statements" I think they are
correct and standard except some are more than one line long. To
add in more parthensies would only make them longer. I feel that
if there is an error here it would have showed up long ago.

There is something else I could do for these tests. Since its impossible
to make all people happy. I could drop the stability code in these versions
they mainly where there so that I could test decompressing test data files
that are highly regular ( that is files that may be useful ) Since the
compuress is bijective you can decompress first and then compress back
to the file and it should match the starting file no matter what.
In fact you may notice the following two lines in code of arb255.cpp

define dasw out.ws
//define dasw out.wz

the out.ws is for the stability thing
the out.wz is for straight bit I/O

When I test it the out.ws while better behaved during decompression
it slower than out.wz Also the out.ws has the side effect of making
the compressor more siutable if one is compressing then encrypting
in my view. I code for fun and encryption interests me but I follow my
own views which most would say are not main stream.


> arb255.cpp(307) : warning C4554: '>>' : check operator precedence for
> possible error; use parentheses to clarify precedence
> arb255.cpp(364) : warning C4554: '>>' : check operator precedence for
> possible error; use parentheses to clarify precedence
> arb255.cpp(365) : warning C4554: '>>' : check operator precedence for
> possible error; use parentheses to clarify precedence
> arb255.cpp(368) : warning C4554: '>>' : check operator precedence for
> possible error; use parentheses to clarify precedence
> arb255.cpp(369) : warning C4554: '>>' : check operator precedence for
> possible error; use parentheses to clarify precedence
> arb255.cpp(370) : warning C4554: '>>' : check operator precedence for
> possible error; use parentheses to clarify precedence
>
>>
>> I will try to mode it for larger size this weekend
>
> As I am not very familiar with the subject in case you expect me to
> run some tests please provide with the source code also some detailed
> instructions how to run the tests (which files to use, what to look
> for etc.)
>
>>
>> Thanks
> You are welcome.
>
> Claudio
>>
>> David A. Scott
>


To run it in compression don't use redirects use

arb255.exe file.in file.out

file.in is any file to be compressed.
file.out is any file for output.

To decompress:

unarb255.exe file.in file.out

file.in is any file you wish to decompress
file.out is any file for output.

Note this works with all files but compression
does not always make files smaller this is
true with any general compressor.


Since machines are so different I would use this in
a directory my itself and as any one in today world
has to say. Hay its free there are no guarantes.

Note I use these in DOS using DGJPP port of GNU
but in the changing world of PC's its my understanding
Gates doesn't like the user to have much control not
sure it even runs on XP unless there still is some
sort of DOS.

I got to go teach so more later if your interested

Claudio Grondi

unread,
Jun 2, 2006, 1:27:16 PM6/2/06
to
Sorry, I had forgot to delete some outcommented lines in the code ...
line at 307 should be line at 303 (all are shifted by 4 additional lines)

>
> Where all three varibles are
> unsigned long long
> which are 64 bit unsigned numbers zero to 2**64 and at this point
> in code high >= low and high >= a
>
>
> Not sure of the other errors since count is off. But you complier
> may not like the long compound 'IF statements"
Exactly this is the reason for the warnings.
To be honest, I have no idea in which order what is done in this
statements without having parentheses as in my own code I always rely on
them and therefore don't know how precedence rules work (they should be
the same, but who knows if they really are...).

Please let me know the exact links to files to run, the exact command
line to run and the results you want to know from that run.

Claudio
>
> David A. Scott

Matt Mahoney

unread,
Jun 2, 2006, 3:11:18 PM6/2/06
to
David A. Scott wrote:
> "Matt Mahoney" <matma...@yahoo.com> wrote in
> news:1149185572....@u72g2000cwu.googlegroups.com:
>
> >
> > Put the programs on your website. I only benchmark programs that are
> > available to the public. If they are really slow I might only test
> > them on enwik8. Of course it will help if you can benchmark it
> > yourself and send me the numbers. I will run a few more tests but I
> > plan to rely mostly on contributed results in the future.
> >
> >
>
> Thinks Matt arb255 is already there
> the page is
> http://bijective.dogma.net/compres11.htm
> the file is
> http://bijective.dogma.net/arb255.zip
> is been available to the public for years.

I benchmarked arb255 and arb2x. Yeah I know these really aren't meant
to compress text. http://cs.fit.edu/~mmahoney/compression/text.html

arb255 is a few bytes smaller than fpaq1 (order 0, 64 bit arithmetic
coder). Both decompress bit-exact on enwik8 but arb2x fails on enwik9.
It seems the compressed output is truncated around 2^31 bits
(overflow?). unarb2x is bit-exact up to the point where the output is
truncated.

-- Matt Mahoney

David A. Scott

unread,
Jun 2, 2006, 4:20:03 PM6/2/06
to
"Matt Mahoney" <matma...@yahoo.com> wrote in
news:1149275477....@h76g2000cwa.googlegroups.com:

Well arb255 is a bijective order 0, 64 bit arithmetic coder. using
a tree of coders 1 2 4 8 ... to 128 for a total of 255. The top of
tree gets used once per byte.

arb2x is just a single coder more used to test the concept. Since
only one coder it would get used 8 times as much as the top row coder
in arb255. This means that arb255 would also likely fail on a file that
is 8 times longer. Your right it obviously overflow somewhere. I don't
account for the having 2^31 bits but I want it to work for all files that
C would support so I will fix it. I don't think it will be that
hard even though I don't have files that long I can simulate the
routines seperately to find out where it is. It may be a problem where
I have a 32 int instead of a 64 bit or it could be the scaling routine
where the product of 2 64 bit ints overflow. Oh well that is going to
make my saturday busy.


I also see in both cases the 64 bit coders didn't get as small an
anwser as the 32 bit. It just means the characters are not mixed well
and maybe here eve a 16 bit might by better.


But thanks for trying it

David A. Scott

unread,
Jun 2, 2006, 5:23:36 PM6/2/06
to
Claudio Grondi <claudio...@freenet.de> wrote in
news:e5psdm$nu7$1...@newsreader2.netcologne.de:

> Please let me know the exact links to files to run, the exact command
> line to run and the results you want to know from that run.
>
> Claudio
>

You would have to go to Matt site
and get enwik8 or wnwik9
or the zip equivalents at
http://cs.fit.edu/~mmahoney/compression/text.html

the exact command line at least in DOS is

arb255.exe enwik8 temp.tmp

to uncompress the exact command would be

unarb255.exe temp.tmp temp1.tmp

if it works on your machine enwik8 and
temp1.tmp should be identical and then
temp.tmp would be the compressed file.

As you know I need to fix an overflow problem
before I make bigger versions of arithmetic
bijective coders. I need to fix arb2x which
it the heart of the whole thing it does not
handle files of 2**31 bits so it will take
time

Here is an example that should show exactly
how it runs in DOS

here is file f1.tmp in hex:
0000 48 45 4C 4C 4F 20 57 4F 52 4C 44 . . . . . *HELLO WORLD*
number of bytes is 11

next is command line to compress f1.tmp to f1c.tmp
arb255.exe f1.tmp f1c.tmp

here is the compressed file f1c.tmp
0000 A8 06 43 FA A4 FE 12 AC 40 . . . . . . . *..C.....@*
number of bytes is 9

here it the commamd line to incompress f1c.tno tmp f1cu.tmp
unarb255.exe f1c.tmp f1cu.tmp

here is uncompressed file f1cu.tmp note same as f1.tmp
0000 48 45 4C 4C 4F 20 57 4F 52 4C 44 . . . . . *HELLO WORLD*
number of bytes is 11

here is command line to uncompress f1.tmp to f1u.tmp
unarb255.exe f1.tmp f1u.tmp

here is the uncommpressed file f1u.tmp
0000 A8 81 FA E1 F7 C3 F6 FB 26 B8 FB F0 . . . . *........&...*
number of bytes is 12

here is command line to compress file f1u.tmp to f1uc.tmp
arb255.exe f1u.tmp f1uc.tmp

here is commpressed file f1uc.tmp note same as f1.tmp
0000 48 45 4C 4C 4F 20 57 4F 52 4C 44 . . . . . *HELLO WORLD*
number of bytes is 11

I hope this example explains how to run arb255.exe

David A. Scott

unread,
Jun 2, 2006, 5:36:49 PM6/2/06
to
"David A. Scott" <daVvid_...@email.com> wrote in
news:Xns97D69DD2232EEH1...@81.174.50.80:

You test on trivally small files first to see if you can even use
it at all.

David A. Scott

unread,
Jun 2, 2006, 6:12:36 PM6/2/06
to
"Matt Mahoney" <matma...@yahoo.com> wrote in
news:1149275477....@h76g2000cwa.googlegroups.com:

>
> arb255 is a few bytes smaller than fpaq1 (order 0, 64 bit arithmetic
> coder). Both decompress bit-exact on enwik8 but arb2x fails on enwik9.
> It seems the compressed output is truncated around 2^31 bits
> (overflow?). unarb2x is bit-exact up to the point where the output is
> truncated.
>
>

I found the problem I keep track of the bit counts in variables that
are 32 bits long. Actualy one of which starts at 1 and is incremented
by 2's Your files are to long for this. I do no scaling what so ever so
the counts are never adjusted. I either have to add scaling or I have
to go to LONG LONG in more places. I hate scaling I could play games
to maybe squeeze out one more bit of use but that would slow it down
and only double the length before failure. It looks like for files
that long I will have to destroy the beauty of the pure bijective
thing I was striving for or go to LONG LONG and most likely not only
double memory but slow it down a little bit more. I need to think of
this for a while.

Actaully arb2x the memory change small. Its the higher order models
with the large memory trees that would double.
And actaully it would not affect arb255 unless you want to go to a few
gigs more in length. The problem is I would like even arb255 to handle
any file the C code allows. let me sleep on it!!

Matt Mahoney

unread,
Jun 2, 2006, 6:22:33 PM6/2/06
to
David A. Scott wrote:
> I also see in both cases the 64 bit coders didn't get as small an
> anwser as the 32 bit. It just means the characters are not mixed well
> and maybe here eve a 16 bit might by better.

I think this is a modeling issue. For text you can get good
compression using 5 to 7 bits in your counters to make it more
adaptive. When they overflow, divide by 2.

-- Matt Mahoney

David A. Scott

unread,
Jun 2, 2006, 7:08:21 PM6/2/06
to
"Matt Mahoney" <matma...@yahoo.com> wrote in
news:1149286953.5...@h76g2000cwa.googlegroups.com:

Yes but then it would not be an true benchmark zero order adaptive
compressor. Yes it might compress smaller but don't you want to know
just what a true zero order adaptive compressor would get. Now you
could argue like nitelight that its foolish to try since arithmetic
coders are bad. But I think this is close. If you take your test
files and sorted the 8 bit character used. Then most compressors would
recognize the difference and compress much smaller. But since this
compressor attempts to be a true zero order adaptive compressor it
should not very by more than one or two bytes.

As for higher order models the higher order true models and with
16 bits I could either treat it as 2 eight bit symbols geting you
an order 1 adaptive or an order 0 16 bit symbol true arithmetic
models. At least you could have known benchmark compressions. For
example if the order 1 8 bit thing beats some PPM then there is
something wrong with the PPM. Any way just some thoughts.


Matt concerning arb2x and arb255 I should have not been in a
hurry to anwser. Maybe I should not be in a hurry to write this.
but it appears that in arb2x I did use 32 bit ints for the
counters I guess thinking of strings. But in arb255 I use 64 bits
so it appears even if you expanded your long file by a factor of 10
or higher the overflow problem is avoided in arb255 however will
change arb2x to handle longer number of bits.

Thanks

David A. Scott

unread,
Jun 3, 2006, 10:32:40 AM6/3/06
to
"David A. Scott" <daVvid_...@email.com> wrote in
news:Xns97D6930BD78FCH1...@81.174.50.80:


Matt arb2x.zip is at my site. I tested it against old
using Calgary files. It gets exact same anwser. I did
3 changes to code.
1) changed 2 varibles form type LONG to type LONG LONG to
match what I did in arb255
2) I write periods "." to screen after a certain number of
bits. Somce its only to show its running its pure waste
so changed rate so ten times less periods to screen
3) updated version numbers to 20060602

arb255 is untouched since working put will mode for the
larger trees and change and less the amount periods written
to the screen.

ARB2X is nothing more that a true bijective 0 order adaptive
coder with one binary cell. The total amount of compression is
a function of the number of ZEROES and ONES and not the ORDER
in which they appear. So if the ratio of Ones and Zeroes close
to ONE you get expansion. If they are loopsided say only
ZEROES or only ONES you get maximium compression.
One more note the actual number of ones and zeroes does
usually add to a multiple of 8, It actaully compresses the
underling bijective string of ones and zeroes which will
never be greater than the number of bits in the byte file.
The intent of this code was to test the binary 64 bit wide
2 state cell and to compress small strings of ones and zeroes
You can take a true file of ascii "1" and "O" bijectively
map to binary byte file. then use arb2x to compress and then
run code to get the actual compressed string back in ascii
"1" and "0" so in theory its just a binary sting compressor
where the string is any number of ones and or zeros.

http://bijective.dogma.net/compres11.htm

Is where I talk about it. I am not the best writer but
there is an example where I show 64 ascill zeroes and
compress it to a string of length 9 and I show the
string in ascii. I example also show what happens if
less than the full 64 bits used in arithmetic HIGH LOW
registers. Of 64 is over kill and and show the results
for various lenghts 1 2 3 4 are really to small but test
code for different register sizes.

If you test it with just plain binary files of all zeroes
or ones you get tremedous compression the lnger the file the
higher the amount of compression. But of course this is just
an exercise.


Question is will it work on your longest files. I don't know
I can't test it on anythink that long. I am not god I may have
missed something but I feel there is at least an 80% change
it will work since what I changed would obviously fail.

Matt Mahoney

unread,
Jun 3, 2006, 6:17:17 PM6/3/06
to
David A. Scott wrote:
> Matt concerning arb2x and arb255 I should have not been in a
> hurry to anwser. Maybe I should not be in a hurry to write this.
> but it appears that in arb2x I did use 32 bit ints for the
> counters I guess thinking of strings. But in arb255 I use 64 bits
> so it appears even if you expanded your long file by a factor of 10
> or higher the overflow problem is avoided in arb255 however will
> change arb2x to handle longer number of bits.

I understand. fpaq1 uses 64 bit arithmetic like arb255, so the 7 byte
difference in enwik8 can be attributed to not coding the EOF. fpaq1
actually uses 32 bit counters that overflow at 2 x 10^9, then computes
the probabilities using 64 bit arithmetic.

-- Matt Mahoney

David A. Scott

unread,
Jun 3, 2006, 9:16:07 PM6/3/06
to
"Matt Mahoney" <matma...@yahoo.com> wrote in
news:1149373037.5...@y43g2000cwc.googlegroups.com:

Yes I think your method adds roughly 53 to 54 bits more which is 7 bytes
for the 9 th bit. for enwik9 it would be roughly 60 bits or 8 bytes.
that is if the scaling doesn't change results.
There is are other minor difference that don't seem to make much of
a difference here. I have a lot of junk in mine to handle tests when I want
to make the high and low smaller that is not needed for 64 bits but I left
it all in. Second difference is how we cut it into two pieces. You have a
quick fixed way to slice it up. I look at the two slices to see if they
both allow for future expansion. Which mean the top is not always for
1 or 0. Its not always for most occuring or least. It varies this adds
more steps but I was aiming for two things. weather the code picks
the likely or unlikely I want both if possible to expand. This is of
course what most would call crazy I am trying to squeeze one more fraction
bit and it has the side affect of I think of making it more secure
for use with compression then encryption but its fun and I like the
fact its bijective which yours is not.

0 new messages