On the Calgary corpus, PAQAR 4.5 compresses about 2K smaller than
PAQ8C, but PAQ8C is 3 times faster. Both are about a 20K improvement
over paq8b. PAQARCC 4.5 is a version of PAQAR 4.5 optimized for the
Calgary corpus. It compresses 4K smaller than PAQAR 4.5 and is the top
compressor on my benchmark. All programs use dictionaries.
PAQ8C is based on PAQ8B. However, PAQAR 4.5 is based on the PAQ6 core.
Future versions of PAQAR will probably be based on PAQ7/8, which
replaces the gradient descent model mixer with a neural network.
Below are results for PAQ8C, PAQAR 4.5, and PAQARCC 4.5 each at 2
memory settings (116 and 418 MB for PAQ8C, 191 and 626 MB for the
others). The times below are for a 2.2 GHz Athlon-64 in 32 bit mode
(WinXP) and 1GB memory.
C:\res\calgary>dir/b | ..\paq8c\paq8c -4 ..\x8c
Enter names of files to compress, followed by blank line
111261 BIB: -> 83112 (text) -> 17160
768771 BOOK1: -> 493825 (text) -> 174567
610856 BOOK2: -> 348174 (text) -> 114098
102400 GEO: -> 44468
377109 NEWS: -> 249870 (text) -> 79313
21504 OBJ1: -> 8006
246814 OBJ2: -> 48510
53161 PAPER1: -> 32438 (text) -> 10568
82199 PAPER2: -> 46144 (text) -> 16082
513216 PIC: -> 23103
39611 PROGC: -> 28455 (text) -> 8456
71646 PROGL: -> 51708 (text) -> 9757
49379 PROGP: -> 40239 (text) -> 7084
93695 TRANS: -> 81099 (binary+text) -> 11357
3141622 -> 572723 (1.4584 bpc) in 119.20 sec (26.355 KB/sec)
Time 119.22 sec, memory 115973024 bytes
C:\res\calgary>dir/b | ..\paq8c\paq8c -6 ..\x8c6
Enter names of files to compress, followed by blank line
111261 BIB: -> 83112 (text) -> 17158
768771 BOOK1: -> 493825 (text) -> 174426
610856 BOOK2: -> 348174 (text) -> 113949
102400 GEO: -> 44441
377109 NEWS: -> 249870 (text) -> 79155
21504 OBJ1: -> 8059
246814 OBJ2: -> 48525
53161 PAPER1: -> 32438 (text) -> 10519
82199 PAPER2: -> 46144 (text) -> 16000
513216 PIC: -> 23089
39611 PROGC: -> 28455 (text) -> 8449
71646 PROGL: -> 51708 (text) -> 9810
49379 PROGP: -> 40239 (text) -> 7138
93695 TRANS: -> 81099 (binary+text) -> 11353
3141622 -> 572265 (1.4572 bpc) in 120.13 sec (26.153 KB/sec)
Time 120.16 sec, memory 417962912 bytes
C:\res\calgary>dir/b | ..\paqar45\paqar -5 ..\xar5
PAQAR (PAQ+Dictionary) v4.5 by M.Mahoney+A.Ratushnyak+P.Skibinski,
12.2.2006
Enter names of files to compress, followed by blank line or EOF.
- loading dictionary C:\res\paqar45\pasqda.dic
- loaded dictionary 103370/559168 words
BOOK2 610856 -> 111362+2109
BOOK1 768771 -> 171145+2913
NEWS 377109 -> 79172+469
PAPER1 53161 -> 10377+81
PAPER2 82199 -> 15849+157
BIB 111261 -> 17280
PROGC 39611 -> 8261
PROGL 71646 -> 9701
PROGP 49379 -> 7000
GEO 102400 -> 44340
OBJ1 21504 -> 7851
OBJ2 246814 -> 45937
PIC 513216 -> 24169
TRANS 93695 -> 11998
570374/3141622 in 373.34 sec. (1.4524 bpc, 18.16% at 8 KB/s)
C:\res\calgary>dir/b | ..\paqar45\paqar -7 ..\xar7
PAQAR (PAQ+Dictionary) v4.5 by M.Mahoney+A.Ratushnyak+P.Skibinski,
12.2.2006
Enter names of files to compress, followed by blank line or EOF.
- loading dictionary C:\res\paqar45\pasqda.dic
- loaded dictionary 103370/559168 words
BOOK2 610856 -> 111212+2109
BOOK1 768771 -> 170995+2913
NEWS 377109 -> 78979+469
PAPER1 53161 -> 10313+81
PAPER2 82199 -> 15791+157
BIB 111261 -> 17315
PROGC 39611 -> 8280
PROGL 71646 -> 9743
PROGP 49379 -> 7061
GEO 102400 -> 44300
OBJ1 21504 -> 7901
OBJ2 246814 -> 45977
PIC 513216 -> 24170
TRANS 93695 -> 11987
569956/3141622 in 369.38 sec. (1.4514 bpc, 18.14% at 9 KB/s)
C:\res\calgary>dir/b | ..\paqar45\paqarcc -5 ..\xcc5
PAQARCC (PAQ+Dictionary) v4.5 by M.Mahoney+A.Ratushnyak+P.Skibinski,
12.2.2006
Enter names of files to compress, followed by blank line or EOF.
- loading dictionary C:\res\paqar45\pasqda.dic
- loaded dictionary 103370/559168 words
BOOK2 610856 -> 111370+2109
BOOK1 768771 -> 171134+2913
NEWS 377109 -> 78702+469
PAPER1 53161 -> 10112+81
PAPER2 82199 -> 15773+157
BIB 111261 -> 17190
PROGC 39611 -> 8280
PROGL 71646 -> 9656
PROGP 49379 -> 6953
TRANS 93695 -> 10501
GEO 102400 -> 44330
OBJ1 21504 -> 7578
OBJ2 246814 -> 45404
PIC 513216 -> 23578
566495/3141622 in 372.28 sec. (1.4426 bpc, 18.03% at 8 KB/s)
C:\res\calgary>dir/b | ..\paqar45\paqarcc -7 ..\xcc7
PAQARCC (PAQ+Dictionary) v4.5 by M.Mahoney+A.Ratushnyak+P.Skibinski,
12.2.2006
Enter names of files to compress, followed by blank line or EOF.
- loading dictionary C:\res\paqar45\pasqda.dic
- loaded dictionary 103370/559168 words
BOOK2 610856 -> 111214+2109
BOOK1 768771 -> 170970+2913
NEWS 377109 -> 78513+469
PAPER1 53161 -> 10107+81
PAPER2 82199 -> 15694+157
BIB 111261 -> 17220
PROGC 39611 -> 8246
PROGL 71646 -> 9695
PROGP 49379 -> 6993
TRANS 93695 -> 10431
GEO 102400 -> 44300
OBJ1 21504 -> 7575
OBJ2 246814 -> 45210
PIC 513216 -> 23567
565669/3141622 in 443.01 sec. (1.4405 bpc, 18.01% at 7 KB/s)
-- Matt Mahoney
You can use paq7asm.asm from paq7. It is the same code in all the paq7
and paq8 versions. I am pretty sure that paq8asm.obj came from
paq7asm.asm.
> Has anyone else managed to build the programs with TextFilter.cpp
> under Linux? Any enlightenment will be gratefully received.
>
> Many thanks to Matt and the others for doing this work and making it
> available.
>
> Palimpsest
paq8b (and c, d, e) were written by Przemyslaw Skibinski. These are
modifications of paq8a (that I wrote) that add dictionary preprocessing
and other transforms, essentially integrating WRT. The core algorithm
is otherwise unchanged. PAsQDa, which is based on the older paq6 and
paqar, does similar processing.
AFAIK WRT only works in Windows. I have not tried porting it. I think
the Windows code is mainly used to find the dictionary files by finding
out the directory containing the executable. It could probably be
ported by using $PATH to find it instead, or some hack like that.
I've been doing lots of tests and so far I haven't tried porting it
because what I found is that dictionary preprocessing actually hurts
compression in a lot of cases. The good results on the Calgary corpus
and maximumcompression.com corpus are in part due to using the text
files from this data (plus other sources) to build the dictionary. On
other data sets I've tested, like the untarred gimp-2.0.0 files, paq8a
(no dictionary) compresses better than the later versions that use
dictionaries (below). So I am reluctant to put text preprocessing in
until these problems are solved.
(-6 option uses 460 MB memory, 1 hour CPU on a 2.2 GHz Athlon-64 in XP
home)
6,269,937 gimp.paq8a-6
6,979,714 gimp.paq8b-6
6,486,840 gimp.paq8d-6
6,647,846 gimp.paq8e-6
(-7 option uses 900 MB memory)
6,054,170 gimp.paq8a-7
6,740,613 gimp.paq8b.7
6,245,464 gimp.paq8d-7
6,521,194 gimp.paq8e-7
I am also concerned about code stability. These are experimental
programs, so I expect there to be some bugs. Several older versions of
PAsQDa had bugs in that certain files did not decompress back to the
original due to bugs in the preprocessor. It is very hard to test for
all possible cases. One version failed if the input text contained the
word "bulandsness".
So in paq7 I added self testing of the transform at compression time to
make such errors less likely. During compression, the input is
transformed to a temporary file, then the inverse transform is applied
and compared to the original input. If there is a mismatch then the
file is compressed without the transform, so decompression is still
correct.
Of course this does not eliminate all bugs. paq8c crashed on some text
files (including gimp). This was fixed in paq8d, but then it ran in
Windows XP but not Windows 98 or Me. That was fixed in paq8e and so
far I have not found any bugs, although it only runs in Windows for
now.
-- Matt Mahoney
Yeah, lots of people have modified the code but haven't bothered to
update the comments. When people improve the compression, I just test
it on the Calgary corpus and post the code they give me.
paq8d (Feb. 8, 2006) - improved text and exe detection, and a bug fix
for paq8c, which crashed on some text files.
paq8e (Feb. 10, 2006) - improved detection of XML, HTML and UTF-8, and
a bug fix for paq8d, which did not work in Windows 98 or Me. So far
there has been lots of testing and no more bugs found.
paq8f (Feb. 28, 2006) - based on paq8a, which does not use dictionaries
or text processing. It has a drag and drop interface (in Windows)
improved x86 detection, a more memory efficient context model, and a
new indirect context model. It has the best compression on most data
types except for most English text files, where the dictionary gives
paq8e the edge. The dictionary versions (paq8b/c/d/e) by Przemyslaw
Skibinski only work in Windows. However I have tested paq8f (and
paq8a) in Linux and Mac OS X.
I was rather surprised to find that indirect context modeling improves
compression on almost every file I tested, text or binary. An indirect
context is a context model in which a history is kept only within
another context. For example, given input:
AB...AC...AB...AC...AB...AC...A_
then in the direct order-1 context "A" we have the sequence "BCBCBC".
We observe that in the order-(1,2) context (A,BC), which occurred twice
before, that in both cases the next symbol is "B", so that is what the
model predicts to be most likely. An ordinary context model could not
make such a prediction. paq8f models indirect contexts with orders
(1,1), (1,2), (1,3), (2,1), and (2,2).
An order-(m,n) indirect context would be equivalent to a BWT compressor
sorting on the first m characters (using a stable sort) and using an
order-n context model on the transform. Normally BWT uses an order-0
model.
-- Matt Mahoney
Why? It's columnising/rasterizing data on an alinear context-grid.
There
where PPM-variants that used a linear context-grid for improving
compressing on 'pic' (the dimensions where known, so the grid-stepsize
was choosen to be part of 640). Because of too expensive possible
mixers
(in that time) it crunched down to only observe deterministic
grid-contexts: for example if every (640/8)th byte was predicted by a
deterministic context favor that over prediction by a PPM-context.
> An indirect
> context is a context model in which a history is kept only within
> another context. For example, given input:
>
> AB...AC...AB...AC...AB...AC...A_
>
> then in the direct order-1 context "A" we have the sequence "BCBCBC".
> We observe that in the order-(1,2) context (A,BC), which occurred twice
> before, that in both cases the next symbol is "B", so that is what the
> model predicts to be most likely. An ordinary context model could not
> make such a prediction. paq8f models indirect contexts with orders
> (1,1), (1,2), (1,3), (2,1), and (2,2).
>
> An order-(m,n) indirect context would be equivalent to a BWT compressor
> sorting on the first m characters (using a stable sort) and using an
> order-n context model on the transform. Normally BWT uses an order-0
> model.
In effect it's a pattern-detector, but a not so good adjustable one.
You
have a parameter-space that's very hard to control, because you can in
principle construct every form of repetitive pattern: for example a
context that observes at position 1,2,4,8,16 of each call, or
1,2,1,1,2.
It's unclear which exact pattern may work beforehand, and it's even
extreme expensive to run a two-pass algorithm detecting optimal pattern
configurations.
Also it's unclear when to switch between patterns and off the patterns
in favor of a possible (for ex.) PPM-context.
I don't say it's a bad method (I did it also, see
http://cvs.sourceforge.net/viewcvs.py/pyramidworkshop/libPPM-0.0.0/Context/Run-Interleaved.hpp?rev=1.1&view=markup
) but it's blowing up optimality (the per se detection that it models
better on most files, doesn't say anything, you know).
Ciao
Niels
An indirect context model is a little different from interleaving for
PPM. With interleaving on 2-D structured data you gain the bytes above
as context but lose the context of the immediately preceding byte. paq
allows both contexts by mixing the predictions of different models
(automatically detecting width). An indirect context model is distinct
from both of these contexts allowing further compression by combining 3
models. It does not require repeating at regular intervals.
Here is an interesting experiment I did. First I created the following
49999 line file, count (with CR LF line terminators):
1
2
3
...
49998
49999
This file has low Kolmogorov complexity, so ought to be highly
compressible. Here are some results:
Original size: 338887
pkzip: 103598
bzip2: 67670
GRZipII: 64425
slim v21: 35880
paqar 4.1 -6: 2111
ppmonstr J: 992
paq8a -4: 721
paq8f -6: 671
paqar uses the paq6 style based model mixing, where each model outputs
a prediction based on the history of 0 and 1 counts. Starting in paq7,
the context model also remembers what the last bit was, and also if
there are 4 or fewer observations, the exact order of these bits. Then
that state is mapped to a prediction using an adaptive table. This is
a simple type of indirect context model applied to all the regular
contexts at the bit level. This might account for some the improvement
of paq8a over paqar, although I think it is more from adding a new
model based on column position in text files. The indirect context
model I described was added to paq8f, so you might say there are
actually 2 levels of this model.
-- Matt Mahoney
>> It's columnising/rasterizing data on an alinear context-grid.
I still say it's not. :)
Your only applying an alinear transform on a
linear grid. The alinear transform is (for example) creating columns
after every occurance of an 'A'. Because it's an adaptive process you
don't store the transform (positions) but determine (build them up)
while running the algorithm. In the end you have a transpositional
matrix (or at least can extract it).
Also a simple modification (adding CM) of the pure PPM leads to the
same conceptual algorithm:
a->b->r->a->c->a->d->a->b->a->grid->frequencies
---------------------------- ---- -----------
1) 2) 3)
1) the prefix-context PPM
2) any additional context CM
3) the probabalistic model
I guess that that is even the conceptual clearer variant because
you can incorporate full symbol-exlusion, even with indirect
context.
> paq
> allows both contexts by mixing the predictions of different models
> (automatically detecting width). An indirect context model is distinct
> from both of these contexts allowing further compression by combining 3
> models.
I know. But:
- the RLI is distinct from the PPM / indirect context model is distinct
from any other model
- the context for the RLI is the linear grid-position / the indirect
context is the transformed grid-position
- the mixer is 'choose-only-deterministic-for-rli-model' / the mixer is
neural-net based
So where do you see the conceptual difference?
RLI is part of the indirect context model, you can exactly create a
RLI
by changing only a handfull of characters in your code (I guess :-).
> It does not require repeating at regular intervals.
I know:
>> for example a
>> context that observes at position 1,2,4,8,16 of each call, or
>> 1,2,1,1,2.
>> It's unclear which exact pattern may work beforehand, and it's even
>> extreme expensive to run a two-pass algorithm detecting optimal pattern
>> configurations.
>> Also it's unclear when to switch between patterns and off the patterns
>> in favor of a possible (for ex.) PPM-context.
Also it's unclear when to switch between patterns and off the patterns
in favor of a possible (for ex.) other PAQ-model.
I'm speaking here about a algorithm computing an optimal indirect
context
model. Because there are (+- in the quantity of) n! (n the number of
characters to code) possible indirect context models it's not very
obvious
how to do that.
We're always screaming after pigeon-hole and
what-makes-it-smaller-makes-
it-bigger-too, so:
- for _which_ data is your (1,2) worst? (which data obeys
pattern-rules,
or can be constructed by pattern-based generators? is there an (for
ex.)
english text without this properties?)
- _why_ is it good for the files you tested? (is it only compensating
slow adaption in other models? at which point reach other models the
same or better predictions?)
> Here is an interesting experiment I did. First I created the following
> 49999 line file, count (with CR LF line terminators):
>
> 1
> 2
> 3
> ...
> 49998
> 49999
Why don't you incorporate a delta-rle-model into PAQ? ;^)
> This file has low Kolmogorov complexity, so ought to be highly
> compressible. Here are some results:
>
> Original size: 338887
> pkzip: 103598
> bzip2: 67670
> GRZipII: 64425
> slim v21: 35880
> paqar 4.1 -6: 2111
> ppmonstr J: 992
> paq8a -4: 721
> paq8f -6: 671
>
> paqar uses the paq6 style based model mixing, where each model outputs
> a prediction based on the history of 0 and 1 counts. Starting in paq7,
> the context model also remembers what the last bit was, and also if
> there are 4 or fewer observations, the exact order of these bits. Then
> that state is mapped to a prediction using an adaptive table. This is
> a simple type of indirect context model applied to all the regular
> contexts at the bit level. This might account for some the improvement
> of paq8a over paqar, although I think it is more from adding a new
> model based on column position in text files. The indirect context
> model I described was added to paq8f, so you might say there are
> actually 2 levels of this model.
>
> -- Matt Mahoney
Ciao
Niels
I suppose that's another way to look at it.
> We're always screaming after pigeon-hole and
> what-makes-it-smaller-makes-
> it-bigger-too, so:
> - for _which_ data is your (1,2) worst? (which data obeys
> pattern-rules,
> or can be constructed by pattern-based generators? is there an (for
> ex.)
> english text without this properties?)
> - _why_ is it good for the files you tested? (is it only compensating
> slow adaption in other models? at which point reach other models the
> same or better predictions?)
Typically paq has many models running, most of which are not very
effective. The mixer turns these off, although not completely without
cost. First it takes time to learn that a model is ineffective, and
meanwhile compression is worse. You will see this effect on random
data (where all models are ineffective). Second, the extra models take
away memory from the more effective ones. Mixing models is a hard
problem.
An indirect context model does improve compression on text as well as
many binary formats. In text I believe this is due to the cache
effect. A word seen recently is likely to be seen again in the near
future.
-- Matt Mahoney