paq8hp10

29 views
Skip to first unread message

Matt Mahoney

unread,
Apr 1, 2007, 10:41:36 PM4/1/07
to Hutter Prize
I tested paq8hp10 and paq8hp10any by Alexander Ratushnyak
http://cs.fit.edu/~mmahoney/compression/text.html#1334
Both programs (Win32 exe) can be found at http://start.binet.com.ua/~artest/HKCC/

paq8hp10 is a Hutter prize entry. It is a single executable file
which extracts temporary files to the current directory and I believe
also to the root directory, which I believe is why Jim Bowery reported
that it failed in Windows Vista (tmpfile: permission denied). However
I verified it in Windows XP home with a nonprivileged login. The test
machine was a 2.2 GHz Athlon-64 with 2 GB memory. Memory usage is 940
MB (verified with Task Manager). It took about 2 hours to compress
and 2 hours to decompress enwik8. To run:

paq8hp10 -7 enwik8.paq8hp10-7 enwik8

Compressed size: 16,490,947
paq8hp10.exe: 103,224
Total: S = 16,594,171

This improves on paq8hp9: 16,516,789 + 112,628 = 16,629,417.
It also improves on earlier versions on the Hutter prize page (which
has not been updated since paq8hp8) at http://prize.hutter1.net/

The Hutter prize was last awarded for paq8hp5 = L = 17,073,018
so 1 - S/L = 2.80471% (1402 euros if my math is right).

Only the -7 option works with paq8hp10. However, paq8hp10any works
with any memory option up to -8. This is not a Hutter prize entry.
The program has 2 external dictionaries which must both be in the
current directory when the program is run. The -7 option produces
archives compatible with paq8hp10. For the large text benchmark:

Compressed size with -8: 16,335,197
paq8hp10any.zip: 333,925
Total: 16,669,112

This is larger because zip does not compress the external dictionaries
as well as the internal compression in paq8hp10. Also, there is some
wasted space because the two dictionaries contain the same words (but
in a different order).

I attempted to test paq8hp10any -8 enwik9, which would take 20 hours
to compress except that the disk started thrashing. I killed it after
13 hours when it was about 30% complete. At the time, it was
compressing at a rate that would have taken 60 hours to finish. So
for now, paq8hp8 still has the record on enwik9. I believe the reason
it thrashes on enwik9 but not enwik8 (in spite of allocating 1849 MB
in both cases) is the rotating input buffer (buf) filling up. The
portion which is actually accessed (rather than just allocated) grows
as the input grows.

Source code should be released 30 days after the exe, after the public
comment period ends on Apr. 25, 2007. The Hutter prize does not
require source code to be released, although the GPL license does.
(The code is derived from paq8h). The source will actually be for
paq8hp10any.

-- Matt Mahoney

Matt Mahoney

unread,
Apr 2, 2007, 11:00:46 AM4/2/07
to Hutter Prize
Sportman has benchmarked paq8hp10any -8 enwik9, 15.5 hours on a 2 GHz
2x dual core Intel Woodcrest with 4GB memory. It is now on top of the
large text benchmark at 133,313,456 bytes (pending verification of
decompression). There is probably room for a 220KB further
improvement using self extracting dictionaries as in paq8hp10.

http://cs.fit.edu/~mmahoney/compression/text.html

-- Matt Mahoney


Reply all
Reply to author
Forward
0 new messages