Hutter Prize data analysis

Matt Mahoney

unread,

Aug 12, 2006, 12:04:32 AM8/12/06

to

I have been studying the data used in the Hutter prize (large text
compression benchmark).
http://cs.fit.edu/~mmahoney/compression/textdata.html
I added two new sections.

1. String matching. I posted a program, FV, that plots x=file
position, y=distance back to last match (log scale), color=length of
match. It can yield insights into the best way to compress many types
of data. I posted plots of enwik8 and enwik9. They have nearest
matching strings that can span the entire length of the file. This
means that you will need lots of memory to compress them well. enwik8
(Hutter prize, 100 MB) is pretty uniform. enwik9 (1 GB) has about 150
MB in the middle that is highly compressible, and this shows up in the
string matching statistics as a repeating structure of length about 3
KB (the length of each article, apparently generated from a census
table).

2. Lexical analysis. I compared the effects of various parsing methods
like n-grams, words, space modeling, capital encoding and stemming on
order 0 models. The most effective techniques are space modeling and
spelling words that occur only once with n-grams. With space modeling,
spaces are omitted before dictionary symbols representing words
depending on the context, and exceptions are coded with a special
symbol.

Stemming and capital encoding don't help on order 0 models but might be
important for higher orders, and especially for compressors that model
syntax or semantics (which I am working on). The goal is to have a
symbol stream where each symbol has an independent meaning, so
"Rotated" becomes (capitalize) + (rotate) + (-ed).

Stemming and capital encoding can be done very accurately using 2
passes, building a dictionary on the first pass. For example, the
stemmer knows that "census" is not a plural form because "censu"
doesn't occur in the dictionary. This allows for very simple rules.
Likewise it knows which words are normally capitalized (proper nouns)
and stores them that way because the lower case equivalents are rare.
This reduces the number of symbols needed to code exceptions compared
to encoding all capital letters as a special symbol followed by the
lower case equivalent (known to help some compressors).

-- Matt Mahoney

budge...@mystarship.com

unread,

Aug 14, 2006, 7:21:15 AM8/14/06

to

Interesting!

I will try my hand at a compressor, i doubt that i will be able to gain
better compression than PAQ however it is a interesting challenge.

Now if i can find some more free time!

Michael Goldshteyn

unread,

Aug 14, 2006, 10:00:56 AM8/14/06

to

"Matt Mahoney" <matma...@yahoo.com> wrote in message
news:1155355472.0...@i3g2000cwc.googlegroups.com...
> ...

> Stemming and capital encoding can be done very accurately using 2
> passes, building a dictionary on the first pass. For example, the
> stemmer knows that "census" is not a plural form because "censu"
> doesn't occur in the dictionary. This allows for very simple rules.
> Likewise it knows which words are normally capitalized (proper nouns)
> and stores them that way because the lower case equivalents are rare.
> This reduces the number of symbols needed to code exceptions compared
> to encoding all capital letters as a special symbol followed by the
> lower case equivalent (known to help some compressors).
>
> -- Matt Mahoney
>

Matt, would it be OK to try to compress the 100MB text by wrapping your
compressor with pre/post processing? In other words, would that be an
acceptable entry, both in terms of the contest's rules and from the
standpoint of reusing what you have already written with regard to paq8?

Mike

Matt Mahoney

unread,

Aug 14, 2006, 3:41:31 PM8/14/06

to

PAQ is open source, so anyone can modify it and claim the prize.

-- Matt Mahoney

Michael Goldshteyn

unread,

Aug 14, 2006, 5:14:55 PM8/14/06

to

"Matt Mahoney" <matma...@yahoo.com> wrote in message

news:1155584491....@h48g2000cwc.googlegroups.com...

> PAQ is open source, so anyone can modify it and claim the prize.
>
> -- Matt Mahoney
>

Also, I've played with your FV program which shows a nice 2-dimensional
graph of the 1, 2, 4, and 8 byte string length/match distributions. I've
modified it to allow for the creation of larger BMP files than the hardcoded
width/height of 512w x 256h and have gotten it go as high as 4096x4096, with
some minor code tweaks, which made the graph for enwik8 look even more
interesting, if not more useful. I also played with the darkness/lightness
of the plot data and substituted in the MT (Mersenne Twister) instead of the
calls to rand(), wherever those were used. All in all an interesting
adventure. If you want the modified source, I'd be happy to post it.

Thanks,

Mike

Message has been deleted

Matt Mahoney

unread,

Aug 15, 2006, 2:40:54 PM8/15/06

to

Post it and I'll add a link to it.

-- Matt Mahoney