First Hutter Prize Award in 2 Years!

James A. Bowery

unread,

Aug 19, 2009, 8:35:52 AM8/19/09

to Hutter Prize

It's now official that Alexander Ratushnyak has been awarded the first
Hutter Prize for Lossless Compression of Human Knowledge in 2 years!

This contest is getting exponentially more difficult so if you have
spare change (in 1000 Euro increments), please send it to Marcus
Hutter to increase the amount of the next prize!

See:

23.May'09 Alexander Ratushnyak decomp8 archive8.bin enwik8 15'970'425
16'252 15'986'677 3.0% 6.26 1.279 924MB 9h 3.0% improvement over
new baseline paq8hp12

http://prize.hutter1.net/#contestants

James A. Bowery

unread,

Aug 20, 2009, 9:13:01 AM8/20/09

to Hutter Prize

Perhaps Dr. Ratushnyak will be kind enough to describe in layman's
terms what he did to achieve this improvement.

It appears that it involves a pre-calculated dictionary which must be
extracted prior to decompression of enwik8 itself.

This brings to mind the fact that the Wiktionary is about 50M:

http://download.wikimedia.org/enwiktionary/latest/

Has there been any _successful_ work done on automatic extraction of
word definitions, and their various senses, from corpora?

Matt Mahoney

unread,

Aug 20, 2009, 10:12:13 AM8/20/09

to hutter...@googlegroups.com, Alex Ratushnyak

My understanding is that he used some utilities and manual tuning to create the dictionary used in paq8hp12 and lpaq9m. Dmitry Shkarin mentioned briefly that he orgainized the dictionary used by the durilca entries grouping words together that shared similar contexts. I don't know the details. Both dictionaries seem to have the property that syntactically and semantically related words are grouped together.

Words can be clustered using latent semantic analysis. Gorrell has developed some efficient online algorithms using neural networks that I think could be applied to text compression. http://en.wikipedia.org/wiki/Latent_semantic_analysis

I think building a dictionary from the input file will be more effective than using a manually constructed dictionary from another source.
-- Matt Mahoney, matma...@yahoo.com

James Bowery

unread,

Aug 20, 2009, 10:50:03 AM8/20/09

to hutter...@googlegroups.com

Now we're really getting somewhere!

In addition to latent semantic analysis, there is a technique Peter Turney used to achieve human-level performance on the SAT verbal analogy test:

Latent Relational Analysis

http://www.apperceptual.com/lra_tools.html

Matt Mahoney

unread,

Aug 20, 2009, 11:17:37 AM8/20/09

to hutter...@googlegroups.com

Exactly. LSA and LRA implement the transitive property of semantics. They allow you to infer relations like rain-wet from observed pairs like rain-water and water-wet. This would allow a compressor to predict "wet" in context "rain" even though the pair was not previously observed.

In paq8hp12 (and probably durilca and decmprs8), words like "rain", "wet", and "water" would be grouped during dictionary preprocessing so that when they are assigned a 2 byte dictionary symbol, one byte is the same. This improves compression because it allows for models that ignore the other context byte.

LSA could be used to automatically construct such dictionaries. Each word is assigned a context vector. Then we find a short path through this space and output the words in that order. For an ordinary word-word association matrix, the context vector would have one component for each other word in the vocabulary. These vectors could be reduced to a few hundred by LSA.

In Gorrell's onlike LSA implementation, a 3 layer neural network is trained by back propagation to predict nearby words in a text corpus. There is one input and output neuron for each word in the vocabulary. Neurons are gradually added to the hidden layer, up to a few hundred. The context vectors are then given by the weights between the input and hidden layers.

I think you might get better compression by predicting words directly rather than going through an offline dictionary. Constructing a linear dictionary loses information about the original distances between words. It would probably be slower because training the network requires a lot of computation that would otherwise be done outside the decompresser's model.

-- Matt Mahoney, matma...@yahoo.com

From: James Bowery <jabo...@gmail.com>
To: hutter...@googlegroups.com
Sent: Thursday, August 20, 2009 10:50:03 AM

Subject: [Hutter Prize] Re: First Hutter Prize Award in 2 Years!

James Bowery

unread,

Aug 20, 2009, 2:35:51 PM8/20/09

to hutter...@googlegroups.com

In this regard I think Turney, although a skeptic of the Hutter Prize, has something in particular to offer with his unification of LSA and LRA.

With such a unification I can see the pre-compressed "dictionary" evolving into a very interesting text in its own right.

Alex

unread,

Aug 31, 2009, 1:25:13 AM8/31/09

to Hutter Prize

Two years long there was no results in the Hutter prize.
Might be it is time to make it's rules less restrictive?
For example allow more memory or more time for execution?
Or may be use more powerful computer for test run with same 10 hours
limit?

Mikael Kuisma

unread,

Aug 31, 2009, 4:00:18 AM8/31/09

to hutter...@googlegroups.com

But with this kind of relaxation, wouldn't it then become a computer
competition and not an algorithm competition? I would say that the
performance of the decompressor computer used is irrelevant as long as
it is the same for all participants. We also already got an implicit
relaxation since the compressor, not bound by the machine restrictions,
can be more demanding with the availability of faster computers anyway.

- Mikael Kuisma

kuisma.vcf

Matt Mahoney

unread,

Aug 31, 2009, 12:49:13 PM8/31/09

to hutter...@googlegroups.com

There was an award May 23, 2009. http://prize.hutter1.net/

-- Matt Mahoney, matma...@yahoo.com

----- Original Message ----
From: Alex <andri...@googlemail.com>
To: Hutter Prize <hutter...@googlegroups.com>
Sent: Monday, August 31, 2009 1:25:13 AM
Subject: [Hutter Prize] Re: First Hutter Prize Award in 2 Years!

Message has been deleted

Alex Ratushnyak

unread,

Sep 14, 2009, 1:38:32 AM9/14/09

to Hutter Prize

> Perhaps Dr. Ratushnyak will be kind enough to describe in layman's
> terms what he did to achieve this improvement.

Sorry for not replying earlier!
I plan to write an article about my work in October or November.
Three parts of it will have titles
"Tightest data compression: main principles"
"Does this have anything to do with artificial intelligence?"
"Future improvement of context mixing compression methods"
I will try to write parts 1 and 2 in layman's terms.
Can't promise to publish it on the web before publishing anywhere
else, but I plan to send it to members of the Prize Committee first of
all.

Matt Mahoney

unread,

Sep 14, 2009, 8:11:10 AM9/14/09

to hutter...@googlegroups.com

I'm looking forward to reading it.

-- Matt Mahoney, matma...@yahoo.com

----- Original Message ----
From: Alex Ratushnyak <p...@rogers.com>
To: Hutter Prize <hutter...@googlegroups.com>
Sent: Monday, September 14, 2009 1:38:32 AM
Subject: [Hutter Prize] Re: First Hutter Prize Award in 2 Years!

Reply all

Reply to author

Forward