Introducing the Hutter Prize for Lossless Compression of Human Knowledge

61 views
Skip to first unread message

James A. Bowery

unread,
Aug 6, 2006, 3:39:16 PM8/6/06
to
Introducing the Hutter Prize for Lossless Compression of Human
Knowledge

Artificial intelligence researchers finally have an objective and
rigorously validated measure of the intelligence of their machines.
Furthermore the higher the measured intelligence of their machines the
more money they can win via the Hutter Prize.

The purse for the Hutter Prize was initially underwritten with a 50,000
Euro commitment to the prize fund by Marcus Hutter of the Swiss Dalle
Molle Institute for Artificial Intelligence, affiliated with the
University of Lugano and The University of Applied Sciences of Southern
Switzerland.

The theoretic basis of the Hutter Prize is related to an insight by the
12th century philosopher, William of Ockham, called "Ockham's Razor",
sometimes quoted as: "It is vain to do with more what can be done with
less." But it was not till the year 2000 that this was mathematically
proven*, by Marcus Hutter, to be a founding principle of intelligence.
Indeed, Hutter's Razor** might be phrased, "It is truer to explain with
less that which can be explained with more."

There have been previous tests and related prizes for artificial
intelligence, such as the Turing Test and the Loebner Prize. However,
these tests suffered from subjective definitions of intelligence.
Hutter's recent theoretic breakthrough creates a mathematics of
artificial intelligence which accurately measures the degree of
intelligence possessed by an artificial agent. It does so by measuring
how succinctly it represents knowledge of the world. As Hutter has now
proven, the most succinct computer model of the world isn't just the
most aesthetic or memory-efficient out of all models of the known
observations -- it also most accurately predicts new observations. In
short, it is the most intelligent.

Artificial intelligence has thereby entered the realm of engineering:
Lossless compression of human knowledge.

This is momentous because by optimizing for rigorous metrics, the field
of artificial intelligence may finally clarify the murky waters of
inadequate definition, within which it has been haphazardly swimming
for the last 50 years, to become both a hard science and tractable
engineering discipline.

Named for the discoverer of the proof and the initial 50,000 Euro
donor, the Hutter Prize currently targets the compression of a 100
megabyte sample of human knowledge drawn from the broadly based
Wikipedia online encyclopedia. As Moore's Law increases the capacity
of machines, and as additional donations to the prize fund increase the
incentives of contestants, the intent is to increase the amount of
knowledge targeted for compression. It is reasonable to expect that
the 100 megabyte sample will produce, at the very least, advances in
linguistic modeling. As the targeted depth and breadth of knowledge
increases, conceptual frameworks will come into play, eventually
covering the range of disciplines from political science to physics by
applying theories that prove optimal in compressing the target.

A common objection to this approach to artificial intelligence is that
it offers little that is new -- that the computational difficulty of
searching for patterns in data remains what it has always been. This
objection misses two important points:

1) Hutter's proof provides a new mathematics of intelligence allowing
for "top down" theoretic advances which may render many problems
tractable that otherwise appear intractable.

2) There is a large overlap between succinctly codified knowledge and
an intelligent compression program. Indeed, a reasonable definition of
"knowledge" is that it optimizes the compression of new observations as
instances of old patterns. This means that even if a compressor does
nothing but apply codified human knowledge, generating no new knowledge
of its own, it can still demonstrate greater intelligence than
competing programs and thereby make measurable progress toward
artificial intelligence but also -- and this is key -- progressively
more intelligent bodies of human-generated knowledge by pitting those
bodies of knowledge against each other in what might be called an
epistemological tournament.

The formula for winnings is modeled after the M-Prize or Methuselah
Mouse Prize, which awards money to longevity researchers for progress
in keeping mice alive the longest. Here, modified for compression
ratios, is the formula:

S = size of program outputting the uncompressed knowledge
Snew = new record
Sprev = previous record
P = [Sprev - Snew] / Sprev = percent improvement

Award monies:

Fund contains: Z at noon GMT on day of new record
Winner receives: Z * P

Initially Z is 50,000 Euro with a minimum payout of 500 Euro (or
minimum improvement of 1% over the prior winner).

Donations are welcome. The history of improvement in The Calgary
Corpus Compression Challenge*** is about 3% per year. The larger the
commitment from donors to the fund, the greater the rate of progress
toward a high quality body of human knowledge and, quite possibly, the
long-held promise of artificial intelligence.

For further details of the Hutter Prize see:

http://prize.hutter1.net

For discussion of the Hutter Prize see:

http://groups.google.com/group/Hutter-Prize

-- Jim Bowery

* http://www.hutter1.net/ai/uaibook.htm

** Hutter's Razor has some caveats relating to the nature of the
universe and computability, but those conditions must be met for any
computer-based intelligence.

*** http://mailcom.com/challenge

Ted Dunning

unread,
Aug 7, 2006, 8:22:32 PM8/7/06
to

Wasn't it Fisher in the 20's who proved just about the same thing about
maximum likelihood estimates?

Certainly, there is nothing new in the thesis that if you have an
estimate q of a probability p, then the using q as a compression model
gives minimum compressed size for p = q. Moreover, this minimum is
unique. This is the same as saying that the maximum likelihood value
of q is p (not quite the same as the maximum likelihood *estimator*, of
course).

For exchangeable events x_i, this is easy to prove using the convexity
of the log function,

log(x) <= x-1

By Finetti's theorem, the x_i are conditionally independent so

p(X) = prod_i p(x_i | theta)

E[ log q(x) ] = sum_x p(x) log q(x)

But, comparing with the entropy of the distribution p,

sum_x p(x) log q(x) + H(p) = sum_x p(x) log q(x) / p(x)

and by the inequality above,

sum_x p(x) log (q(x) / p(x)) <= sum_x p(x) [q(x) / p(x) - 1]
<= sum_x q(x) - sum_x p(x) = 0

Thus

- sum_x p(x) log q(x) >= H(p)

and

- sum_x p(x) log q(x) = H(p) <==> p = q

But this is just the expected length of an optimal compressor
representing X using q as a model. Thus, minimum expected compressed
size implies p = q.

Also, the expected compressed size can be estimated by compressing a
large number of x_i, thus the minimum compressed size estimator is
asymptotically equal to p.

All of the excitement is, of course, before the asymptote kicks in.
How can we estimate p with fewer x_i.

Why this is news is beyond me.


James A. Bowery wrote:
> ... But it was not till the year 2000 that this was mathematically


> proven*, by Marcus Hutter, to be a founding principle of intelligence.
> Indeed, Hutter's Razor** might be phrased, "It is truer to explain with
> less that which can be explained with more."
>

> ... As Hutter has now

Steve Richfie1d

unread,
Aug 8, 2006, 4:12:35 PM8/8/06
to
Ted (and James),

We often argue about lots of things, but I am on your side here.

The insistence on limited program size pretty much excuses it from the
realm of the scientific, as it diverts people's efforts from the claimed
goals of the prize and onto clever and concise programming methods, and
in the process obscures what might otherwise be simple but powerful
compression methods. It also indirectly (and needlessly) presumes an
18MB simplicity in human experience, thought, and expression that
probably just isn't there.

Research programs like this should be coded in some user friendly and
extensively checked medium like Microsoft Access, which probably
requires 18MB just for "Hello World". Even the .NET run-time package
would exceed this. HOWEVER, it is unclear whether the run-time package
would count in the total, but the contest seems to presume an EXE
instead of (say) an MDB like Access would use. Probably the best
approach to squeeze things into the smallest EXE would be to "go retro"
with, say, VB5, but to what end - to try to fit some screwball rule?

Further, the program would be disqualified if it were to correct
grammatical errors and/or summarize cumbersome statements using the very
computer understanding that the prise is supposed to be promoting.

Indeed, abandoning lossless compression would pretty much yield what the
semantic web searching people are looking for once it gets good enough.

Further, this sort of a competition blows right past the ITAR problems
that I have posted in the past - a VERY good thing.

Yes, the world would GREATLY benefit from this sort of research, but the
"minor" flaws in this particular competition appear (to me) to be rather
fatal to its stated goals.

Sad. I was really excited about competing until I saw the flaws.

Steve Richfie1d
===============

Matt Mahoney

unread,
Aug 8, 2006, 5:00:20 PM8/8/06
to
Ted Dunning wrote:
> Wasn't it Fisher in the 20's who proved just about the same thing about
> maximum likelihood estimates?
>
> Certainly, there is nothing new in the thesis that if you have an
> estimate q of a probability p, then the using q as a compression model
> gives minimum compressed size for p = q. Moreover, this minimum is
> unique. This is the same as saying that the maximum likelihood value
> of q is p (not quite the same as the maximum likelihood *estimator*, of
> course).

I think it was Shannon who proved this in 1948 (discete noiseless
channel capacity theorem).
http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
I could not find any reference to Fisher in his paper.

> Why this is news is beyond me.

That isn't what Hutter proved. What he proved is that the optimal
behavior of a goal seeking agent in an unknown but computable
environment is to guess that the environment is controlled by the
shortest program (Turing machine) consistent with all behavior observed
so far. Thus, a solution to the problem of maximizing the accumulated
reward signal from an unknown environment reduces to algorithmic
compression, which is an undecidable problem. However, Hutter also
proved that in the restricted case of an envoronment bounded by space s
and time t, that the problem is computable in time O(t2^s) (so the
problem is still hard).

When I developed the large text benchmark (without funding) my goal was
to promote research in NLP. There are 3 arguments why text compression
implies AI.

1. Hutter's proof in the specific case of text prediction (a problem
that is easy for people but hard for machines).
2. An explicit representation of the probability distribution of
natural language strings (a correct language model) enables one to pass
the Turing test for AI.
3. Compression ratio (perplexity) is already used in practice to
measure the quality of language models (mainly for speech recognition).

However, AI does *not* imply compression. Decompression requires a
deterministic computation to invert the compression function, and the
human brain is not deterministic. I believe this is the reason that
many people doubt the converse. I discuss these issues in
http://cs.fit.edu/~mmahoney/compression/rationale.html

-- Matt Mahoney

Matt Mahoney

unread,
Aug 8, 2006, 5:45:30 PM8/8/06
to
Steve Richfie1d wrote:
> Ted (and James),
>
> We often argue about lots of things, but I am on your side here.
>
> The insistence on limited program size pretty much excuses it from the
> realm of the scientific, as it diverts people's efforts from the claimed
> goals of the prize and onto clever and concise programming methods, and
> in the process obscures what might otherwise be simple but powerful
> compression methods. It also indirectly (and needlessly) presumes an
> 18MB simplicity in human experience, thought, and expression that
> probably just isn't there.

Marcus Hutter and I disagreed on what the size of the test data should
be. I argued that it should be 1 GB because that is what Turing
estimated in 1950 it would take to pass the Turing test (a machine with
10^9 bits of memory, which is 1 GB based on Shannon's 1950 estimate of
1 bpc entropy of written English), Landauer's estimate of 10^9 bits of
human long term memory, and because it is about the amount of language
that the average adult has processed since birth. Hutter argued for
100 MB because it is easier to work with and makes the contest more
accessible to the public. He has a good point because many of the top
compressors are there only because they use all of the 800 MB available
on my test machine.

In the original test design I assume that all knowledge needed to pass
any text-based AI test (such as text prediction or the Turing test) is
learnable solely from text. I realize that the vast knowledge that
humans use in these tasks is not all learned verbally, but I think that
most of it *could* be. For example, a blind person can tell you that
the sky is blue. Also, if knowledge cannot easily be learned verbally,
then it is difficult to prove to somebody that you know it in a verbal
test. For example, how would you prove in the Turing test that you
know how to ride a bicycle, that you know what a banana smells like,
that you could recognize somebody's face? Can you give an example of
*any* human knowledge that cannot be learned verbally but can be
demonstrated verbally?

Thus, to successfully pass the compression test, it will be necessary
for a program to learn a language model from the test data itself as it
compresses/decompresses the data.

> Research programs like this should be coded in some user friendly and
> extensively checked medium like Microsoft Access, which probably
> requires 18MB just for "Hello World". Even the .NET run-time package
> would exceed this. HOWEVER, it is unclear whether the run-time package
> would count in the total, but the contest seems to presume an EXE
> instead of (say) an MDB like Access would use. Probably the best
> approach to squeeze things into the smallest EXE would be to "go retro"
> with, say, VB5, but to what end - to try to fit some screwball rule?

Most compressors are written in C, C++ and assembler for speed. File
I/O does not require any fancy programming environment. The executable
files are typically 100 KB or less. It is necessary to include the
size of the decompressor, otherwise I could trivially write a 100 MB
"decompressor" that simply outputs the data, thus achieving a
compressed size of 0.

> Further, the program would be disqualified if it were to correct
> grammatical errors and/or summarize cumbersome statements using the very
> computer understanding that the prise is supposed to be promoting.

The test data is very clean. Misspelled words and grammatical errors
are rare. This is one reason I chose this data set. See
http://cs.fit.edu/~mmahoney/compression/textdata.html for an analysis
of the data.

>
> Indeed, abandoning lossless compression would pretty much yield what the
> semantic web searching people are looking for once it gets good enough.
>
> Further, this sort of a competition blows right past the ITAR problems
> that I have posted in the past - a VERY good thing.
>
> Yes, the world would GREATLY benefit from this sort of research, but the
> "minor" flaws in this particular competition appear (to me) to be rather
> fatal to its stated goals.
>
> Sad. I was really excited about competing until I saw the flaws.
>
> Steve Richfie1d

-- Matt Mahoney

David Kinny

unread,
Aug 8, 2006, 7:54:11 PM8/8/06
to
"Matt Mahoney" <matma...@yahoo.com> writes:

> ...


> When I developed the large text benchmark (without funding) my goal was
> to promote research in NLP. There are 3 arguments why text compression
> implies AI.

> 1. Hutter's proof in the specific case of text prediction (a problem
> that is easy for people but hard for machines).
> 2. An explicit representation of the probability distribution of
> natural language strings (a correct language model) enables one to pass
> the Turing test for AI.

Not if I'm the one asking the questions. This idea is just
the ludicrous "humungous lookup table" in disguise.

David

David Kinny

unread,
Aug 8, 2006, 8:50:01 PM8/8/06
to
"Matt Mahoney" <matma...@yahoo.com> writes:

> Steve Richfie1d wrote:
> > Ted (and James),
> >
> > We often argue about lots of things, but I am on your side here.
> >
> > The insistence on limited program size pretty much excuses it from the
> > realm of the scientific, as it diverts people's efforts from the claimed
> > goals of the prize and onto clever and concise programming methods, and
> > in the process obscures what might otherwise be simple but powerful
> > compression methods. It also indirectly (and needlessly) presumes an
> > 18MB simplicity in human experience, thought, and expression that
> > probably just isn't there.

> [...]


> Thus, to successfully pass the compression test, it will be necessary
> for a program to learn a language model from the test data itself as it
> compresses/decompresses the data.

This "intelligence test" is so silly, it's hard to know where to start.

- There is no notion of "successfully passing the test", unless you
consider every lossless compressor/decompressor to pass the test, and
that includes ones that are clearly completely unintelligent.

- You are seeking minimal size of the decompressor that reproduces a
particular 100MB text file. A compressor/decompressor scheme that is
specialized for the structure and statistics of that particular file
can obviously do better than one that is more general in ability, but
this is equally obviously less intelligent than the more general one.

- It seems there is no limit on the CPU time required to generate
the decompressor, which again penalizes intelligence and rewards
brute force approaches.

- Why do you imagine that any "understanding" of the text is needed
to achieve maximal compression. A purely syntactic/statistical
approach that ignores any meaning of the text is going to be more
intelligent, by your criterion, since addressing semantics just adds
baggage and size to the decompressor. This might change if you
accepted a "fair paraphrase" rather than an exact reproduction,
but you don't.

In summary, what is proposed is no more or less than a test of the
intelligence and skill of lossless compression programmers, not of
AI programs.

David

Matt Mahoney

unread,
Aug 8, 2006, 11:40:49 PM8/8/06
to

This is the point of measuring compressed size. One way to pass the
Turing test would be to look up the interrogator's exact question in
Google and reply with the text that follows from highest ranked
response. The problem is that your lookup table is several terabytes.
The challenge for you is to replace this huge table with something
smaller. That means instead of estimating the probability of a
sentence by counting Google hits, that you analyze the syntax and
semantics of the sentence to determine if it is correct, because the
number of exact matches is most likely to be zero. You have a choice
of either finding a succinct way of describing the rules that govern
whether a sentence is likely or unlikely, or you can find an efficient
way to learn these rules from just enough unlabled text.

Using compression ratio to measure the quality of a language model is
nothing new. Speech recognition researchers have been doing it for
years.

-- Matt Mahoney

Matt Mahoney

unread,
Aug 9, 2006, 12:30:53 AM8/9/06
to
David Kinny wrote:
> "Matt Mahoney" <matma...@yahoo.com> writes:
>
> > Steve Richfie1d wrote:
> > > Ted (and James),
> > >
> > > We often argue about lots of things, but I am on your side here.
> > >
> > > The insistence on limited program size pretty much excuses it from the
> > > realm of the scientific, as it diverts people's efforts from the claimed
> > > goals of the prize and onto clever and concise programming methods, and
> > > in the process obscures what might otherwise be simple but powerful
> > > compression methods. It also indirectly (and needlessly) presumes an
> > > 18MB simplicity in human experience, thought, and expression that
> > > probably just isn't there.
>
> > [...]
> > Thus, to successfully pass the compression test, it will be necessary
> > for a program to learn a language model from the test data itself as it
> > compresses/decompresses the data.
>
> This "intelligence test" is so silly, it's hard to know where to start.
>
> - There is no notion of "successfully passing the test", unless you
> consider every lossless compressor/decompressor to pass the test, and
> that includes ones that are clearly completely unintelligent.

We did not set a criteria for "passing" the test because we do not know
exactly what level of compression is needed. Shannon estimated in 1950
that the written English (using a 27 character alphabet of A-Z and
space) has an entropy of between 0.6 and 1.3 bits per character. This
is better than the best compressors can do, but not by much.

Humans are better than machines at text prediction, e.g. guessing the
next character or word in "roses are ___" or "3+8=__". Clearly this
task requires lexical, syntactic, semantic and vast, real world
knowledge. For a machine to pass the AI test, it must do as well as
humans by some measure, such as the number of guesses on average until
the correct answer is guessed. The measure we use is to assign
probabilities to all possible choices and a score of -log(p) for the
probability p assigned by the guesser to the correct answer. Shannon
proved that this score is minimized when the guesser assigns the exact
same probabilities as the true (but unknown) probability distribution.

A problem with this approach is that it is conceptually difficult for
people to assign numeric probabilities. This is why Shannon used a
guessing game instead, but this resulted in the large error bounds I
described, because different distributions could lead to the same
series of guesses. Therefore, a task that remains is to more
accurately estimate the entropy of written English with respect to
humans. In particular, we need to estimate the entropy for the exact
text used in this particular data set. Until that is done, we can only
reward improvements in the state of the art.

> - You are seeking minimal size of the decompressor that reproduces a
> particular 100MB text file. A compressor/decompressor scheme that is
> specialized for the structure and statistics of that particular file
> can obviously do better than one that is more general in ability, but
> this is equally obviously less intelligent than the more general one.

I expect that a winning decompressor will be tuned specifically to this
data set. That is why the size of the decompressor is included. The
statistics can either be learned from the data or they can be hard
coded into the decompressor by the programmer. It does not matter to
us which way it is done. Any tuning done in the decompressor will
decrease the compressed file size but increase the size of the
decompressor. It is within the rules to write a "decompressor" that
simply outputs the test file verbatim, but good luck with the size of
the program.

> - It seems there is no limit on the CPU time required to generate
> the decompressor, which again penalizes intelligence and rewards
> brute force approaches.

I expect that NLP will be CPU and/or memory intensive, because other
approaches have not worked. We do not want to restrict the range of
possible solutions any more than necessary. Thus, we only require
"reasonable" computation limits by specifying that the solution must be
testable by the public and have a 30 day period of public comment. If
nobody can run the program in that time, then it loses. This rule
allows for better models as computers become more powerful.

> - Why do you imagine that any "understanding" of the text is needed
> to achieve maximal compression. A purely syntactic/statistical
> approach that ignores any meaning of the text is going to be more
> intelligent, by your criterion, since addressing semantics just adds
> baggage and size to the decompressor. This might change if you
> accepted a "fair paraphrase" rather than an exact reproduction,
> but you don't.

A "fair paraphrase" implies lossy compression, which cannot be measured
objectively. Who is to say if a paraphrase is fair or not? The extra
information needed to encode the exact text is small, and is the same
for all contestants.

And what is "understanding"? When the semantics of words are not
grounded to sensory or motor I/O, semantics is just a fuzzy identity
relation: two words are related if they are likely to appear near each
other in running text. Models like LSA exploit the transitive property
of semantics. If "moon" appears near "star" and "star" appears near
"night", then the model predicts that "night" will appear near "moon".
Semantics is just a statistical property.

> In summary, what is proposed is no more or less than a test of the
> intelligence and skill of lossless compression programmers, not of
> AI programs.

Yes, that is what the test is designed to do, unless you think that
machines can think.

> David

-- Matt Mahoney

David Kinny

unread,
Aug 9, 2006, 5:32:20 AM8/9/06
to
"Matt Mahoney" <matma...@yahoo.com> writes:

I don't dispute the relevance of compression in measuring the quality
of a language model. What I dispute is the notion that you can pass
the Turing test by looking up responses in a table, compressed or not.
My claim is that I, as judge, can easily construct a sequence of
questions that cannot be answered correctly by table lookup, as the
correct answers depends on previous aspects of the conversation, or
other variable context information not even part of the conversation:

Me: I was born in December 1966.
System: That's nice. [[or whatever, no correct answer expected here]]
Me: How many days left before Xmas, not counting today?
System: ??

Despite the inadequacy of a table/language model to generate the right
answer, let's suppose somehow it comes up with it. I continue:

Me: OK, add that to the number of whole years since the first moon
landing, and substract my age, what do you get?
System: ??

Do you see the problem now?

David

David Kinny

unread,
Aug 9, 2006, 6:17:10 AM8/9/06
to
"Matt Mahoney" <matma...@yahoo.com> writes:
>David Kinny wrote:
>[...]

>> In summary, what is proposed is no more or less than a test of the
>> intelligence and skill of lossless compression programmers, not of
>> AI programs.

>Yes, that is what the test is designed to do, unless you think that
>machines can think.

Well, I'm glad we agree on what the test measures. Why then did the
original announcement contain the following very misleading claim:

"Artificial intelligence researchers finally have an objective and
rigorously validated measure of the intelligence of their machines.
Furthermore the higher the measured intelligence of their machines
the more money they can win via the Hutter Prize."

The problem is that "AI researchers' machines" are almost always
intended to do things other than lossless compression of slabs of
text, so the test would invariably claim they have no intelligence.

David

Matt Mahoney

unread,
Aug 9, 2006, 12:47:37 PM8/9/06
to

How do you propose that intelligence (artificial or otherwise) be
measured?

-- Matt Mahoney

Ted Dunning

unread,
Aug 9, 2006, 12:58:46 PM8/9/06
to

Matt Mahoney wrote:
> Ted Dunning wrote:
> > Wasn't it Fisher in the 20's who proved just about the same thing about
> > maximum likelihood estimates?

> I think it was Shannon who proved this in 1948 (discete noiseless


> channel capacity theorem).
> http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
> I could not find any reference to Fisher in his paper.


I was referring to the asymptotic correctness of MLE's, not the
compression. The fundamental insight is that the maximum likelihood
estimator converges to truth. The fact that the MLE also gives you
optimal compression is an add-on to the basic idea and does indeed
follow from Shannon's work.

David Kinny

unread,
Aug 9, 2006, 7:48:41 PM8/9/06
to

>-- Matt Mahoney

By carefully crafted tests of knowledge, capability, comprehension,
and ability to reason, to recognise, to solve problems, to learn...

David

James A. Bowery

unread,
Aug 9, 2006, 9:06:18 PM8/9/06
to
Think about it like this:

You've heard of functions, right? They are things with inputs and
outputs. All possible inputs are called the domain. All possible
outputs are called the range. One way to implement functions is to
make a two column table with one column labeled "inputs" and the other
labeled "outputs" where each row is what the function outputs for the
given input. You'd call it a lookup table.

Now, imagine a program that, for every _sequence_ of N text inputs, i1,
i2, i3...iN, creates N rows in the table, the first row has input of
i1, and some output o1. The input for the next row is i2 appended to
the end of i1 and the output is o2... and you continue this until you
have completed all N inputs. Agreed this could be a VERY BIG table but
so could the table representing the function "+" between two 64 bit
numbers. Yet the size of the program you need to implement the
function "+" between two 64 bit numbers is very small... measured in
only tens of bytes on some computers. Incredible isn't it? That's the
magic of what is called Kolmogorov complexity.

Anyway, so let's go back to our ridiculously big text function and ask
ourselves, how big does the program have to be to generate the program
that's equivalent to that table?

Would you agree that it might be somewhat smaller than the table itself?

Matt Mahoney

unread,
Aug 9, 2006, 11:15:23 PM8/9/06
to

I don't claim that a language model can be implemented as a lookup
table. I claim

1. that a better language model is more likely to pass the Turing test,
2. that a better language model will compress text better.

By language model, I mean a probability distribution over the set of
all strings. So a model will assign p("Today is Aug. 10, 2006 ... I
was born in (rest of dialog)... what do you get? x") some number in
the open interval (0,1) for all x.

Now let q be a dialog ending with a question, and x1 and x2 be two
posible answers. I think it is reasonable to assume that the following
statements are equivalent (all true or all false) for all q, x1, x2:

1. A human picked at random and asked question q is more likely to
answer x1 than x2.
2. A human picked at random is more likely to judge the pair q,x1 as
correct than q,x2.
3. A text corpus of human origin is more likely to contain q,x1 than
q,x2.

Now suppose that a machine is able to compute p() such that the above
statements are equivalent to p(q,x1) > p(q,x2) for all q, x1, and x2.
Then that machine would answer all questions in a way that is
indistinguishable from a human. It would pass the Turing test.

Now suppose that we have two models p1() and p2() and want to know
which is better (closer to p)? One way is to test it is to ask it a
multiple choice question q with answers x1 and x2, and have each
program output the answer with the higher probability according to its
model. So if p1(q,x1) > p1(q,x2) in model p1() then the program
answers x1. If p1() answers x1 and p2() answers x2 and a human judge
decides that x1 is more correct, then p1() is judged more human.

Another test would be for the judge to first decide the correct answer
(say, x1), then ask both models to explicitly compute p1(q,x1) and
p2(q,x1). The model that gives the higher value wins. We don't
normally test humans this way because it is conceptually difficult for
people to assign numeric probabilities. But it is not hard for
machines.

This test can be repeated, for example, by multiplying the scores for
several independent questions to compute the probability that all of
the answers are correct. So given a series of questions q1, q2, q3,...
with correct answers x1, x2, x3..., the winner is the one with the
greatest value of PROD_i [p(x_i|q_i)], or equivalently, the smallest
value of -SUM_i [log p(x_i|q_i)], which is the compressed size of the
answers using ideal coding.

The advantage of this approach is that no human is required to evaluate
the language model. The score is exact, repeatable, and verifiable.

What I tried to show is that good text compression requires
sophisticated reasoning ability and vast knowledge, just as your
example Turing test session does.

-- Matt Mahoney

Matt Mahoney

unread,
Aug 9, 2006, 11:43:08 PM8/9/06
to
David Kinny wrote:
> >How do you propose that intelligence (artificial or otherwise) be
> >measured?
>
> >-- Matt Mahoney
>
> By carefully crafted tests of knowledge, capability, comprehension,
> and ability to reason, to recognise, to solve problems, to learn...
>
> David

If you were to run a competition lasting for many years among AI
systems with cash prizes, how would you design the tests so that the
competition is fair, scores are 100% objective, and systems can't
cheat?

-- Matt Mahoney

David Kinny

unread,
Aug 10, 2006, 6:29:34 PM8/10/06
to

You claimed above that "One way to pass the Turing test would be to


look up the interrogator's exact question in Google and reply with
the text that follows from highest ranked response. The problem is
that your lookup table is several terabytes."

This claim is nonsense - the prospect of passing the Turing test
in this way is effectively nil.

> I claim
> 1. that a better language model is more likely to pass the Turing test,
> 2. that a better language model will compress text better.

Another claim you made, to which I responded above, is:


"2. An explicit representation of the probability distribution of
natural language strings (a correct language model) enables one
to pass the Turing test for AI."

You use the phrase "An explicit representation of ... ", which certainly
sounds like a table of probabilities indexed by strings, and "enables
one to pass the Turing test", which is a very strong claim. But it
seems you now want to downgrade this to a program that computes
probabilities, and to "is more likely to pass the TT". Clearly a
program that has an infinitesimal chance of passing the TT is more
likely to pass than one that has zero chance, but it's not very
reasonable to claim that such a program can pass the TT if its
chances of soing so are sufficiently small. I could create a perl
script that has a non-zero chance to pass the TT in half an hour.

> By language model, I mean a probability distribution over the set of
> all strings. So a model will assign p("Today is Aug. 10, 2006 ... I
> was born in (rest of dialog)... what do you get? x") some number in
> the open interval (0,1) for all x.

Before I try to respond to the rest of this, please clarify for me:

- What you meant by "an explict representation of ..."

- What, exactly, does the probability of some string S represent here?
The probability of S occurring next in a particular document?
P(S occurs next in the whole corpus of written English text)?
P(S occurs next in a randomly chosen spoken English conversation)?
P(S occurs next in a randomly chosen Turing Test)?

- Do you really mean "over the set of ALL strings"? Language models
typically deal with n-grams for some small n. Are you proposing
an idealised, infinitely complex model? Or what?

David

Matt Mahoney

unread,
Aug 10, 2006, 9:24:28 PM8/10/06
to
David Kinny wrote:
> You claimed above that "One way to pass the Turing test would be to
> look up the interrogator's exact question in Google and reply with
> the text that follows from highest ranked response. The problem is
> that your lookup table is several terabytes."
>
> This claim is nonsense - the prospect of passing the Turing test
> in this way is effectively nil.

OK, it's nil. The point I was trying to make is that if you have a
small training corpus, then you have to be smart about how you learn a
language model from it. If it's large, you can use a brute force
string matching approach. I believe that 1 GB is the right size, large
enough to learn syntax and semantics, but small enough that you can't
just search for the judge's question and output whatever follows.

It is controversial whether a language model can be learned from an
unlabeled corpus of positive examples, but I believe it is possible.
It is easy to extract a lexical model (dictionary). There are
statistical semantic models such as LSA, and low level syntactic models
such as hidden Markov models, that have been learned on data sets of
similar size. I believe it is possible with the Wikipedia dataset. It
is doubtful to me that the data is sufficient to learn higher level
reasoning and problem solving skills like arithmetic without some help
from the programmer, but we shall see.

> > I claim
> > 1. that a better language model is more likely to pass the Turing test,
> > 2. that a better language model will compress text better.
>
> Another claim you made, to which I responded above, is:
> "2. An explicit representation of the probability distribution of
> natural language strings (a correct language model) enables one
> to pass the Turing test for AI."
>
> You use the phrase "An explicit representation of ... ", which certainly
> sounds like a table of probabilities indexed by strings, and "enables
> one to pass the Turing test", which is a very strong claim. But it
> seems you now want to downgrade this to a program that computes
> probabilities, and to "is more likely to pass the TT". Clearly a
> program that has an infinitesimal chance of passing the TT is more
> likely to pass than one that has zero chance, but it's not very
> reasonable to claim that such a program can pass the TT if its
> chances of soing so are sufficiently small. I could create a perl
> script that has a non-zero chance to pass the TT in half an hour.

It is irrelevant whether a Turing machine implements a language model
as an algorithm or a lookup table.

> > By language model, I mean a probability distribution over the set of
> > all strings. So a model will assign p("Today is Aug. 10, 2006 ... I
> > was born in (rest of dialog)... what do you get? x") some number in
> > the open interval (0,1) for all x.
>
> Before I try to respond to the rest of this, please clarify for me:
>
> - What you meant by "an explict representation of ..."
>
> - What, exactly, does the probability of some string S represent here?
> The probability of S occurring next in a particular document?
> P(S occurs next in the whole corpus of written English text)?
> P(S occurs next in a randomly chosen spoken English conversation)?
> P(S occurs next in a randomly chosen Turing Test)?
>
> - Do you really mean "over the set of ALL strings"? Language models
> typically deal with n-grams for some small n. Are you proposing
> an idealised, infinitely complex model? Or what?
>
> David

A language model is a function, p1(): S -> [0,1] where S is the set of
all string over some alphabet, e.g. S = {0,1}*, and furthermore, p1(s)
>= 0, and SUM_s p1(s) = 1 for s in S.

Define p(s) as the probability that s will occur in a two way dialog
between a human judge and a human confederate in the Turing test. s is
a prefix of the entire dialog from beginning to end. p(s) is unknown,
like all probabilities in real life. p(s) depends on the judge, the
confederate, and when the test is given.

A model, p1(), is an approximation of p(). The cross entropy is
defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross
entropy is a measure of how accurately p1() approximates p(). H1 is
minimized when p1(s) = p(s) for all s in S. Note that H1 is the
expected size of s chosen randomly from distribution p() when
compressed with model p1(). It is required that 0 < p1(s) < 1 for all
s in S.

When I say there is an explicit representation of p1(), I mean that
there is a Turing machine that computes it.

I realize that Wikipedia has a different distribution than p(s) for
Turing tests. The proper training corpus for passing the Turing test
would be a set of transcripts between judges and human confederates,
like those at the Loebner prize website. Chat logs would also be a
better choice than Wikipedia. I argue that this difference is an
irrelevant detail. The main difference is that Wikipedia articles are
not interactive, so a machine training on it would not learn to hold a
conversation. Other than that, text prediction still requires high
level language comprehension, common sense, reasoning, and problem
solving abilities just like passing the Turing test does. If you can
model written articles, then I think extending the model to interactive
conversation would not be too hard.

-- Matt Mahoney

Ted Dunning

unread,
Aug 11, 2006, 5:05:37 PM8/11/06
to

David Kinny wrote:
> Another claim you made, to which I responded above, is:
> "2. An explicit representation of the probability distribution of
> natural language strings (a correct language model) enables one
> to pass the Turing test for AI."
>
> You use the phrase "An explicit representation of ... ", which certainly
> sounds like a table of probabilities indexed by strings, and "enables
> one to pass the Turing test", which is a very strong claim.

It may sound like that to you, but it doesn't sound like that to
somebody who is taking just a bit more time to listen.

An explicit representation is one that is explicitly defined. A table
of logarithms is an explicit representation of the log function. So is
a series expansion.

> But it
> seems you now want to downgrade this to a program that computes
> probabilities, and to "is more likely to pass the TT".

There is no downgrading involved. This is conventional usage.

You are fighting a straw man for no discernible reason.

David Kinny

unread,
Aug 12, 2006, 4:50:32 AM8/12/06
to
Sorry for the delay in getting back to you here ...

That might be true in theory, but we are concerned here with what can
be feasibly implemented in current or future practice, are we not?

> > > By language model, I mean a probability distribution over the set of
> > > all strings. So a model will assign p("Today is Aug. 10, 2006 ... I
> > > was born in (rest of dialog)... what do you get? x") some number in
> > > the open interval (0,1) for all x.
> >
> > Before I try to respond to the rest of this, please clarify for me:
> >
> > - What you meant by "an explict representation of ..."
> >
> > - What, exactly, does the probability of some string S represent here?
> > The probability of S occurring next in a particular document?
> > P(S occurs next in the whole corpus of written English text)?
> > P(S occurs next in a randomly chosen spoken English conversation)?
> > P(S occurs next in a randomly chosen Turing Test)?
> >
> > - Do you really mean "over the set of ALL strings"? Language models
> > typically deal with n-grams for some small n. Are you proposing
> > an idealised, infinitely complex model? Or what?

> A language model is a function, p1(): S -> [0,1] where S is the set of


> all string over some alphabet, e.g. S = {0,1}*, and furthermore, p1(s)
> >= 0, and SUM_s p1(s) = 1 for s in S.

> Define p(s) as the probability that s will occur in a two way dialog
> between a human judge and a human confederate in the Turing test. s is
> a prefix of the entire dialog from beginning to end. p(s) is unknown,
> like all probabilities in real life. p(s) depends on the judge, the
> confederate, and when the test is given.

Here I'm a little confused. You say that s is a prefix of the dialog,
and that p(s) depends on the judge, etc, and when the test is given,
implying that this p() is for a *specific* dialog. In that case the
distribution is trivial: all k possible prefixes have probability 1/k,
all other strings 0 (or some variant of this). Surely you wish to
consider the distribution over some set of dialogs, but which set?

For a given p() to be useful in passing the TT, it has to adequately
cover possible dialogs. It's easy, as judge, to create novel dialogs.
For example, the following short question whose answer is obvious has
almost certainly never been asked before in the history of mankind.

How many legs do 3 ants, an octopus and 4 aardvarks have in total?

Then there's the issue of background knowledge. In a previous posting
you addressed the need to know such things as the date by prefixing
that information onto the actual dialog. How do you propose to
decide which background information to add in explicitly?

It seems, the way this is heading, that you need p() to be some
idealized distribution over the set of all possible human to human TT
dialogs prefixed by all necessary background information. In that
case it doesn't depend on the judge, the time, etc. Right?

> A model, p1(), is an approximation of p(). The cross entropy is
> defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross
> entropy is a measure of how accurately p1() approximates p(). H1 is
> minimized when p1(s) = p(s) for all s in S. Note that H1 is the
> expected size of s chosen randomly from distribution p() when
> compressed with model p1(). It is required that 0 < p1(s) < 1 for all
> s in S.

> When I say there is an explicit representation of p1(), I mean that
> there is a Turing machine that computes it.

I'll return below to the question of how such a p1() could be created,
but let's move on to another issue. Let's assume that you have an
actual, computationally tractable implementation of an algorithm that
computes p1(), which is a very good approximation of some suitably
extensive p(). Now how can you actually use this to pass the TT?

You've been asked a question, and the dialog so far is the string d.
You need to find a suffix x such that p1(d+x) is non-zero (I assume
you'll settle for any suitable answer rather than the most likely).
This means searching over the space of possible strings x and
repeatedly evaluating p1(d+x). This is not feasible. And if your
algorithm in fact takes d as input and returns an x s.t. p1(d+x) > 0,
it is not a language model as you've defined it, but something else.

For completeness, I'm just going to backtrack to something you wrote
earlier in the thread.

> Now let q be a dialog ending with a question, and x1 and x2 be two
> posible answers. I think it is reasonable to assume that the following
> statements are equivalent (all true or all false) for all q, x1, x2:

> 1. A human picked at random and asked question q is more likely to
> answer x1 than x2.
> 2. A human picked at random is more likely to judge the pair q,x1 as
> correct than q,x2.
> 3. A text corpus of human origin is more likely to contain q,x1 than
> q,x2.

I'd agree that 1 and 2 are equivalent, but dispute the equivalence of 3.
The reason is that, except for a vanishingly small number of dialogs q,
the probability of a text corpus of human origin containing q,x1 is
identical to the probability of it containing q,x2 , namely 0. As I've
shown it is easy to create dialogs such that this is the case.

> I realize that Wikipedia has a different distribution than p(s) for
> Turing tests. The proper training corpus for passing the Turing test
> would be a set of transcripts between judges and human confederates,
> like those at the Loebner prize website. Chat logs would also be a
> better choice than Wikipedia. I argue that this difference is an
> irrelevant detail. The main difference is that Wikipedia articles are
> not interactive, so a machine training on it would not learn to hold a
> conversation. Other than that, text prediction still requires high
> level language comprehension, common sense, reasoning, and problem
> solving abilities just like passing the Turing test does. If you can
> model written articles, then I think extending the model to interactive
> conversation would not be too hard.

> -- Matt Mahoney

So finally you gloss over several issues, dismissing them as irrelevant
details, or easy. You claim, apparently as an act of faith, that:

- It will be feasible to learn a sufficiently good language model from a
tractably small corpus. This is on a par with claiming that it will
be feasible to learn to play championship Go just from example games.

- Given such a model, it will be feasible to use it to pass the TT.

- The difference between the language model of TT dialogs and of a slab
of Wikipedia is unimportant. You, the programmer, can somehow extend
or convert the one into the other.

- The limited extent of the corpus will not matter. Somehow your LM
will have the capability to generalise, extrapolate, answer novel
questions, incorporate all necessary background information, know
how to do computations, solve simple problems, remember things, ...


Here's my take. It will, eventually, be feasible to create programs
that do all these latter things well enough to pass a TT, even in the
face of the types of lines of questioning I've pointed to. Inevitably
such systems will acquire much of their knowledge by learning, from
conversations and texts amongst other things. But they will be so
much more than just language models, in terms of what they consist of
and what they are capable of doing. They could actually be used to
generate language models of a certain type, ones where probabilities
were significant only in whether they were zero or not. Indeed, I
suspect that the only way you can generate a LM with sufficient
coverage to (in theory) pass the TT, is to use a system with such
capabilities as the generator.

To put it simply, I'd claim the only way you can create a LM good
enough to (in theory) pass the TT is to build a system that can
already pass the TT without any need to create that LM. And you
don't achieve that just as a side-effect of achieving compression
of 1G slabs of text that is close to the information-theoretic limit,
as that task can be achieved by much much simpler approaches.
Compressing text better is at best just one small step on the path
to implementing systems that pass the TT.

David

David Kinny

unread,
Aug 12, 2006, 5:01:03 AM8/12/06
to
"Ted Dunning" <ted.d...@gmail.com> writes:
> David Kinny wrote:
> > Another claim you made, to which I responded above, is:
> > "2. An explicit representation of the probability distribution of
> > natural language strings (a correct language model) enables one
> > to pass the Turing test for AI."
> >
> > You use the phrase "An explicit representation of ... ", which certainly
> > sounds like a table of probabilities indexed by strings, and "enables
> > one to pass the Turing test", which is a very strong claim.

> It may sound like that to you, but it doesn't sound like that to
> somebody who is taking just a bit more time to listen.

Steady on there Ted. Matt began by claiming that one could pass the
TT using a several terabyte lookup table. I'm the one who is taking
the time here to try and clarify exactly what he really means, and it
keeps changing from the laughable pop-science claims of the original
announcement. Hopefully now it's approaching something reasonable.

> An explicit representation is one that is explicitly defined. A table
> of logarithms is an explicit representation of the log function. So is
> a series expansion.

I agree, both are explicit. I really don't want to quibble about
semantics, but I'd say that an algorithm that computes logarithms in
some non-transparent way is by contrast an implicit representation.
Your usage of language may vary.

> > But it
> > seems you now want to downgrade this to a program that computes
> > probabilities, and to "is more likely to pass the TT".

> There is no downgrading involved. This is conventional usage.

The change from "can pass" to "is more likely to pass", from a
strong claim to a weak one, is certainly downgrading. And a program
is potentially feasible, while a lookup table is certainly not, as
it seems we all now agree.

> You are fighting a straw man for no discernible reason.

There is confusion being generated here that deserves to be dispelled.

David

Matt Mahoney

unread,
Aug 12, 2006, 1:18:22 PM8/12/06
to
David Kinny wrote:
> "Matt Mahoney" <matma...@yahoo.com> writes:
> > It is irrelevant whether a Turing machine implements a language model
> > as an algorithm or a lookup table.
>
> That might be true in theory, but we are concerned here with what can
> be feasibly implemented in current or future practice, are we not?

In practice, Turing machines don't exist.

> > Define p(s) as the probability that s will occur in a two way dialog
> > between a human judge and a human confederate in the Turing test. s is
> > a prefix of the entire dialog from beginning to end. p(s) is unknown,
> > like all probabilities in real life. p(s) depends on the judge, the
> > confederate, and when the test is given.
>
> Here I'm a little confused. You say that s is a prefix of the dialog,
> and that p(s) depends on the judge, etc, and when the test is given,
> implying that this p() is for a *specific* dialog. In that case the
> distribution is trivial: all k possible prefixes have probability 1/k,
> all other strings 0 (or some variant of this). Surely you wish to
> consider the distribution over some set of dialogs, but which set?

In real life, probabilities are always unknown. Different people will
estimate them differently depending on what they know. For example, I
flip a fair coin, look at it, but don't show it to you. What is the
probability that it is heads? You say 1/2. I say 1. Both of us are
right.

I define p(s) as the probability that s will be a prefix of a dialog in
a Turing test, given that you do not know who will be the judge or the
confederate or when the test will be given.

I realize that p() is still not well defined. But that is the nature
of probability.

> For a given p() to be useful in passing the TT, it has to adequately
> cover possible dialogs. It's easy, as judge, to create novel dialogs.
> For example, the following short question whose answer is obvious has
> almost certainly never been asked before in the history of mankind.
>
> How many legs do 3 ants, an octopus and 4 aardvarks have in total?

And yet, the probability that this will answered in an unknown training
corpus will not be exactly 0.

Also, I think you will agree that this question and the answer could be
phrased as a text prediction problem, in which case a compressor that
can solve problems like this will compress smaller than one that
cannot.

> Then there's the issue of background knowledge. In a previous posting
> you addressed the need to know such things as the date by prefixing
> that information onto the actual dialog. How do you propose to
> decide which background information to add in explicitly?

If a machine responds too fast to my questions, I will know it is not
human. A dialog must contain information about who said what and when.
You can represent this information how you like.

> It seems, the way this is heading, that you need p() to be some
> idealized distribution over the set of all possible human to human TT
> dialogs prefixed by all necessary background information. In that
> case it doesn't depend on the judge, the time, etc. Right?

It does depend, but you don't have this information, so you have to
make reasonable guesses.

> > A model, p1(), is an approximation of p(). The cross entropy is
> > defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross
> > entropy is a measure of how accurately p1() approximates p(). H1 is
> > minimized when p1(s) = p(s) for all s in S. Note that H1 is the
> > expected size of s chosen randomly from distribution p() when
> > compressed with model p1(). It is required that 0 < p1(s) < 1 for all
> > s in S.
>
> > When I say there is an explicit representation of p1(), I mean that
> > there is a Turing machine that computes it.
>
> I'll return below to the question of how such a p1() could be created,
> but let's move on to another issue. Let's assume that you have an
> actual, computationally tractable implementation of an algorithm that
> computes p1(), which is a very good approximation of some suitably
> extensive p(). Now how can you actually use this to pass the TT?
>
> You've been asked a question, and the dialog so far is the string d.
> You need to find a suffix x such that p1(d+x) is non-zero (I assume
> you'll settle for any suitable answer rather than the most likely).
> This means searching over the space of possible strings x and
> repeatedly evaluating p1(d+x). This is not feasible. And if your
> algorithm in fact takes d as input and returns an x s.t. p1(d+x) > 0,
> it is not a language model as you've defined it, but something else.

This is a perfectly reasonable implementation for a Turing machine. In
practice, any language model can be converted to a generator with at
most O(n) increase in tme complexity by binary search by appending one
bit at a time. You first compute p1(d+0) and p1(d+1) and randomly
append one bit in proportion to these probabilities, repeating until
you have appended a symbol indicating the end of the message (which is
required in a dialog).

> For completeness, I'm just going to backtrack to something you wrote
> earlier in the thread.
>
> > Now let q be a dialog ending with a question, and x1 and x2 be two
> > posible answers. I think it is reasonable to assume that the following
> > statements are equivalent (all true or all false) for all q, x1, x2:
>
> > 1. A human picked at random and asked question q is more likely to
> > answer x1 than x2.
> > 2. A human picked at random is more likely to judge the pair q,x1 as
> > correct than q,x2.
> > 3. A text corpus of human origin is more likely to contain q,x1 than
> > q,x2.
>
> I'd agree that 1 and 2 are equivalent, but dispute the equivalence of 3.
> The reason is that, except for a vanishingly small number of dialogs q,
> the probability of a text corpus of human origin containing q,x1 is
> identical to the probability of it containing q,x2 , namely 0. As I've
> shown it is easy to create dialogs such that this is the case.

Without full knowledge, you cannot say that any probability is 0. If
you had full knowledge then there would be no use for probability
theory.

> > I realize that Wikipedia has a different distribution than p(s) for
> > Turing tests. The proper training corpus for passing the Turing test
> > would be a set of transcripts between judges and human confederates,
> > like those at the Loebner prize website. Chat logs would also be a
> > better choice than Wikipedia. I argue that this difference is an
> > irrelevant detail. The main difference is that Wikipedia articles are
> > not interactive, so a machine training on it would not learn to hold a
> > conversation. Other than that, text prediction still requires high
> > level language comprehension, common sense, reasoning, and problem
> > solving abilities just like passing the Turing test does. If you can
> > model written articles, then I think extending the model to interactive
> > conversation would not be too hard.
>
> > -- Matt Mahoney
>
> So finally you gloss over several issues, dismissing them as irrelevant
> details, or easy. You claim, apparently as an act of faith, that:
>
> - It will be feasible to learn a sufficiently good language model from a
> tractably small corpus. This is on a par with claiming that it will
> be feasible to learn to play championship Go just from example games.

I believe it is possible to develop lexical, semantic, and syntactic
models from unlabeled text containing only positive examples. Children
seem to do this. It is probably insufficient to develop higher level
skills like solving word problems with arithmetic. That will probably
take some help from the programmer.

A corpus consisting of articles is not sufficient to train a chatterbot
in interactive converstation. But I argue that such a model is still
useful, probably more so. If you want to translate articles, scan
them, recognize them from broadcast speech, etc., you want to train on
the type of text that it would typically encounter.

> - Given such a model, it will be feasible to use it to pass the TT.

Given about 1 GB of the right type of training material (dialogs), yes.

> - The difference between the language model of TT dialogs and of a slab
> of Wikipedia is unimportant. You, the programmer, can somehow extend
> or convert the one into the other.

There is a lot of overlap in the learning mechanisms. The difference
is mainly in the training material and the requirement for interaction.

> - The limited extent of the corpus will not matter. Somehow your LM
> will have the capability to generalise, extrapolate, answer novel
> questions, incorporate all necessary background information, know
> how to do computations, solve simple problems, remember things, ...

I think 1 GB is the right size, as explained in
http://cs.fit.edu/~mmahoney/compression/rationale.html

I do not claim that this particular corpus is right for any particular
AI problem you want to solve. You need to be careful to feed it the
information you want it to know. But I believe for the problem of
predicting text in a corpus, the best information to feed it is the
corpus itself. So I have no problem with the Wikipedia text for this
purpose.

> Here's my take. It will, eventually, be feasible to create programs
> that do all these latter things well enough to pass a TT, even in the
> face of the types of lines of questioning I've pointed to. Inevitably
> such systems will acquire much of their knowledge by learning, from
> conversations and texts amongst other things. But they will be so
> much more than just language models, in terms of what they consist of
> and what they are capable of doing. They could actually be used to
> generate language models of a certain type, ones where probabilities
> were significant only in whether they were zero or not. Indeed, I
> suspect that the only way you can generate a LM with sufficient
> coverage to (in theory) pass the TT, is to use a system with such
> capabilities as the generator.
>
> To put it simply, I'd claim the only way you can create a LM good
> enough to (in theory) pass the TT is to build a system that can
> already pass the TT without any need to create that LM. And you
> don't achieve that just as a side-effect of achieving compression
> of 1G slabs of text that is close to the information-theoretic limit,
> as that task can be achieved by much much simpler approaches.
> Compressing text better is at best just one small step on the path
> to implementing systems that pass the TT.
>
> David

Passing the Turing test requires:
1. Solving the problem of language learning.
2. Modeling interactive conversation.
3. Selecting the right training data.

I am only addressing problem 1.

-- Matt Mahoney

Ian Parker

unread,
Aug 13, 2006, 3:59:15 PM8/13/06
to

Matt Mahoney wrote:
> David
>
> Passing the Turing test requires:
> 1. Solving the problem of language learning.
> 2. Modeling interactive conversation.
> 3. Selecting the right training data.
>
> I am only addressing problem 1.
>
In fact most of the programs that have beeen entered for Turing/Loebner
have been based on databases. (like a Google search). Given a database
only "1" is needed to pass the Turing Test.

Essentially chatterboxes have done "2" quite adaquately.

I realize that this paragraph is one of high subjectivity (nothing like
compression!), but it is doubtful to what extent human beings really
think for themselves. Social scientists and evolutionary biologists
talk about memes, which I suppose you could describe in AI terms as
being elements of a database. The test is for a conversation similar to
that which you could have with a human and it is my experience that you
do not look for reason! Extreme Islamic statements would for example
pass T. They are of course memes.

Clearly if you have compression, or some other purely objective
criterion this does not apply. We are not in compression attempting to
mimic human behaviour. I think you are right, it is the only true test,
although I don't believe monolinual Wiki is the right answer.

James A. Bowery

unread,
Aug 13, 2006, 9:29:11 PM8/13/06
to
Ian Parker wrote:
> I think you are right, it is the only true test,
> although I don't believe monolinual Wiki is the right answer.

Hutter's decision to go with 10e8 bytes was an uneasy, temporary
compromise. Certainly multilingual compression will be very valuable
-- particularly for my personal obsession which is applied relation
arithmetic.

Matt Mahoney

unread,
Aug 14, 2006, 12:15:59 AM8/14/06
to

I had considered a multilingual corpus because it would encourage
research in learning language independent parsing rules rather than
hard coding rules into the program. For example, Chinese does not use
spaces between words (1 to 4 characters), making it a challenge to
build a dictionary. Nevertheless, it is possible to do this in English
text without spaces, starting with no knowledge. Such models might
help us understand how babies learn to parse continuous speech into
words at age 7-10 months, before they learn to speak.
http://cs.fit.edu/~mmahoney/dissertation/lex1.html

I used a monolingual corpus because multilingual ability is not a
requirement for intelligence. The problem is hard enough as it is. If
the same articles appeared in several languages, then compression would
require solving the language translation problem in addition to the
modeling problem.

-- Matt Mahoney

David Kinny

unread,
Aug 14, 2006, 4:57:47 AM8/14/06
to
"Ian Parker" <ianpa...@gmail.com> writes:

> [...] Clearly if you have compression, or some other purely objective


> criterion this does not apply. We are not in compression attempting to
> mimic human behaviour. I think you are right, it is the only true test,
> although I don't believe monolinual Wiki is the right answer.

The problems with compressing text as a benchmark of intelligence are
that most systems don't compress text at all, and it is feasible to
compress text very well while understanding nothing about its meaning,
i.e. being incapable of passing a simple comprehension test.

David

David Kinny

unread,
Aug 14, 2006, 5:10:53 AM8/14/06
to
"Matt Mahoney" <matma...@yahoo.com> writes:

> David Kinny wrote:
> > "Matt Mahoney" <matma...@yahoo.com> writes:
> > > It is irrelevant whether a Turing machine implements a language model
> > > as an algorithm or a lookup table.
> >
> > That might be true in theory, but we are concerned here with what can
> > be feasibly implemented in current or future practice, are we not?

> In practice, Turing machines don't exist.

Tell me something I don't know. That's why I keep talking about
tractable algorithms and feasibility ...

> > > Define p(s) as the probability that s will occur in a two way dialog
> > > between a human judge and a human confederate in the Turing test. s is
> > > a prefix of the entire dialog from beginning to end. p(s) is unknown,
> > > like all probabilities in real life. p(s) depends on the judge, the
> > > confederate, and when the test is given.
> >
> > Here I'm a little confused. You say that s is a prefix of the dialog,
> > and that p(s) depends on the judge, etc, and when the test is given,
> > implying that this p() is for a *specific* dialog. In that case the
> > distribution is trivial: all k possible prefixes have probability 1/k,
> > all other strings 0 (or some variant of this). Surely you wish to
> > consider the distribution over some set of dialogs, but which set?

> In real life, probabilities are always unknown. Different people will
> estimate them differently depending on what they know. For example, I
> flip a fair coin, look at it, but don't show it to you. What is the
> probability that it is heads? You say 1/2. I say 1. Both of us are
> right.

> I define p(s) as the probability that s will be a prefix of a dialog in
> a Turing test, given that you do not know who will be the judge or the
> confederate or when the test will be given.

> I realize that p() is still not well defined. But that is the nature
> of probability.

... while you keep retreating back to unknowable idealisations that
you're somehow going to produce excellent approximations to. Your p()
contains information on all conceivable TT dialogs for all (english
speaking?) humans over all time. You're going to somehow learn this
from 1G of text. I don't think so.

> > For a given p() to be useful in passing the TT, it has to adequately
> > cover possible dialogs. It's easy, as judge, to create novel dialogs.
> > For example, the following short question whose answer is obvious has
> > almost certainly never been asked before in the history of mankind.
> >
> > How many legs do 3 ants, an octopus and 4 aardvarks have in total?

> And yet, the probability that this will answered in an unknown training
> corpus will not be exactly 0.

A mere quibble. Whether it is exactly 0 or vanishingly small, you can't
learn to reliably answer such questions by text prediction based on a
manageably small training corpus, because you can't learn a sufficiently
good approximation of your idealized p(). There are simply too many
easily asked and answered questions you will never have seen before.
Your system must be capable of responding to novel questions like this,
and LM's merely learned directly from text corpora can't.

> Also, I think you will agree that this question and the answer could be
> phrased as a text prediction problem, in which case a compressor that
> can solve problems like this will compress smaller than one that
> cannot.

Surely. But that doesn't imply the converse: that a compressor that
compresses smaller than <one that cannot solve such problems>, can solve
such problems.

> > Then there's the issue of background knowledge. In a previous posting
> > you addressed the need to know such things as the date by prefixing
> > that information onto the actual dialog. How do you propose to
> > decide which background information to add in explicitly?

> If a machine responds too fast to my questions, I will know it is not
> human.

Are you serious? This is easily addressed.

> A dialog must contain information about who said what and when.
> You can represent this information how you like.

Delimiters between utterances are indeed necessary. Timestamps are not.
In any case you've missed (or are deliberately avoiding?) the point here.
You prefixed the actual dialog with non-dialog information such as the
date, presumably to prop up your argument that you could learn an LM
that could answer questions such as "How many days to Xmas?".
The issue is that you need to add a potentially huge amount of
background information that is known by humans but absent from dialogs.

OK, that seems reasonable.

> > For completeness, I'm just going to backtrack to something you wrote
> > earlier in the thread.
> >
> > > Now let q be a dialog ending with a question, and x1 and x2 be two
> > > posible answers. I think it is reasonable to assume that the following
> > > statements are equivalent (all true or all false) for all q, x1, x2:
> >
> > > 1. A human picked at random and asked question q is more likely to
> > > answer x1 than x2.
> > > 2. A human picked at random is more likely to judge the pair q,x1 as
> > > correct than q,x2.
> > > 3. A text corpus of human origin is more likely to contain q,x1 than
> > > q,x2.
> >
> > I'd agree that 1 and 2 are equivalent, but dispute the equivalence of 3.
> > The reason is that, except for a vanishingly small number of dialogs q,
> > the probability of a text corpus of human origin containing q,x1 is
> > identical to the probability of it containing q,x2 , namely 0. As I've
> > shown it is easy to create dialogs such that this is the case.

> Without full knowledge, you cannot say that any probability is 0. If
> you had full knowledge then there would be no use for probability
> theory.

The probabilities assigned to strings by your LM is determined by how
you construct the LM, not by what you or I might consider them to be.
You have yet to explain how you will create an LM by training on a
corpus that will assign non-zero probabilities to strings it has never
encountered, including all questions that might reasonably be asked.

> > > I realize that Wikipedia has a different distribution than p(s) for
> > > Turing tests. The proper training corpus for passing the Turing test
> > > would be a set of transcripts between judges and human confederates,
> > > like those at the Loebner prize website. Chat logs would also be a
> > > better choice than Wikipedia. I argue that this difference is an
> > > irrelevant detail. The main difference is that Wikipedia articles are
> > > not interactive, so a machine training on it would not learn to hold a
> > > conversation. Other than that, text prediction still requires high
> > > level language comprehension, common sense, reasoning, and problem
> > > solving abilities just like passing the Turing test does. If you can
> > > model written articles, then I think extending the model to interactive
> > > conversation would not be too hard.
> >
> > > -- Matt Mahoney
> >
> > So finally you gloss over several issues, dismissing them as irrelevant
> > details, or easy. You claim, apparently as an act of faith, that:
> >
> > - It will be feasible to learn a sufficiently good language model from a
> > tractably small corpus. This is on a par with claiming that it will
> > be feasible to learn to play championship Go just from example games.

> I believe it is possible to develop lexical, semantic, and syntactic
> models from unlabeled text containing only positive examples. Children
> seem to do this. It is probably insufficient to develop higher level
> skills like solving word problems with arithmetic. That will probably
> take some help from the programmer.

Children have inherent capabilities that permit them to do this, and much
more. Until you work out how to reproduce a sufficient subset of those
capabilities, you can't expect to replicate the achievent. You believe
that learning a language model for a suitable corpus is most of the job.
I think you, like many others, are grossly underestimating the complexity
of the problem, and looking for simple solutions to complex problems.

> A corpus consisting of articles is not sufficient to train a chatterbot
> in interactive converstation. But I argue that such a model is still
> useful, probably more so. If you want to translate articles, scan
> them, recognize them from broadcast speech, etc., you want to train on
> the type of text that it would typically encounter.

> > - Given such a model, it will be feasible to use it to pass the TT.

> Given about 1 GB of the right type of training material (dialogs), yes.

Dream on. The information content is inadequate.

> > - The difference between the language model of TT dialogs and of a slab
> > of Wikipedia is unimportant. You, the programmer, can somehow extend
> > or convert the one into the other.

> There is a lot of overlap in the learning mechanisms. The difference
> is mainly in the training material and the requirement for interaction.

How can you claim this, when you have yet to devise the learning
mechanisms?

> > - The limited extent of the corpus will not matter. Somehow your LM
> > will have the capability to generalise, extrapolate, answer novel
> > questions, incorporate all necessary background information, know
> > how to do computations, solve simple problems, remember things, ...

> I think 1 GB is the right size, as explained in
> http://cs.fit.edu/~mmahoney/compression/rationale.html

> I do not claim that this particular corpus is right for any particular
> AI problem you want to solve. You need to be careful to feed it the
> information you want it to know. But I believe for the problem of
> predicting text in a corpus, the best information to feed it is the
> corpus itself. So I have no problem with the Wikipedia text for this
> purpose.

I totally agree that to predict the text that occurs in X, train on X.
But you've still offered no convincing argument as to how this allows
your system to answer questions that don't occur in X, or even ones
that are minor variations of those that do.

> [...]


> Passing the Turing test requires:
> 1. Solving the problem of language learning.
> 2. Modeling interactive conversation.
> 3. Selecting the right training data.

> I am only addressing problem 1.

> -- Matt Mahoney

I guess we've just about reached the point of agreeing to disagree.

David

Ian Parker

unread,
Aug 14, 2006, 6:35:28 AM8/14/06
to
David Kinney wrote

>The problems with compressing text as a benchmark of intelligence are
>that most systems don't compress text at all, and it is feasible to
>compress text very well while understanding nothing about its meaning,
>i.e. being incapable of passing a simple comprehension test.

To achieve high levels of compression, particularly with multilingual
texts, understanding is required. "What do you mean by understand?".
You need an objective definition for that too. The only possible
criterion is a Logical Posivist one. What can you do? In other words.

It is clear to me too that "meaning" is a part of the information
content of text, and a compression algorithm must at some point
consider meaning. With multilinual texts this has to be considered
early on.

- Ian Parker

James A. Bowery

unread,
Aug 14, 2006, 5:19:09 PM8/14/06
to

Actually, rules are compressed knowledge. Tractability and
comprehension constant, the more unified the rules, the less space they
occupy and the more accurate they are at predicting future observations.

Matt Mahoney

unread,
Aug 14, 2006, 5:34:40 PM8/14/06
to
David Kinny wrote:
> "Matt Mahoney" <matma...@yahoo.com> writes:
> > A corpus consisting of articles is not sufficient to train a chatterbot
> > in interactive converstation. But I argue that such a model is still
> > useful, probably more so. If you want to translate articles, scan
> > them, recognize them from broadcast speech, etc., you want to train on
> > the type of text that it would typically encounter.
>
> > > - Given such a model, it will be feasible to use it to pass the TT.
>
> > Given about 1 GB of the right type of training material (dialogs), yes.
>
> Dream on. The information content is inadequate.

OK, then, how much is adequate? I presented evidence that the storage
capacity for verbally expressable knowledge in the human brain is about
10^9 bits.

> I totally agree that to predict the text that occurs in X, train on X.
> But you've still offered no convincing argument as to how this allows
> your system to answer questions that don't occur in X, or even ones
> that are minor variations of those that do.

If I knew how to solve the language modeling problem then I would have
solved it. The fact is that nobody knows how to solve it. That is why
there is a contest. Maybe it can't be solved.

But let's not be pessimistic. I believe the Wikipedia data set has
enough information to infer general knowledge and then apply it to
novel situations occurring later in the text. Here is an example.
There is no occurrence of the string "the United States is a place" in
enwik8 (the Hutter prize data set). Yet there are 1359 occurrences of
"in the United States", 265 of "to the United States", etc. from which
this knowledge could be inferred. Knowing whether X is a place or a
person allows me to make better predictions of the text that follows.
The phrase "X said" tells me that X is probably a person, so I can
predict phrases like "X believes", "X thinks", etc.

So if a program can make such inferences, could it not compress better
than one that couldn't? Or is this not a valid test?

-- Matt Mahoney

Message has been deleted

Rob Freeman

unread,
Aug 15, 2006, 5:31:56 AM8/15/06
to
Matt,

I am interested in this idea that intelligence is equivalent to
compression. I like it. My question is, what guarantee do we have that
a given compression has any generality?

I'm guessing this is related to the assumption that the problem be
limited to a "restricted case of an environment bounded by space s and
time t."

I'm not trying to be negative by saying this. It is just I think there
is a better solution to be found by not making that assumption. Not a
better solution to the compression problem, but a better solution to
the problem of intelligence defined as compression.

-Rob Freeman

Matt Mahoney wrote:
> Ted Dunning wrote:
> > Wasn't it Fisher in the 20's who proved just about the same thing about
> > maximum likelihood estimates?
> >

> > Certainly, there is nothing new in the thesis that if you have an
> > estimate q of a probability p, then the using q as a compression model
> > gives minimum compressed size for p = q. Moreover, this minimum is
> > unique. This is the same as saying that the maximum likelihood value
> > of q is p (not quite the same as the maximum likelihood *estimator*, of
> > course).


>
> I think it was Shannon who proved this in 1948 (discete noiseless
> channel capacity theorem).
> http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
> I could not find any reference to Fisher in his paper.
>

> > Why this is news is beyond me.
>
> That isn't what Hutter proved. What he proved is that the optimal
> behavior of a goal seeking agent in an unknown but computable
> environment is to guess that the environment is controlled by the
> shortest program (Turing machine) consistent with all behavior observed
> so far. Thus, a solution to the problem of maximizing the accumulated
> reward signal from an unknown environment reduces to algorithmic
> compression, which is an undecidable problem. However, Hutter also
> proved that in the restricted case of an envoronment bounded by space s
> and time t, that the problem is computable in time O(t2^s) (so the
> problem is still hard).


>
> When I developed the large text benchmark (without funding) my goal was
> to promote research in NLP. There are 3 arguments why text compression
> implies AI.
>
> 1. Hutter's proof in the specific case of text prediction (a problem
> that is easy for people but hard for machines).

> 2. An explicit representation of the probability distribution of
> natural language strings (a correct language model) enables one to pass
> the Turing test for AI.

> 3. Compression ratio (perplexity) is already used in practice to
> measure the quality of language models (mainly for speech recognition).
>
> However, AI does *not* imply compression. Decompression requires a
> deterministic computation to invert the compression function, and the
> human brain is not deterministic. I believe this is the reason that
> many people doubt the converse. I discuss these issues in
> http://cs.fit.edu/~mmahoney/compression/rationale.html
>
> -- Matt Mahoney

Matt Mahoney

unread,
Aug 15, 2006, 3:28:41 PM8/15/06
to
Rob Freeman wrote:
> Matt,
>
> I am interested in this idea that intelligence is equivalent to
> compression. I like it. My question is, what guarantee do we have that
> a given compression has any generality?

There is no such thing as a perfectly general compression function.
Every compressor that decreases the size of some inputs must increase
the size of others.

Of course, most "interesting" data, like text or images, is
compressable. We could probably define "interesting" to mean that
there is a short description of it. More formally, a string of length
n bits is interesting if there is a universal Turing machine with an
encoding shorter than n bits that outputs the string. The length of
the shortest program to do this is the algorithmic complexity.
(Obviously it depends on the programming language, but only up to a
constant, because any program can be converted to another language by
appending a compiler). The problem of computing algorithmic complexity
for all strings in any given language is not computable.

Since "general" compression is not obtainable, that isn't the goal of
the Hutter prize. I expect that the best solutions will be tuned
specifically to the data set, and not be very useful as general purpose
compressors. However, to do well, these programs will have to solve
hard problems in language modeling, like how to integrate lexical,
semantic and syntactic knowledge to recognize correct sentences and
coherent paragraphs. I think the data has sufficient knowledge to do
this if we can figure out how language is learned.

> I'm guessing this is related to the assumption that the problem be
> limited to a "restricted case of an environment bounded by space s and
> time t."

These bounds reduce AIXI from an undecidable problem to an intractable
one (O(t2^s)), so it is still a theoretical result. The goal is not to
verify AIXI. It is proven, so it doesn't need to be verified
experimentally. The goal is more modest, to understand how people
learn language. I believe this problems is tractable, or else our
brains couldn't do it.

> I'm not trying to be negative by saying this. It is just I think there
> is a better solution to be found by not making that assumption. Not a
> better solution to the compression problem, but a better solution to
> the problem of intelligence defined as compression.

The Turing test is more direct. We have the Loebner prize for that.
The Turing test requires machines to have a world model and to reason
and solve problems using it. The Wikipedia benchmark is probably
insufficient to train a system for this level of abstract thought, but
I think it will bring us to a level of language modeling much higher
than we have now.

-- Matt Mahoney

Rob Freeman

unread,
Aug 15, 2006, 8:08:55 PM8/15/06
to
Matt Mahoney wrote:
>
> There is no such thing as a perfectly general compression function.

Is this true also for language?

> The problem of computing algorithmic complexity
> for all strings in any given language is not computable.

This is probably the result I am interested in. Why is this so?

Most especially, if it is so, why are you looking for a single
compressor as the solution to your "language learning" problem, even
for a single data set?

> Since "general" compression is not obtainable, that isn't the goal of
> the Hutter prize. I expect that the best solutions will be tuned
> specifically to the data set, and not be very useful as general purpose
> compressors. However, to do well, these programs will have to solve
> hard problems in language modeling, like how to integrate lexical,
> semantic and syntactic knowledge to recognize correct sentences and
> coherent paragraphs. I think the data has sufficient knowledge to do
> this if we can figure out how language is learned.

My impression is that you are assuming the problem with "language
modeling" has been that have not known how to compress natural
language. What I am suggesting is that the problem has been we only
ever look for a single best solution.

If you agree there are many solutions, that each data set will have a
different solution, and by implication that each subset of each
data-set will have a different best solution, why should we not
optimize our compressor on, say, each sentence as that sentence is
presented to the system?

Why do we have to optimize globally for a solution to the language
modeling problem, even when we know that global solutions are not
general?

-Rob

Overcomer

unread,
Aug 16, 2006, 3:49:39 AM8/16/06
to
James A. Bowery wrote:
<snip>
> Indeed, Hutter's Razor** might be phrased, "It is truer to explain with
> less that which can be explained with more."
<snip>
> Named for the discoverer of the proof and the initial 50,000 Euro
> donor, the Hutter Prize currently targets the compression of a 100
> megabyte sample of human knowledge drawn from the broadly based
> Wikipedia online encyclopedia.
<snip>

Certianly one is free to put one's money down on any proposition but in
this case it might be better spent on a search for truth. Real
scientists don't really use Occam's Razor except as a convienent excuse
to dismiss things they don't like. To claim it is more true to explain
less is much like adherents of certian programming languages who claim
there is no need to comment code. I would argue the general
proposition of proving less is more cannot be proven because it is
obviously false. Let's prove that. Take any brand of encyclopedia and
reduce it to the character X. There is clearly no more truth in
stating "X" than stating the entire encyclopedia. So hereafter you may
consider Hutter's Razor as presented by Bowery's Post as proven false.
Hutter may have something worthwhile that is not represented with
sufficient accuracy by Bowery's Post. So then more accuracy yeilds
more truth, not less volume, unless we're talking about Liespace where
fewer statements are always more true and silence is golden.

If more truth exists saying less, why post at all? The Wikipedia is
also a big joke of mediocrity, enforced by the moderator method of self
appointed eletist whacked in the head peer reviews who reject real
truth which can hurt for the sake of truth by committee.

So I request you get Hutter himself to post and defend himself as
Bower's post doesn't present stuff that's even defensible to a logical
person. Sometimes truth is so important it is repeated -- see the 4
gospels on Jesus Christ.

Sincerely,
Kirk Fraser

James A. Bowery

unread,
Aug 16, 2006, 1:37:03 PM8/16/06
to

Overcomer wrote:
> James A. Bowery wrote:
> > Indeed, Hutter's Razor** might be phrased, "It is truer to explain with
> > less that which can be explained with more."
> To claim it is more true to explain
> less

Why did you read "explain with less" as "explain less"?

James A. Bowery

unread,
Aug 16, 2006, 1:43:14 PM8/16/06
to
Rob Freeman wrote:
> Most especially, if it is so, why are you looking for a single
> compressor as the solution to your "language learning" problem, even
> for a single data set?

There are actually 2 things of value that can come out of the Hutter
Prize:

1) An AI compressor that works for encyclopedic knowledge.

2) More conicise enyclopedic knowledge.

2 needn't be accomplished via 1. It can also be accomplished by humans
creating formalized knowledge that is more concise.

Rob Freeman

unread,
Aug 16, 2006, 9:38:53 PM8/16/06
to
Hi James,

I think there is a lot of value that can come out of it. Don't get me
wrong. I think you are on the right track. In contrast to some of the
other messages in this thread I'm sure compression is a very good way
to see intelligence, "meaning" anyway, more narrowly, and certainly
natural language.

But I'm most interested in the limitations Matt places on that
compression. Those limitations bring you closer to what I think is the
answer than any other work I've seen.

In particular I like these statements from Matt:

"There is no such thing as a perfectly general compression function.

Every compressor that decreases the size of some inputs must increase
the size of others."

"The problem of computing algorithmic complexity for all strings in any


given language is not computable."

"Since "general" compression is not obtainable ... I expect that the


best solutions will be tuned specifically to the data set, and not be
very useful as general purpose compressors."

I think these are the right observations. I'm just trying to remind you
there are other conclusions you could draw from those observations.

Isn't the attempt to model language by finding a single best
compression function related to the assumption that there is a general
compression function for language? If we accept that general solution
does not exist, shouldn't that make us reassess our goal (a single best
solution) rather than simply the generality of that goal?

What meaning does a global solution for a compression of language have
if it is not general?

I'm just trying to nudge you in the direction of seeing there are
possibilities for modeling language other than general/global solutions
(which surely we only seek because of a traditional bias in the way we
think about language -- as parametrized by a single grammar.)

You are already on the right track when you identify compression with
language modeling, and in particular when you accept that there is no
general solution for that compression. I just want you to think about
what that means for the way you formulate the problem

If solutions are not general, why not optimize locally instead of
globally? When we understand there are no general solutions, we
understand solutions must be local in a sense. Why compromise and make
those solutions only half local by optimizing them over the biggest
data set you can find? Think about what happens as your data set gets
larger. Aren't you only approximating more and more closely a general
solution which we already know does not exist?

My suggestion is, if solutions are necessarily local, why not address
your intelligence problem, language modeling anyway, as a search for
really relevant local solutions, and forget global solutions
completely?

-Rob

Matt Mahoney

unread,
Aug 16, 2006, 9:44:00 PM8/16/06
to
Rob Freeman wrote:
> Matt Mahoney wrote:
> >
> > There is no such thing as a perfectly general compression function.
>
> Is this true also for language?

Only if you exclude some possible strings. For "normal" text where you
exclude highly unlikely byte strings (such as nonprintable characters),
then you can always compress it. But if you allow all binary strings
of n bits, then there is no function that will compress all 2^n
possibilities. That is because there are only 2^n - 1 binary strings
of length less than n. (This is the counting theorem or counting
argument).

> > The problem of computing algorithmic complexity
> > for all strings in any given language is not computable.
>
> This is probably the result I am interested in. Why is this so?

Kolmogorov gave a simple proof of this. Suppose there was a function
K(x) that outputs the length of the shortest program (in some fixed
language) that outputs x. (This number is also known as the
algorithmic complexity or Kolmogorov complexity). Then you could write
the following program to find the first string x (in some
lexicographical ordering) that cannot be compressed to n bits for some
n:

let x = ""
while K(x) <= n do
x = next(x)
output x

Now let n be the length of this program, and you have a contradiction.
Therefore K() is not computable.

Even though K(x) is not computable, it can be proven that K(x) does not
depend on the programming language used, except for a constant
independent of x. This is because if x is coded as a program in
language L1, you can code it in any other language L2 by appending a
compiler for L1 written in L2.

The motivation for using Kolmogorov complexity as opposed to some
probability model to define information content is that it removes any
debate about which model should be used. You intuitively "know" that
"00000000000000" has less information than "hvvW80f8fwl34i". Why? It
is easier to remember. It is easier to describe. It follows a
pattern. Because it has less information, it should compress smaller.
But suppose my model says that characters are independent and letters
are more common than digits. Then my compressor would compress the
second string smaller. This is not intuitive. Instead, algorithmic
complexity has a universal prior. The first string has a shorter
program that will output it in most languages. This includes English,
when I say it is easier to describe ("14 zeros" is shorter than
"hvvW80f8fwl34i").

> Most especially, if it is so, why are you looking for a single
> compressor as the solution to your "language learning" problem, even
> for a single data set?

Essentially, the "intelligence" of a model p1(x) depends on how
accurately it approximates the true but unknown natural language model
p(x) for all strings x. Shannon proved that the cross entropy E[-SUM_x
p(x) log p1(x)] is minimized when p1(x) = p(x) for all x. We don't
know p(x) but we can approximate this value by sampling x from the
unknown distribution p(x) and computing p1(x) with the model under
test. So the test depends on the assumption that WIkipedia is a fair
sample of natural language text

There is still one big problem. Probabilities are not intrinsic
properties of real world events. They depend on what you know. If you
know that you will be compressing a specific x, then your model is p(x)
= 1 and you can compress x to 0 bits. There are two ways to solve
this. One is to not reveal x. The other is to require a universal
prior, p(x) = 2^-K(x).

The first approach is widely used. If we used this approach, we could
test compressors on maybe just a few KB of text and get fairly accurate
results. But nobody could verify or repeat our results.

We used the second approach. The problem is that K(x) is a universal
model which says nothing about natural language. Our solution is to
make x big enough (we think) to learn natural language from x itself.

Whether it is possible to learn natural language at all from unlabeled
text, and whether the Wikipedia data is the right quantity or quality
to do this is controversial. I discuss these issues further in

Rob Freeman

unread,
Aug 16, 2006, 10:29:34 PM8/16/06
to
Matt Mahoney wrote:
>
> > > The problem of computing algorithmic complexity
> > > for all strings in any given language is not computable.
> >
> > This is probably the result I am interested in. Why is this so?
>
> Kolmogorov gave a simple proof of this.

Good, yeah. That was what I thought you meant. I'm not arguing with it,
only how we interpret it for the "language modeling" problem.

> > Most especially, if it is so, why are you looking for a single
> > compressor as the solution to your "language learning" problem, even
> > for a single data set?
>
> Essentially, the "intelligence" of a model p1(x) depends on how
> accurately it approximates the true but unknown natural language model

> p(x) for all strings x...

_The_ "true but unknown natural language model"?

Do you agree there is an assumption here? Why do we assume a "true but
unknown" language model exists at all?

What happens if there is no "true by unknown" language model. What
happens if if all you really have is text?

-Rob

Overcomer

unread,
Aug 17, 2006, 2:15:58 AM8/17/06
to

Why did you ask? Obviously your first quote is contained in your
second with less.

Overcomer

unread,
Aug 17, 2006, 2:16:58 AM8/17/06
to

Ian Parker

unread,
Aug 17, 2006, 12:44:18 PM8/17/06
to

Rob Freeman wrote:
>
> What happens if there is no "true by unknown" language model. What
> happens if if all you really have is text?
>
There must be.Look eclusia/cerradura is one bit. Primavera, ressorte,
mamanthal is two (well just about). If a human has no difficulty in
distinguishing between the different meanings the clear implication is
that these bits are redundant.

I still believe that the best compressor is probably the deterministic
guesser with the top guesses having a small number of bits.

You may of course believe that there is some sort of "vital force" in
human translation and understanding. I doubt though whether many of
this usergroup's readership would agree.

- Ian Parker

Matt Mahoney

unread,
Aug 17, 2006, 4:22:45 PM8/17/06
to
Overcomer wrote:
> So I request you get Hutter himself to post and defend himself as
> Bower's post doesn't present stuff that's even defensible to a logical
> person. Sometimes truth is so important it is repeated -- see the 4
> gospels on Jesus Christ.
>
> Sincerely,
> Kirk Fraser

Marcus Hutter's FAQ answers most of your questions.
http://prize.hutter1.net/prizefaq.htm

-- Matt Mahoney

Ian Parker

unread,
Aug 17, 2006, 4:40:23 PM8/17/06
to

I think Hutter still has to post on multilingualism. Why English? This
is one thing he does not cover. From all the other things he says I
would have thought multilingualism a natural part of what he is trying
to do.

- Ian Parker

Matt Mahoney

unread,
Aug 17, 2006, 7:44:29 PM8/17/06
to

I discuss this in in
http://www.cs.fit.edu/~mmahoney/compression/rationale.html
A multilingual model is sufficient but not necessary to demonstrate AI.
The Turing test is done in one language. If a machine could answer
questions in several languages, then that's nice, but if it only speaks
one language, then it still passes. So why make the problem harder
than it needs to be?

Also, I eventually want to measure the entropy of the data set using
methods like the Shannon game. If it is multilingual, then I need to
find subjects that speak each of the languages, and then I would still
have the wrong answer because there would be translations of articles
containing the same information. To measure the entropy accurately I
would need to find multilingual subjects and present an article in one
language as context while having them predict in the other. This
complication is easily avoided by using just one language.

-- Matt Mahoney

Rob Freeman

unread,
Aug 17, 2006, 7:52:05 PM8/17/06
to
Ian Parker wrote:
> Rob Freeman wrote:
> >
> > What happens if there is no "true by unknown" language model. What
> > happens if if all you really have is text?
>
> ...

>
> You may of course believe that there is some sort of "vital force" in
> human translation and understanding. I doubt though whether many of
> this usergroup's readership would agree.

Are our imaginations really so limited that the choices come down to
"language model" or "vital force"?

Ian, on the basis of past messages I know you believe language
understanding must be based on "relaxation".

The existence of a language model would mean language could be
parametrized by rules.

By saying language understanding must be based on relaxation you are
really also saying a general language model does not exist.

Your very own "relaxation" could be a third alternative to that
mysterious "vital force".

Relaxation is just ad-hoc compression. Which is consistent with the
model these guys are suggesting, in so far that it is based on a proof
all you need for useful intelligence is compression (leaving rules to
be an added assumption, an assumption inconsistent with a proof that a
general solution is impossible, in point of fact.)

-Rob

Ian Parker

unread,
Aug 18, 2006, 6:59:41 AM8/18/06
to
Relaxation simply means that the meanings are defined in an iterative
way. Speech has to be based on relaxation as it is initially ill
defined.

We may be talking slightly at cross purposes.

"El barco attravesta una cerradura" and "La estacion de ressorte" are
examples of GROSS errors in computer translation. If we were to get to
a stage were we could say "In Castile it means this, in Latin America
it means something similar but subtely different" we would have
advanced well beyond the current state of the art. If the differences
in meaning between Spain and Mexico were to become important - in other
words gross errors had been eliminated, VoIP would be virtually error
free.

Also I don't believe that in human/human communication all shades of
meaning are conveyed. Matt Mahoney says, and I agree with this, that
anything the human brain can do a computer can do. True there may be a
spiritual dimension to our lives, but this is certainly not relevant to
basic linguistic understanding.

Ian Parker

unread,
Aug 18, 2006, 7:11:48 AM8/18/06
to

Matt Mahoney wrote:

> Ian Parker wrote:
> A multilingual model is sufficient but not necessary to demonstrate AI.
> The Turing test is done in one language. If a machine could answer
> questions in several languages, then that's nice, but if it only speaks
> one language, then it still passes. So why make the problem harder
> than it needs to be?
>

As I have said multilinualism indicates a Euclidean position.
Primavera, ressote and mamanthal are overt in Spanish but are latent in
English. Latent Semanic Analysis should be providing the correct
positions.

> Also, I eventually want to measure the entropy of the data set using
> methods like the Shannon game. If it is multilingual, then I need to
> find subjects that speak each of the languages, and then I would still
> have the wrong answer because there would be translations of articles
> containing the same information. To measure the entropy accurately I
> would need to find multilingual subjects and present an article in one
> language as context while having them predict in the other. This
> complication is easily avoided by using just one language.
>

To me entropy reflects the ability to guess the next word. Here is
something to think about - Does the ability to measure entropy, imply
you can compress. I proposed a compressor which lised the words in a
language in order of probability at a given point. Yes it does!

The same thing is true with translations and multilingualism. What you
are doing here is second guessing the the original translator.

BTW - Entropy can change on translation. An example of this is the
adjective noun order in Spanish. In French tghe academie francaise has
standardized the order. In French "grand" comes before the noun. In
Spanish the position is optional. You have the "Rio Grande" but in
other places the French order is followed.

I was just saying that a translation forces AI quickly as opposed to
the present compressors which do not give any deep insights.

Matt Mahoney

unread,
Aug 20, 2006, 7:17:32 PM8/20/06
to
Ian Parker wrote:
> Matt Mahoney wrote:
> > Ian Parker wrote:
> > A multilingual model is sufficient but not necessary to demonstrate AI.
> > The Turing test is done in one language. If a machine could answer
> > questions in several languages, then that's nice, but if it only speaks
> > one language, then it still passes. So why make the problem harder
> > than it needs to be?
> >
>
> As I have said multilinualism indicates a Euclidean position.
> Primavera, ressote and mamanthal are overt in Spanish but are latent in
> English. Latent Semanic Analysis should be providing the correct
> positions.

I think it is possible to demonstrate word sense disambiguation in a
monolingual model. For example, "bank" would not predict "interest" in
context "river".

> > Also, I eventually want to measure the entropy of the data set using
> > methods like the Shannon game. If it is multilingual, then I need to
> > find subjects that speak each of the languages, and then I would still
> > have the wrong answer because there would be translations of articles
> > containing the same information. To measure the entropy accurately I
> > would need to find multilingual subjects and present an article in one
> > language as context while having them predict in the other. This
> > complication is easily avoided by using just one language.
> >
> To me entropy reflects the ability to guess the next word. Here is
> something to think about - Does the ability to measure entropy, imply
> you can compress. I proposed a compressor which lised the words in a
> language in order of probability at a given point. Yes it does!

I should have said "estimate" the entropy of a human model, not
"measure". The ability to measure entropy not only implies the ability
to compress (because entropy is a function of the model), but also
knowledge of an ideal model, which is not computable for the general
case.

-- Matt Mahoney

Rob Freeman

unread,
Aug 21, 2006, 1:11:24 AM8/21/06
to
Matt Mahoney wrote:
>
> I should have said "estimate" the entropy of a human model, not
> "measure". The ability to measure entropy not only implies the ability
> to compress (because entropy is a function of the model), but also
> knowledge of an ideal model, which is not computable for the general
> case.

Matt, even if you change "measure" to "estimate", surely you still need
to explain how it is meaningful to try and compute something which it
has been proven is not computable.

-Rob

Jorrit Vermeiren

unread,
Aug 21, 2006, 6:59:49 AM8/21/06
to

Matt Mahoney wrote:
> But let's not be pessimistic. I believe the Wikipedia data set has
> enough information to infer general knowledge and then apply it to
> novel situations occurring later in the text.
(...)

> So if a program can make such inferences, could it not compress better
> than one that couldn't? Or is this not a valid test?

Using your own Occam's Razor reference, the smallest size any data set
can be compressed to would be with a language model that perfectly
models the data *and nothing else*. Anything more would be redundant
and unnecessary to model the data. It would be massively overfitted to
the data in question.

Necessarily, compressing the data more will result in a loss of ability
to predict things not contained within the data.

Ian Parker

unread,
Aug 21, 2006, 4:23:19 PM8/21/06
to

Matt Mahoney wrote:
> > As I have said multilinualism indicates a Euclidean position.
> > Primavera, ressote and mamanthal are overt in Spanish but are latent in
> > English. Latent Semanic Analysis should be providing the correct
> > positions.
>
> I think it is possible to demonstrate word sense disambiguation in a
> monolingual model. For example, "bank" would not predict "interest" in
> context "river".
>
Yes it is. However a 2L is in my view the quickest way to demonste
this. I could talk about fishing on the bank of a river(sp orilla,
rivera) for hours. If it taked about interest rates I could well
conclude that it was making some kind of joke. Spanish tells me
instantly what the level of understanding is. I could find out in pure
English but it would take a lot longer.

Certaily if it cannot understand that you don't fish for €50,000 it
will be unlikely to compress to the Hutter level.. Herr Hutter has set
the compression level to be just beyond was is possible with fractal
type compressors. €50,000 can only be won with AI. The problem really
comes with the lesser prizes. I would be sorry to see €10,000 going
to a slighly more efficient fractal.

Matt, you are on the committee. What precisely are the criteria for the
lesser prizes and how much discretion do the committee have? The top
prize is or course awarded according to a narrowly defined criterion.


>
> I should have said "estimate" the entropy of a human model, not
> "measure". The ability to measure entropy not only implies the ability
> to compress (because entropy is a function of the model), but also
> knowledge of an ideal model, which is not computable for the general
> case.
>

If you ESTIMATE but do not MEASURE entropy it is subjective.

- Ian Parker