Introducing the Hutter Prize for Lossless Compression of Human Knowledge

James A. Bowery

unread,

Aug 6, 2006, 3:39:16 PM8/6/06

to

Introducing the Hutter Prize for Lossless Compression of Human
Knowledge

Artificial intelligence researchers finally have an objective and
rigorously validated measure of the intelligence of their machines.
Furthermore the higher the measured intelligence of their machines the
more money they can win via the Hutter Prize.

The purse for the Hutter Prize was initially underwritten with a 50,000
Euro commitment to the prize fund by Marcus Hutter of the Swiss Dalle
Molle Institute for Artificial Intelligence, affiliated with the
University of Lugano and The University of Applied Sciences of Southern
Switzerland.

The theoretic basis of the Hutter Prize is related to an insight by the
12th century philosopher, William of Ockham, called "Ockham's Razor",
sometimes quoted as: "It is vain to do with more what can be done with
less." But it was not till the year 2000 that this was mathematically
proven*, by Marcus Hutter, to be a founding principle of intelligence.
Indeed, Hutter's Razor** might be phrased, "It is truer to explain with
less that which can be explained with more."

There have been previous tests and related prizes for artificial
intelligence, such as the Turing Test and the Loebner Prize. However,
these tests suffered from subjective definitions of intelligence.
Hutter's recent theoretic breakthrough creates a mathematics of
artificial intelligence which accurately measures the degree of
intelligence possessed by an artificial agent. It does so by measuring
how succinctly it represents knowledge of the world. As Hutter has now
proven, the most succinct computer model of the world isn't just the
most aesthetic or memory-efficient out of all models of the known
observations -- it also most accurately predicts new observations. In
short, it is the most intelligent.

Artificial intelligence has thereby entered the realm of engineering:
Lossless compression of human knowledge.

This is momentous because by optimizing for rigorous metrics, the field
of artificial intelligence may finally clarify the murky waters of
inadequate definition, within which it has been haphazardly swimming
for the last 50 years, to become both a hard science and tractable
engineering discipline.

Named for the discoverer of the proof and the initial 50,000 Euro
donor, the Hutter Prize currently targets the compression of a 100
megabyte sample of human knowledge drawn from the broadly based
Wikipedia online encyclopedia. As Moore's Law increases the capacity
of machines, and as additional donations to the prize fund increase the
incentives of contestants, the intent is to increase the amount of
knowledge targeted for compression. It is reasonable to expect that
the 100 megabyte sample will produce, at the very least, advances in
linguistic modeling. As the targeted depth and breadth of knowledge
increases, conceptual frameworks will come into play, eventually
covering the range of disciplines from political science to physics by
applying theories that prove optimal in compressing the target.

A common objection to this approach to artificial intelligence is that
it offers little that is new -- that the computational difficulty of
searching for patterns in data remains what it has always been. This
objection misses two important points:

1) Hutter's proof provides a new mathematics of intelligence allowing
for "top down" theoretic advances which may render many problems
tractable that otherwise appear intractable.

2) There is a large overlap between succinctly codified knowledge and
an intelligent compression program. Indeed, a reasonable definition of
"knowledge" is that it optimizes the compression of new observations as
instances of old patterns. This means that even if a compressor does
nothing but apply codified human knowledge, generating no new knowledge
of its own, it can still demonstrate greater intelligence than
competing programs and thereby make measurable progress toward
artificial intelligence but also -- and this is key -- progressively
more intelligent bodies of human-generated knowledge by pitting those
bodies of knowledge against each other in what might be called an
epistemological tournament.

The formula for winnings is modeled after the M-Prize or Methuselah
Mouse Prize, which awards money to longevity researchers for progress
in keeping mice alive the longest. Here, modified for compression
ratios, is the formula:

S = size of program outputting the uncompressed knowledge
Snew = new record
Sprev = previous record
P = [Sprev - Snew] / Sprev = percent improvement

Award monies:

Fund contains: Z at noon GMT on day of new record
Winner receives: Z * P

Initially Z is 50,000 Euro with a minimum payout of 500 Euro (or
minimum improvement of 1% over the prior winner).

Donations are welcome. The history of improvement in The Calgary
Corpus Compression Challenge*** is about 3% per year. The larger the
commitment from donors to the fund, the greater the rate of progress
toward a high quality body of human knowledge and, quite possibly, the
long-held promise of artificial intelligence.

For further details of the Hutter Prize see:

http://prize.hutter1.net

For discussion of the Hutter Prize see:

http://groups.google.com/group/Hutter-Prize

-- Jim Bowery

* http://www.hutter1.net/ai/uaibook.htm

** Hutter's Razor has some caveats relating to the nature of the
universe and computability, but those conditions must be met for any
computer-based intelligence.

*** http://mailcom.com/challenge

Ted Dunning

unread,

Aug 7, 2006, 8:22:32 PM8/7/06

to

Wasn't it Fisher in the 20's who proved just about the same thing about
maximum likelihood estimates?

Certainly, there is nothing new in the thesis that if you have an
estimate q of a probability p, then the using q as a compression model
gives minimum compressed size for p = q. Moreover, this minimum is
unique. This is the same as saying that the maximum likelihood value
of q is p (not quite the same as the maximum likelihood *estimator*, of
course).

For exchangeable events x_i, this is easy to prove using the convexity
of the log function,

log(x) <= x-1

By Finetti's theorem, the x_i are conditionally independent so

p(X) = prod_i p(x_i | theta)

E[ log q(x) ] = sum_x p(x) log q(x)

But, comparing with the entropy of the distribution p,

sum_x p(x) log q(x) + H(p) = sum_x p(x) log q(x) / p(x)

and by the inequality above,

sum_x p(x) log (q(x) / p(x)) <= sum_x p(x) [q(x) / p(x) - 1]
<= sum_x q(x) - sum_x p(x) = 0

Thus

- sum_x p(x) log q(x) >= H(p)

and

- sum_x p(x) log q(x) = H(p) <==> p = q

But this is just the expected length of an optimal compressor
representing X using q as a model. Thus, minimum expected compressed
size implies p = q.

Also, the expected compressed size can be estimated by compressing a
large number of x_i, thus the minimum compressed size estimator is
asymptotically equal to p.

All of the excitement is, of course, before the asymptote kicks in.
How can we estimate p with fewer x_i.

Why this is news is beyond me.

James A. Bowery wrote:
> ... But it was not till the year 2000 that this was mathematically

> proven*, by Marcus Hutter, to be a founding principle of intelligence.
> Indeed, Hutter's Razor** might be phrased, "It is truer to explain with
> less that which can be explained with more."
>

> ... As Hutter has now

Steve Richfie1d

unread,

Aug 8, 2006, 4:12:35 PM8/8/06

to

Ted (and James),

We often argue about lots of things, but I am on your side here.

The insistence on limited program size pretty much excuses it from the
realm of the scientific, as it diverts people's efforts from the claimed
goals of the prize and onto clever and concise programming methods, and
in the process obscures what might otherwise be simple but powerful
compression methods. It also indirectly (and needlessly) presumes an
18MB simplicity in human experience, thought, and expression that
probably just isn't there.

Research programs like this should be coded in some user friendly and
extensively checked medium like Microsoft Access, which probably
requires 18MB just for "Hello World". Even the .NET run-time package
would exceed this. HOWEVER, it is unclear whether the run-time package
would count in the total, but the contest seems to presume an EXE
instead of (say) an MDB like Access would use. Probably the best
approach to squeeze things into the smallest EXE would be to "go retro"
with, say, VB5, but to what end - to try to fit some screwball rule?

Further, the program would be disqualified if it were to correct
grammatical errors and/or summarize cumbersome statements using the very
computer understanding that the prise is supposed to be promoting.

Indeed, abandoning lossless compression would pretty much yield what the
semantic web searching people are looking for once it gets good enough.

Further, this sort of a competition blows right past the ITAR problems
that I have posted in the past - a VERY good thing.

Yes, the world would GREATLY benefit from this sort of research, but the
"minor" flaws in this particular competition appear (to me) to be rather
fatal to its stated goals.

Sad. I was really excited about competing until I saw the flaws.

Steve Richfie1d
===============

Matt Mahoney

unread,

Aug 8, 2006, 5:00:20 PM8/8/06

to

Ted Dunning wrote:
> Wasn't it Fisher in the 20's who proved just about the same thing about
> maximum likelihood estimates?
>
> Certainly, there is nothing new in the thesis that if you have an
> estimate q of a probability p, then the using q as a compression model
> gives minimum compressed size for p = q. Moreover, this minimum is
> unique. This is the same as saying that the maximum likelihood value
> of q is p (not quite the same as the maximum likelihood *estimator*, of
> course).

I think it was Shannon who proved this in 1948 (discete noiseless
channel capacity theorem).
http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
I could not find any reference to Fisher in his paper.

> Why this is news is beyond me.

That isn't what Hutter proved. What he proved is that the optimal
behavior of a goal seeking agent in an unknown but computable
environment is to guess that the environment is controlled by the
shortest program (Turing machine) consistent with all behavior observed
so far. Thus, a solution to the problem of maximizing the accumulated
reward signal from an unknown environment reduces to algorithmic
compression, which is an undecidable problem. However, Hutter also
proved that in the restricted case of an envoronment bounded by space s
and time t, that the problem is computable in time O(t2^s) (so the
problem is still hard).

When I developed the large text benchmark (without funding) my goal was
to promote research in NLP. There are 3 arguments why text compression
implies AI.

1. Hutter's proof in the specific case of text prediction (a problem
that is easy for people but hard for machines).
2. An explicit representation of the probability distribution of
natural language strings (a correct language model) enables one to pass
the Turing test for AI.
3. Compression ratio (perplexity) is already used in practice to
measure the quality of language models (mainly for speech recognition).

However, AI does *not* imply compression. Decompression requires a
deterministic computation to invert the compression function, and the
human brain is not deterministic. I believe this is the reason that
many people doubt the converse. I discuss these issues in
http://cs.fit.edu/~mmahoney/compression/rationale.html

-- Matt Mahoney

Matt Mahoney

unread,

Aug 8, 2006, 5:45:30 PM8/8/06

to

Steve Richfie1d wrote:
> Ted (and James),
>
> We often argue about lots of things, but I am on your side here.
>
> The insistence on limited program size pretty much excuses it from the
> realm of the scientific, as it diverts people's efforts from the claimed
> goals of the prize and onto clever and concise programming methods, and
> in the process obscures what might otherwise be simple but powerful
> compression methods. It also indirectly (and needlessly) presumes an
> 18MB simplicity in human experience, thought, and expression that
> probably just isn't there.

Marcus Hutter and I disagreed on what the size of the test data should
be. I argued that it should be 1 GB because that is what Turing
estimated in 1950 it would take to pass the Turing test (a machine with
10^9 bits of memory, which is 1 GB based on Shannon's 1950 estimate of
1 bpc entropy of written English), Landauer's estimate of 10^9 bits of
human long term memory, and because it is about the amount of language
that the average adult has processed since birth. Hutter argued for
100 MB because it is easier to work with and makes the contest more
accessible to the public. He has a good point because many of the top
compressors are there only because they use all of the 800 MB available
on my test machine.

In the original test design I assume that all knowledge needed to pass
any text-based AI test (such as text prediction or the Turing test) is
learnable solely from text. I realize that the vast knowledge that
humans use in these tasks is not all learned verbally, but I think that
most of it *could* be. For example, a blind person can tell you that
the sky is blue. Also, if knowledge cannot easily be learned verbally,
then it is difficult to prove to somebody that you know it in a verbal
test. For example, how would you prove in the Turing test that you
know how to ride a bicycle, that you know what a banana smells like,
that you could recognize somebody's face? Can you give an example of
*any* human knowledge that cannot be learned verbally but can be
demonstrated verbally?

Thus, to successfully pass the compression test, it will be necessary
for a program to learn a language model from the test data itself as it
compresses/decompresses the data.

> Research programs like this should be coded in some user friendly and
> extensively checked medium like Microsoft Access, which probably
> requires 18MB just for "Hello World". Even the .NET run-time package
> would exceed this. HOWEVER, it is unclear whether the run-time package
> would count in the total, but the contest seems to presume an EXE
> instead of (say) an MDB like Access would use. Probably the best
> approach to squeeze things into the smallest EXE would be to "go retro"
> with, say, VB5, but to what end - to try to fit some screwball rule?

Most compressors are written in C, C++ and assembler for speed. File
I/O does not require any fancy programming environment. The executable
files are typically 100 KB or less. It is necessary to include the
size of the decompressor, otherwise I could trivially write a 100 MB
"decompressor" that simply outputs the data, thus achieving a
compressed size of 0.

> Further, the program would be disqualified if it were to correct
> grammatical errors and/or summarize cumbersome statements using the very
> computer understanding that the prise is supposed to be promoting.

The test data is very clean. Misspelled words and grammatical errors
are rare. This is one reason I chose this data set. See
http://cs.fit.edu/~mmahoney/compression/textdata.html for an analysis
of the data.

>
> Indeed, abandoning lossless compression would pretty much yield what the
> semantic web searching people are looking for once it gets good enough.
>
> Further, this sort of a competition blows right past the ITAR problems
> that I have posted in the past - a VERY good thing.
>
> Yes, the world would GREATLY benefit from this sort of research, but the
> "minor" flaws in this particular competition appear (to me) to be rather
> fatal to its stated goals.
>
> Sad. I was really excited about competing until I saw the flaws.
>
> Steve Richfie1d

-- Matt Mahoney

David Kinny

unread,

Aug 8, 2006, 7:54:11 PM8/8/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:

> ...

> When I developed the large text benchmark (without funding) my goal was
> to promote research in NLP. There are 3 arguments why text compression
> implies AI.

> 1. Hutter's proof in the specific case of text prediction (a problem
> that is easy for people but hard for machines).
> 2. An explicit representation of the probability distribution of
> natural language strings (a correct language model) enables one to pass
> the Turing test for AI.

Not if I'm the one asking the questions. This idea is just
the ludicrous "humungous lookup table" in disguise.

David

David Kinny

unread,

Aug 8, 2006, 8:50:01 PM8/8/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:

> Steve Richfie1d wrote:
> > Ted (and James),
> >
> > We often argue about lots of things, but I am on your side here.
> >
> > The insistence on limited program size pretty much excuses it from the
> > realm of the scientific, as it diverts people's efforts from the claimed
> > goals of the prize and onto clever and concise programming methods, and
> > in the process obscures what might otherwise be simple but powerful
> > compression methods. It also indirectly (and needlessly) presumes an
> > 18MB simplicity in human experience, thought, and expression that
> > probably just isn't there.

> [...]

> Thus, to successfully pass the compression test, it will be necessary
> for a program to learn a language model from the test data itself as it
> compresses/decompresses the data.

This "intelligence test" is so silly, it's hard to know where to start.

- There is no notion of "successfully passing the test", unless you
consider every lossless compressor/decompressor to pass the test, and
that includes ones that are clearly completely unintelligent.

- You are seeking minimal size of the decompressor that reproduces a
particular 100MB text file. A compressor/decompressor scheme that is
specialized for the structure and statistics of that particular file
can obviously do better than one that is more general in ability, but
this is equally obviously less intelligent than the more general one.

- It seems there is no limit on the CPU time required to generate
the decompressor, which again penalizes intelligence and rewards
brute force approaches.

- Why do you imagine that any "understanding" of the text is needed
to achieve maximal compression. A purely syntactic/statistical
approach that ignores any meaning of the text is going to be more
intelligent, by your criterion, since addressing semantics just adds
baggage and size to the decompressor. This might change if you
accepted a "fair paraphrase" rather than an exact reproduction,
but you don't.

In summary, what is proposed is no more or less than a test of the
intelligence and skill of lossless compression programmers, not of
AI programs.

David

Matt Mahoney

unread,

Aug 8, 2006, 11:40:49 PM8/8/06

to

This is the point of measuring compressed size. One way to pass the
Turing test would be to look up the interrogator's exact question in
Google and reply with the text that follows from highest ranked
response. The problem is that your lookup table is several terabytes.
The challenge for you is to replace this huge table with something
smaller. That means instead of estimating the probability of a
sentence by counting Google hits, that you analyze the syntax and
semantics of the sentence to determine if it is correct, because the
number of exact matches is most likely to be zero. You have a choice
of either finding a succinct way of describing the rules that govern
whether a sentence is likely or unlikely, or you can find an efficient
way to learn these rules from just enough unlabled text.

Using compression ratio to measure the quality of a language model is
nothing new. Speech recognition researchers have been doing it for
years.

-- Matt Mahoney

Matt Mahoney

unread,

Aug 9, 2006, 12:30:53 AM8/9/06

to

David Kinny wrote:
> "Matt Mahoney" <matma...@yahoo.com> writes:
>
> > Steve Richfie1d wrote:
> > > Ted (and James),
> > >
> > > We often argue about lots of things, but I am on your side here.
> > >
> > > The insistence on limited program size pretty much excuses it from the
> > > realm of the scientific, as it diverts people's efforts from the claimed
> > > goals of the prize and onto clever and concise programming methods, and
> > > in the process obscures what might otherwise be simple but powerful
> > > compression methods. It also indirectly (and needlessly) presumes an
> > > 18MB simplicity in human experience, thought, and expression that
> > > probably just isn't there.
>
> > [...]
> > Thus, to successfully pass the compression test, it will be necessary
> > for a program to learn a language model from the test data itself as it
> > compresses/decompresses the data.
>
> This "intelligence test" is so silly, it's hard to know where to start.
>
> - There is no notion of "successfully passing the test", unless you
> consider every lossless compressor/decompressor to pass the test, and
> that includes ones that are clearly completely unintelligent.

We did not set a criteria for "passing" the test because we do not know
exactly what level of compression is needed. Shannon estimated in 1950
that the written English (using a 27 character alphabet of A-Z and
space) has an entropy of between 0.6 and 1.3 bits per character. This
is better than the best compressors can do, but not by much.

Humans are better than machines at text prediction, e.g. guessing the
next character or word in "roses are ___" or "3+8=__". Clearly this
task requires lexical, syntactic, semantic and vast, real world
knowledge. For a machine to pass the AI test, it must do as well as
humans by some measure, such as the number of guesses on average until
the correct answer is guessed. The measure we use is to assign
probabilities to all possible choices and a score of -log(p) for the
probability p assigned by the guesser to the correct answer. Shannon
proved that this score is minimized when the guesser assigns the exact
same probabilities as the true (but unknown) probability distribution.

A problem with this approach is that it is conceptually difficult for
people to assign numeric probabilities. This is why Shannon used a
guessing game instead, but this resulted in the large error bounds I
described, because different distributions could lead to the same
series of guesses. Therefore, a task that remains is to more
accurately estimate the entropy of written English with respect to
humans. In particular, we need to estimate the entropy for the exact
text used in this particular data set. Until that is done, we can only
reward improvements in the state of the art.

> - You are seeking minimal size of the decompressor that reproduces a
> particular 100MB text file. A compressor/decompressor scheme that is
> specialized for the structure and statistics of that particular file
> can obviously do better than one that is more general in ability, but
> this is equally obviously less intelligent than the more general one.

I expect that a winning decompressor will be tuned specifically to this
data set. That is why the size of the decompressor is included. The
statistics can either be learned from the data or they can be hard
coded into the decompressor by the programmer. It does not matter to
us which way it is done. Any tuning done in the decompressor will
decrease the compressed file size but increase the size of the
decompressor. It is within the rules to write a "decompressor" that
simply outputs the test file verbatim, but good luck with the size of
the program.

> - It seems there is no limit on the CPU time required to generate
> the decompressor, which again penalizes intelligence and rewards
> brute force approaches.

I expect that NLP will be CPU and/or memory intensive, because other
approaches have not worked. We do not want to restrict the range of
possible solutions any more than necessary. Thus, we only require
"reasonable" computation limits by specifying that the solution must be
testable by the public and have a 30 day period of public comment. If
nobody can run the program in that time, then it loses. This rule
allows for better models as computers become more powerful.

> - Why do you imagine that any "understanding" of the text is needed
> to achieve maximal compression. A purely syntactic/statistical
> approach that ignores any meaning of the text is going to be more
> intelligent, by your criterion, since addressing semantics just adds
> baggage and size to the decompressor. This might change if you
> accepted a "fair paraphrase" rather than an exact reproduction,
> but you don't.

A "fair paraphrase" implies lossy compression, which cannot be measured
objectively. Who is to say if a paraphrase is fair or not? The extra
information needed to encode the exact text is small, and is the same
for all contestants.

And what is "understanding"? When the semantics of words are not
grounded to sensory or motor I/O, semantics is just a fuzzy identity
relation: two words are related if they are likely to appear near each
other in running text. Models like LSA exploit the transitive property
of semantics. If "moon" appears near "star" and "star" appears near
"night", then the model predicts that "night" will appear near "moon".
Semantics is just a statistical property.

> In summary, what is proposed is no more or less than a test of the
> intelligence and skill of lossless compression programmers, not of
> AI programs.

Yes, that is what the test is designed to do, unless you think that
machines can think.

> David

-- Matt Mahoney

David Kinny

unread,

Aug 9, 2006, 5:32:20 AM8/9/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:

I don't dispute the relevance of compression in measuring the quality
of a language model. What I dispute is the notion that you can pass
the Turing test by looking up responses in a table, compressed or not.
My claim is that I, as judge, can easily construct a sequence of
questions that cannot be answered correctly by table lookup, as the
correct answers depends on previous aspects of the conversation, or
other variable context information not even part of the conversation:

Me: I was born in December 1966.
System: That's nice. [[or whatever, no correct answer expected here]]
Me: How many days left before Xmas, not counting today?
System: ??

Despite the inadequacy of a table/language model to generate the right
answer, let's suppose somehow it comes up with it. I continue:

Me: OK, add that to the number of whole years since the first moon
landing, and substract my age, what do you get?
System: ??

Do you see the problem now?

David

David Kinny

unread,

Aug 9, 2006, 6:17:10 AM8/9/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:
>David Kinny wrote:

>[...]

>> In summary, what is proposed is no more or less than a test of the
>> intelligence and skill of lossless compression programmers, not of
>> AI programs.

>Yes, that is what the test is designed to do, unless you think that
>machines can think.

Well, I'm glad we agree on what the test measures. Why then did the
original announcement contain the following very misleading claim:

"Artificial intelligence researchers finally have an objective and
rigorously validated measure of the intelligence of their machines.
Furthermore the higher the measured intelligence of their machines
the more money they can win via the Hutter Prize."

The problem is that "AI researchers' machines" are almost always
intended to do things other than lossless compression of slabs of
text, so the test would invariably claim they have no intelligence.

David

Matt Mahoney

unread,

Aug 9, 2006, 12:47:37 PM8/9/06

to

How do you propose that intelligence (artificial or otherwise) be
measured?

-- Matt Mahoney

Ted Dunning

unread,

Aug 9, 2006, 12:58:46 PM8/9/06

to

Matt Mahoney wrote:
> Ted Dunning wrote:
> > Wasn't it Fisher in the 20's who proved just about the same thing about
> > maximum likelihood estimates?

> I think it was Shannon who proved this in 1948 (discete noiseless

> channel capacity theorem).
> http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
> I could not find any reference to Fisher in his paper.

I was referring to the asymptotic correctness of MLE's, not the
compression. The fundamental insight is that the maximum likelihood
estimator converges to truth. The fact that the MLE also gives you
optimal compression is an add-on to the basic idea and does indeed
follow from Shannon's work.

David Kinny

unread,

Aug 9, 2006, 7:48:41 PM8/9/06

to

>-- Matt Mahoney

By carefully crafted tests of knowledge, capability, comprehension,
and ability to reason, to recognise, to solve problems, to learn...

David

James A. Bowery

unread,

Aug 9, 2006, 9:06:18 PM8/9/06

to

Think about it like this:

You've heard of functions, right? They are things with inputs and
outputs. All possible inputs are called the domain. All possible
outputs are called the range. One way to implement functions is to
make a two column table with one column labeled "inputs" and the other
labeled "outputs" where each row is what the function outputs for the
given input. You'd call it a lookup table.

Now, imagine a program that, for every _sequence_ of N text inputs, i1,
i2, i3...iN, creates N rows in the table, the first row has input of
i1, and some output o1. The input for the next row is i2 appended to
the end of i1 and the output is o2... and you continue this until you
have completed all N inputs. Agreed this could be a VERY BIG table but
so could the table representing the function "+" between two 64 bit
numbers. Yet the size of the program you need to implement the
function "+" between two 64 bit numbers is very small... measured in
only tens of bytes on some computers. Incredible isn't it? That's the
magic of what is called Kolmogorov complexity.

Anyway, so let's go back to our ridiculously big text function and ask
ourselves, how big does the program have to be to generate the program
that's equivalent to that table?

Would you agree that it might be somewhat smaller than the table itself?

Matt Mahoney

unread,

Aug 9, 2006, 11:15:23 PM8/9/06

to

I don't claim that a language model can be implemented as a lookup
table. I claim

1. that a better language model is more likely to pass the Turing test,
2. that a better language model will compress text better.

By language model, I mean a probability distribution over the set of
all strings. So a model will assign p("Today is Aug. 10, 2006 ... I
was born in (rest of dialog)... what do you get? x") some number in
the open interval (0,1) for all x.

Now let q be a dialog ending with a question, and x1 and x2 be two
posible answers. I think it is reasonable to assume that the following
statements are equivalent (all true or all false) for all q, x1, x2:

1. A human picked at random and asked question q is more likely to
answer x1 than x2.
2. A human picked at random is more likely to judge the pair q,x1 as
correct than q,x2.
3. A text corpus of human origin is more likely to contain q,x1 than
q,x2.

Now suppose that a machine is able to compute p() such that the above
statements are equivalent to p(q,x1) > p(q,x2) for all q, x1, and x2.
Then that machine would answer all questions in a way that is
indistinguishable from a human. It would pass the Turing test.

Now suppose that we have two models p1() and p2() and want to know
which is better (closer to p)? One way is to test it is to ask it a
multiple choice question q with answers x1 and x2, and have each
program output the answer with the higher probability according to its
model. So if p1(q,x1) > p1(q,x2) in model p1() then the program
answers x1. If p1() answers x1 and p2() answers x2 and a human judge
decides that x1 is more correct, then p1() is judged more human.

Another test would be for the judge to first decide the correct answer
(say, x1), then ask both models to explicitly compute p1(q,x1) and
p2(q,x1). The model that gives the higher value wins. We don't
normally test humans this way because it is conceptually difficult for
people to assign numeric probabilities. But it is not hard for
machines.

This test can be repeated, for example, by multiplying the scores for
several independent questions to compute the probability that all of
the answers are correct. So given a series of questions q1, q2, q3,...
with correct answers x1, x2, x3..., the winner is the one with the
greatest value of PROD_i [p(x_i|q_i)], or equivalently, the smallest
value of -SUM_i [log p(x_i|q_i)], which is the compressed size of the
answers using ideal coding.

The advantage of this approach is that no human is required to evaluate
the language model. The score is exact, repeatable, and verifiable.

What I tried to show is that good text compression requires
sophisticated reasoning ability and vast knowledge, just as your
example Turing test session does.

-- Matt Mahoney

Matt Mahoney

unread,

Aug 9, 2006, 11:43:08 PM8/9/06

to

David Kinny wrote:
> >How do you propose that intelligence (artificial or otherwise) be
> >measured?
>
> >-- Matt Mahoney
>
> By carefully crafted tests of knowledge, capability, comprehension,
> and ability to reason, to recognise, to solve problems, to learn...
>
> David

If you were to run a competition lasting for many years among AI
systems with cash prizes, how would you design the tests so that the
competition is fair, scores are 100% objective, and systems can't
cheat?

-- Matt Mahoney

David Kinny

unread,

Aug 10, 2006, 6:29:34 PM8/10/06

to

You claimed above that "One way to pass the Turing test would be to

look up the interrogator's exact question in Google and reply with
the text that follows from highest ranked response. The problem is
that your lookup table is several terabytes."

This claim is nonsense - the prospect of passing the Turing test
in this way is effectively nil.

> I claim
> 1. that a better language model is more likely to pass the Turing test,
> 2. that a better language model will compress text better.

Another claim you made, to which I responded above, is:

"2. An explicit representation of the probability distribution of
natural language strings (a correct language model) enables one
to pass the Turing test for AI."

You use the phrase "An explicit representation of ... ", which certainly
sounds like a table of probabilities indexed by strings, and "enables
one to pass the Turing test", which is a very strong claim. But it
seems you now want to downgrade this to a program that computes
probabilities, and to "is more likely to pass the TT". Clearly a
program that has an infinitesimal chance of passing the TT is more
likely to pass than one that has zero chance, but it's not very
reasonable to claim that such a program can pass the TT if its
chances of soing so are sufficiently small. I could create a perl
script that has a non-zero chance to pass the TT in half an hour.

> By language model, I mean a probability distribution over the set of
> all strings. So a model will assign p("Today is Aug. 10, 2006 ... I
> was born in (rest of dialog)... what do you get? x") some number in
> the open interval (0,1) for all x.

Before I try to respond to the rest of this, please clarify for me:

- What you meant by "an explict representation of ..."

- What, exactly, does the probability of some string S represent here?
The probability of S occurring next in a particular document?
P(S occurs next in the whole corpus of written English text)?
P(S occurs next in a randomly chosen spoken English conversation)?
P(S occurs next in a randomly chosen Turing Test)?

- Do you really mean "over the set of ALL strings"? Language models
typically deal with n-grams for some small n. Are you proposing
an idealised, infinitely complex model? Or what?

David

Matt Mahoney

unread,

Aug 10, 2006, 9:24:28 PM8/10/06

to

David Kinny wrote:
> You claimed above that "One way to pass the Turing test would be to
> look up the interrogator's exact question in Google and reply with
> the text that follows from highest ranked response. The problem is
> that your lookup table is several terabytes."
>
> This claim is nonsense - the prospect of passing the Turing test
> in this way is effectively nil.

OK, it's nil. The point I was trying to make is that if you have a
small training corpus, then you have to be smart about how you learn a
language model from it. If it's large, you can use a brute force
string matching approach. I believe that 1 GB is the right size, large
enough to learn syntax and semantics, but small enough that you can't
just search for the judge's question and output whatever follows.

It is controversial whether a language model can be learned from an
unlabeled corpus of positive examples, but I believe it is possible.
It is easy to extract a lexical model (dictionary). There are
statistical semantic models such as LSA, and low level syntactic models
such as hidden Markov models, that have been learned on data sets of
similar size. I believe it is possible with the Wikipedia dataset. It
is doubtful to me that the data is sufficient to learn higher level
reasoning and problem solving skills like arithmetic without some help
from the programmer, but we shall see.

> > I claim
> > 1. that a better language model is more likely to pass the Turing test,
> > 2. that a better language model will compress text better.
>
> Another claim you made, to which I responded above, is:
> "2. An explicit representation of the probability distribution of
> natural language strings (a correct language model) enables one
> to pass the Turing test for AI."
>
> You use the phrase "An explicit representation of ... ", which certainly
> sounds like a table of probabilities indexed by strings, and "enables
> one to pass the Turing test", which is a very strong claim. But it
> seems you now want to downgrade this to a program that computes
> probabilities, and to "is more likely to pass the TT". Clearly a
> program that has an infinitesimal chance of passing the TT is more
> likely to pass than one that has zero chance, but it's not very
> reasonable to claim that such a program can pass the TT if its
> chances of soing so are sufficiently small. I could create a perl
> script that has a non-zero chance to pass the TT in half an hour.

It is irrelevant whether a Turing machine implements a language model
as an algorithm or a lookup table.

> > By language model, I mean a probability distribution over the set of
> > all strings. So a model will assign p("Today is Aug. 10, 2006 ... I
> > was born in (rest of dialog)... what do you get? x") some number in
> > the open interval (0,1) for all x.
>
> Before I try to respond to the rest of this, please clarify for me:
>
> - What you meant by "an explict representation of ..."
>
> - What, exactly, does the probability of some string S represent here?
> The probability of S occurring next in a particular document?
> P(S occurs next in the whole corpus of written English text)?
> P(S occurs next in a randomly chosen spoken English conversation)?
> P(S occurs next in a randomly chosen Turing Test)?
>
> - Do you really mean "over the set of ALL strings"? Language models
> typically deal with n-grams for some small n. Are you proposing
> an idealised, infinitely complex model? Or what?
>
> David

A language model is a function, p1(): S -> [0,1] where S is the set of
all string over some alphabet, e.g. S = {0,1}*, and furthermore, p1(s)
>= 0, and SUM_s p1(s) = 1 for s in S.

Define p(s) as the probability that s will occur in a two way dialog
between a human judge and a human confederate in the Turing test. s is
a prefix of the entire dialog from beginning to end. p(s) is unknown,
like all probabilities in real life. p(s) depends on the judge, the
confederate, and when the test is given.

A model, p1(), is an approximation of p(). The cross entropy is
defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross
entropy is a measure of how accurately p1() approximates p(). H1 is
minimized when p1(s) = p(s) for all s in S. Note that H1 is the
expected size of s chosen randomly from distribution p() when
compressed with model p1(). It is required that 0 < p1(s) < 1 for all
s in S.

When I say there is an explicit representation of p1(), I mean that
there is a Turing machine that computes it.

I realize that Wikipedia has a different distribution than p(s) for
Turing tests. The proper training corpus for passing the Turing test
would be a set of transcripts between judges and human confederates,
like those at the Loebner prize website. Chat logs would also be a
better choice than Wikipedia. I argue that this difference is an
irrelevant detail. The main difference is that Wikipedia articles are
not interactive, so a machine training on it would not learn to hold a
conversation. Other than that, text prediction still requires high
level language comprehension, common sense, reasoning, and problem
solving abilities just like passing the Turing test does. If you can
model written articles, then I think extending the model to interactive
conversation would not be too hard.

-- Matt Mahoney

Ted Dunning

unread,

Aug 11, 2006, 5:05:37 PM8/11/06

to

David Kinny wrote:
> Another claim you made, to which I responded above, is:
> "2. An explicit representation of the probability distribution of
> natural language strings (a correct language model) enables one
> to pass the Turing test for AI."
>
> You use the phrase "An explicit representation of ... ", which certainly
> sounds like a table of probabilities indexed by strings, and "enables
> one to pass the Turing test", which is a very strong claim.

It may sound like that to you, but it doesn't sound like that to
somebody who is taking just a bit more time to listen.

An explicit representation is one that is explicitly defined. A table
of logarithms is an explicit representation of the log function. So is
a series expansion.

> But it
> seems you now want to downgrade this to a program that computes
> probabilities, and to "is more likely to pass the TT".

There is no downgrading involved. This is conventional usage.

You are fighting a straw man for no discernible reason.

David Kinny

unread,

Aug 12, 2006, 4:50:32 AM8/12/06

to

Sorry for the delay in getting back to you here ...

That might be true in theory, but we are concerned here with what can
be feasibly implemented in current or future practice, are we not?

> > > By language model, I mean a probability distribution over the set of
> > > all strings. So a model will assign p("Today is Aug. 10, 2006 ... I
> > > was born in (rest of dialog)... what do you get? x") some number in
> > > the open interval (0,1) for all x.
> >
> > Before I try to respond to the rest of this, please clarify for me:
> >
> > - What you meant by "an explict representation of ..."
> >
> > - What, exactly, does the probability of some string S represent here?
> > The probability of S occurring next in a particular document?
> > P(S occurs next in the whole corpus of written English text)?
> > P(S occurs next in a randomly chosen spoken English conversation)?
> > P(S occurs next in a randomly chosen Turing Test)?
> >
> > - Do you really mean "over the set of ALL strings"? Language models
> > typically deal with n-grams for some small n. Are you proposing
> > an idealised, infinitely complex model? Or what?

> A language model is a function, p1(): S -> [0,1] where S is the set of

> all string over some alphabet, e.g. S = {0,1}*, and furthermore, p1(s)
> >= 0, and SUM_s p1(s) = 1 for s in S.

> Define p(s) as the probability that s will occur in a two way dialog
> between a human judge and a human confederate in the Turing test. s is
> a prefix of the entire dialog from beginning to end. p(s) is unknown,
> like all probabilities in real life. p(s) depends on the judge, the
> confederate, and when the test is given.

Here I'm a little confused. You say that s is a prefix of the dialog,
and that p(s) depends on the judge, etc, and when the test is given,
implying that this p() is for a *specific* dialog. In that case the
distribution is trivial: all k possible prefixes have probability 1/k,
all other strings 0 (or some variant of this). Surely you wish to
consider the distribution over some set of dialogs, but which set?

For a given p() to be useful in passing the TT, it has to adequately
cover possible dialogs. It's easy, as judge, to create novel dialogs.
For example, the following short question whose answer is obvious has
almost certainly never been asked before in the history of mankind.

How many legs do 3 ants, an octopus and 4 aardvarks have in total?

Then there's the issue of background knowledge. In a previous posting
you addressed the need to know such things as the date by prefixing
that information onto the actual dialog. How do you propose to
decide which background information to add in explicitly?

It seems, the way this is heading, that you need p() to be some
idealized distribution over the set of all possible human to human TT
dialogs prefixed by all necessary background information. In that
case it doesn't depend on the judge, the time, etc. Right?

> A model, p1(), is an approximation of p(). The cross entropy is
> defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross
> entropy is a measure of how accurately p1() approximates p(). H1 is
> minimized when p1(s) = p(s) for all s in S. Note that H1 is the
> expected size of s chosen randomly from distribution p() when
> compressed with model p1(). It is required that 0 < p1(s) < 1 for all
> s in S.

> When I say there is an explicit representation of p1(), I mean that
> there is a Turing machine that computes it.

I'll return below to the question of how such a p1() could be created,
but let's move on to another issue. Let's assume that you have an
actual, computationally tractable implementation of an algorithm that
computes p1(), which is a very good approximation of some suitably
extensive p(). Now how can you actually use this to pass the TT?

You've been asked a question, and the dialog so far is the string d.
You need to find a suffix x such that p1(d+x) is non-zero (I assume
you'll settle for any suitable answer rather than the most likely).
This means searching over the space of possible strings x and
repeatedly evaluating p1(d+x). This is not feasible. And if your
algorithm in fact takes d as input and returns an x s.t. p1(d+x) > 0,
it is not a language model as you've defined it, but something else.

For completeness, I'm just going to backtrack to something you wrote
earlier in the thread.

> Now let q be a dialog ending with a question, and x1 and x2 be two
> posible answers. I think it is reasonable to assume that the following
> statements are equivalent (all true or all false) for all q, x1, x2:

> 1. A human picked at random and asked question q is more likely to
> answer x1 than x2.
> 2. A human picked at random is more likely to judge the pair q,x1 as
> correct than q,x2.
> 3. A text corpus of human origin is more likely to contain q,x1 than
> q,x2.

I'd agree that 1 and 2 are equivalent, but dispute the equivalence of 3.
The reason is that, except for a vanishingly small number of dialogs q,
the probability of a text corpus of human origin containing q,x1 is
identical to the probability of it containing q,x2 , namely 0. As I've
shown it is easy to create dialogs such that this is the case.

> I realize that Wikipedia has a different distribution than p(s) for
> Turing tests. The proper training corpus for passing the Turing test
> would be a set of transcripts between judges and human confederates,
> like those at the Loebner prize website. Chat logs would also be a
> better choice than Wikipedia. I argue that this difference is an
> irrelevant detail. The main difference is that Wikipedia articles are
> not interactive, so a machine training on it would not learn to hold a
> conversation. Other than that, text prediction still requires high
> level language comprehension, common sense, reasoning, and problem
> solving abilities just like passing the Turing test does. If you can
> model written articles, then I think extending the model to interactive
> conversation would not be too hard.

> -- Matt Mahoney

So finally you gloss over several issues, dismissing them as irrelevant
details, or easy. You claim, apparently as an act of faith, that:

- It will be feasible to learn a sufficiently good language model from a
tractably small corpus. This is on a par with claiming that it will
be feasible to learn to play championship Go just from example games.

- Given such a model, it will be feasible to use it to pass the TT.

- The difference between the language model of TT dialogs and of a slab
of Wikipedia is unimportant. You, the programmer, can somehow extend
or convert the one into the other.

- The limited extent of the corpus will not matter. Somehow your LM
will have the capability to generalise, extrapolate, answer novel
questions, incorporate all necessary background information, know
how to do computations, solve simple problems, remember things, ...

Here's my take. It will, eventually, be feasible to create programs
that do all these latter things well enough to pass a TT, even in the
face of the types of lines of questioning I've pointed to. Inevitably
such systems will acquire much of their knowledge by learning, from
conversations and texts amongst other things. But they will be so
much more than just language models, in terms of what they consist of
and what they are capable of doing. They could actually be used to
generate language models of a certain type, ones where probabilities
were significant only in whether they were zero or not. Indeed, I
suspect that the only way you can generate a LM with sufficient
coverage to (in theory) pass the TT, is to use a system with such
capabilities as the generator.

To put it simply, I'd claim the only way you can create a LM good
enough to (in theory) pass the TT is to build a system that can
already pass the TT without any need to create that LM. And you
don't achieve that just as a side-effect of achieving compression
of 1G slabs of text that is close to the information-theoretic limit,
as that task can be achieved by much much simpler approaches.
Compressing text better is at best just one small step on the path
to implementing systems that pass the TT.

David

David Kinny

unread,

Aug 12, 2006, 5:01:03 AM8/12/06

to

"Ted Dunning" <ted.d...@gmail.com> writes:
> David Kinny wrote:
> > Another claim you made, to which I responded above, is:
> > "2. An explicit representation of the probability distribution of
> > natural language strings (a correct language model) enables one
> > to pass the Turing test for AI."
> >
> > You use the phrase "An explicit representation of ... ", which certainly
> > sounds like a table of probabilities indexed by strings, and "enables
> > one to pass the Turing test", which is a very strong claim.

> It may sound like that to you, but it doesn't sound like that to
> somebody who is taking just a bit more time to listen.

Steady on there Ted. Matt began by claiming that one could pass the
TT using a several terabyte lookup table. I'm the one who is taking
the time here to try and clarify exactly what he really means, and it
keeps changing from the laughable pop-science claims of the original
announcement. Hopefully now it's approaching something reasonable.

> An explicit representation is one that is explicitly defined. A table
> of logarithms is an explicit representation of the log function. So is
> a series expansion.

I agree, both are explicit. I really don't want to quibble about
semantics, but I'd say that an algorithm that computes logarithms in
some non-transparent way is by contrast an implicit representation.
Your usage of language may vary.

> > But it
> > seems you now want to downgrade this to a program that computes
> > probabilities, and to "is more likely to pass the TT".

> There is no downgrading involved. This is conventional usage.

The change from "can pass" to "is more likely to pass", from a
strong claim to a weak one, is certainly downgrading. And a program
is potentially feasible, while a lookup table is certainly not, as
it seems we all now agree.

> You are fighting a straw man for no discernible reason.

There is confusion being generated here that deserves to be dispelled.

David

Matt Mahoney

unread,

Aug 12, 2006, 1:18:22 PM8/12/06

to

David Kinny wrote:
> "Matt Mahoney" <matma...@yahoo.com> writes:

> > It is irrelevant whether a Turing machine implements a language model
> > as an algorithm or a lookup table.
>
> That might be true in theory, but we are concerned here with what can
> be feasibly implemented in current or future practice, are we not?

In practice, Turing machines don't exist.

> > Define p(s) as the probability that s will occur in a two way dialog
> > between a human judge and a human confederate in the Turing test. s is
> > a prefix of the entire dialog from beginning to end. p(s) is unknown,
> > like all probabilities in real life. p(s) depends on the judge, the
> > confederate, and when the test is given.
>
> Here I'm a little confused. You say that s is a prefix of the dialog,
> and that p(s) depends on the judge, etc, and when the test is given,
> implying that this p() is for a *specific* dialog. In that case the
> distribution is trivial: all k possible prefixes have probability 1/k,
> all other strings 0 (or some variant of this). Surely you wish to
> consider the distribution over some set of dialogs, but which set?

In real life, probabilities are always unknown. Different people will
estimate them differently depending on what they know. For example, I
flip a fair coin, look at it, but don't show it to you. What is the
probability that it is heads? You say 1/2. I say 1. Both of us are
right.

I define p(s) as the probability that s will be a prefix of a dialog in
a Turing test, given that you do not know who will be the judge or the
confederate or when the test will be given.

I realize that p() is still not well defined. But that is the nature
of probability.

> For a given p() to be useful in passing the TT, it has to adequately
> cover possible dialogs. It's easy, as judge, to create novel dialogs.
> For example, the following short question whose answer is obvious has
> almost certainly never been asked before in the history of mankind.
>
> How many legs do 3 ants, an octopus and 4 aardvarks have in total?

And yet, the probability that this will answered in an unknown training
corpus will not be exactly 0.

Also, I think you will agree that this question and the answer could be
phrased as a text prediction problem, in which case a compressor that
can solve problems like this will compress smaller than one that
cannot.

> Then there's the issue of background knowledge. In a previous posting
> you addressed the need to know such things as the date by prefixing
> that information onto the actual dialog. How do you propose to
> decide which background information to add in explicitly?

If a machine responds too fast to my questions, I will know it is not
human. A dialog must contain information about who said what and when.
You can represent this information how you like.

> It seems, the way this is heading, that you need p() to be some
> idealized distribution over the set of all possible human to human TT
> dialogs prefixed by all necessary background information. In that
> case it doesn't depend on the judge, the time, etc. Right?

It does depend, but you don't have this information, so you have to
make reasonable guesses.

> > A model, p1(), is an approximation of p(). The cross entropy is
> > defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross
> > entropy is a measure of how accurately p1() approximates p(). H1 is
> > minimized when p1(s) = p(s) for all s in S. Note that H1 is the
> > expected size of s chosen randomly from distribution p() when
> > compressed with model p1(). It is required that 0 < p1(s) < 1 for all
> > s in S.
>
> > When I say there is an explicit representation of p1(), I mean that
> > there is a Turing machine that computes it.
>
> I'll return below to the question of how such a p1() could be created,
> but let's move on to another issue. Let's assume that you have an
> actual, computationally tractable implementation of an algorithm that
> computes p1(), which is a very good approximation of some suitably
> extensive p(). Now how can you actually use this to pass the TT?
>
> You've been asked a question, and the dialog so far is the string d.
> You need to find a suffix x such that p1(d+x) is non-zero (I assume
> you'll settle for any suitable answer rather than the most likely).
> This means searching over the space of possible strings x and
> repeatedly evaluating p1(d+x). This is not feasible. And if your
> algorithm in fact takes d as input and returns an x s.t. p1(d+x) > 0,
> it is not a language model as you've defined it, but something else.

This is a perfectly reasonable implementation for a Turing machine. In
practice, any language model can be converted to a generator with at
most O(n) increase in tme complexity by binary search by appending one
bit at a time. You first compute p1(d+0) and p1(d+1) and randomly
append one bit in proportion to these probabilities, repeating until
you have appended a symbol indicating the end of the message (which is
required in a dialog).

> For completeness, I'm just going to backtrack to something you wrote
> earlier in the thread.
>
> > Now let q be a dialog ending with a question, and x1 and x2 be two
> > posible answers. I think it is reasonable to assume that the following
> > statements are equivalent (all true or all false) for all q, x1, x2:
>
> > 1. A human picked at random and asked question q is more likely to
> > answer x1 than x2.
> > 2. A human picked at random is more likely to judge the pair q,x1 as
> > correct than q,x2.
> > 3. A text corpus of human origin is more likely to contain q,x1 than
> > q,x2.
>
> I'd agree that 1 and 2 are equivalent, but dispute the equivalence of 3.
> The reason is that, except for a vanishingly small number of dialogs q,
> the probability of a text corpus of human origin containing q,x1 is
> identical to the probability of it containing q,x2 , namely 0. As I've
> shown it is easy to create dialogs such that this is the case.

Without full knowledge, you cannot say that any probability is 0. If
you had full knowledge then there would be no use for probability
theory.

> > I realize that Wikipedia has a different distribution than p(s) for
> > Turing tests. The proper training corpus for passing the Turing test
> > would be a set of transcripts between judges and human confederates,
> > like those at the Loebner prize website. Chat logs would also be a
> > better choice than Wikipedia. I argue that this difference is an
> > irrelevant detail. The main difference is that Wikipedia articles are
> > not interactive, so a machine training on it would not learn to hold a
> > conversation. Other than that, text prediction still requires high
> > level language comprehension, common sense, reasoning, and problem
> > solving abilities just like passing the Turing test does. If you can
> > model written articles, then I think extending the model to interactive
> > conversation would not be too hard.
>
> > -- Matt Mahoney
>
> So finally you gloss over several issues, dismissing them as irrelevant
> details, or easy. You claim, apparently as an act of faith, that:
>
> - It will be feasible to learn a sufficiently good language model from a
> tractably small corpus. This is on a par with claiming that it will
> be feasible to learn to play championship Go just from example games.

I believe it is possible to develop lexical, semantic, and syntactic
models from unlabeled text containing only positive examples. Children
seem to do this. It is probably insufficient to develop higher level
skills like solving word problems with arithmetic. That will probably
take some help from the programmer.

A corpus consisting of articles is not sufficient to train a chatterbot
in interactive converstation. But I argue that such a model is still
useful, probably more so. If you want to translate articles, scan
them, recognize them from broadcast speech, etc., you want to train on
the type of text that it would typically encounter.

> - Given such a model, it will be feasible to use it to pass the TT.

Given about 1 GB of the right type of training material (dialogs), yes.

> - The difference between the language model of TT dialogs and of a slab
> of Wikipedia is unimportant. You, the programmer, can somehow extend
> or convert the one into the other.

There is a lot of overlap in the learning mechanisms. The difference
is mainly in the training material and the requirement for interaction.

> - The limited extent of the corpus will not matter. Somehow your LM
> will have the capability to generalise, extrapolate, answer novel
> questions, incorporate all necessary background information, know
> how to do computations, solve simple problems, remember things, ...

I think 1 GB is the right size, as explained in
http://cs.fit.edu/~mmahoney/compression/rationale.html

I do not claim that this particular corpus is right for any particular
AI problem you want to solve. You need to be careful to feed it the
information you want it to know. But I believe for the problem of
predicting text in a corpus, the best information to feed it is the
corpus itself. So I have no problem with the Wikipedia text for this
purpose.

> Here's my take. It will, eventually, be feasible to create programs
> that do all these latter things well enough to pass a TT, even in the
> face of the types of lines of questioning I've pointed to. Inevitably
> such systems will acquire much of their knowledge by learning, from
> conversations and texts amongst other things. But they will be so
> much more than just language models, in terms of what they consist of
> and what they are capable of doing. They could actually be used to
> generate language models of a certain type, ones where probabilities
> were significant only in whether they were zero or not. Indeed, I
> suspect that the only way you can generate a LM with sufficient
> coverage to (in theory) pass the TT, is to use a system with such
> capabilities as the generator.
>
> To put it simply, I'd claim the only way you can create a LM good
> enough to (in theory) pass the TT is to build a system that can
> already pass the TT without any need to create that LM. And you
> don't achieve that just as a side-effect of achieving compression
> of 1G slabs of text that is close to the information-theoretic limit,
> as that task can be achieved by much much simpler approaches.
> Compressing text better is at best just one small step on the path
> to implementing systems that pass the TT.
>
> David

Passing the Turing test requires:
1. Solving the problem of language learning.
2. Modeling interactive conversation.
3. Selecting the right training data.

I am only addressing problem 1.

-- Matt Mahoney

Ian Parker

unread,

Aug 13, 2006, 3:59:15 PM8/13/06

to

Matt Mahoney wrote:
> David
>
> Passing the Turing test requires:
> 1. Solving the problem of language learning.
> 2. Modeling interactive conversation.
> 3. Selecting the right training data.
>
> I am only addressing problem 1.
>

In fact most of the programs that have beeen entered for Turing/Loebner
have been based on databases. (like a Google search). Given a database
only "1" is needed to pass the Turing Test.

Essentially chatterboxes have done "2" quite adaquately.

I realize that this paragraph is one of high subjectivity (nothing like
compression!), but it is doubtful to what extent human beings really
think for themselves. Social scientists and evolutionary biologists
talk about memes, which I suppose you could describe in AI terms as
being elements of a database. The test is for a conversation similar to
that which you could have with a human and it is my experience that you
do not look for reason! Extreme Islamic statements would for example
pass T. They are of course memes.

Clearly if you have compression, or some other purely objective
criterion this does not apply. We are not in compression attempting to
mimic human behaviour. I think you are right, it is the only true test,
although I don't believe monolinual Wiki is the right answer.

James A. Bowery

unread,

Aug 13, 2006, 9:29:11 PM8/13/06

to

Ian Parker wrote:
> I think you are right, it is the only true test,
> although I don't believe monolinual Wiki is the right answer.

Hutter's decision to go with 10e8 bytes was an uneasy, temporary
compromise. Certainly multilingual compression will be very valuable
-- particularly for my personal obsession which is applied relation
arithmetic.

Matt Mahoney

unread,

Aug 14, 2006, 12:15:59 AM8/14/06

to

I had considered a multilingual corpus because it would encourage
research in learning language independent parsing rules rather than
hard coding rules into the program. For example, Chinese does not use
spaces between words (1 to 4 characters), making it a challenge to
build a dictionary. Nevertheless, it is possible to do this in English
text without spaces, starting with no knowledge. Such models might
help us understand how babies learn to parse continuous speech into
words at age 7-10 months, before they learn to speak.
http://cs.fit.edu/~mmahoney/dissertation/lex1.html

I used a monolingual corpus because multilingual ability is not a
requirement for intelligence. The problem is hard enough as it is. If
the same articles appeared in several languages, then compression would
require solving the language translation problem in addition to the
modeling problem.

-- Matt Mahoney

David Kinny

unread,

Aug 14, 2006, 4:57:47 AM8/14/06

to

"Ian Parker" <ianpa...@gmail.com> writes:

> [...] Clearly if you have compression, or some other purely objective

> criterion this does not apply. We are not in compression attempting to
> mimic human behaviour. I think you are right, it is the only true test,
> although I don't believe monolinual Wiki is the right answer.

The problems with compressing text as a benchmark of intelligence are
that most systems don't compress text at all, and it is feasible to
compress text very well while understanding nothing about its meaning,
i.e. being incapable of passing a simple comprehension test.

David

David Kinny

unread,

Aug 14, 2006, 5:10:53 AM8/14/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:

> David Kinny wrote:
> > "Matt Mahoney" <matma...@yahoo.com> writes:
> > > It is irrelevant whether a Turing machine implements a language model
> > > as an algorithm or a lookup table.
> >
> > That might be true in theory, but we are concerned here with what can
> > be feasibly implemented in current or future practice, are we not?

> In practice, Turing machines don't exist.

Tell me something I don't know. That's why I keep talking about
tractable algorithms and feasibility ...

> > > Define p(s) as the probability that s will occur in a two way dialog
> > > between a human judge and a human confederate in the Turing test. s is
> > > a prefix of the entire dialog from beginning to end. p(s) is unknown,
> > > like all probabilities in real life. p(s) depends on the judge, the
> > > confederate, and when the test is given.
> >
> > Here I'm a little confused. You say that s is a prefix of the dialog,
> > and that p(s) depends on the judge, etc, and when the test is given,
> > implying that this p() is for a *specific* dialog. In that case the
> > distribution is trivial: all k possible prefixes have probability 1/k,
> > all other strings 0 (or some variant of this). Surely you wish to
> > consider the distribution over some set of dialogs, but which set?

> In real life, probabilities are always unknown. Different people will
> estimate them differently depending on what they know. For example, I
> flip a fair coin, look at it, but don't show it to you. What is the
> probability that it is heads? You say 1/2. I say 1. Both of us are
> right.

> I define p(s) as the probability that s will be a prefix of a dialog in
> a Turing test, given that you do not know who will be the judge or the
> confederate or when the test will be given.

> I realize that p() is still not well defined. But that is the nature
> of probability.

... while you keep retreating back to unknowable idealisations that
you're somehow going to produce excellent approximations to. Your p()
contains information on all conceivable TT dialogs for all (english
speaking?) humans over all time. You're going to somehow learn this
from 1G of text. I don't think so.

> > For a given p() to be useful in passing the TT, it has to adequately
> > cover possible dialogs. It's easy, as judge, to create novel dialogs.
> > For example, the following short question whose answer is obvious has
> > almost certainly never been asked before in the history of mankind.
> >
> > How many legs do 3 ants, an octopus and 4 aardvarks have in total?

> And yet, the probability that this will answered in an unknown training
> corpus will not be exactly 0.

A mere quibble. Whether it is exactly 0 or vanishingly small, you can't
learn to reliably answer such questions by text prediction based on a
manageably small training corpus, because you can't learn a sufficiently
good approximation of your idealized p(). There are simply too many
easily asked and answered questions you will never have seen before.
Your system must be capable of responding to novel questions like this,
and LM's merely learned directly from text corpora can't.

> Also, I think you will agree that this question and the answer could be
> phrased as a text prediction problem, in which case a compressor that
> can solve problems like this will compress smaller than one that
> cannot.

Surely. But that doesn't imply the converse: that a compressor that
compresses smaller than <one that cannot solve such problems>, can solve
such problems.

> > Then there's the issue of background knowledge. In a previous posting
> > you addressed the need to know such things as the date by prefixing
> > that information onto the actual dialog. How do you propose to
> > decide which background information to add in explicitly?

> If a machine responds too fast to my questions, I will know it is not
> human.

Are you serious? This is easily addressed.

> A dialog must contain information about who said what and when.
> You can represent this information how you like.

Delimiters between utterances are indeed necessary. Timestamps are not.
In any case you've missed (or are deliberately avoiding?) the point here.
You prefixed the actual dialog with non-dialog information such as the
date, presumably to prop up your argument that you could learn an LM
that could answer questions such as "How many days to Xmas?".
The issue is that you need to add a potentially huge amount of
background information that is known by humans but absent from dialogs.

OK, that seems reasonable.

> > For completeness, I'm just going to backtrack to something you wrote
> > earlier in the thread.
> >
> > > Now let q be a dialog ending with a question, and x1 and x2 be two
> > > posible answers. I think it is reasonable to assume that the following
> > > statements are equivalent (all true or all false) for all q, x1, x2:
> >
> > > 1. A human picked at random and asked question q is more likely to
> > > answer x1 than x2.
> > > 2. A human picked at random is more likely to judge the pair q,x1 as
> > > correct than q,x2.
> > > 3. A text corpus of human origin is more likely to contain q,x1 than
> > > q,x2.
> >
> > I'd agree that 1 and 2 are equivalent, but dispute the equivalence of 3.
> > The reason is that, except for a vanishingly small number of dialogs q,
> > the probability of a text corpus of human origin containing q,x1 is
> > identical to the probability of it containing q,x2 , namely 0. As I've
> > shown it is easy to create dialogs such that this is the case.

> Without full knowledge, you cannot say that any probability is 0. If
> you had full knowledge then there would be no use for probability
> theory.

The probabilities assigned to strings by your LM is determined by how
you construct the LM, not by what you or I might consider them to be.
You have yet to explain how you will create an LM by training on a
corpus that will assign non-zero probabilities to strings it has never
encountered, including all questions that might reasonably be asked.

> > > I realize that Wikipedia has a different distribution than p(s) for
> > > Turing tests. The proper training corpus for passing the Turing test
> > > would be a set of transcripts between judges and human confederates,
> > > like those at the Loebner prize website. Chat logs would also be a
> > > better choice than Wikipedia. I argue that this difference is an
> > > irrelevant detail. The main difference is that Wikipedia articles are
> > > not interactive, so a machine training on it would not learn to hold a
> > > conversation. Other than that, text prediction still requires high
> > > level language comprehension, common sense, reasoning, and problem
> > > solving abilities just like passing the Turing test does. If you can
> > > model written articles, then I think extending the model to interactive
> > > conversation would not be too hard.
> >
> > > -- Matt Mahoney
> >
> > So finally you gloss over several issues, dismissing them as irrelevant
> > details, or easy. You claim, apparently as an act of faith, that:
> >
> > - It will be feasible to learn a sufficiently good language model from a
> > tractably small corpus. This is on a par with claiming that it will
> > be feasible to learn to play championship Go just from example games.

> I believe it is possible to develop lexical, semantic, and syntactic
> models from unlabeled text containing only positive examples. Children
> seem to do this. It is probably insufficient to develop higher level
> skills like solving word problems with arithmetic. That will probably
> take some help from the programmer.

Children have inherent capabilities that permit them to do this, and much
more. Until you work out how to reproduce a sufficient subset of those
capabilities, you can't expect to replicate the achievent. You believe
that learning a language model for a suitable corpus is most of the job.
I think you, like many others, are grossly underestimating the complexity
of the problem, and looking for simple solutions to complex problems.

> A corpus consisting of articles is not sufficient to train a chatterbot
> in interactive converstation. But I argue that such a model is still
> useful, probably more so. If you want to translate articles, scan
> them, recognize them from broadcast speech, etc., you want to train on
> the type of text that it would typically encounter.

> > - Given such a model, it will be feasible to use it to pass the TT.

> Given about 1 GB of the right type of training material (dialogs), yes.

Dream on. The information content is inadequate.

> > - The difference between the language model of TT dialogs and of a slab
> > of Wikipedia is unimportant. You, the programmer, can somehow extend
> > or convert the one into the other.

> There is a lot of overlap in the learning mechanisms. The difference
> is mainly in the training material and the requirement for interaction.

How can you claim this, when you have yet to devise the learning
mechanisms?

> > - The limited extent of the corpus will not matter. Somehow your LM
> > will have the capability to generalise, extrapolate, answer novel
> > questions, incorporate all necessary background information, know
> > how to do computations, solve simple problems, remember things, ...

> I think 1 GB is the right size, as explained in
> http://cs.fit.edu/~mmahoney/compression/rationale.html

> I do not claim that this particular corpus is right for any particular
> AI problem you want to solve. You need to be careful to feed it the
> information you want it to know. But I believe for the problem of
> predicting text in a corpus, the best information to feed it is the
> corpus itself. So I have no problem with the Wikipedia text for this
> purpose.

I totally agree that to predict the text that occurs in X, train on X.
But you've still offered no convincing argument as to how this allows
your system to answer questions that don't occur in X, or even ones
that are minor variations of those that do.

> [...]

> Passing the Turing test requires:
> 1. Solving the problem of language learning.
> 2. Modeling interactive conversation.
> 3. Selecting the right training data.

> I am only addressing problem 1.

> -- Matt Mahoney

I guess we've just about reached the point of agreeing to disagree.

David

Ian Parker

unread,

Aug 14, 2006, 6:35:28 AM8/14/06

to

David Kinney wrote

>The problems with compressing text as a benchmark of intelligence are
>that most systems don't compress text at all, and it is feasible to
>compress text very well while understanding nothing about its meaning,
>i.e. being incapable of passing a simple comprehension test.

To achieve high levels of compression, particularly with multilingual
texts, understanding is required. "What do you mean by understand?".
You need an objective definition for that too. The only possible
criterion is a Logical Posivist one. What can you do? In other words.

It is clear to me too that "meaning" is a part of the information
content of text, and a compression algorithm must at some point
consider meaning. With multilinual texts this has to be considered
early on.

- Ian Parker

James A. Bowery

unread,

Aug 14, 2006, 5:19:09 PM8/14/06

to

Actually, rules are compressed knowledge. Tractability and
comprehension constant, the more unified the rules, the less space they
occupy and the more accurate they are at predicting future observations.

Matt Mahoney

unread,

Aug 14, 2006, 5:34:40 PM8/14/06

to

David Kinny wrote:
> "Matt Mahoney" <matma...@yahoo.com> writes:
> > A corpus consisting of articles is not sufficient to train a chatterbot
> > in interactive converstation. But I argue that such a model is still
> > useful, probably more so. If you want to translate articles, scan
> > them, recognize them from broadcast speech, etc., you want to train on
> > the type of text that it would typically encounter.
>
> > > - Given such a model, it will be feasible to use it to pass the TT.
>
> > Given about 1 GB of the right type of training material (dialogs), yes.
>
> Dream on. The information content is inadequate.

OK, then, how much is adequate? I presented evidence that the storage
capacity for verbally expressable knowledge in the human brain is about
10^9 bits.

> I totally agree that to predict the text that occurs in X, train on X.
> But you've still offered no convincing argument as to how this allows
> your system to answer questions that don't occur in X, or even ones
> that are minor variations of those that do.

If I knew how to solve the language modeling problem then I would have
solved it. The fact is that nobody knows how to solve it. That is why
there is a contest. Maybe it can't be solved.

But let's not be pessimistic. I believe the Wikipedia data set has
enough information to infer general knowledge and then apply it to
novel situations occurring later in the text. Here is an example.
There is no occurrence of the string "the United States is a place" in
enwik8 (the Hutter prize data set). Yet there are 1359 occurrences of
"in the United States", 265 of "to the United States", etc. from which
this knowledge could be inferred. Knowing whether X is a place or a
person allows me to make better predictions of the text that follows.
The phrase "X said" tells me that X is probably a person, so I can
predict phrases like "X believes", "X thinks", etc.

So if a program can make such inferences, could it not compress better
than one that couldn't? Or is this not a valid test?

-- Matt Mahoney

Message has been deleted

Rob Freeman

unread,

Aug 15, 2006, 5:31:56 AM8/15/06

to

Matt,

I am interested in this idea that intelligence is equivalent to
compression. I like it. My question is, what guarantee do we have that
a given compression has any generality?

I'm guessing this is related to the assumption that the problem be
limited to a "restricted case of an environment bounded by space s and
time t."

I'm not trying to be negative by saying this. It is just I think there
is a better solution to be found by not making that assumption. Not a
better solution to the compression problem, but a better solution to
the problem of intelligence defined as compression.

-Rob Freeman

Matt Mahoney wrote:
> Ted Dunning wrote:
> > Wasn't it Fisher in the 20's who proved just about the same thing about
> > maximum likelihood estimates?
> >

> > Certainly, there is nothing new in the thesis that if you have an
> > estimate q of a probability p, then the using q as a compression model
> > gives minimum compressed size for p = q. Moreover, this minimum is
> > unique. This is the same as saying that the maximum likelihood value
> > of q is p (not quite the same as the maximum likelihood *estimator*, of
> > course).

>
> I think it was Shannon who proved this in 1948 (discete noiseless
> channel capacity theorem).
> http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
> I could not find any reference to Fisher in his paper.
>

> > Why this is news is beyond me.
>
> That isn't what Hutter proved. What he proved is that the optimal
> behavior of a goal seeking agent in an unknown but computable
> environment is to guess that the environment is controlled by the
> shortest program (Turing machine) consistent with all behavior observed
> so far. Thus, a solution to the problem of maximizing the accumulated
> reward signal from an unknown environment reduces to algorithmic
> compression, which is an undecidable problem. However, Hutter also
> proved that in the restricted case of an envoronment bounded by space s
> and time t, that the problem is computable in time O(t2^s) (so the
> problem is still hard).

>
> When I developed the large text benchmark (without funding) my goal was
> to promote research in NLP. There are 3 arguments why text compression
> implies AI.
>
> 1. Hutter's proof in the specific case of text prediction (a problem
> that is easy for people but hard for machines).

> 2. An explicit representation of the probability distribution of
> natural language strings (a correct language model) enables one to pass
> the Turing test for AI.

> 3. Compression ratio (perplexity) is already used in practice to
> measure the quality of language models (mainly for speech recognition).
>
> However, AI does *not* imply compression. Decompression requires a
> deterministic computation to invert the compression function, and the
> human brain is not deterministic. I believe this is the reason that
> many people doubt the converse. I discuss these issues in
> http://cs.fit.edu/~mmahoney/compression/rationale.html
>
> -- Matt Mahoney

Matt Mahoney

unread,

Aug 15, 2006, 3:28:41 PM8/15/06

to

Rob Freeman wrote:
> Matt,
>
> I am interested in this idea that intelligence is equivalent to
> compression. I like it. My question is, what guarantee do we have that
> a given compression has any generality?

There is no such thing as a perfectly general compression function.
Every compressor that decreases the size of some inputs must increase
the size of others.

Of course, most "interesting" data, like text or images, is
compressable. We could probably define "interesting" to mean that
there is a short description of it. More formally, a string of length
n bits is interesting if there is a universal Turing machine with an
encoding shorter than n bits that outputs the string. The length of
the shortest program to do this is the algorithmic complexity.
(Obviously it depends on the programming language, but only up to a
constant, because any program can be converted to another language by
appending a compiler). The problem of computing algorithmic complexity
for all strings in any given language is not computable.

Since "general" compression is not obtainable, that isn't the goal of
the Hutter prize. I expect that the best solutions will be tuned
specifically to the data set, and not be very useful as general purpose
compressors. However, to do well, these programs will have to solve
hard problems in language modeling, like how to integrate lexical,
semantic and syntactic knowledge to recognize correct sentences and
coherent paragraphs. I think the data has sufficient knowledge to do
this if we can figure out how language is learned.

> I'm guessing this is related to the assumption that the problem be
> limited to a "restricted case of an environment bounded by space s and
> time t."

These bounds reduce AIXI from an undecidable problem to an intractable
one (O(t2^s)), so it is still a theoretical result. The goal is not to
verify AIXI. It is proven, so it doesn't need to be verified
experimentally. The goal is more modest, to understand how people
learn language. I believe this problems is tractable, or else our
brains couldn't do it.

> I'm not trying to be negative by saying this. It is just I think there
> is a better solution to be found by not making that assumption. Not a
> better solution to the compression problem, but a better solution to
> the problem of intelligence defined as compression.

The Turing test is more direct. We have the Loebner prize for that.
The Turing test requires machines to have a world model and to reason
and solve problems using it. The Wikipedia benchmark is probably
insufficient to train a system for this level of abstract thought, but
I think it will bring us to a level of language modeling much higher
than we have now.

-- Matt Mahoney

Rob Freeman

unread,

Aug 15, 2006, 8:08:55 PM8/15/06

to

Matt Mahoney wrote:
>
> There is no such thing as a perfectly general compression function.

Is this true also for language?

> The problem of computing algorithmic complexity
> for all strings in any given language is not computable.

This is probably the result I am interested in. Why is this so?

Most especially, if it is so, why are you looking for a single
compressor as the solution to your "language learning" problem, even
for a single data set?

> Since "general" compression is not obtainable, that isn't the goal of
> the Hutter prize. I expect that the best solutions will be tuned
> specifically to the data set, and not be very useful as general purpose
> compressors. However, to do well, these programs will have to solve
> hard problems in language modeling, like how to integrate lexical,
> semantic and syntactic knowledge to recognize correct sentences and
> coherent paragraphs. I think the data has sufficient knowledge to do
> this if we can figure out how language is learned.

My impression is that you are assuming the problem with "language
modeling" has been that have not known how to compress natural
language. What I am suggesting is that the problem has been we only
ever look for a single best solution.

If you agree there are many solutions, that each data set will have a
different solution, and by implication that each subset of each
data-set will have a different best solution, why should we not
optimize our compressor on, say, each sentence as that sentence is
presented to the system?

Why do we have to optimize globally for a solution to the language
modeling problem, even when we know that global solutions are not
general?

-Rob

Overcomer

unread,

Aug 16, 2006, 3:49:39 AM8/16/06

to

James A. Bowery wrote:
<snip>
> Indeed, Hutter's Razor** might be phrased, "It is truer to explain with
> less that which can be explained with more."
<snip>
> Named for the discoverer of the proof and the initial 50,000 Euro
> donor, the Hutter Prize currently targets the compression of a 100
> megabyte sample of human knowledge drawn from the broadly based
> Wikipedia online encyclopedia.
<snip>

Certianly one is free to put one's money down on any proposition but in
this case it might be better spent on a search for truth. Real
scientists don't really use Occam's Razor except as a convienent excuse
to dismiss things they don't like. To claim it is more true to explain
less is much like adherents of certian programming languages who claim
there is no need to comment code. I would argue the general
proposition of proving less is more cannot be proven because it is
obviously false. Let's prove that. Take any brand of encyclopedia and
reduce it to the character X. There is clearly no more truth in
stating "X" than stating the entire encyclopedia. So hereafter you may
consider Hutter's Razor as presented by Bowery's Post as proven false.
Hutter may have something worthwhile that is not represented with
sufficient accuracy by Bowery's Post. So then more accuracy yeilds
more truth, not less volume, unless we're talking about Liespace where
fewer statements are always more true and silence is golden.

If more truth exists saying less, why post at all? The Wikipedia is
also a big joke of mediocrity, enforced by the moderator method of self
appointed eletist whacked in the head peer reviews who reject real
truth which can hurt for the sake of truth by committee.

So I request you get Hutter himself to post and defend himself as
Bower's post doesn't present stuff that's even defensible to a logical
person. Sometimes truth is so important it is repeated -- see the 4
gospels on Jesus Christ.

Sincerely,
Kirk Fraser

James A. Bowery

unread,

Aug 16, 2006, 1:37:03 PM8/16/06

to

Overcomer wrote:
> James A. Bowery wrote:

> > Indeed, Hutter's Razor** might be phrased, "It is truer to explain with
> > less that which can be explained with more."

> To claim it is more true to explain
> less

Why did you read "explain with less" as "explain less"?

James A. Bowery

unread,

Aug 16, 2006, 1:43:14 PM8/16/06

to

Rob Freeman wrote:
> Most especially, if it is so, why are you looking for a single
> compressor as the solution to your "language learning" problem, even
> for a single data set?

There are actually 2 things of value that can come out of the Hutter
Prize:

1) An AI compressor that works for encyclopedic knowledge.

2) More conicise enyclopedic knowledge.

2 needn't be accomplished via 1. It can also be accomplished by humans
creating formalized knowledge that is more concise.

Rob Freeman

unread,

Aug 16, 2006, 9:38:53 PM8/16/06

to

Hi James,

I think there is a lot of value that can come out of it. Don't get me
wrong. I think you are on the right track. In contrast to some of the
other messages in this thread I'm sure compression is a very good way
to see intelligence, "meaning" anyway, more narrowly, and certainly
natural language.

But I'm most interested in the limitations Matt places on that
compression. Those limitations bring you closer to what I think is the
answer than any other work I've seen.

In particular I like these statements from Matt:

"There is no such thing as a perfectly general compression function.

Every compressor that decreases the size of some inputs must increase
the size of others."

"The problem of computing algorithmic complexity for all strings in any

given language is not computable."

"Since "general" compression is not obtainable ... I expect that the

best solutions will be tuned specifically to the data set, and not be
very useful as general purpose compressors."

I think these are the right observations. I'm just trying to remind you
there are other conclusions you could draw from those observations.

Isn't the attempt to model language by finding a single best
compression function related to the assumption that there is a general
compression function for language? If we accept that general solution
does not exist, shouldn't that make us reassess our goal (a single best
solution) rather than simply the generality of that goal?

What meaning does a global solution for a compression of language have
if it is not general?

I'm just trying to nudge you in the direction of seeing there are
possibilities for modeling language other than general/global solutions
(which surely we only seek because of a traditional bias in the way we
think about language -- as parametrized by a single grammar.)

You are already on the right track when you identify compression with
language modeling, and in particular when you accept that there is no
general solution for that compression. I just want you to think about
what that means for the way you formulate the problem

If solutions are not general, why not optimize locally instead of
globally? When we understand there are no general solutions, we
understand solutions must be local in a sense. Why compromise and make
those solutions only half local by optimizing them over the biggest
data set you can find? Think about what happens as your data set gets
larger. Aren't you only approximating more and more closely a general
solution which we already know does not exist?

My suggestion is, if solutions are necessarily local, why not address
your intelligence problem, language modeling anyway, as a search for
really relevant local solutions, and forget global solutions
completely?

-Rob

Matt Mahoney

unread,

Aug 16, 2006, 9:44:00 PM8/16/06

to

Rob Freeman wrote:
> Matt Mahoney wrote:
> >
> > There is no such thing as a perfectly general compression function.
>
> Is this true also for language?

Only if you exclude some possible strings. For "normal" text where you
exclude highly unlikely byte strings (such as nonprintable characters),
then you can always compress it. But if you allow all binary strings
of n bits, then there is no function that will compress all 2^n
possibilities. That is because there are only 2^n - 1 binary strings
of length less than n. (This is the counting theorem or counting
argument).

> > The problem of computing algorithmic complexity
> > for all strings in any given language is not computable.
>
> This is probably the result I am interested in. Why is this so?

Kolmogorov gave a simple proof of this. Suppose there was a function
K(x) that outputs the length of the shortest program (in some fixed
language) that outputs x. (This number is also known as the
algorithmic complexity or Kolmogorov complexity). Then you could write
the following program to find the first string x (in some
lexicographical ordering) that cannot be compressed to n bits for some
n:

let x = ""
while K(x) <= n do
x = next(x)
output x

Now let n be the length of this program, and you have a contradiction.
Therefore K() is not computable.

Even though K(x) is not computable, it can be proven that K(x) does not
depend on the programming language used, except for a constant
independent of x. This is because if x is coded as a program in
language L1, you can code it in any other language L2 by appending a
compiler for L1 written in L2.

The motivation for using Kolmogorov complexity as opposed to some
probability model to define information content is that it removes any
debate about which model should be used. You intuitively "know" that
"00000000000000" has less information than "hvvW80f8fwl34i". Why? It
is easier to remember. It is easier to describe. It follows a
pattern. Because it has less information, it should compress smaller.
But suppose my model says that characters are independent and letters
are more common than digits. Then my compressor would compress the
second string smaller. This is not intuitive. Instead, algorithmic
complexity has a universal prior. The first string has a shorter
program that will output it in most languages. This includes English,
when I say it is easier to describe ("14 zeros" is shorter than
"hvvW80f8fwl34i").

> Most especially, if it is so, why are you looking for a single
> compressor as the solution to your "language learning" problem, even
> for a single data set?

Essentially, the "intelligence" of a model p1(x) depends on how
accurately it approximates the true but unknown natural language model
p(x) for all strings x. Shannon proved that the cross entropy E[-SUM_x
p(x) log p1(x)] is minimized when p1(x) = p(x) for all x. We don't
know p(x) but we can approximate this value by sampling x from the
unknown distribution p(x) and computing p1(x) with the model under
test. So the test depends on the assumption that WIkipedia is a fair
sample of natural language text

There is still one big problem. Probabilities are not intrinsic
properties of real world events. They depend on what you know. If you
know that you will be compressing a specific x, then your model is p(x)
= 1 and you can compress x to 0 bits. There are two ways to solve
this. One is to not reveal x. The other is to require a universal
prior, p(x) = 2^-K(x).

The first approach is widely used. If we used this approach, we could
test compressors on maybe just a few KB of text and get fairly accurate
results. But nobody could verify or repeat our results.

We used the second approach. The problem is that K(x) is a universal
model which says nothing about natural language. Our solution is to
make x big enough (we think) to learn natural language from x itself.

Whether it is possible to learn natural language at all from unlabeled
text, and whether the Wikipedia data is the right quantity or quality
to do this is controversial. I discuss these issues further in

Rob Freeman

unread,

Aug 16, 2006, 10:29:34 PM8/16/06

to

Matt Mahoney wrote:
>
> > > The problem of computing algorithmic complexity
> > > for all strings in any given language is not computable.
> >
> > This is probably the result I am interested in. Why is this so?
>
> Kolmogorov gave a simple proof of this.

Good, yeah. That was what I thought you meant. I'm not arguing with it,
only how we interpret it for the "language modeling" problem.

> > Most especially, if it is so, why are you looking for a single
> > compressor as the solution to your "language learning" problem, even
> > for a single data set?
>
> Essentially, the "intelligence" of a model p1(x) depends on how
> accurately it approximates the true but unknown natural language model

> p(x) for all strings x...

_The_ "true but unknown natural language model"?

Do you agree there is an assumption here? Why do we assume a "true but
unknown" language model exists at all?

What happens if there is no "true by unknown" language model. What
happens if if all you really have is text?

-Rob

Overcomer

unread,

Aug 17, 2006, 2:15:58 AM8/17/06

to

Why did you ask? Obviously your first quote is contained in your
second with less.

Overcomer

unread,

Aug 17, 2006, 2:16:58 AM8/17/06

to

Ian Parker

unread,

Aug 17, 2006, 12:44:18 PM8/17/06

to

Rob Freeman wrote:
>
> What happens if there is no "true by unknown" language model. What
> happens if if all you really have is text?
>

There must be.Look eclusia/cerradura is one bit. Primavera, ressorte,
mamanthal is two (well just about). If a human has no difficulty in
distinguishing between the different meanings the clear implication is
that these bits are redundant.

I still believe that the best compressor is probably the deterministic
guesser with the top guesses having a small number of bits.

You may of course believe that there is some sort of "vital force" in
human translation and understanding. I doubt though whether many of
this usergroup's readership would agree.

- Ian Parker

Matt Mahoney

unread,

Aug 17, 2006, 4:22:45 PM8/17/06

to

Overcomer wrote:
> So I request you get Hutter himself to post and defend himself as
> Bower's post doesn't present stuff that's even defensible to a logical
> person. Sometimes truth is so important it is repeated -- see the 4
> gospels on Jesus Christ.
>
> Sincerely,
> Kirk Fraser

Marcus Hutter's FAQ answers most of your questions.
http://prize.hutter1.net/prizefaq.htm

-- Matt Mahoney

Ian Parker

unread,

Aug 17, 2006, 4:40:23 PM8/17/06

to

I think Hutter still has to post on multilingualism. Why English? This
is one thing he does not cover. From all the other things he says I
would have thought multilingualism a natural part of what he is trying
to do.

- Ian Parker

Matt Mahoney

unread,

Aug 17, 2006, 7:44:29 PM8/17/06

to

I discuss this in in
http://www.cs.fit.edu/~mmahoney/compression/rationale.html
A multilingual model is sufficient but not necessary to demonstrate AI.
The Turing test is done in one language. If a machine could answer
questions in several languages, then that's nice, but if it only speaks
one language, then it still passes. So why make the problem harder
than it needs to be?

Also, I eventually want to measure the entropy of the data set using
methods like the Shannon game. If it is multilingual, then I need to
find subjects that speak each of the languages, and then I would still
have the wrong answer because there would be translations of articles
containing the same information. To measure the entropy accurately I
would need to find multilingual subjects and present an article in one
language as context while having them predict in the other. This
complication is easily avoided by using just one language.

-- Matt Mahoney

Rob Freeman

unread,

Aug 17, 2006, 7:52:05 PM8/17/06

to

Ian Parker wrote:
> Rob Freeman wrote:
> >
> > What happens if there is no "true by unknown" language model. What
> > happens if if all you really have is text?
>

> ...

>
> You may of course believe that there is some sort of "vital force" in
> human translation and understanding. I doubt though whether many of
> this usergroup's readership would agree.

Are our imaginations really so limited that the choices come down to
"language model" or "vital force"?

Ian, on the basis of past messages I know you believe language
understanding must be based on "relaxation".

The existence of a language model would mean language could be
parametrized by rules.

By saying language understanding must be based on relaxation you are
really also saying a general language model does not exist.

Your very own "relaxation" could be a third alternative to that
mysterious "vital force".

Relaxation is just ad-hoc compression. Which is consistent with the
model these guys are suggesting, in so far that it is based on a proof
all you need for useful intelligence is compression (leaving rules to
be an added assumption, an assumption inconsistent with a proof that a
general solution is impossible, in point of fact.)

-Rob

Ian Parker

unread,

Aug 18, 2006, 6:59:41 AM8/18/06

to

Relaxation simply means that the meanings are defined in an iterative
way. Speech has to be based on relaxation as it is initially ill
defined.

We may be talking slightly at cross purposes.

"El barco attravesta una cerradura" and "La estacion de ressorte" are
examples of GROSS errors in computer translation. If we were to get to
a stage were we could say "In Castile it means this, in Latin America
it means something similar but subtely different" we would have
advanced well beyond the current state of the art. If the differences
in meaning between Spain and Mexico were to become important - in other
words gross errors had been eliminated, VoIP would be virtually error
free.

Also I don't believe that in human/human communication all shades of
meaning are conveyed. Matt Mahoney says, and I agree with this, that
anything the human brain can do a computer can do. True there may be a
spiritual dimension to our lives, but this is certainly not relevant to
basic linguistic understanding.

Ian Parker

unread,

Aug 18, 2006, 7:11:48 AM8/18/06

to

Matt Mahoney wrote:

> Ian Parker wrote:
> A multilingual model is sufficient but not necessary to demonstrate AI.
> The Turing test is done in one language. If a machine could answer
> questions in several languages, then that's nice, but if it only speaks
> one language, then it still passes. So why make the problem harder
> than it needs to be?
>

As I have said multilinualism indicates a Euclidean position.
Primavera, ressote and mamanthal are overt in Spanish but are latent in
English. Latent Semanic Analysis should be providing the correct
positions.

> Also, I eventually want to measure the entropy of the data set using
> methods like the Shannon game. If it is multilingual, then I need to
> find subjects that speak each of the languages, and then I would still
> have the wrong answer because there would be translations of articles
> containing the same information. To measure the entropy accurately I
> would need to find multilingual subjects and present an article in one
> language as context while having them predict in the other. This
> complication is easily avoided by using just one language.
>

To me entropy reflects the ability to guess the next word. Here is
something to think about - Does the ability to measure entropy, imply
you can compress. I proposed a compressor which lised the words in a
language in order of probability at a given point. Yes it does!

The same thing is true with translations and multilingualism. What you
are doing here is second guessing the the original translator.

BTW - Entropy can change on translation. An example of this is the
adjective noun order in Spanish. In French tghe academie francaise has
standardized the order. In French "grand" comes before the noun. In
Spanish the position is optional. You have the "Rio Grande" but in
other places the French order is followed.

I was just saying that a translation forces AI quickly as opposed to
the present compressors which do not give any deep insights.

Matt Mahoney

unread,

Aug 20, 2006, 7:17:32 PM8/20/06

to

Ian Parker wrote:
> Matt Mahoney wrote:
> > Ian Parker wrote:
> > A multilingual model is sufficient but not necessary to demonstrate AI.
> > The Turing test is done in one language. If a machine could answer
> > questions in several languages, then that's nice, but if it only speaks
> > one language, then it still passes. So why make the problem harder
> > than it needs to be?
> >
>
> As I have said multilinualism indicates a Euclidean position.
> Primavera, ressote and mamanthal are overt in Spanish but are latent in
> English. Latent Semanic Analysis should be providing the correct
> positions.

I think it is possible to demonstrate word sense disambiguation in a
monolingual model. For example, "bank" would not predict "interest" in
context "river".

> > Also, I eventually want to measure the entropy of the data set using
> > methods like the Shannon game. If it is multilingual, then I need to
> > find subjects that speak each of the languages, and then I would still
> > have the wrong answer because there would be translations of articles
> > containing the same information. To measure the entropy accurately I
> > would need to find multilingual subjects and present an article in one
> > language as context while having them predict in the other. This
> > complication is easily avoided by using just one language.
> >
> To me entropy reflects the ability to guess the next word. Here is
> something to think about - Does the ability to measure entropy, imply
> you can compress. I proposed a compressor which lised the words in a
> language in order of probability at a given point. Yes it does!

I should have said "estimate" the entropy of a human model, not
"measure". The ability to measure entropy not only implies the ability
to compress (because entropy is a function of the model), but also
knowledge of an ideal model, which is not computable for the general
case.

-- Matt Mahoney

Rob Freeman

unread,

Aug 21, 2006, 1:11:24 AM8/21/06

to

Matt Mahoney wrote:
>
> I should have said "estimate" the entropy of a human model, not
> "measure". The ability to measure entropy not only implies the ability
> to compress (because entropy is a function of the model), but also
> knowledge of an ideal model, which is not computable for the general
> case.

Matt, even if you change "measure" to "estimate", surely you still need
to explain how it is meaningful to try and compute something which it
has been proven is not computable.

-Rob

Jorrit Vermeiren

unread,

Aug 21, 2006, 6:59:49 AM8/21/06

to

Matt Mahoney wrote:
> But let's not be pessimistic. I believe the Wikipedia data set has
> enough information to infer general knowledge and then apply it to
> novel situations occurring later in the text.

(...)

> So if a program can make such inferences, could it not compress better
> than one that couldn't? Or is this not a valid test?

Using your own Occam's Razor reference, the smallest size any data set
can be compressed to would be with a language model that perfectly
models the data *and nothing else*. Anything more would be redundant
and unnecessary to model the data. It would be massively overfitted to
the data in question.

Necessarily, compressing the data more will result in a loss of ability
to predict things not contained within the data.

Ian Parker

unread,

Aug 21, 2006, 4:23:19 PM8/21/06

to

Matt Mahoney wrote:
> > As I have said multilinualism indicates a Euclidean position.
> > Primavera, ressote and mamanthal are overt in Spanish but are latent in
> > English. Latent Semanic Analysis should be providing the correct
> > positions.
>
> I think it is possible to demonstrate word sense disambiguation in a
> monolingual model. For example, "bank" would not predict "interest" in
> context "river".
>

Yes it is. However a 2L is in my view the quickest way to demonste
this. I could talk about fishing on the bank of a river(sp orilla,
rivera) for hours. If it taked about interest rates I could well
conclude that it was making some kind of joke. Spanish tells me
instantly what the level of understanding is. I could find out in pure
English but it would take a lot longer.

Certaily if it cannot understand that you don't fish for €50,000 it
will be unlikely to compress to the Hutter level.. Herr Hutter has set
the compression level to be just beyond was is possible with fractal
type compressors. €50,000 can only be won with AI. The problem really
comes with the lesser prizes. I would be sorry to see €10,000 going
to a slighly more efficient fractal.

Matt, you are on the committee. What precisely are the criteria for the
lesser prizes and how much discretion do the committee have? The top
prize is or course awarded according to a narrowly defined criterion.

>
> I should have said "estimate" the entropy of a human model, not
> "measure". The ability to measure entropy not only implies the ability
> to compress (because entropy is a function of the model), but also
> knowledge of an ideal model, which is not computable for the general
> case.
>

If you ESTIMATE but do not MEASURE entropy it is subjective.

- Ian Parker

Matt Mahoney

unread,

Aug 21, 2006, 7:08:00 PM8/21/06

to

Let's not confuse modeling with algorithmic complexity. The ideal
model of string x is p(x) = 1. This would compress to size 0. Writing
such a compressor is trivial,e.g.
http://cs.fit.edu/~mmahoney/compression/barf.html

But you can't neglect the size of the decompressor. Using algorithmic
complexity does not allow it. Probability no longer depends on what
you know.

-- Matt Mahoney

Matt Mahoney

unread,

Aug 21, 2006, 7:27:01 PM8/21/06

to

The goal is to estimate the entropy of a human language model. Given a
model p1(X) over random variable X with true but unknown distribution
X, one can estimate the cross entropy E[-log p1(x)] by averaging over a
large sampling of x from X. So if p = p1, then we are estimating the
entropy of X, defined as E[-log p(x)]. In the Shannon game, p is the
distribution of text in the corpus, and p1 is the model of the test
subject. We presume that these models are the same (or close) since
they are both of human origin.

The problem with is that it is difficult to have humans explicitly
state p1(x). Rather, it is conceptually easier for people to rank
p1(x1) > p1(x2), e.g. x1 is more likely than x2. So the Shannon game
has people guess successive characters in text. But this only places
course bounds on the result (0.6 to 1.3 bits per character for a 27
character alphabet of A-Z and space).

One possible improvement might be a form of human assisted computer
modeling, similar to what Shannon did by allowing subjects to use
1,2,3-gram and word frequency tables. At each decision point, a
program could output its next guess as e.g. a pie chart based on an
n-gram model, and the subject could make adjustments based on high
level knowledge not available to the computer. This could be made into
a game, where the subjects wins points for better compression. I think
this type of test should provide a tighter upper bound.

-- Matt Mahoney

Matt Mahoney

unread,

Aug 21, 2006, 7:34:50 PM8/21/06

to

Ian Parker wrote:
> Certaily if it cannot understand that you don't fish for €50,000 it
> will be unlikely to compress to the Hutter level.. Herr Hutter has set
> the compression level to be just beyond was is possible with fractal
> type compressors. €50,000 can only be won with AI. The problem really
> comes with the lesser prizes. I would be sorry to see €10,000 going
> to a slighly more efficient fractal.

I'm not aware of fractals being useful for text compression. AFAIK
they are only useful in lossy image compression to encode textures.

> Matt, you are on the committee. What precisely are the criteria for the
> lesser prizes and how much discretion do the committee have? The top
> prize is or course awarded according to a narrowly defined criterion.

Each 1% improvement in compression ratio of enwik8 (100 MB of Wikipedia
text) wins 500 euros. The minimum is 1%.

> > I should have said "estimate" the entropy of a human model, not
> > "measure". The ability to measure entropy not only implies the ability
> > to compress (because entropy is a function of the model), but also
> > knowledge of an ideal model, which is not computable for the general
> > case.
> >
> If you ESTIMATE but do not MEASURE entropy it is subjective.
>
> - Ian Parker

Any measurement is also an estimate, but not all estimates are
measurements. I would like to estimate the entropy of a human model
using the text in enwik8 so that we can estimate what level of
compression would achieve AI.

-- Matt Mahoney

Ted Dunning

unread,

Aug 21, 2006, 7:41:46 PM8/21/06

to

You may have already mentioned this, but it suddenly occurs to me that
compression of a bilingual parallel corpus at a rate better than
compression of independent corpora in the two languages is an
interesting and possibly even useful test of machine translation just
as compression of a single corpus is a test of a language model.

Matt Mahoney wrote:
> ...
> I used a monolingual corpus because multilingual ability is not a
> requirement for intelligence. The problem is hard enough as it is. If
> the same articles appeared in several languages, then compression would
> require solving the language translation problem in addition to the
> modeling problem.
>
> -- Matt Mahoney

Ted Dunning

unread,

Aug 21, 2006, 7:48:53 PM8/21/06

to

This is a good point, but it misses the requirement that the size of
the compression program be included in the cost of the compression.

Most over-fitting results in an increase in the size of the compressor
that out-weighs the improvement in compression. This difference is
roughly equivalent to searching for an MAP estimator rather than an ML
estimator.

Ted Dunning

unread,

Aug 21, 2006, 7:54:50 PM8/21/06

to

The guarantee of generality is that if tested on data from the same
distribution as it approximates, then the new data will be compressed
well.

This follows from Shannon's theorem that I quoted earlier (and
attributed to Fisher).

Basically, if your approximation q of the true distribution p is good,
then it will achieve something near the optimum value of sum p log q
and vice versa. Moreover, sum p log q has a unique optimum where p = q
and is nice and smooth so that any q that produces a nearly optimal
value will be near p. If p doesn't change, then q will still be near
p which will make sum p log q still be nearly optimal and you have your
generality (or not if q is far from p).

Rob Freeman wrote:
> Matt,
>
> I am interested in this idea that intelligence is equivalent to
> compression. I like it. My question is, what guarantee do we have that
> a given compression has any generality?
>
> I'm guessing this is related to the assumption that the problem be
> limited to a "restricted case of an environment bounded by space s and
> time t."
>
> I'm not trying to be negative by saying this. It is just I think there
> is a better solution to be found by not making that assumption. Not a
> better solution to the compression problem, but a better solution to
> the problem of intelligence defined as compression.
>
> -Rob Freeman
>
> Matt Mahoney wrote:
> > Ted Dunning wrote:
> > > Wasn't it Fisher in the 20's who proved just about the same thing about
> > > maximum likelihood estimates?
> > >
> > > Certainly, there is nothing new in the thesis that if you have an
> > > estimate q of a probability p, then the using q as a compression model
> > > gives minimum compressed size for p = q. Moreover, this minimum is
> > > unique. This is the same as saying that the maximum likelihood value
> > > of q is p (not quite the same as the maximum likelihood *estimator*, of
> > > course).
> >
> > I think it was Shannon who proved this in 1948 (discete noiseless
> > channel capacity theorem).
> > http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
> > I could not find any reference to Fisher in his paper.
> >
> > > Why this is news is beyond me.
> >
> > That isn't what Hutter proved. What he proved is that the optimal
> > behavior of a goal seeking agent in an unknown but computable
> > environment is to guess that the environment is controlled by the
> > shortest program (Turing machine) consistent with all behavior observed
> > so far. Thus, a solution to the problem of maximizing the accumulated
> > reward signal from an unknown environment reduces to algorithmic
> > compression, which is an undecidable problem. However, Hutter also
> > proved that in the restricted case of an envoronment bounded by space s
> > and time t, that the problem is computable in time O(t2^s) (so the
> > problem is still hard).
> >
> > When I developed the large text benchmark (without funding) my goal was
> > to promote research in NLP. There are 3 arguments why text compression
> > implies AI.
> >
> > 1. Hutter's proof in the specific case of text prediction (a problem
> > that is easy for people but hard for machines).
> > 2. An explicit representation of the probability distribution of
> > natural language strings (a correct language model) enables one to pass
> > the Turing test for AI.
> > 3. Compression ratio (perplexity) is already used in practice to
> > measure the quality of language models (mainly for speech recognition).
> >
> > However, AI does *not* imply compression. Decompression requires a
> > deterministic computation to invert the compression function, and the
> > human brain is not deterministic. I believe this is the reason that
> > many people doubt the converse. I discuss these issues in

Matt Mahoney

unread,

Aug 21, 2006, 8:41:22 PM8/21/06

to

Ted Dunning wrote:
> You may have already mentioned this, but it suddenly occurs to me that
> compression of a bilingual parallel corpus at a rate better than
> compression of independent corpora in the two languages is an
> interesting and possibly even useful test of machine translation just
> as compression of a single corpus is a test of a language model.

That is right. If you could compress a monolingual corpus to 1 bpc,
then a perfect translator implies you could compress a parallel corpus
to 0.5 bpc because the translation adds no new information. This is an
applications of Keogh's CDM.

-- Matt Mahoney

Rob Freeman

unread,

Aug 21, 2006, 9:50:41 PM8/21/06

to

Well sure, Ted. If you get a general compression you can test it this
way. The real question is "How do we know a general compression is
possible?" Your answer implies you think it is possible. Matt does not.
Matt answers this question by saying that ' "general" compression is
not obtainable'.

I am happy to see Matt saying ' "general" compression is not
obtainable' because I also believe this to be the case for language.

I just want him to think through the implications of that result a bit
more deeply.

-Rob

Ted Dunning wrote:
> The guarantee of generality is that if tested on data from the same
> distribution as it approximates, then the new data will be compressed
> well.

> ...

> Rob Freeman wrote:
> > Matt,
> >
> > I am interested in this idea that intelligence is equivalent to
> > compression. I like it. My question is, what guarantee do we have that
> > a given compression has any generality?

> > ...

Rob Freeman

unread,

Aug 21, 2006, 10:02:25 PM8/21/06

to

But Matt, this seems to be a description of how you intend to compute
something. What I want to know is how this relates to the proof that
what you are trying to compute is not computable.

-Rob

Matt Mahoney

unread,

Aug 22, 2006, 6:03:41 PM8/22/06

to

Rob Freeman wrote:
> But Matt, this seems to be a description of how you intend to compute
> something. What I want to know is how this relates to the proof that
> what you are trying to compute is not computable.

Kolmogorov complexity is not computable for the general case, but this
does not exclude that K(s) can be proven for some s (but not all s).
But it is an argument that it is a hard problem, and worthy of a
competition.

My reason for wanting to estimate tighter bounds on the entropy of
written English is because we are now compressing text to within the
uncertainty bounds determined by Shannon. It would be nice to know how
far we are from the goal.

-- Matt Mahoney

Rob Freeman

unread,

Aug 22, 2006, 8:34:16 PM8/22/06

to

Matt Mahoney wrote:
> ...

> My reason for wanting to estimate tighter bounds on the entropy of
> written English is because we are now compressing text to within the
> uncertainty bounds determined by Shannon. It would be nice to know how
> far we are from the goal.

But how do we know the goal exists, Matt? Shannon tells us the
interpretation of his results is uncertain (based on different ways of
interpreting his observed distributions in terms of an underlying
probability model), and Kolmogorov tells us it is not computable.
Actually, isn't want Kolmogorov telling us that the goal is unknowable,
because his proof rests on the observation not that to compute it is
very difficult, or would take a long time, but that to know it would
result in a paradox.

What does it mean that to know something would result in a paradox?
Mightn't the paradox be an attempt to reconcile many inconsistent
perspectives?

What I am getting at is what I see as this assumption that language is
governed by a single probability model.

Isn't it more consistent with the facts to conclude that language is
not governed by a single probability model?

-Rob

Message has been deleted

Matt Mahoney

unread,

Aug 23, 2006, 12:31:02 AM8/23/06

to

Kolmogorov says it is impossible to compute K(s) for all s. It is not
impossible to compute K(s) for some s. For example, I could prove that
the smallest C program to output "x" is main(){printf("x");} by
enumerating all shorter syntactically correct programs and showing that
they halt and do not output "x". Kolmogorov only proves that I can't
do this for every string.

Also, I am not attempting to find K(enwik8), but rather the entropy of
a human language model with respect to enwik8, -SUM_x p(x) log p1(x),
where p() is the language model of the source that generated enwik8 and
p1() is the model of the human subject predicting enwik8. Since both
models are human, we can assume they are approximately the same and
therefore estimate the entropy of the human model. This should be
close to K(enwik8) as long as there are no hidden patterns, for
example, data generated by a cryptographic process.

Now if we know the entropy H, then we can say that if a model p1()
compresses to H then it must equal p(), which is the goal.

Shannon's rough approximation of H is due to the uncertainty of p1(),
because there are many models that could result in the observed
sequence of guesses. Humans are good at judging if p(x1) > p(x2), but
are poor at judging by how much, which limits the methods we can use.

-- Matt Mahoney

Rob Freeman

unread,

Aug 23, 2006, 1:24:51 AM8/23/06

to

Matt Mahoney wrote:
>
> Kolmogorov says it is impossible to compute K(s) for all s. It is not
> impossible to compute K(s) for some s. For example, I could prove that
> the smallest C program to output "x" is main(){printf("x");} by
> enumerating all shorter syntactically correct programs and showing that
> they halt and do not output "x". Kolmogorov only proves that I can't
> do this for every string.

Matt, that's OK, but you still haven't said how we know we can do this
for natural language.

How do we know a general solution exists, even, or especially, when we
limit s to natural language?

Do you now think a general compression is possible for natural
language? How does this compare with your earlier statement that:
'Since "general" compression is not obtainable, that isn't the goal of
the Hutter prize. I expect that the best solutions will be tuned
specifically to the data set, and not be very useful as general purpose
compressors.'

-Rob

Matt Mahoney

unread,

Aug 24, 2006, 3:26:59 PM8/24/06

to

Rob Freeman wrote:
> Matt Mahoney wrote:
> >
> > Kolmogorov says it is impossible to compute K(s) for all s. It is not
> > impossible to compute K(s) for some s. For example, I could prove that
> > the smallest C program to output "x" is main(){printf("x");} by
> > enumerating all shorter syntactically correct programs and showing that
> > they halt and do not output "x". Kolmogorov only proves that I can't
> > do this for every string.
>
> Matt, that's OK, but you still haven't said how we know we can do this
> for natural language.
>
> How do we know a general solution exists, even, or especially, when we
> limit s to natural language?

It does not. Here is the proof. Suppose that K(s) is computable for
the special case of natural language strings, and suppose further there
is a function f(s) that returns true if s is a string from natural
language and false otherwise. Assume the set of natural language
strings is infinite. Then we can write the following program to find
the first (in some lexicographical order) natural language string x
that cannot be compressed to n bits.

let s := ""
while not f(x) or else K(x) <= n do
let x := next(x)
output x

Then if you let n be the length of this program, then you have a
contradiction. So K(X) is not computable even for the special case of
X = {x: f(x) = true} where |X| is infinite. Note that the proof does
not depend on the tricky meaning of "natural language". You can define
it however you want.

The undecidability of text compression is not a limitation. Rather it
is motivation to hold a contest to tighten the upper bound. As you get
closer, the problem gets harder.

> Do you now think a general compression is possible for natural
> language? How does this compare with your earlier statement that:
> 'Since "general" compression is not obtainable, that isn't the goal of
> the Hutter prize. I expect that the best solutions will be tuned
> specifically to the data set, and not be very useful as general purpose
> compressors.'
>
> -Rob

The goal of the Hutter prize is to find solutions to the problem of
natural language modeling. I expect the winning programs to be
specifically tuned to the data set. This does not distract from the
goal because a solution will still require solving difficult problems
in learning and modeling syntax and semantics. I discussed this
further in comp.compression under the thread "Hutter prize entries?".

http://groups.google.com/group/comp.compression/browse_frm/thread/1091241d8923d34d/3343ee9386f528bb#3343ee9386f528bb

-- Matt Mahoney

rob...@gmail.com

unread,

Aug 28, 2006, 12:23:48 AM8/28/06

to

Matt Mahoney wrote:
> Rob Freeman wrote:

> > ..

> > How do we know a general solution exists, even, or especially, when we
> > limit s to natural language?
>

> It does not. ...
>
> ...

>
> The goal of the Hutter prize is to find solutions to the problem of
> natural language modeling. I expect the winning programs to be
> specifically tuned to the data set. This does not distract from the
> goal because a solution will still require solving difficult problems
> in learning and modeling syntax and semantics.

Matt,

I am happy to hear you emphasize again that a general compression for
natural language does not exist.

It is just that I think this result alone provides the answer to the
biggest of those problems in natural language processing you talk
about, i.e. why historically we have been unable to learn a single
complete natural language grammar.

The answer is: because a single complete natural language grammar does
not exist.

Having understood this I think it is going to be easy to model natural
language grammar (using all kinds of ad-hoc compressions.) It seems the
problem all along has been that we have insisted there must be only
one, and that it must be complete.

-Rob

Overcomer

unread,

Sep 9, 2006, 1:01:29 AM9/9/06

to

Matt Mahoney wrote:
> Overcomer wrote:
> > Sometimes truth is so important it is repeated -- see the 4
> > gospels on Jesus Christ.
>

> Marcus Hutter's FAQ answers most of your questions.
> http://prize.hutter1.net/prizefaq.htm

While interesting the FAQ also inspires new questions.

"Intelligence has many faces, like creativity, solving problems,
pattern recognition, classification, learning, induction, deduction,
building analogies, optimization, surviving in an environment, language
processing, knowledge, and many more. ...[snip]... The key concepts for
a formal definition are compression and utility maximization, the other
aspects are emergent. phenomena."

How does compression include creativity? Compression actually doesn't
include many of the faces mentioned. Certianly compression can be done
more intelligently than WinZip and related programs and that could
minimize the file size which may require restricted forms of the faces
of AI but compression could never maximize creativity or many of the
rest.

"The contest will have highest participation when the prize will be
initiated, so the file-size should be appropriate for moderate-cost
home-computers."

Having worked on a compressor to transmit changes instead of whole
files, I found the most efficient file size for desktop computers was
1Mb, not 100Mb. There may be some adjustments for not having to
transmit or receive in the process but generally a programmer testing
1Mb files would be able to do say 1000 iterations compared to 10 for a
100Mb file in the same time on the surface. Deeper analysis would take
longer thus there should be expected an arithmetic reduction in
performance with larger file sizes.

Why demand the first contest be on 100Mb instead of 1Mb? It would seem
people who need to bicycle to borrow a computer would appreciate being
able to test new ideas several times an hour instead of once every 10
hours.

"The total prize is not exactly 50'000€"

Why not? A formula award sounds like scam. It may not be a scam but
it is not honest sounding like a solid prize for the winner. And it
may discourage some entrants who find the risk of wasted time vs.
expectation of 50K acceptable but taking away from 50K even for the
winner may make the risk less acceptable.

Interestingly, 50K in dollars is the amount Bill Gates paid for his
first operating system. What do you plan to do with the winning
software -- the hint of scam on prize leads me to ask if you intend to
reverse engineer and sell it? [I'd like Loebner's answer too.]

KF

Ian Parker

unread,

Sep 9, 2006, 8:05:52 AM9/9/06

to

Overcomer wrote:
> Matt Mahoney wrote:
> > Overcomer wrote:
> > > Sometimes truth is so important it is repeated -- see the 4
> > > gospels on Jesus Christ.
> >
> > Marcus Hutter's FAQ answers most of your questions.
> > http://prize.hutter1.net/prizefaq.htm
>
> While interesting the FAQ also inspires new questions.
>
> "Intelligence has many faces, like creativity, solving problems,
> pattern recognition, classification, learning, induction, deduction,
> building analogies, optimization, surviving in an environment, language
> processing, knowledge, and many more. ...[snip]... The key concepts for
> a formal definition are compression and utility maximization, the other
> aspects are emergent. phenomena."
>
> How does compression include creativity? Compression actually doesn't
> include many of the faces mentioned. Certianly compression can be done
> more intelligently than WinZip and related programs and that could
> minimize the file size which may require restricted forms of the faces
> of AI but compression could never maximize creativity or many of the
> rest.
>

First question - What do you define creativity as? The main barrier to
computers behaving in an intelligent way is their lack of understanding
of natural language. If natural language could be readily undersood,
creativity would take place as part of the process of manipulating
linguistic ideas. To put this in a very banal way a spider that
understood NL could look at tthe descriptions and comment statements in
code and cobble together a C++ or Java program to do whatever.

BTW - I have been thinking about the fossil record and evolution. Matts
published program could quite esily look at DNA. A compression of DNA
gives us the lineage of all life on Earth. This surely is an
intelligent deduction.

> "The contest will have highest participation when the prize will be
> initiated, so the file-size should be appropriate for moderate-cost
> home-computers."

These powers are increasing all the time.

>
> Having worked on a compressor to transmit changes instead of whole
> files, I found the most efficient file size for desktop computers was
> 1Mb, not 100Mb. There may be some adjustments for not having to
> transmit or receive in the process but generally a programmer testing
> 1Mb files would be able to do say 1000 iterations compared to 10 for a
> 100Mb file in the same time on the surface. Deeper analysis would take
> longer thus there should be expected an arithmetic reduction in
> performance with larger file sizes.

For NL 100MB allows the size of a dictionary to be amortized.

>
> Why demand the first contest be on 100Mb instead of 1Mb? It would seem
> people who need to bicycle to borrow a computer would appreciate being
> able to test new ideas several times an hour instead of once every 10
> hours.
>
> "The total prize is not exactly 50'000€"
>
> Why not? A formula award sounds like scam. It may not be a scam but
> it is not honest sounding like a solid prize for the winner. And it
> may discourage some entrants who find the risk of wasted time vs.
> expectation of 50K acceptable but taking away from 50K even for the
> winner may make the risk less acceptable.
>
> Interestingly, 50K in dollars is the amount Bill Gates paid for his
> first operating system. What do you plan to do with the winning
> software -- the hint of scam on prize leads me to ask if you intend to
> reverse engineer and sell it? [I'd like Loebner's answer too.]
>

I think it is valuable in publicity value. A linuistic model will be
worth billions of Euros. Think of the money the EU spends translating
documents.

You can indeed use terms like entropy to describe the content of
language.

- Ian Parker

Overcomer

unread,

Sep 9, 2006, 6:44:28 PM9/9/06

to

Ian Parker wrote:
> First question - What do you define creativity as?

Several definitions come to mind. Professors mint PhD's for expanding
the state of the art in a known field or inventing a new field of
study. The Biblical useage is making something from nothing (as
opposed to recycling). A cute one used in the hippy era was dealing
with a mess as a mess instead of organizing it. I personally like to
discover new things nobody I've heard of has discovered before,
although I don't do exhaustive research to find if ANYBODY has EVER
discovered the same things. One definition for children is
fingerpainting -- making patterns in the sandbox of life that are at
least pretty and hopefully unique and useful. In NL educated by the
Bible, one could specify absolute perfection.

> The main barrier to
> computers behaving in an intelligent way is their lack of understanding
> of natural language.

Agreed. People trying to enlighten a computer must know NL composition
and analysis better than most linguists.

> If natural language could be readily undersood,
> creativity would take place as part of the process of manipulating
> linguistic ideas.

No not necessarily. Linguistic ideas as taught where I went to college
made for a class that wasn't worth taking. Let's postulate the
existance of a demonic force or human knowledge virus called
anti-creativity which causes people infected to operate in ways that
defeat creativity instead of nurture it (Communism for example). The
linguistic class I skipped was like that. So creativity is NOT a
natural outcome of text processing but it may happen by coincidence and
should be pursued as a distinct goal in its own right.

> To put this in a very banal way a spider that
> understood NL could look at tthe descriptions and comment statements in
> code and cobble together a C++ or Java program to do whatever.

Nope, the creativity required for good programming is often not
reflected at all in any descriptions or comment statements whatsoever.
In fact, the state of the art in Smalltalk's "Extreme Programming"
movement is to program simply and refactor until comments are
unnecessary. That makes it impossible for a newbie to get a grasp on
what's going on. So I would tend to categorize banality as
anti-creative.

> BTW - I have been thinking about the fossil record and evolution. Matts
> published program could quite esily look at DNA. A compression of DNA
> gives us the lineage of all life on Earth. This surely is an
> intelligent deduction.

Perhaps but having a picture of DNA is like having a picture of a
living being instead of the living being.

> I think it is valuable in publicity value. A linuistic model will be
> worth billions of Euros. Think of the money the EU spends translating
> documents.

Hmm, things are worth what other people are willing to pay you for it.
That is why oil is so much more expensive than it ought to be. I find
it unfair that certian Saudi's through being wealthy at birth can
afford and do procreate a classroom of children when I can barely
afford to live alone. The EU can assign values as it chooses but those
choices are not necessarily fair nor good for the world.

Ian Parker

unread,

Sep 10, 2006, 7:00:56 AM9/10/06

to

It is not necesary that is true. But let me put this another way. Where
the computer "understands" the parameters as is the case with chess, a
high standard of play results. In the case of Bridge and Poker this is
not the case - yet. Perhaps it is due to our failure to have a proper
mathematical understanding. Mahbe we need some pure mathematical
research on probability. If we take as our definition the ability to
produce new results, computers do this already. Who would do
engineering without a computer?

We ought to perhaps be more precise about the type of intelligence we
are looking for. There is the intelligence involved in innovation.
There is also the intellegence involved in searching for information
To a degree we rate people who can command a lot of information as
being intelligent.

If we have crossed the Pons Asinorum of natural language a lot becomes
possible. Security is one application which spings to mind. We could
for example keep a very close watch on extremist Islamic websites. We
could even insert counter arguments.

http://groups.google.co.uk/group/sci.military.naval/browse_frm/thread/6131a95f5290fcd3/b86cda291eaaa959?lnk=st&q=&rnum=1&hl=en#b86cda291eaaa959

AI could take a look at this and similar contributiuons before
attempting the Turing Test!

> > To put this in a very banal way a spider that
> > understood NL could look at tthe descriptions and comment statements in
> > code and cobble together a C++ or Java program to do whatever.
>
> Nope, the creativity required for good programming is often not
> reflected at all in any descriptions or comment statements whatsoever.
> In fact, the state of the art in Smalltalk's "Extreme Programming"
> movement is to program simply and refactor until comments are
> unnecessary. That makes it impossible for a newbie to get a grasp on
> what's going on. So I would tend to categorize banality as
> anti-creative.
>

This is part of what I mean. There is an awful lot of stuff out there.
The Carnigie Mellon metrics relate to the reusability of code. I am
convinced that if we could categorize previous code in some way we
could write large Java systems without in fact cutting very much code
ourselves. I say Java because with JVM it is completely portable. NL
understanding will certainly revolutionize code development systems
such as Forte if nothing else.

> > BTW - I have been thinking about the fossil record and evolution. Matts
> > published program could quite esily look at DNA. A compression of DNA
> > gives us the lineage of all life on Earth. This surely is an
> > intelligent deduction.
>
> Perhaps but having a picture of DNA is like having a picture of a
> living being instead of the living being.
>

Yes, I simply used this to illustrate the fact that sheer compression
can produce intelligent results. Matt claims (and on balance I think I
agree) that compression implies AI. Optimal compression, which may or
may not be computable MUST give us NL understanding. I have said I
thought Hutter should be bi/multi lingual. This would mean that the
problem was contronted at an EARLY stage.

> > I think it is valuable in publicity value. A linuistic model will be
> > worth billions of Euros. Think of the money the EU spends translating
> > documents.
>
> Hmm, things are worth what other people are willing to pay you for it.
> That is why oil is so much more expensive than it ought to be. I find
> it unfair that certian Saudi's through being wealthy at birth can
> afford and do procreate a classroom of children when I can barely
> afford to live alone. The EU can assign values as it chooses but those
> choices are not necessarily fair nor good for the world.

If I could sit here and talk to someone in Iraq - BTW I know no Arabic
and have a simultaneous translation of high quality that would ease
understanding no end. This is true within the EU I am simply basing my
"what is worth" estimates on what is being spent now. We need a
European Parliament with these facilities.

The price of Oil - True but off topic.

- Ian Parker

Overcomer

unread,

Sep 10, 2006, 1:23:12 PM9/10/06

to

Ian Parker wrote:
> > So creativity is NOT a
> > natural outcome of text processing but it may happen by coincidence and
> > should be pursued as a distinct goal in its own right.
> >
> It is not necesary that is true.

Wrong! So it appears your logic unit isn't working today. Perhaps you
got infected with some anti-creativity?

> But let me put this another way. Where
> the computer "understands" the parameters as is the case with chess, a
> high standard of play results.

Yes, but the computer's learning and creativity is still low.

> If we take as our definition the ability to
> produce new results, computers do this already. Who would do
> engineering without a computer?

Ha. Now you go off topic. Engineering is done with HUMAN creativity
applied by the CAD operator, not computer creativity.

> We ought to perhaps be more precise about the type of intelligence we
> are looking for. There is the intelligence involved in innovation.
> There is also the intellegence involved in searching for information
> To a degree we rate people who can command a lot of information as
> being intelligent.

Again you slip into talking about people in a way that doesn't support
your point much. Yes people who command lots of information are
generally considered intelligent since they can often apply the
information to new situations instead of simply recall it like a tape
recorder. So far AI is closer to a tape recorder with very limited
application abilities. The real problem is how do you convert a
dictionary or encyclopedia (lots of information) into something useful
to AI? There are two ways, rote "brain surgery" programming or
learning. The first is too big a job and the second requires too much
intelligence for most human programmers. So it requires a
million-dollar a year team of highly intelligent programmers with a
heart to accomplish the job or some very intelligent individual working
on learning (including learning creativity) to leverage his skills
until there is enough AI software to be useful.

The lossless compression of human knowledge contest is a fraud like the
whole EU. It may work for a while but it is based on false ideas so it
must ultimately fail or at best be refocussed by the showman to keep
participant's attention moving from failure to failure keeping hope
alive so the unpaid volunteers as well as the paid ones don't all quit.
Anyone can write a "scholarly paper" saying creativity is a natural
outcome of compression but it is not actually true. Likewise when the
EU made law banning religion specifically the Bible from itself, they
guaranteed their ultimate failure because they guided their future away
from the pursuit of absolute perfection aka. Jesus Christ. So they
have nothing left but their con-artist shell game of imperfection,
telling people the gold is over here, now the gold is over here, etc.
and maybe creating artificial economies just like their failed
prototype Communism.

> If we have crossed the Pons Asinorum of natural language a lot becomes
> possible. Security is one application which spings to mind. We could
> for example keep a very close watch on extremist Islamic websites. We
> could even insert counter arguments.

True but compression will not get you there. You have to make NL your
primary goal for many man-years and forget about compression until you
are getting ready to deploy.

> I am
> convinced that if we could categorize previous code in some way we
> could write large Java systems without in fact cutting very much code
> ourselves. I say Java because with JVM it is completely portable. NL
> understanding will certainly revolutionize code development systems
> such as Forte if nothing else.

I proposed working on such a system to augment www.krugle.com but they
don't have enough money. I proposed such to Google which does have
enough money for R&D but they simply don't want to fund it. The
combination of wealth and willingness to spend it on something creative
is hard to find and hard to do. I used my limited resources to hire
off-shore programmers to work on AI and results were mostly poor. One
guy who was promising went to the hospital with a fever and never came
back to email me. The best result I could afford was from an American
high-schooler.

> > > BTW - I have been thinking about the fossil record and evolution. Matts
> > > published program could quite esily look at DNA. A compression of DNA
> > > gives us the lineage of all life on Earth. This surely is an
> > > intelligent deduction.
> >

> Yes, I simply used this to illustrate the fact that sheer compression
> can produce intelligent results.

Your illustration was ineffective on me. I was granting you credit for
your intelligent deduction, not the program.

> Matt claims (and on balance I think I
> agree) that compression implies AI.

No. Compression may SOMEDAY include a little AI but Matt is getting a
big head like many with a little success to proudly make false claims.
It is sad that success does not necessarily create humility enough to
remain truthful.

> Optimal compression, which may or
> may not be computable MUST give us NL understanding. I have said I
> thought Hutter should be bi/multi lingual. This would mean that the
> problem was contronted at an EARLY stage.

No, no, no. All you need for "optimal" compression is just a few
recursive sentence structures based on parts of speech represented as
many people have already written in AI magazine and elsewhere then add
those structures to word frequency lists to attempt further compression
beyond bit or byte wise representation of the most frequent strings.
NL is way more comprehensive than compression.

Ian Parker

unread,

Sep 11, 2006, 10:30:21 AM9/11/06

to

Overcomer wrote:

> No, no, no. All you need for "optimal" compression is just a few
> recursive sentence structures based on parts of speech represented as
> many people have already written in AI magazine and elsewhere then add
> those structures to word frequency lists to attempt further compression
> beyond bit or byte wise representation of the most frequent strings.
> NL is way more comprehensive than compression.

Matt MUST be right in theory. For let us suppose you have got a text
with zero entropy. That is to say a string which is completely random.
The computer must have extracted all the meaning since there can be no
further meaning to extract.

Let us do a "gedanken" compression. Let us predict the next word and
let us arrange the words we have in order of probability. We then have
a small number of binary bits defining each word. As an illustration of
how little entropy language contains.

Ahura pudis comer el pan y beber el agua (Now you can eat bread and
drink water)

If that occurred in let us say El Cid do you suppose you could read the
whole book with just a knowledge of French and Latin? Well that is how
the Hittite language was decoded! An ideal compressor could clearly do
this. We can say that compression will be able to do certain things
almost by theorem. In fact as we have seen a zero entropy compression
involves understanding of a language, again as Matt observes the
converse is not true. Matt talks about computability and wonders
whether of not zero entropy compression is computable. I think the
conclusion we could perhaps come to is that it is not. An ideal
compressor would thus be able to read any ancient text. One prize that
perhaps someone could offer if a prize for the decoding of Linear A
(Unlike linear B not decoded).

What in fact is not certain is whether or not the situation is
sufficiently computable for AI to beat other compression techniques. If
you have English and Spanish in a translation AI techniques clearly
must beat anything else. With just English AI techniques will still
win, but can it win in the age of the Universe? I mentioned Evolution
and DNA since this is what existing compression techniques are
basically programmed to do.

As Matt rightly says the Turing/Loebner tests are subjective. However
they do indicate, if somewhat subjectively, some quite important facts.

1) Every Loebner contestant has used a database and the chatterbox
makes up replies accoring to what has been said and what is in the
database. It is clear that a good NL predictor added to this will
produce Loebner.
2) As far as imagination goes it is difficult to see where we will end
up. Compression alone will not produce imaginative solutions. It could
however remove an important block. If we were to treat concepts as Java
Classes we could make progress. In fact something like a class is now
one scool of AI is approaching the problem.

If we had bots on the Web with full Turing characteristics we would be
in a powerful position to put our ideas forward. What we would need, in
essence, is a bot linked to our database. We are told that the battle
(if it could be called that) with Islam is going to be fought in the
field of ideas as much as militarily. A Turing bot would seem be be the
ideal thing.

BTW - We must, according to Asimov, go round extreme Islamic sites. "No
robot can by neglect lead to an injury of a human being".

Overcomer

unread,

Sep 11, 2006, 1:21:21 PM9/11/06

to

Ian Parker wrote:
> Overcomer wrote:
>
> > No, no, no. All you need for "optimal" compression is just a few
> > recursive sentence structures based on parts of speech represented as
> > many people have already written in AI magazine and elsewhere then add
> > those structures to word frequency lists to attempt further compression
> > beyond bit or byte wise representation of the most frequent strings.
> > NL is way more comprehensive than compression.
>
> Matt MUST be right in theory. For let us suppose you have got a text
> with zero entropy. That is to say a string which is completely random.
> The computer must have extracted all the meaning since there can be no
> further meaning to extract.

So far you have not proven "Matt MUST be right in theory." I have
previously shown why "Matt is wrong in theory." A random string is
uncompressable unless there IS some pattern such as all the characters
are represented as bytes and are alphabetic (or alphanumeric plus
puncutation) giving some extra bits per byte which can be used for
compression.

> Let us do a "gedanken" compression. Let us predict the next word and
> let us arrange the words we have in order of probability. We then have
> a small number of binary bits defining each word. As an illustration of
> how little entropy language contains.
>
> Ahura pudis comer el pan y beber el agua (Now you can eat bread and
> drink water)
>
> If that occurred in let us say El Cid do you suppose you could read the
> whole book with just a knowledge of French and Latin?

Of course, word frequency or even character frequency analysis is a
simple first step in compression.

I personally would never attempt such a project -- I attempted to learn
German in college but found nobody who wanted to study with me so it
was to me impossible to learn more than a few words. Now if I get an
AI translator working it may use the original (pre-Communist abridged)
Chinese language since I have some books written in both Chinese and
English. But I value real AI capable of doing human work more than I
value working on translation.

Eventually everyone should be taught only English or only Hebrew
anyway. All the different languages and cultures just enable war and
ignorance of truth. BTW, I understand none of the words in your
example. I can't even tell what language they are written in.

> Well that is how
> the Hittite language was decoded! An ideal compressor could clearly do
> this. We can say that compression will be able to do certain things
> almost by theorem.

Ok, I grant compression can do some interesting things if those tasks
are made part of the project. This is nothing new. Word frequency
analysis has been around for years. You can use the algorithm for that
in compression programs or in decoding language programs. Just because
the same subroutine exists in two applications does not mean both
applications can accomplish the same end result. You and Matt are
overgeneralizing, creating a fantasy that App. A and B both have
routine C therefore they are equal. Only if you reduce A, B, and C to
a neutral form inside and AI knowledge base are they generally equal.
Then you would have an AI with compression and language decoder
capabilities, not a compressor with language decoder capabilites. And
notice A, B, and C do not produce significant generic creativity in NL
or any field of study.

> In fact as we have seen a zero entropy compression
> involves understanding of a language, again as Matt observes the
> converse is not true.

Wrong on both counts. A compression can include word frequency
analysis but that is not natural language understanding. Even a
computer language compiler doesn't understand its own computer language
-- it's just a text processor that translates one language to another
for the execution machine (which can be the CPU or some abstraction
thereof). As for the converse, if you had actual understanding of a
language then it could be assumed that would include understanding of
the word "compression."

> Matt talks about computability and wonders
> whether of not zero entropy compression is computable. I think the
> conclusion we could perhaps come to is that it is not. An ideal
> compressor would thus be able to read any ancient text. One prize that
> perhaps someone could offer if a prize for the decoding of Linear A
> (Unlike linear B not decoded).

Hmm, I recently got a free sample Linear ADC. I think it is more
computable than your ideas on compression.

> What in fact is not certain is whether or not the situation is
> sufficiently computable for AI to beat other compression techniques. If
> you have English and Spanish in a translation AI techniques clearly
> must beat anything else. With just English AI techniques will still
> win, but can it win in the age of the Universe? I mentioned Evolution
> and DNA since this is what existing compression techniques are
> basically programmed to do.

Well if you want to get picky, I think Spanish is one of many
unnecessary languages. No knowledge exists in Spanish that is
technically useful compared to English except that knowledge which came
from English speaking inventors which was translated into Spanish. So
Spanish people can convert to English making future translation totally
unnecessary.

> As Matt rightly says the Turing/Loebner tests are subjective. However
> they do indicate, if somewhat subjectively, some quite important facts.
>
> 1) Every Loebner contestant has used a database and the chatterbox
> makes up replies accoring to what has been said and what is in the
> database. It is clear that a good NL predictor added to this will
> produce Loebner.

Your analysis is true but your conclusion is as clear as mud. If you
believe it so strongly, try it. I doubt you will even pass the first
level of competition to enter the actual contest but if you do, your
results will be unlikely to exceed other contestants using your
theories.

> 2) As far as imagination goes it is difficult to see where we will end
> up. Compression alone will not produce imaginative solutions.

Finally! An admission of my point on creativity, thank you.

> It could
> however remove an important block. If we were to treat concepts as Java
> Classes we could make progress. In fact something like a class is now
> one scool of AI is approaching the problem.

Yes and that school of AI was doing so long before the abberant
compression concept came forth. So it really has nothing taken from
compression. That school of AI might be more valuable than
compression.

> If we had bots on the Web with full Turing characteristics we would be
> in a powerful position to put our ideas forward. What we would need, in
> essence, is a bot linked to our database. We are told that the battle
> (if it could be called that) with Islam is going to be fought in the
> field of ideas as much as militarily. A Turing bot would seem be be the
> ideal thing.

This is no big deal. One student wrote if I get AI going he would hook
it up to the Web for free. I believe some chatterbots are already
online. AI is the hard part.

> BTW - We must, according to Asimov, go round extreme Islamic sites. "No
> robot can by neglect lead to an injury of a human being".

Your statement is compressed confusion. A robot is mechanical, a bot
is software only. A robot can break things and kill people with bad
non-Asimov software. A bot can only stretch the limits of free speech.

Ian Parker

unread,

Sep 12, 2006, 5:51:48 AM9/12/06

to

Overcomer wrote:
> > Matt MUST be right in theory. For let us suppose you have got a text
> > with zero entropy. That is to say a string which is completely random.
> > The computer must have extracted all the meaning since there can be no
> > further meaning to extract.
>
> So far you have not proven "Matt MUST be right in theory." I have
> previously shown why "Matt is wrong in theory." A random string is
> uncompressable unless there IS some pattern such as all the characters
> are represented as bytes and are alphabetic (or alphanumeric plus
> puncutation) giving some extra bits per byte which can be used for
> compression.
>

Let us suppose that we have done it we have a string of zero entropy
that can be losslessly expanded. Never mind how we got there. Part of
the information content is the actual meaning. This clearly contradicts
the fact that the computer has now understanding of meaning.

As far as entering the Loebner competition is concerned, what I am
doing is proving that certain things are AI hard as Matt describes
them. That is to say solving these problems will solve AI or at any
rate a significant part of it.

I have postulated that an understanding of NL is AI hard. It is a pivot
on which a whole lot else depends. The PRESENT Loebner competitors
would work iff (two fs deliberate) they used a concept language.
Primavera/ressorte/mamanthal is AI hard. It is hard wrt Loebner/Turing.
If we have La estacion de ressorte is is manifestly clear that we will
fail Turing since the way to defeat a Loebner/Turing bot is to have a
sentence which is grammatically correct but nonsense. Does the converse
hold? Not of logical necessity but I feel it does in practice.

Matt has of course gone one stage further and described compression as
being AI hard. This MUST be true for the following reason. Suppose we
cannot distinguish between "eclusia" and "cerradura" or "primavera",
"ressorte" or "mamanthal". We have entropy left, we can compress
further by the word guessing proceedure if by nothing else.

- Ian Parker

Overcomer

unread,

Sep 12, 2006, 1:25:22 PM9/12/06

to

Ian Parker wrote:
After finishing your previous post, it is evident that Matt might be
correct in theory but you Ian are not.

> 2) As far as imagination goes it is difficult to see where we will end

> up. Compression alone will not produce imaginative solutions. It could

> however remove an important block. If we were to treat concepts as Java
> Classes we could make progress. In fact something like a class is now
> one scool of AI is approaching the problem.

This is correct thinking as far as it goes. Then you Ian ignored this
and proceeded with your fantasy imagination. So it is not Matt that is
quite wrong, it is you Ian. I'm sorry I gave you the benefit of the
doubt earlier and considered Matt was wrong based on the information
you provided. Of course the above quote was provided by you too,
indicating a possibility of further inaccuracy, but trusting your
representation of Matt's comments, it is you who are ultimately misled
by your own wish to contribute to something you don't yet understand.

> Let us suppose that we have done it we have a string of zero entropy
> that can be losslessly expanded. Never mind how we got there. Part of
> the information content is the actual meaning. This clearly contradicts
> the fact that the computer has now understanding of meaning.

Your statement is true but useful how? The standard of truth is
usually both factual and useful. Please pursue absolute truth.

> As far as entering the Loebner competition is concerned, what I am
> doing is proving that certain things are AI hard as Matt describes
> them. That is to say solving these problems will solve AI or at any
> rate a significant part of it.

No, you are playing in fantasy land. Get real, write a Loebner entry
if you can to prove your concepts.

> I have postulated that an understanding of NL is AI hard. It is a pivot
> on which a whole lot else depends.

No, it is a lazy academic way of looking at things to spin miles of
papers without an ounce of code in a publish or perish world. Free
yourself of such considerations and get busy designing and coding
something useful.

> The PRESENT Loebner competitors
> would work iff (two fs deliberate) they used a concept language.
> Primavera/ressorte/mamanthal is AI hard. It is hard wrt Loebner/Turing.
> If we have La estacion de ressorte is is manifestly clear that we will
> fail Turing since the way to defeat a Loebner/Turing bot is to have a
> sentence which is grammatically correct but nonsense. Does the converse
> hold? Not of logical necessity but I feel it does in practice.

Refer to my previous remark.

> Matt has of course gone one stage further and described compression as
> being AI hard. This MUST be true for the following reason. Suppose we
> cannot distinguish between "eclusia" and "cerradura" or "primavera",
> "ressorte" or "mamanthal". We have entropy left, we can compress
> further by the word guessing proceedure if by nothing else.

Matt may not be pursuing NLAI any more than you but at least it appears
he knows his limitations. I prefer to pursue the real thing instead of
your fantasies Ian.

Matt Mahoney

unread,

Sep 12, 2006, 4:09:59 PM9/12/06

to

Somewhat off topic, but the latest Hutter prize candidates are now
using syntactic and semantic modeling. The modeling is somewhat crude
(grouping related words in the dictionary), but it is still an advance
over the purely lexical models used in the baseline. This is exactly
the type of research that the prize was designed to encourage. I
expect that more sophisticated models will be used in future
submissions.

I have summarized my analysis of the candidates with links to relevant
papers here.
http://cs.fit.edu/~mmahoney/compression/text.html#1426

-- Matt Mahoney

Overcomer

unread,

Sep 12, 2006, 11:30:08 PM9/12/06

to

One of the biggest problems in NL is it takes man-years to tweak
dictionaries to meet needs for most any model which makes it harder for
individuals. The only approach I see outside of being wealthy enough
to hire a team or pursuasive enough to get free help like Linux was
produced with is working on learning. Have you found anyone working in
that direction?

Thanks,
KF

Ian Parker

unread,

Sep 13, 2006, 6:54:22 AM9/13/06

to

I will reply to "overcomer" later after I have thought a little bit.
One quick question for you. You used progressive training on "Alice in
Wonderland" rather than reading the whole book and performing matrix
calculations. Any particular reason? Was it to demonstrate progressive
compression?

- Ian Parker

Ian Parker

unread,

Sep 13, 2006, 10:44:59 AM9/13/06

to

Overcomer wrote:
> > 2) As far as imagination goes it is difficult to see where we will end
> > up. Compression alone will not produce imaginative solutions. It could
> > however remove an important block. If we were to treat concepts as Java
> > Classes we could make progress. In fact something like a class is now
> > one scool of AI is approaching the problem.
>
> This is correct thinking as far as it goes. Then you Ian ignored this
> and proceeded with your fantasy imagination. So it is not Matt that is
> quite wrong, it is you Ian. I'm sorry I gave you the benefit of the
> doubt earlier and considered Matt was wrong based on the information
> you provided. Of course the above quote was provided by you too,

> > Let us suppose that we have done it we have a string of zero entropy

> > that can be losslessly expanded. Never mind how we got there. Part of
> > the information content is the actual meaning. This clearly contradicts
> > the fact that the computer has now understanding of meaning.
>
> Your statement is true but useful how? The standard of truth is
> usually both factual and useful. Please pursue absolute truth.
>
> > As far as entering the Loebner competition is concerned, what I am
> > doing is proving that certain things are AI hard as Matt describes
> > them. That is to say solving these problems will solve AI or at any
> > rate a significant part of it.

> indicating a possibility of further inaccuracy, but trusting your
> representation of Matt's comments, it is you who are ultimately misled
> by your own wish to contribute to something you don't yet understand.
>

Look I never said that there was not a lot of work to be done. In fact

"El barco attravesta una cerradura" and "La estacion de ressorte"
indicate how far off we are. In one sense it is quite remarkable how
far Turing/Loebner have in fact gone with these glaring difficiencies.
In fact the Jaberwocky, last years Loebner winner is in fact being
used.

http://www.creativematch.co.uk/viewNews/?92730

For many applications, like insurance, imagination is not required, the
salesperson/bot has to stick to a database and trivial calculations
only are performed by the bot. If we have got to what I would term the
"bueno espagnol" stage. That is to say we can perform good translations
which would be of UN interperator quality. (Compression -> bueno
espagnol as has been demonstrated). Bueno espsagnol (and compression by
inplication) directly enables us to do the following.

1) Recofnize speech. Gove the Jabwewocky a voice that speaks and an ear
that listens as well as Jaws that bite and claws that catch This is
so since the recognition of individual phonemes by computer is as good
if not better than the recognition by humans. Why current speech
recognition is so bad is because no context is included, there is no
estimate made of the probability of different words based on meaning.

> No, you are playing in fantasy land. Get real, write a Loebner entry
> if you can to prove your concepts.
>

No I am simply saying that existing software, such as the Jaberwocky
would pass the test given "bueno espagnol". I have felt instinctively
for some time that bueno espagnol = Turing. Certainly the converse is
true. Consider this. Two humans chat to one and another. We take their
converstaion and compress it to zero entropy (we assume for the sake of
argument we can do it). Now if we take the side of Human 1 the
compressor MUST be predicting what human 2 is saying. If it cannot the
fact of zero entropy is thereby contradicted. If therefore we have a
database out of which replies are constructed, given a database of
reasonable size we must have Turing. Bueno espagnol does not strictly
speaking imply Turing although compression does.

> > I have postulated that an understanding of NL is AI hard. It is a pivot
> > on which a whole lot else depends.
>
> No, it is a lazy academic way of looking at things to spin miles of
> papers without an ounce of code in a publish or perish world. Free
> yourself of such considerations and get busy designing and coding
> something useful.
>

> Matt may not be pursuing NLAI any more than you but at least it appears
> he knows his limitations. I prefer to pursue the real thing instead of
> your fantasies Ian.

You were specifically asking for imagination. There is a lot of work to
do I make no bones about it. A lot of useful work can be done purely by
selecting replies from a database. Insurance selling is of course one
of them. My suggestion of extremist islamic sites is another. However
imagination requires us to go one stage further. Of course as we are
nowhere near "bueno espagnol" and if you say we need to concentrate on
this I will agree with you, with one important reservation. We need to
understand the basic theory, we need to be able to say that this
inplies that. We perhaps also need to talk about creativity in purely
humanistic terms. What can we do to encourage it.

- Ian Parker

James A. Bowery

unread,

Sep 13, 2006, 10:15:19 PM9/13/06

to

Overcomer wrote:
> One of the biggest problems in NL is it takes man-years to tweak
> dictionaries to meet needs for most any model which makes it harder for
> individuals. The only approach I see outside of being wealthy enough
> to hire a team or pursuasive enough to get free help like Linux was
> produced with is working on learning. Have you found anyone working in
> that direction?

http://www.cognitionresearch.org.uk/papers/overview/overview_short/jgw2.pdf

Inductive inference and IC come together in the concept of a `grammar'
for a language. The rules of
a grammar may be seen to be recurrent patterns in the language (at
various levels of abstraction) and
these rules may be used like a dictionary to compress a sample of the
language. They may also be used
to predict missing elements of a sample like \It's a lovely |".
Solomono® realised that, if we are trying to devise a grammar for a
given sample of language, there are
typically many alternative grammars that describe the sample, some
better than others. It is tempting
at ¯rst sight to choose the simplest grammar but this is typically
something trivial that generates lots
of `garbage' as well as valid sentences. Another possibility is to
choose the grammar that compresses a
sample of the language most e®ectively. But such a grammar will only
generate the original sample and
cannot generate other valid sentences.
Solomono®'s key insight was that, in choosing amongst the many
possible grammars that one might
infer from a sample of language, one should try to minimise (G+E),
where G is the size of the grammar
(measured in `bits') and E is the size of the sample (in bits) after it
has been encoded in terms of the
grammar. This guards against the induction of trivially small grammars
(where a small G is o®set by a
relatively large E) and also avoids the induction of excessively large
grammars (where E may be small
but G is disproportionately large).
This and related principles have various names but perhaps the most
useful umbrella term is \Mini-
mum Length Encoding" (MLE).

2.3 Language Learning and the ICMAUS concepts
My interest in these kinds of ideas was sparked originally by
fascinating lecturers about economical coding
in the nervous system given by Horace Barlow when I was an
undergraduate at Cambridge University.
Some time later, I developed two computer models of language learning:
MK10 which demonstrates
how a knowledge of the segmental structure of language (words, phrases
etc) can be bootstrapped with-
out error-correction or `supervision' by a `teacher' and SNPR|an
augmented version of MK10|that
demonstrates how grammars can be learned without external supervision.
In the course of developing
these models, the importance of economical coding and MLE principles
became increasingly clear.
At about that time, I became acquainted with Prolog and I was struck by
the parallels that seemed to
exist between that system, designed originally for theorem proving, and
my computer models of language
learning. A prominent feature of my learning models is a process of IC
by searching for patterns that
match each other and a process of merging or `unifying' patterns that
are the same. Although IC is not
a recognised feature of Prolog, a process of searching for patterns
that match each other is fundamental
in that system and the merging of matching patterns is an important
part of `uni¯cation' as that term is
understood in logic. It seemed possible that IC might have the same
fundamental importance in logic as
it has in inductive learning.
These observations led to the thought that it might be possible to
integrate inductive learning and
logical inference within a single system, dedicated to IC by pattern
matching, uni¯cation and search.
Further thinking suggested that the scope of this integrated system
might be expanded to include such
things as information retrieval, pattern recognition, parsing and
production of language, and probabilistic
inference.
Development of these ideas has been underway since 1987.

Ian Parker

unread,

Sep 14, 2006, 4:41:40 AM9/14/06

to

Grammar will undoubtedly come in, but apparently not now. The
remarkable thing about Latent Semantic Analysis is that it simply takes
words, does not attempt to parse or anything yet still gets good
results. It must however still be very sub optimal, there must be a lot
of information left. Sometimes I wonder what the effect of doing LSA in
a language like Latin would be, if we treated every inflexion as being
a different word.

As I have said all along, compression is theoretically AI. Whether or
not the present generation of AI (the programs in Matt's list of
publications) will be better on a monolingual text than simple Zip like
programs is an open question.

- Ian Parker

Overcomer

unread,

Sep 14, 2006, 4:59:50 PM9/14/06

to

Ian Parker wrote:
> Look I never said that there was not a lot of work to be done. In fact
> "El barco attravesta una cerradura" and "La estacion de ressorte"
> indicate how far off we are. In one sense it is quite remarkable how
> far Turing/Loebner have in fact gone with these glaring difficiencies.
> In fact the Jaberwocky, last years Loebner winner is in fact being
> used.
>
> http://www.creativematch.co.uk/viewNews/?92730
>
> For many applications, like insurance, imagination is not required, the
> salesperson/bot has to stick to a database and trivial calculations
> only are performed by the bot.

Have you ever worked in insurance? It is not quite as rote as you
imagine, at least in the real companies, not just the sales shops. The
underwriters must make decisions on things like whether to pay,
investigate, or stop the policy on a client. Also, there are computer
programmers trying to make the process faster. And like law, there are
many rules which require interpretation, not simple rote justice. Even
some of the trivial calculations are based on formulas influenced by a
perceived balance between risk of financial reward / loss for the
insurance itself vs. the perceived value to the customer who may leave
if the insurance is written in a way that is not actually covering a
real risk. Warren Buffett's GIECO sends liturature frequently but
having talked to their sales force, I know their claim of lower rates
than I pay now can only be met if they would accept me directly as a
preferred customer instead of a general customer so I can now reject
their junk mail -- a bot choice arrived at with imagination. So while
there may be applications in which bot programming would save time,
there is no complete lack of imagination in insurance.

> If we have got to what I would term the
> "bueno espagnol" stage. That is to say we can perform good translations
> which would be of UN interperator quality. (Compression -> bueno
> espagnol as has been demonstrated). Bueno espsagnol (and compression by
> inplication) directly enables us to do the following.

I repeat, it would be much better to convert to English or some other
language to unite the world instead of wasting computation to coddle
the irrational cultures of the world.

> 1) Recofnize speech. Gove the Jabwewocky a voice that speaks and an ear
> that listens as well as Jaws that bite and claws that catch This is
> so since the recognition of individual phonemes by computer is as good
> if not better than the recognition by humans. Why current speech
> recognition is so bad is because no context is included, there is no
> estimate made of the probability of different words based on meaning.

Although applications for this technology exist, some are poorly done
like HP's printer support bots which fail to recognize my voice even on
"yes." As for my research, I consider the Bible's warning against the
image which speaks important so I will not work on speech I/O so I
won't contribute to whatever that danger is. Obviously if the world
were automated nobody could earn money to pay for anything so everyone
would be poor and no progress would be made. So some technology is
ultimately worthless even if it could produce millionaires short term.
You have to look at the big picture, not in fantasy but reality.

> No I am simply saying that existing software, such as the Jaberwocky
> would pass the test given "bueno espagnol". I have felt instinctively
> for some time that bueno espagnol = Turing. Certainly the converse is
> true. Consider this. Two humans chat to one and another. We take their
> converstaion and compress it to zero entropy (we assume for the sake of
> argument we can do it). Now if we take the side of Human 1 the
> compressor MUST be predicting what human 2 is saying. If it cannot the
> fact of zero entropy is thereby contradicted. If therefore we have a
> database out of which replies are constructed, given a database of
> reasonable size we must have Turing. Bueno espagnol does not strictly
> speaking imply Turing although compression does.

I consider that inflation -- puffed up knowledge that floats over
reality needing deflation to fit the real world. If you have a
database of replies that perfectly predict what the other human is
saying, the human is no longer even slightly imaginative and therefore
unnecessary. Loebner implies Turing, not compression. Refer to Matt's
remark on imagination, which is the very thing LACKING from Loebner
entries which the human confederates have in abundance and correctly
thinking judges catch.

> You were specifically asking for imagination. There is a lot of work to
> do I make no bones about it. A lot of useful work can be done purely by
> selecting replies from a database. Insurance selling is of course one
> of them. My suggestion of extremist islamic sites is another. However
> imagination requires us to go one stage further. Of course as we are
> nowhere near "bueno espagnol" and if you say we need to concentrate on
> this I will agree with you, with one important reservation. We need to
> understand the basic theory, we need to be able to say that this
> inplies that. We perhaps also need to talk about creativity in purely
> humanistic terms. What can we do to encourage it.

That is a step I agree with but it is hard for me to answer. It is so
necessary Loebner entrants will always fail without it. To use the HP
failed voice example, I percieve it is the ability to adapt to new
input quantizing which will take care of the edge instead of the
majority. In a standard distribution of responses, software that
responds correctly to one to three standard deviations is so necessary
it should be considered as basic as the character set -- required at
the lowest level. The responses on four to five standard deviations
ought to be considered the common ordinary level of performance
programmers are fired for if they don't achieve. The responses on six
and above deviations should be considered the leading edge where actual
work is accomplished. That may be a rough but generic way of stating
the need for imagination & creativity.

AI Programmers have to achieve 6 deviations or a program with IQ of 100
to enter the game, then they START earning their pay. That is quite a
requirement but if it were observed by employers more often, it might
eventually result in more imagination & creativity. That requirement
would encourage it. The question of how to achieve it can be stated in
an inflated way -- you start with standard deviation 1 or IQ 1 and keep
working until you reach the minimum of SD 6 or IQ 100 then keep
improving beyond that. It is like a student which must achieve a GED
to enter college, then the student can really start learning --
likewise for programs. Last I heard Cyc was considered 4th grade and
OpenCyc is supposed to be 2 years behind that and all v.1 had was
WordNet. If we tell Douglas B. Lenat your CYC software is not worth
working on until it achieves a GED that would require imagination in
CYC but it wouldn't necessarily achieve it. It may be the whole CYC
approach needs to be abandon to develop something with enough
imagination that is worth starting work on. But trust me, compression
isn't closer to doing imagination in theory or practice than CYC.

Matt Mahoney

unread,

Sep 14, 2006, 6:03:46 PM9/14/06

to

Online and offline modeling are equivalent in compression power. For
the algorithms I was using, online methods seem easier to implement.
Offline methods have the disadvantage that you need to find an
efficient representation for your model. But I am looking at 2 or 3
pass algorithms now.

Pass 1 - build a dictionary.
Pass 2 - reduce dictionary by stemming, removing rare words.
Pass 3 - tokenize and compress.

This requires that I compress the dictionary and transmit it along with
the text.

-- Matt Mahoney

Ian Parker

unread,

Sep 15, 2006, 9:57:05 AM9/15/06

to

Overcomer wrote:
> Ian Parker wrote:

> > http://www.creativematch.co.uk/viewNews/?92730
> >
> > For many applications, like insurance, imagination is not required, the
> > salesperson/bot has to stick to a database and trivial calculations
> > only are performed by the bot.
>
> Have you ever worked in insurance? It is not quite as rote as you
> imagine, at least in the real companies, not just the sales shops. The
> underwriters must make decisions on things like whether to pay,
> investigate, or stop the policy on a client. Also, there are computer
> programmers trying to make the process faster. And like law, there are
> many rules which require interpretation, not simple rote justice. Even
> some of the trivial calculations are based on formulas influenced by a
> perceived balance between risk of financial reward / loss for the
> insurance itself vs. the perceived value to the customer who may leave
> if the insurance is written in a way that is not actually covering a
> real risk. Warren Buffett's GIECO sends liturature frequently but
> having talked to their sales force, I know their claim of lower rates
> than I pay now can only be met if they would accept me directly as a
> preferred customer instead of a general customer so I can now reject
> their junk mail -- a bot choice arrived at with imagination. So while
> there may be applications in which bot programming would save time,
> there is no complete lack of imagination in insurance.
>

I think we may mean different things when we talk about "imagination".
To me it is clear that an insurance salesperson needs to stick to a
database. You simply cannot make policy on the hoof.

What you may be saying is that a salesperson needs to have empathy with
the purchaser. Now psychologists have given us personality tests.
According to security chiefs threat can be measured by the tone of
voice being used. If you fit the jabberwocky up with psychological
soiftware/appeciation of body language, it could do an evaluation. I
said "could", nowhere in the reference did it say "does". Now you may
call this imagination. On my definition the act of imagination lies in
the appication of psychological testing. The fact that a computer will
evaluate data extremely rapidly does not make it imaginative even
though it may be seen to be psychic when viewed by humans in real time.

For the other example I gave of extreme Islam, ditto. Real time
psychological evaluation is needed.

> > If we have got to what I would term the
> > "bueno espagnol" stage. That is to say we can perform good translations
> > which would be of UN interperator quality. (Compression -> bueno
> > espagnol as has been demonstrated). Bueno espsagnol (and compression by
> > inplication) directly enables us to do the following.
>
> I repeat, it would be much better to convert to English or some other
> language to unite the world instead of wasting computation to coddle
> the irrational cultures of the world.
>

That is your opinion. There are 2 possible relies.

1) Scientific - If I say "La estacion de ressorte" I am answering a
question. That question is "Has the computer got any internal
representation of meaning, hyperspatial LSA or anything else?" The
answer if "NO". We know that to compress we need this representation
(proof reductio ad absurdam as given earlier). We know therefore that
compression -> bueno espagnol. We know that our station (season) is
thus a stiff spring - rather imcompressible! We also suspect that if we
do get good translatons we can get Wiki to the required 16MB or so.

2) Political (A BIT OFF TOPIC)- Why must everyone speak one language?
Americans must realize that they live in a hemisphere. If they did that
relations with other countries would be improved no end.

> > 1) Recofnize speech. Gove the Jabwewocky a voice that speaks and an ear
> > that listens as well as Jaws that bite and claws that catch This is
> > so since the recognition of individual phonemes by computer is as good
> > if not better than the recognition by humans. Why current speech
> > recognition is so bad is because no context is included, there is no
> > estimate made of the probability of different words based on meaning.
>
> Although applications for this technology exist, some are poorly done
> like HP's printer support bots which fail to recognize my voice even on
> "yes." As for my research, I consider the Bible's warning against the
> image which speaks important so I will not work on speech I/O so I
> won't contribute to whatever that danger is. Obviously if the world
> were automated nobody could earn money to pay for anything so everyone
> would be poor and no progress would be made. So some technology is
> ultimately worthless even if it could produce millionaires short term.
> You have to look at the big picture, not in fantasy but reality.
>

The basic point is that ambiguities are exactly the same as thise in
translation. Speech is the SAME problem logically as language
translation. There are words which sould EXACTLY the same. I will give
the Spanish to illustrate the point.

knight, night - caballero, noche
wether, weather - si, tiempo
there, their - aquel, su

Imagine you are translating phonetic English text into Spanish. It is
the SAME problem as translating anything into Spanish.

Looking at it as a compression problem we can be even more definite.
How do we compress? Basically we can produce a zero entropy string by
guessing words. To avoid ambiguites in speech we again use the
guesswork of compression. We have a compression algorithm and we
register the most probable words.

OK you can have a measure of speech recognition based on pure phonemes.
Many people however find this unsatisfactory.

> > No I am simply saying that existing software, such as the Jaberwocky
> > would pass the test given "bueno espagnol". I have felt instinctively
> > for some time that bueno espagnol = Turing. Certainly the converse is
> > true. Consider this. Two humans chat to one and another. We take their
> > converstaion and compress it to zero entropy (we assume for the sake of
> > argument we can do it). Now if we take the side of Human 1 the
> > compressor MUST be predicting what human 2 is saying. If it cannot the
> > fact of zero entropy is thereby contradicted. If therefore we have a
> > database out of which replies are constructed, given a database of
> > reasonable size we must have Turing. Bueno espagnol does not strictly
> > speaking imply Turing although compression does.
>
> I consider that inflation -- puffed up knowledge that floats over
> reality needing deflation to fit the real world. If you have a
> database of replies that perfectly predict what the other human is
> saying, the human is no longer even slightly imaginative and therefore
> unnecessary. Loebner implies Turing, not compression. Refer to Matt's
> remark on imagination, which is the very thing LACKING from Loebner
> entries which the human confederates have in abundance and correctly
> thinking judges catch.

You do not need perfect prediction for reductio ad absurdam to be
valid.

>
> > You were specifically asking for imagination. There is a lot of work to
> > do I make no bones about it. A lot of useful work can be done purely by
> > selecting replies from a database. Insurance selling is of course one
> > of them. My suggestion of extremist islamic sites is another. However
> > imagination requires us to go one stage further. Of course as we are
> > nowhere near "bueno espagnol" and if you say we need to concentrate on
> > this I will agree with you, with one important reservation. We need to
> > understand the basic theory, we need to be able to say that this
> > inplies that. We perhaps also need to talk about creativity in purely
> > humanistic terms. What can we do to encourage it.
>
> That is a step I agree with but it is hard for me to answer. It is so
> necessary Loebner entrants will always fail without it. To use the HP
> failed voice example, I percieve it is the ability to adapt to new
> input quantizing which will take care of the edge instead of the
> majority. In a standard distribution of responses, software that
> responds correctly to one to three standard deviations is so necessary
> it should be considered as basic as the character set -- required at
> the lowest level. The responses on four to five standard deviations
> ought to be considered the common ordinary level of performance
> programmers are fired for if they don't achieve. The responses on six
> and above deviations should be considered the leading edge where actual
> work is accomplished. That may be a rough but generic way of stating
> the need for imagination & creativity.
>
> AI Programmers have to achieve 6 deviations or a program with IQ of 100

Don't talk about IQ. IQ represents questions of a specific type. A
computer with "bueno espasgnol" that is to say logical understanding
would have an IQ of 200. You see mathematical IQ is a given set of
matches that you go through in turn. This is where my "Java Compendium"
would come in. All you do is have a sequence of tests.

As far as verbal reasoning is concerned. This would indeed be a direct
test of the computer's internal representation. Personally I would
prefer to look at a Spanish or French translation rather than set an IQ
test.

> to enter the game, then they START earning their pay. That is quite a
> requirement but if it were observed by employers more often, it might
> eventually result in more imagination & creativity. That requirement
> would encourage it. The question of how to achieve it can be stated in
> an inflated way -- you start with standard deviation 1 or IQ 1 and keep
> working until you reach the minimum of SD 6 or IQ 100 then keep
> improving beyond that. It is like a student which must achieve a GED
> to enter college, then the student can really start learning --
> likewise for programs.

Bur we are NOT students. Our boundaries are differently. A student
makes mistakes in Spanish because of poor knowledge of Spanish. A
computer makes mistakes because of poor knowledge of English and the
lack of an adaquate knowledge representation.

>Last I heard Cyc was considered 4th grade and

See above - You either have "La estacion de ressorte" - or you are at
UN interperater level.

> OpenCyc is supposed to be 2 years behind that and all v.1 had was
> WordNet. If we tell Douglas B. Lenat your CYC software is not worth
> working on until it achieves a GED that would require imagination in
> CYC but it wouldn't necessarily achieve it. It may be the whole CYC
> approach needs to be abandon to develop something with enough
> imagination that is worth starting work on. But trust me, compression
> isn't closer to doing imagination in theory or practice than CYC.

We will see how CYC goes. I believe that a good linguistic approach
woulld augment CYC in the ways suggested.

- Ian Parker

Overcomer

unread,

Sep 15, 2006, 10:57:29 PM9/15/06

to

I think you need more skill in reading. Try again on the above. I
clearly was not describing salespersons. Insurance has many
applications salespeople never see. Yet in a sense, the entire
workforce is a part of selling their products.

> What you may be saying is that a salesperson needs to have empathy with
> the purchaser. Now psychologists have given us personality tests.
> According to security chiefs threat can be measured by the tone of
> voice being used. If you fit the jabberwocky up with psychological
> soiftware/appeciation of body language, it could do an evaluation. I
> said "could", nowhere in the reference did it say "does". Now you may
> call this imagination. On my definition the act of imagination lies in
> the appication of psychological testing. The fact that a computer will
> evaluate data extremely rapidly does not make it imaginative even
> though it may be seen to be psychic when viewed by humans in real time.

Bull. All those pseudo psych employment tests are simply a way for
LAZY personnel people to discriminate between applicants without
resorting to illegal values such as race, age, sex, religion, and
whatever the 4 others are or doing something that makes them look worse
like hiring first come first serve. What they really should be doing
is evaluating job performance with trial job tasks like HP used to do
before they moved their printer manufacturing overseas. Psychologists
are probably the least valueable professionals requiring an acedemic
degree -- the whole concept is not science but an accumulation of
coincidental results wrapped in a facade of double-blind studies. It
is more likely hiring first come first serve would produce as good
employees as "psychological testing."

> For the other example I gave of extreme Islam, ditto. Real time
> psychological evaluation is needed.

Not really. What is needed is re-education to adopt the only true
religion, that of Jesus Christ.

> > I repeat, it would be much better to convert to English or some other
> > language to unite the world instead of wasting computation to coddle
> > the irrational cultures of the world.
> >
> That is your opinion.

It is also a statement of fact.

> 1) Scientific - If I say "La estacion de ressorte" I am answering a

Nope, you are saying a vacuity, words which have no meaning to me or
most other English speakers. I note few posts by non-English speakers
here. So why not get a clue from that and translate your own words or
abandon your decent into foreign tongues as I suggested?

> 2) Political (A BIT OFF TOPIC)- Why must everyone speak one language?
> Americans must realize that they live in a hemisphere. If they did that
> relations with other countries would be improved no end.

Have you bats flying about your belfry man? What does geography have
to do with language? Convert the world to English or Hebrew and trash
the obsolete.

The improvement in relations came from being oceans distant from the
irrational power mad Europeans. We developed supreme intelligence and
economy making USA the target or example for most of the world's
citizens who want to improve thier standard of living. Note: We did
not become the world's sole remaining superpower by speaking obsolete
languages. We learned to speak other languages mainly for security to
spy on the backward nations of the world. We can have no better
relations than being the envy of the world. That some mini-minds
choose jealousy instead of admiration and emulation is their problem
which we can end as we did with Sadam.

> Imagine you are translating phonetic English text into Spanish. It is
> the SAME problem as translating anything into Spanish.

Actually it is not a problem at all for me as I don't know obsolete
languages like Spanish nor do I forsee my products will ever need to
know Spanish. Spanish is a dead language like Latin. Even genuine
Chinese know their language is dead since China politically obsoleted
and replaced it with a dumb language which they renamed Chinese. That
was a wasted effort intended to brainwash the masses by enforcing
ignorance to the thoughts embedded in real Chinese. To make an
improved language they should have replaced the whole with English.

> Looking at it as a compression problem we can be even more definite.
> How do we compress? Basically we can produce a zero entropy string by
> guessing words. To avoid ambiguites in speech we again use the
> guesswork of compression. We have a compression algorithm and we
> register the most probable words.

I guess you have yet to write a program. Have you any background in
computers other than as a newsgroup poster?

> Don't talk about IQ. IQ represents questions of a specific type. A
> computer with "bueno espasgnol" that is to say logical understanding
> would have an IQ of 200. You see mathematical IQ is a given set of
> matches that you go through in turn. This is where my "Java Compendium"
> would come in. All you do is have a sequence of tests.

See question above. You pretend, you inflate, you fantasize but you
have no reality because you cannot produce IQ100 let alone IQ200. Your
Java tests are as much vaporhype as Java itself. Java is in no way the
best computer language but many people were brainwashed by Sun
marketing into thinking so. This is your function it seems, you are a
human brainwasher instead of a real computer programmer. Why not
switch sides and brainwash all Spanish speakers to drop the obsolete
hinderance and become English or Hebrew speakers?

> As far as verbal reasoning is concerned. This would indeed be a direct
> test of the computer's internal representation. Personally I would
> prefer to look at a Spanish or French translation rather than set an IQ
> test.

That is your preference. In reality, the words of a Spanish or French
saying nothing continue to say nothing in translation. IQ is an
attempt to measure something of content.

> Bur we are NOT students. Our boundaries are differently. A student
> makes mistakes in Spanish because of poor knowledge of Spanish. A
> computer makes mistakes because of poor knowledge of English and the
> lack of an adaquate knowledge representation.

That is your personal problem. You see my replies to your invalid
nonsense and never learn or humble yourself to submit to my suggestion
to write a computer program. That you are no student is quite evident.
I am likely old compared to you but I still take classes when I think
I would improve by them. Last year I spent over $500 to learn some
electronics. So I have a life-long attitude as a student. You are
like a rote program that was released before it learned enough.

Ian Parker

unread,

Sep 16, 2006, 7:49:56 AM9/16/06

to

Could you just answer two simnple questions.

1) What do IQ tests measure?
2) Is it relevant for AI?

You cannot apply human type intelligence tests to AI.

- Ian Parker

Overcomer

unread,

Sep 16, 2006, 1:31:51 PM9/16/06

to

Ian Parker wrote:
> Could you just answer two simnple questions.
>
> 1) What do IQ tests measure?

The degree of trivia accumulation including some simple perception and
problem solving skills.

> 2) Is it relevant for AI?

Certianly much more relevant than "compression." Compression can be
made to include some NL but it will NEVER achieve AI because it will
NEVER achieve creative imagination (per Matt). Therefore you shouldn't
often mention AI when mentioning compression since they are generally
unlinked / incompatible concepts.

> You cannot apply human type intelligence tests to AI.

Here's a simple question for you -- what is AI?

James A. Bowery

unread,

Sep 16, 2006, 5:23:20 PM9/16/06

to

Overcomer wrote:
> Certianly much more relevant than "compression." Compression can be
> made to include some NL but it will NEVER achieve AI because it will
> NEVER achieve creative imagination (per Matt).

Every time a compression algorithm creates an ontology, which it does
every time it creates a dictionary or any other set of patterns, it is
engaging in "creative imagination".

The only question is how effective are its reifications to compression.
That is the measure of the degree of its creativity.

Overcomer

unread,

Sep 17, 2006, 12:36:38 AM9/17/06

to

James A. Bowery wrote:
> Overcomer wrote:
> > Certianly much more relevant than "compression." Compression can be
> > made to include some NL but it will NEVER achieve AI because it will
> > NEVER achieve creative imagination (per Matt).
>
> Every time a compression algorithm creates an ontology, which it does
> every time it creates a dictionary or any other set of patterns, it is
> engaging in "creative imagination".

In the limited sense by which an insect which finds its way into your
home and becomes a bed-bug is creatively imaginative. You are talking
on such a low level of creative imagination, I would say Matt's comment
rules.

> The only question is how effective are its reifications to compression.
> That is the measure of the degree of its creativity.

That is nonsense. You seem to be programmed or shall we say
brainwashed like Ian. Get a clue! That may be a measure of success
relative to compression but it has virtually nothing to do with
creative imagination. Matt knows this, why do you inflate you
statements to fantasy?.

James A. Bowery

unread,

Sep 17, 2006, 1:06:48 AM9/17/06

to

Overcomer wrote:

> James A. Bowery wrote:
> > Every time a compression algorithm creates an ontology, which it does
> > every time it creates a dictionary or any other set of patterns, it is
> > engaging in "creative imagination".

> > The only question is how effective are its reifications to compression.
> > That is the measure of the degree of its creativity.
>
> That is nonsense. You seem to be programmed or shall we say
> brainwashed like Ian. Get a clue! That may be a measure of success
> relative to compression but it has virtually nothing to do with
> creative imagination. Matt knows this, why do you inflate you
> statements to fantasy?.

I'm simply providing an operation definition of creative imagination
and I believe it to be powerful enough to encompass the stuff people
normally mean when they use the phrase.

What is _your_ operational definition of creative imagination?

Ian Parker

unread,

Sep 17, 2006, 10:39:16 AM9/17/06

to

Overcomer wrote:
> Ian Parker wrote:
> > Could you just answer two simnple questions.
> >
> > 1) What do IQ tests measure?
>
> The degree of trivia accumulation including some simple perception and
> problem solving skills.
>

Excellent. Why then do you build them up as being a good metric. In
fact the general principle of the mathematical part is precisely
compression. You are asked to predict & compress. As stated a computer
will (by definition almost) score 100% on anything formalized. Certain
series (like fibonacci have been formalized. We say Fibonacci (this is
an integer code) a,b and this predicts everything. - Perfect
compression eh.

On verbal reasoning the objective is a meaning metric. This will in
fact result from compression

> > 2) Is it relevant for AI?
>
> Certianly much more relevant than "compression." Compression can be
> made to include some NL but it will NEVER achieve AI because it will
> NEVER achieve creative imagination (per Matt). Therefore you shouldn't
> often mention AI when mentioning compression since they are generally
> unlinked / incompatible concepts.

Compression is the sum total of the mathematical tests and compression
-> "bueno espagnol" in verbal reasoning.

>
> > You cannot apply human type intelligence tests to AI.
>
> Here's a simple question for you -- what is AI?

What is intelligence. This is quite a tricky philosophical question. We
talk about strong AI and weak AI. What we are discussing now is weak.
Linguistics is now generally regarded as a branch of AI.

- Ian Parker

Overcomer

unread,

Sep 17, 2006, 12:35:51 PM9/17/06

to

Ian Parker wrote:
> Excellent. Why then do you build them up as being a good metric.

The IQ test is at least real even if weak. Compression has NOTHING to
do with AI. I am pursuing AI.

> In
> fact the general principle of the mathematical part is precisely
> compression. You are asked to predict & compress. As stated a computer
> will (by definition almost) score 100% on anything formalized. Certain
> series (like fibonacci have been formalized. We say Fibonacci (this is
> an integer code) a,b and this predicts everything. - Perfect
> compression eh.

True, math has a great compressive strength. It is not too likely that
anyone can make a set of NL formulas that perfectly (losslessly)
predict sentences but I grant that would be a useful area of research
that would have overlap into AI IFF they can be generalized. Such a
set of NL formulas wouldn't be useful for AI if they were always
calculated specific to the input text as would be desired for maximum
compression.

> > > You cannot apply human type intelligence tests to AI.
> >
> > Here's a simple question for you -- what is AI?
>
> What is intelligence. This is quite a tricky philosophical question. We
> talk about strong AI and weak AI. What we are discussing now is weak.
> Linguistics is now generally regarded as a branch of AI.

Why then do you say you cannot apply human type INTELLIGENCE tests to
AI?

Overcomer

unread,

Sep 17, 2006, 12:38:40 PM9/17/06

to

James A. Bowery wrote:
> I'm simply providing an operation definition of creative imagination
> and I believe it to be powerful enough to encompass the stuff people
> normally mean when they use the phrase.
>
> What is _your_ operational definition of creative imagination?

See previous post on this thread answering Ian.

James A. Bowery

unread,

Sep 17, 2006, 3:48:48 PM9/17/06

to

It's sort of hard to make sense out of all the responses filled insults
and appeals to authority like "per Matt". I even went so far as to
search for all occurances of "creative" and "imagination" to try to
decipher what you are saying.

I asked you a simple question.

Provide a simple answer: What is _your_ operational definition of
creative imagination?

Cut and paste if you need to go find the precise words you previously
wrote.

Overcomer

unread,

Sep 17, 2006, 7:33:33 PM9/17/06

to

James A. Bowery wrote:
> > > What is _your_ operational definition of creative imagination?
> >
> > See previous post on this thread answering Ian.
>
> It's sort of hard to make sense out of all the responses filled insults
> and appeals to authority like "per Matt". I even went so far as to
> search for all occurances of "creative" and "imagination" to try to
> decipher what you are saying.
>

That plus this enables you to grade yourself on literature research.
See my post of Sept. 9 which includes:

Ian Parker wrote:
> First question - What do you define creativity as?

Several definitions come to mind. Professors mint PhD's for expanding
the state of the art in a known field or inventing a new field of
study. The Biblical useage is making something from nothing (as
opposed to recycling). A cute one used in the hippy era was dealing
with a mess as a mess instead of organizing it. I personally like to
discover new things nobody I've heard of has discovered before,
although I don't do exhaustive research to find if ANYBODY has EVER
discovered the same things. One definition for children is
fingerpainting -- making patterns in the sandbox of life that are at
least pretty and hopefully unique and useful. In NL educated by the
Bible, one could specify absolute perfection.

> If natural language could be readily undersood,

> creativity would take place as part of the process of manipulating
> linguistic ideas.

No not necessarily. Linguistic ideas as taught where I went to college
made for a class that wasn't worth taking. Let's postulate the
existance of a demonic force or human knowledge virus called
anti-creativity which causes people infected to operate in ways that
defeat creativity instead of nurture it (Communism for example). The

linguistic class I skipped was like that. So creativity is NOT a

natural outcome of text processing but it may happen by coincidence and

should be pursued as a distinct goal in its own right.

> ... Compression alone will not produce imaginative solutions. It could

James A. Bowery

unread,

Sep 17, 2006, 8:50:05 PM9/17/06

to

Overcomer wrote:
> James A. Bowery wrote:
> > > > What is _your_ operational definition of creative imagination?

The following isn't even close to an operation definition of creative
imagination.

Overcomer wrote:
> Ian Parker wrote:
> > First question - What do you define creativity as?
>
> Several definitions come to mind. Professors mint PhD's for expanding
> the state of the art in a known field or inventing a new field of
> study. The Biblical useage is making something from nothing (as
> opposed to recycling). A cute one used in the hippy era was dealing
> with a mess as a mess instead of organizing it. I personally like to
> discover new things nobody I've heard of has discovered before,
> although I don't do exhaustive research to find if ANYBODY has EVER
> discovered the same things. One definition for children is
> fingerpainting -- making patterns in the sandbox of life that are at
> least pretty and hopefully unique and useful. In NL educated by the
> Bible, one could specify absolute perfection.

...

> > ... Compression alone will not produce imaginative solutions. It could
> > remove an important block. If we were to treat concepts as Java
> > Classes we could make progress. In fact something like a class is now
> > one scool of AI is approaching the problem.
>
> This is correct thinking as far as it goes.

My statement was that compression requires creative imagination. I
stand by that statement but further assert that compression provides
the basis for creative imagination via the model or grammar. The two
part code of algorithmic statistics (or the two part messages of
minimum message length, minimum description length etc) has, as one of
its parts, the model which provides the foundation for the other part
-- the compressed message or data. Strip off the compressed
message/data and you have removed the "over-training" constraints on
the overall message and can thereby generate all manner of
possibilities shaped by the model.

This is what people do when they creatively imagine except that people
frequenly suspend belief in various parts of their model and provide
alternate hypotheses upon which their creative imaginations are based
-- something straight forward to do with the model portion of the
algorithmic statistics -- you just add or subtract (or add and subtrace
-- modify) various components of the model.

Note, this doesn't mean you can create an algorithm that can find the
Kolmogorov program for any sequence of observations but then neither
can people.