66 views

Skip to first unread message

Aug 6, 2006, 3:39:16 PM8/6/06

to

Introducing the Hutter Prize for Lossless Compression of Human

Knowledge

Knowledge

Artificial intelligence researchers finally have an objective and

rigorously validated measure of the intelligence of their machines.

Furthermore the higher the measured intelligence of their machines the

more money they can win via the Hutter Prize.

The purse for the Hutter Prize was initially underwritten with a 50,000

Euro commitment to the prize fund by Marcus Hutter of the Swiss Dalle

Molle Institute for Artificial Intelligence, affiliated with the

University of Lugano and The University of Applied Sciences of Southern

Switzerland.

The theoretic basis of the Hutter Prize is related to an insight by the

12th century philosopher, William of Ockham, called "Ockham's Razor",

sometimes quoted as: "It is vain to do with more what can be done with

less." But it was not till the year 2000 that this was mathematically

proven*, by Marcus Hutter, to be a founding principle of intelligence.

Indeed, Hutter's Razor** might be phrased, "It is truer to explain with

less that which can be explained with more."

There have been previous tests and related prizes for artificial

intelligence, such as the Turing Test and the Loebner Prize. However,

these tests suffered from subjective definitions of intelligence.

Hutter's recent theoretic breakthrough creates a mathematics of

artificial intelligence which accurately measures the degree of

intelligence possessed by an artificial agent. It does so by measuring

how succinctly it represents knowledge of the world. As Hutter has now

proven, the most succinct computer model of the world isn't just the

most aesthetic or memory-efficient out of all models of the known

observations -- it also most accurately predicts new observations. In

short, it is the most intelligent.

Artificial intelligence has thereby entered the realm of engineering:

Lossless compression of human knowledge.

This is momentous because by optimizing for rigorous metrics, the field

of artificial intelligence may finally clarify the murky waters of

inadequate definition, within which it has been haphazardly swimming

for the last 50 years, to become both a hard science and tractable

engineering discipline.

Named for the discoverer of the proof and the initial 50,000 Euro

donor, the Hutter Prize currently targets the compression of a 100

megabyte sample of human knowledge drawn from the broadly based

Wikipedia online encyclopedia. As Moore's Law increases the capacity

of machines, and as additional donations to the prize fund increase the

incentives of contestants, the intent is to increase the amount of

knowledge targeted for compression. It is reasonable to expect that

the 100 megabyte sample will produce, at the very least, advances in

linguistic modeling. As the targeted depth and breadth of knowledge

increases, conceptual frameworks will come into play, eventually

covering the range of disciplines from political science to physics by

applying theories that prove optimal in compressing the target.

A common objection to this approach to artificial intelligence is that

it offers little that is new -- that the computational difficulty of

searching for patterns in data remains what it has always been. This

objection misses two important points:

1) Hutter's proof provides a new mathematics of intelligence allowing

for "top down" theoretic advances which may render many problems

tractable that otherwise appear intractable.

2) There is a large overlap between succinctly codified knowledge and

an intelligent compression program. Indeed, a reasonable definition of

"knowledge" is that it optimizes the compression of new observations as

instances of old patterns. This means that even if a compressor does

nothing but apply codified human knowledge, generating no new knowledge

of its own, it can still demonstrate greater intelligence than

competing programs and thereby make measurable progress toward

artificial intelligence but also -- and this is key -- progressively

more intelligent bodies of human-generated knowledge by pitting those

bodies of knowledge against each other in what might be called an

epistemological tournament.

The formula for winnings is modeled after the M-Prize or Methuselah

Mouse Prize, which awards money to longevity researchers for progress

in keeping mice alive the longest. Here, modified for compression

ratios, is the formula:

S = size of program outputting the uncompressed knowledge

Snew = new record

Sprev = previous record

P = [Sprev - Snew] / Sprev = percent improvement

Award monies:

Fund contains: Z at noon GMT on day of new record

Winner receives: Z * P

Initially Z is 50,000 Euro with a minimum payout of 500 Euro (or

minimum improvement of 1% over the prior winner).

Donations are welcome. The history of improvement in The Calgary

Corpus Compression Challenge*** is about 3% per year. The larger the

commitment from donors to the fund, the greater the rate of progress

toward a high quality body of human knowledge and, quite possibly, the

long-held promise of artificial intelligence.

For further details of the Hutter Prize see:

For discussion of the Hutter Prize see:

http://groups.google.com/group/Hutter-Prize

-- Jim Bowery

* http://www.hutter1.net/ai/uaibook.htm

** Hutter's Razor has some caveats relating to the nature of the

universe and computability, but those conditions must be met for any

computer-based intelligence.

Aug 7, 2006, 8:22:32 PM8/7/06

to

Wasn't it Fisher in the 20's who proved just about the same thing about

maximum likelihood estimates?

Certainly, there is nothing new in the thesis that if you have an

estimate q of a probability p, then the using q as a compression model

gives minimum compressed size for p = q. Moreover, this minimum is

unique. This is the same as saying that the maximum likelihood value

of q is p (not quite the same as the maximum likelihood *estimator*, of

course).

For exchangeable events x_i, this is easy to prove using the convexity

of the log function,

log(x) <= x-1

By Finetti's theorem, the x_i are conditionally independent so

p(X) = prod_i p(x_i | theta)

E[ log q(x) ] = sum_x p(x) log q(x)

But, comparing with the entropy of the distribution p,

sum_x p(x) log q(x) + H(p) = sum_x p(x) log q(x) / p(x)

and by the inequality above,

sum_x p(x) log (q(x) / p(x)) <= sum_x p(x) [q(x) / p(x) - 1]

<= sum_x q(x) - sum_x p(x) = 0

Thus

- sum_x p(x) log q(x) >= H(p)

and

- sum_x p(x) log q(x) = H(p) <==> p = q

But this is just the expected length of an optimal compressor

representing X using q as a model. Thus, minimum expected compressed

size implies p = q.

Also, the expected compressed size can be estimated by compressing a

large number of x_i, thus the minimum compressed size estimator is

asymptotically equal to p.

All of the excitement is, of course, before the asymptote kicks in.

How can we estimate p with fewer x_i.

Why this is news is beyond me.

James A. Bowery wrote:

> ... But it was not till the year 2000 that this was mathematically

> proven*, by Marcus Hutter, to be a founding principle of intelligence.

> Indeed, Hutter's Razor** might be phrased, "It is truer to explain with

> less that which can be explained with more."

>

> ... As Hutter has now

Aug 8, 2006, 4:12:35 PM8/8/06

to

Ted (and James),

We often argue about lots of things, but I am on your side here.

The insistence on limited program size pretty much excuses it from the

realm of the scientific, as it diverts people's efforts from the claimed

goals of the prize and onto clever and concise programming methods, and

in the process obscures what might otherwise be simple but powerful

compression methods. It also indirectly (and needlessly) presumes an

18MB simplicity in human experience, thought, and expression that

probably just isn't there.

Research programs like this should be coded in some user friendly and

extensively checked medium like Microsoft Access, which probably

requires 18MB just for "Hello World". Even the .NET run-time package

would exceed this. HOWEVER, it is unclear whether the run-time package

would count in the total, but the contest seems to presume an EXE

instead of (say) an MDB like Access would use. Probably the best

approach to squeeze things into the smallest EXE would be to "go retro"

with, say, VB5, but to what end - to try to fit some screwball rule?

Further, the program would be disqualified if it were to correct

grammatical errors and/or summarize cumbersome statements using the very

computer understanding that the prise is supposed to be promoting.

Indeed, abandoning lossless compression would pretty much yield what the

semantic web searching people are looking for once it gets good enough.

Further, this sort of a competition blows right past the ITAR problems

that I have posted in the past - a VERY good thing.

Yes, the world would GREATLY benefit from this sort of research, but the

"minor" flaws in this particular competition appear (to me) to be rather

fatal to its stated goals.

Sad. I was really excited about competing until I saw the flaws.

Steve Richfie1d

===============

Aug 8, 2006, 5:00:20 PM8/8/06

to

Ted Dunning wrote:

> Wasn't it Fisher in the 20's who proved just about the same thing about

> maximum likelihood estimates?

>

> Certainly, there is nothing new in the thesis that if you have an

> estimate q of a probability p, then the using q as a compression model

> gives minimum compressed size for p = q. Moreover, this minimum is

> unique. This is the same as saying that the maximum likelihood value

> of q is p (not quite the same as the maximum likelihood *estimator*, of

> course).

> Wasn't it Fisher in the 20's who proved just about the same thing about

> maximum likelihood estimates?

>

> Certainly, there is nothing new in the thesis that if you have an

> estimate q of a probability p, then the using q as a compression model

> gives minimum compressed size for p = q. Moreover, this minimum is

> unique. This is the same as saying that the maximum likelihood value

> of q is p (not quite the same as the maximum likelihood *estimator*, of

> course).

I think it was Shannon who proved this in 1948 (discete noiseless

channel capacity theorem).

http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

I could not find any reference to Fisher in his paper.

> Why this is news is beyond me.

That isn't what Hutter proved. What he proved is that the optimal

behavior of a goal seeking agent in an unknown but computable

environment is to guess that the environment is controlled by the

shortest program (Turing machine) consistent with all behavior observed

so far. Thus, a solution to the problem of maximizing the accumulated

reward signal from an unknown environment reduces to algorithmic

compression, which is an undecidable problem. However, Hutter also

proved that in the restricted case of an envoronment bounded by space s

and time t, that the problem is computable in time O(t2^s) (so the

problem is still hard).

When I developed the large text benchmark (without funding) my goal was

to promote research in NLP. There are 3 arguments why text compression

implies AI.

1. Hutter's proof in the specific case of text prediction (a problem

that is easy for people but hard for machines).

2. An explicit representation of the probability distribution of

natural language strings (a correct language model) enables one to pass

the Turing test for AI.

3. Compression ratio (perplexity) is already used in practice to

measure the quality of language models (mainly for speech recognition).

However, AI does *not* imply compression. Decompression requires a

deterministic computation to invert the compression function, and the

human brain is not deterministic. I believe this is the reason that

many people doubt the converse. I discuss these issues in

http://cs.fit.edu/~mmahoney/compression/rationale.html

-- Matt Mahoney

Aug 8, 2006, 5:45:30 PM8/8/06

to

Steve Richfie1d wrote:

> Ted (and James),

>

> We often argue about lots of things, but I am on your side here.

>

> The insistence on limited program size pretty much excuses it from the

> realm of the scientific, as it diverts people's efforts from the claimed

> goals of the prize and onto clever and concise programming methods, and

> in the process obscures what might otherwise be simple but powerful

> compression methods. It also indirectly (and needlessly) presumes an

> 18MB simplicity in human experience, thought, and expression that

> probably just isn't there.

> Ted (and James),

>

> We often argue about lots of things, but I am on your side here.

>

> The insistence on limited program size pretty much excuses it from the

> realm of the scientific, as it diverts people's efforts from the claimed

> goals of the prize and onto clever and concise programming methods, and

> in the process obscures what might otherwise be simple but powerful

> compression methods. It also indirectly (and needlessly) presumes an

> 18MB simplicity in human experience, thought, and expression that

> probably just isn't there.

Marcus Hutter and I disagreed on what the size of the test data should

be. I argued that it should be 1 GB because that is what Turing

estimated in 1950 it would take to pass the Turing test (a machine with

10^9 bits of memory, which is 1 GB based on Shannon's 1950 estimate of

1 bpc entropy of written English), Landauer's estimate of 10^9 bits of

human long term memory, and because it is about the amount of language

that the average adult has processed since birth. Hutter argued for

100 MB because it is easier to work with and makes the contest more

accessible to the public. He has a good point because many of the top

compressors are there only because they use all of the 800 MB available

on my test machine.

In the original test design I assume that all knowledge needed to pass

any text-based AI test (such as text prediction or the Turing test) is

learnable solely from text. I realize that the vast knowledge that

humans use in these tasks is not all learned verbally, but I think that

most of it *could* be. For example, a blind person can tell you that

the sky is blue. Also, if knowledge cannot easily be learned verbally,

then it is difficult to prove to somebody that you know it in a verbal

test. For example, how would you prove in the Turing test that you

know how to ride a bicycle, that you know what a banana smells like,

that you could recognize somebody's face? Can you give an example of

*any* human knowledge that cannot be learned verbally but can be

demonstrated verbally?

Thus, to successfully pass the compression test, it will be necessary

for a program to learn a language model from the test data itself as it

compresses/decompresses the data.

> Research programs like this should be coded in some user friendly and

> extensively checked medium like Microsoft Access, which probably

> requires 18MB just for "Hello World". Even the .NET run-time package

> would exceed this. HOWEVER, it is unclear whether the run-time package

> would count in the total, but the contest seems to presume an EXE

> instead of (say) an MDB like Access would use. Probably the best

> approach to squeeze things into the smallest EXE would be to "go retro"

> with, say, VB5, but to what end - to try to fit some screwball rule?

Most compressors are written in C, C++ and assembler for speed. File

I/O does not require any fancy programming environment. The executable

files are typically 100 KB or less. It is necessary to include the

size of the decompressor, otherwise I could trivially write a 100 MB

"decompressor" that simply outputs the data, thus achieving a

compressed size of 0.

> Further, the program would be disqualified if it were to correct

> grammatical errors and/or summarize cumbersome statements using the very

> computer understanding that the prise is supposed to be promoting.

The test data is very clean. Misspelled words and grammatical errors

are rare. This is one reason I chose this data set. See

http://cs.fit.edu/~mmahoney/compression/textdata.html for an analysis

of the data.

>

> Indeed, abandoning lossless compression would pretty much yield what the

> semantic web searching people are looking for once it gets good enough.

>

> Further, this sort of a competition blows right past the ITAR problems

> that I have posted in the past - a VERY good thing.

>

> Yes, the world would GREATLY benefit from this sort of research, but the

> "minor" flaws in this particular competition appear (to me) to be rather

> fatal to its stated goals.

>

> Sad. I was really excited about competing until I saw the flaws.

>

> Steve Richfie1d

-- Matt Mahoney

Aug 8, 2006, 7:54:11 PM8/8/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:

> ...

> When I developed the large text benchmark (without funding) my goal was

> to promote research in NLP. There are 3 arguments why text compression

> implies AI.

> 1. Hutter's proof in the specific case of text prediction (a problem

> that is easy for people but hard for machines).

> 2. An explicit representation of the probability distribution of

> natural language strings (a correct language model) enables one to pass

> the Turing test for AI.

Not if I'm the one asking the questions. This idea is just

the ludicrous "humungous lookup table" in disguise.

David

Aug 8, 2006, 8:50:01 PM8/8/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:

> Steve Richfie1d wrote:

> > Ted (and James),

> >

> > We often argue about lots of things, but I am on your side here.

> >

> > The insistence on limited program size pretty much excuses it from the

> > realm of the scientific, as it diverts people's efforts from the claimed

> > goals of the prize and onto clever and concise programming methods, and

> > in the process obscures what might otherwise be simple but powerful

> > compression methods. It also indirectly (and needlessly) presumes an

> > 18MB simplicity in human experience, thought, and expression that

> > probably just isn't there.

> [...]

> Thus, to successfully pass the compression test, it will be necessary

> for a program to learn a language model from the test data itself as it

> compresses/decompresses the data.

This "intelligence test" is so silly, it's hard to know where to start.

- There is no notion of "successfully passing the test", unless you

consider every lossless compressor/decompressor to pass the test, and

that includes ones that are clearly completely unintelligent.

- You are seeking minimal size of the decompressor that reproduces a

particular 100MB text file. A compressor/decompressor scheme that is

specialized for the structure and statistics of that particular file

can obviously do better than one that is more general in ability, but

this is equally obviously less intelligent than the more general one.

- It seems there is no limit on the CPU time required to generate

the decompressor, which again penalizes intelligence and rewards

brute force approaches.

- Why do you imagine that any "understanding" of the text is needed

to achieve maximal compression. A purely syntactic/statistical

approach that ignores any meaning of the text is going to be more

intelligent, by your criterion, since addressing semantics just adds

baggage and size to the decompressor. This might change if you

accepted a "fair paraphrase" rather than an exact reproduction,

but you don't.

In summary, what is proposed is no more or less than a test of the

intelligence and skill of lossless compression programmers, not of

AI programs.

David

Aug 8, 2006, 11:40:49 PM8/8/06

to

This is the point of measuring compressed size. One way to pass the

Turing test would be to look up the interrogator's exact question in

Google and reply with the text that follows from highest ranked

response. The problem is that your lookup table is several terabytes.

The challenge for you is to replace this huge table with something

smaller. That means instead of estimating the probability of a

sentence by counting Google hits, that you analyze the syntax and

semantics of the sentence to determine if it is correct, because the

number of exact matches is most likely to be zero. You have a choice

of either finding a succinct way of describing the rules that govern

whether a sentence is likely or unlikely, or you can find an efficient

way to learn these rules from just enough unlabled text.

Using compression ratio to measure the quality of a language model is

nothing new. Speech recognition researchers have been doing it for

years.

-- Matt Mahoney

Aug 9, 2006, 12:30:53 AM8/9/06

to

David Kinny wrote:

> "Matt Mahoney" <matma...@yahoo.com> writes:

>

> > Steve Richfie1d wrote:

> > > Ted (and James),

> > >

> > > We often argue about lots of things, but I am on your side here.

> > >

> > > The insistence on limited program size pretty much excuses it from the

> > > realm of the scientific, as it diverts people's efforts from the claimed

> > > goals of the prize and onto clever and concise programming methods, and

> > > in the process obscures what might otherwise be simple but powerful

> > > compression methods. It also indirectly (and needlessly) presumes an

> > > 18MB simplicity in human experience, thought, and expression that

> > > probably just isn't there.

>

> > [...]

> > Thus, to successfully pass the compression test, it will be necessary

> > for a program to learn a language model from the test data itself as it

> > compresses/decompresses the data.

>

> This "intelligence test" is so silly, it's hard to know where to start.

>

> - There is no notion of "successfully passing the test", unless you

> consider every lossless compressor/decompressor to pass the test, and

> that includes ones that are clearly completely unintelligent.

> "Matt Mahoney" <matma...@yahoo.com> writes:

>

> > Steve Richfie1d wrote:

> > > Ted (and James),

> > >

> > > We often argue about lots of things, but I am on your side here.

> > >

> > > The insistence on limited program size pretty much excuses it from the

> > > realm of the scientific, as it diverts people's efforts from the claimed

> > > goals of the prize and onto clever and concise programming methods, and

> > > in the process obscures what might otherwise be simple but powerful

> > > compression methods. It also indirectly (and needlessly) presumes an

> > > 18MB simplicity in human experience, thought, and expression that

> > > probably just isn't there.

>

> > [...]

> > Thus, to successfully pass the compression test, it will be necessary

> > for a program to learn a language model from the test data itself as it

> > compresses/decompresses the data.

>

> This "intelligence test" is so silly, it's hard to know where to start.

>

> - There is no notion of "successfully passing the test", unless you

> consider every lossless compressor/decompressor to pass the test, and

> that includes ones that are clearly completely unintelligent.

We did not set a criteria for "passing" the test because we do not know

exactly what level of compression is needed. Shannon estimated in 1950

that the written English (using a 27 character alphabet of A-Z and

space) has an entropy of between 0.6 and 1.3 bits per character. This

is better than the best compressors can do, but not by much.

Humans are better than machines at text prediction, e.g. guessing the

next character or word in "roses are ___" or "3+8=__". Clearly this

task requires lexical, syntactic, semantic and vast, real world

knowledge. For a machine to pass the AI test, it must do as well as

humans by some measure, such as the number of guesses on average until

the correct answer is guessed. The measure we use is to assign

probabilities to all possible choices and a score of -log(p) for the

probability p assigned by the guesser to the correct answer. Shannon

proved that this score is minimized when the guesser assigns the exact

same probabilities as the true (but unknown) probability distribution.

A problem with this approach is that it is conceptually difficult for

people to assign numeric probabilities. This is why Shannon used a

guessing game instead, but this resulted in the large error bounds I

described, because different distributions could lead to the same

series of guesses. Therefore, a task that remains is to more

accurately estimate the entropy of written English with respect to

humans. In particular, we need to estimate the entropy for the exact

text used in this particular data set. Until that is done, we can only

reward improvements in the state of the art.

> - You are seeking minimal size of the decompressor that reproduces a

> particular 100MB text file. A compressor/decompressor scheme that is

> specialized for the structure and statistics of that particular file

> can obviously do better than one that is more general in ability, but

> this is equally obviously less intelligent than the more general one.

I expect that a winning decompressor will be tuned specifically to this

data set. That is why the size of the decompressor is included. The

statistics can either be learned from the data or they can be hard

coded into the decompressor by the programmer. It does not matter to

us which way it is done. Any tuning done in the decompressor will

decrease the compressed file size but increase the size of the

decompressor. It is within the rules to write a "decompressor" that

simply outputs the test file verbatim, but good luck with the size of

the program.

> - It seems there is no limit on the CPU time required to generate

> the decompressor, which again penalizes intelligence and rewards

> brute force approaches.

I expect that NLP will be CPU and/or memory intensive, because other

approaches have not worked. We do not want to restrict the range of

possible solutions any more than necessary. Thus, we only require

"reasonable" computation limits by specifying that the solution must be

testable by the public and have a 30 day period of public comment. If

nobody can run the program in that time, then it loses. This rule

allows for better models as computers become more powerful.

> - Why do you imagine that any "understanding" of the text is needed

> to achieve maximal compression. A purely syntactic/statistical

> approach that ignores any meaning of the text is going to be more

> intelligent, by your criterion, since addressing semantics just adds

> baggage and size to the decompressor. This might change if you

> accepted a "fair paraphrase" rather than an exact reproduction,

> but you don't.

A "fair paraphrase" implies lossy compression, which cannot be measured

objectively. Who is to say if a paraphrase is fair or not? The extra

information needed to encode the exact text is small, and is the same

for all contestants.

And what is "understanding"? When the semantics of words are not

grounded to sensory or motor I/O, semantics is just a fuzzy identity

relation: two words are related if they are likely to appear near each

other in running text. Models like LSA exploit the transitive property

of semantics. If "moon" appears near "star" and "star" appears near

"night", then the model predicts that "night" will appear near "moon".

Semantics is just a statistical property.

> In summary, what is proposed is no more or less than a test of the

> intelligence and skill of lossless compression programmers, not of

> AI programs.

Yes, that is what the test is designed to do, unless you think that

machines can think.

> David

-- Matt Mahoney

Aug 9, 2006, 5:32:20 AM8/9/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:

I don't dispute the relevance of compression in measuring the quality

of a language model. What I dispute is the notion that you can pass

the Turing test by looking up responses in a table, compressed or not.

My claim is that I, as judge, can easily construct a sequence of

questions that cannot be answered correctly by table lookup, as the

correct answers depends on previous aspects of the conversation, or

other variable context information not even part of the conversation:

Me: I was born in December 1966.

System: That's nice. [[or whatever, no correct answer expected here]]

Me: How many days left before Xmas, not counting today?

System: ??

Despite the inadequacy of a table/language model to generate the right

answer, let's suppose somehow it comes up with it. I continue:

Me: OK, add that to the number of whole years since the first moon

landing, and substract my age, what do you get?

System: ??

Do you see the problem now?

David

Aug 9, 2006, 6:17:10 AM8/9/06

to

>[...]

>> In summary, what is proposed is no more or less than a test of the

>> intelligence and skill of lossless compression programmers, not of

>> AI programs.

>> In summary, what is proposed is no more or less than a test of the

>> intelligence and skill of lossless compression programmers, not of

>> AI programs.

>Yes, that is what the test is designed to do, unless you think that

>machines can think.

Well, I'm glad we agree on what the test measures. Why then did the

original announcement contain the following very misleading claim:

"Artificial intelligence researchers finally have an objective and

rigorously validated measure of the intelligence of their machines.

Furthermore the higher the measured intelligence of their machines

the more money they can win via the Hutter Prize."

The problem is that "AI researchers' machines" are almost always

intended to do things other than lossless compression of slabs of

text, so the test would invariably claim they have no intelligence.

David

Aug 9, 2006, 12:47:37 PM8/9/06

to

How do you propose that intelligence (artificial or otherwise) be

measured?

-- Matt Mahoney

Aug 9, 2006, 12:58:46 PM8/9/06

to

Matt Mahoney wrote:

> Ted Dunning wrote:

> > Wasn't it Fisher in the 20's who proved just about the same thing about

> > maximum likelihood estimates?

> I think it was Shannon who proved this in 1948 (discete noiseless

> channel capacity theorem).

> http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

> I could not find any reference to Fisher in his paper.

I was referring to the asymptotic correctness of MLE's, not the

compression. The fundamental insight is that the maximum likelihood

estimator converges to truth. The fact that the MLE also gives you

optimal compression is an add-on to the basic idea and does indeed

follow from Shannon's work.

Aug 9, 2006, 7:48:41 PM8/9/06

to

>-- Matt Mahoney

By carefully crafted tests of knowledge, capability, comprehension,

and ability to reason, to recognise, to solve problems, to learn...

David

Aug 9, 2006, 9:06:18 PM8/9/06

to

Think about it like this:

You've heard of functions, right? They are things with inputs and

outputs. All possible inputs are called the domain. All possible

outputs are called the range. One way to implement functions is to

make a two column table with one column labeled "inputs" and the other

labeled "outputs" where each row is what the function outputs for the

given input. You'd call it a lookup table.

Now, imagine a program that, for every _sequence_ of N text inputs, i1,

i2, i3...iN, creates N rows in the table, the first row has input of

i1, and some output o1. The input for the next row is i2 appended to

the end of i1 and the output is o2... and you continue this until you

have completed all N inputs. Agreed this could be a VERY BIG table but

so could the table representing the function "+" between two 64 bit

numbers. Yet the size of the program you need to implement the

function "+" between two 64 bit numbers is very small... measured in

only tens of bytes on some computers. Incredible isn't it? That's the

magic of what is called Kolmogorov complexity.

Anyway, so let's go back to our ridiculously big text function and ask

ourselves, how big does the program have to be to generate the program

that's equivalent to that table?

Would you agree that it might be somewhat smaller than the table itself?

Aug 9, 2006, 11:15:23 PM8/9/06

to

I don't claim that a language model can be implemented as a lookup

table. I claim

1. that a better language model is more likely to pass the Turing test,

2. that a better language model will compress text better.

By language model, I mean a probability distribution over the set of

all strings. So a model will assign p("Today is Aug. 10, 2006 ... I

was born in (rest of dialog)... what do you get? x") some number in

the open interval (0,1) for all x.

Now let q be a dialog ending with a question, and x1 and x2 be two

posible answers. I think it is reasonable to assume that the following

statements are equivalent (all true or all false) for all q, x1, x2:

1. A human picked at random and asked question q is more likely to

answer x1 than x2.

2. A human picked at random is more likely to judge the pair q,x1 as

correct than q,x2.

3. A text corpus of human origin is more likely to contain q,x1 than

q,x2.

Now suppose that a machine is able to compute p() such that the above

statements are equivalent to p(q,x1) > p(q,x2) for all q, x1, and x2.

Then that machine would answer all questions in a way that is

indistinguishable from a human. It would pass the Turing test.

Now suppose that we have two models p1() and p2() and want to know

which is better (closer to p)? One way is to test it is to ask it a

multiple choice question q with answers x1 and x2, and have each

program output the answer with the higher probability according to its

model. So if p1(q,x1) > p1(q,x2) in model p1() then the program

answers x1. If p1() answers x1 and p2() answers x2 and a human judge

decides that x1 is more correct, then p1() is judged more human.

Another test would be for the judge to first decide the correct answer

(say, x1), then ask both models to explicitly compute p1(q,x1) and

p2(q,x1). The model that gives the higher value wins. We don't

normally test humans this way because it is conceptually difficult for

people to assign numeric probabilities. But it is not hard for

machines.

This test can be repeated, for example, by multiplying the scores for

several independent questions to compute the probability that all of

the answers are correct. So given a series of questions q1, q2, q3,...

with correct answers x1, x2, x3..., the winner is the one with the

greatest value of PROD_i [p(x_i|q_i)], or equivalently, the smallest

value of -SUM_i [log p(x_i|q_i)], which is the compressed size of the

answers using ideal coding.

The advantage of this approach is that no human is required to evaluate

the language model. The score is exact, repeatable, and verifiable.

What I tried to show is that good text compression requires

sophisticated reasoning ability and vast knowledge, just as your

example Turing test session does.

-- Matt Mahoney

Aug 9, 2006, 11:43:08 PM8/9/06

to

David Kinny wrote:

> >How do you propose that intelligence (artificial or otherwise) be

> >measured?

>

> >-- Matt Mahoney

>

> By carefully crafted tests of knowledge, capability, comprehension,

> and ability to reason, to recognise, to solve problems, to learn...

>

> David

> >How do you propose that intelligence (artificial or otherwise) be

> >measured?

>

> >-- Matt Mahoney

>

> By carefully crafted tests of knowledge, capability, comprehension,

> and ability to reason, to recognise, to solve problems, to learn...

>

> David

If you were to run a competition lasting for many years among AI

systems with cash prizes, how would you design the tests so that the

competition is fair, scores are 100% objective, and systems can't

cheat?

-- Matt Mahoney

Aug 10, 2006, 6:29:34 PM8/10/06

to

You claimed above that "One way to pass the Turing test would be to

look up the interrogator's exact question in Google and reply with

the text that follows from highest ranked response. The problem is

that your lookup table is several terabytes."

This claim is nonsense - the prospect of passing the Turing test

in this way is effectively nil.

> I claim

> 1. that a better language model is more likely to pass the Turing test,

> 2. that a better language model will compress text better.

Another claim you made, to which I responded above, is:

"2. An explicit representation of the probability distribution of

natural language strings (a correct language model) enables one

to pass the Turing test for AI."

You use the phrase "An explicit representation of ... ", which certainly

sounds like a table of probabilities indexed by strings, and "enables

one to pass the Turing test", which is a very strong claim. But it

seems you now want to downgrade this to a program that computes

probabilities, and to "is more likely to pass the TT". Clearly a

program that has an infinitesimal chance of passing the TT is more

likely to pass than one that has zero chance, but it's not very

reasonable to claim that such a program can pass the TT if its

chances of soing so are sufficiently small. I could create a perl

script that has a non-zero chance to pass the TT in half an hour.

> By language model, I mean a probability distribution over the set of

> all strings. So a model will assign p("Today is Aug. 10, 2006 ... I

> was born in (rest of dialog)... what do you get? x") some number in

> the open interval (0,1) for all x.

Before I try to respond to the rest of this, please clarify for me:

- What you meant by "an explict representation of ..."

- What, exactly, does the probability of some string S represent here?

The probability of S occurring next in a particular document?

P(S occurs next in the whole corpus of written English text)?

P(S occurs next in a randomly chosen spoken English conversation)?

P(S occurs next in a randomly chosen Turing Test)?

- Do you really mean "over the set of ALL strings"? Language models

typically deal with n-grams for some small n. Are you proposing

an idealised, infinitely complex model? Or what?

David

Aug 10, 2006, 9:24:28 PM8/10/06

to

David Kinny wrote:

> You claimed above that "One way to pass the Turing test would be to

> look up the interrogator's exact question in Google and reply with

> the text that follows from highest ranked response. The problem is

> that your lookup table is several terabytes."

>

> This claim is nonsense - the prospect of passing the Turing test

> in this way is effectively nil.

> You claimed above that "One way to pass the Turing test would be to

> look up the interrogator's exact question in Google and reply with

> the text that follows from highest ranked response. The problem is

> that your lookup table is several terabytes."

>

> This claim is nonsense - the prospect of passing the Turing test

> in this way is effectively nil.

OK, it's nil. The point I was trying to make is that if you have a

small training corpus, then you have to be smart about how you learn a

language model from it. If it's large, you can use a brute force

string matching approach. I believe that 1 GB is the right size, large

enough to learn syntax and semantics, but small enough that you can't

just search for the judge's question and output whatever follows.

It is controversial whether a language model can be learned from an

unlabeled corpus of positive examples, but I believe it is possible.

It is easy to extract a lexical model (dictionary). There are

statistical semantic models such as LSA, and low level syntactic models

such as hidden Markov models, that have been learned on data sets of

similar size. I believe it is possible with the Wikipedia dataset. It

is doubtful to me that the data is sufficient to learn higher level

reasoning and problem solving skills like arithmetic without some help

from the programmer, but we shall see.

> > I claim

> > 1. that a better language model is more likely to pass the Turing test,

> > 2. that a better language model will compress text better.

>

> Another claim you made, to which I responded above, is:

> "2. An explicit representation of the probability distribution of

> natural language strings (a correct language model) enables one

> to pass the Turing test for AI."

>

> You use the phrase "An explicit representation of ... ", which certainly

> sounds like a table of probabilities indexed by strings, and "enables

> one to pass the Turing test", which is a very strong claim. But it

> seems you now want to downgrade this to a program that computes

> probabilities, and to "is more likely to pass the TT". Clearly a

> program that has an infinitesimal chance of passing the TT is more

> likely to pass than one that has zero chance, but it's not very

> reasonable to claim that such a program can pass the TT if its

> chances of soing so are sufficiently small. I could create a perl

> script that has a non-zero chance to pass the TT in half an hour.

It is irrelevant whether a Turing machine implements a language model

as an algorithm or a lookup table.

> > By language model, I mean a probability distribution over the set of

> > all strings. So a model will assign p("Today is Aug. 10, 2006 ... I

> > was born in (rest of dialog)... what do you get? x") some number in

> > the open interval (0,1) for all x.

>

> Before I try to respond to the rest of this, please clarify for me:

>

> - What you meant by "an explict representation of ..."

>

> - What, exactly, does the probability of some string S represent here?

> The probability of S occurring next in a particular document?

> P(S occurs next in the whole corpus of written English text)?

> P(S occurs next in a randomly chosen spoken English conversation)?

> P(S occurs next in a randomly chosen Turing Test)?

>

> - Do you really mean "over the set of ALL strings"? Language models

> typically deal with n-grams for some small n. Are you proposing

> an idealised, infinitely complex model? Or what?

>

> David

A language model is a function, p1(): S -> [0,1] where S is the set of

all string over some alphabet, e.g. S = {0,1}*, and furthermore, p1(s)

>= 0, and SUM_s p1(s) = 1 for s in S.

Define p(s) as the probability that s will occur in a two way dialog

between a human judge and a human confederate in the Turing test. s is

a prefix of the entire dialog from beginning to end. p(s) is unknown,

like all probabilities in real life. p(s) depends on the judge, the

confederate, and when the test is given.

A model, p1(), is an approximation of p(). The cross entropy is

defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross

entropy is a measure of how accurately p1() approximates p(). H1 is

minimized when p1(s) = p(s) for all s in S. Note that H1 is the

expected size of s chosen randomly from distribution p() when

compressed with model p1(). It is required that 0 < p1(s) < 1 for all

s in S.

When I say there is an explicit representation of p1(), I mean that

there is a Turing machine that computes it.

I realize that Wikipedia has a different distribution than p(s) for

Turing tests. The proper training corpus for passing the Turing test

would be a set of transcripts between judges and human confederates,

like those at the Loebner prize website. Chat logs would also be a

better choice than Wikipedia. I argue that this difference is an

irrelevant detail. The main difference is that Wikipedia articles are

not interactive, so a machine training on it would not learn to hold a

conversation. Other than that, text prediction still requires high

level language comprehension, common sense, reasoning, and problem

solving abilities just like passing the Turing test does. If you can

model written articles, then I think extending the model to interactive

conversation would not be too hard.

-- Matt Mahoney

Aug 11, 2006, 5:05:37 PM8/11/06

to

David Kinny wrote:

> Another claim you made, to which I responded above, is:

> "2. An explicit representation of the probability distribution of

> natural language strings (a correct language model) enables one

> to pass the Turing test for AI."

>

> You use the phrase "An explicit representation of ... ", which certainly

> sounds like a table of probabilities indexed by strings, and "enables

> one to pass the Turing test", which is a very strong claim.

It may sound like that to you, but it doesn't sound like that to

somebody who is taking just a bit more time to listen.

An explicit representation is one that is explicitly defined. A table

of logarithms is an explicit representation of the log function. So is

a series expansion.

> But it

> seems you now want to downgrade this to a program that computes

> probabilities, and to "is more likely to pass the TT".

There is no downgrading involved. This is conventional usage.

You are fighting a straw man for no discernible reason.

Aug 12, 2006, 4:50:32 AM8/12/06

to

Sorry for the delay in getting back to you here ...

That might be true in theory, but we are concerned here with what can

be feasibly implemented in current or future practice, are we not?

> > > By language model, I mean a probability distribution over the set of

> > > all strings. So a model will assign p("Today is Aug. 10, 2006 ... I

> > > was born in (rest of dialog)... what do you get? x") some number in

> > > the open interval (0,1) for all x.

> >

> > Before I try to respond to the rest of this, please clarify for me:

> >

> > - What you meant by "an explict representation of ..."

> >

> > - What, exactly, does the probability of some string S represent here?

> > The probability of S occurring next in a particular document?

> > P(S occurs next in the whole corpus of written English text)?

> > P(S occurs next in a randomly chosen spoken English conversation)?

> > P(S occurs next in a randomly chosen Turing Test)?

> >

> > - Do you really mean "over the set of ALL strings"? Language models

> > typically deal with n-grams for some small n. Are you proposing

> > an idealised, infinitely complex model? Or what?

> A language model is a function, p1(): S -> [0,1] where S is the set of

> all string over some alphabet, e.g. S = {0,1}*, and furthermore, p1(s)

> >= 0, and SUM_s p1(s) = 1 for s in S.

> Define p(s) as the probability that s will occur in a two way dialog

> between a human judge and a human confederate in the Turing test. s is

> a prefix of the entire dialog from beginning to end. p(s) is unknown,

> like all probabilities in real life. p(s) depends on the judge, the

> confederate, and when the test is given.

Here I'm a little confused. You say that s is a prefix of the dialog,

and that p(s) depends on the judge, etc, and when the test is given,

implying that this p() is for a *specific* dialog. In that case the

distribution is trivial: all k possible prefixes have probability 1/k,

all other strings 0 (or some variant of this). Surely you wish to

consider the distribution over some set of dialogs, but which set?

For a given p() to be useful in passing the TT, it has to adequately

cover possible dialogs. It's easy, as judge, to create novel dialogs.

For example, the following short question whose answer is obvious has

almost certainly never been asked before in the history of mankind.

How many legs do 3 ants, an octopus and 4 aardvarks have in total?

Then there's the issue of background knowledge. In a previous posting

you addressed the need to know such things as the date by prefixing

that information onto the actual dialog. How do you propose to

decide which background information to add in explicitly?

It seems, the way this is heading, that you need p() to be some

idealized distribution over the set of all possible human to human TT

dialogs prefixed by all necessary background information. In that

case it doesn't depend on the judge, the time, etc. Right?

> A model, p1(), is an approximation of p(). The cross entropy is

> defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross

> entropy is a measure of how accurately p1() approximates p(). H1 is

> minimized when p1(s) = p(s) for all s in S. Note that H1 is the

> expected size of s chosen randomly from distribution p() when

> compressed with model p1(). It is required that 0 < p1(s) < 1 for all

> s in S.

> When I say there is an explicit representation of p1(), I mean that

> there is a Turing machine that computes it.

I'll return below to the question of how such a p1() could be created,

but let's move on to another issue. Let's assume that you have an

actual, computationally tractable implementation of an algorithm that

computes p1(), which is a very good approximation of some suitably

extensive p(). Now how can you actually use this to pass the TT?

You've been asked a question, and the dialog so far is the string d.

You need to find a suffix x such that p1(d+x) is non-zero (I assume

you'll settle for any suitable answer rather than the most likely).

This means searching over the space of possible strings x and

repeatedly evaluating p1(d+x). This is not feasible. And if your

algorithm in fact takes d as input and returns an x s.t. p1(d+x) > 0,

it is not a language model as you've defined it, but something else.

For completeness, I'm just going to backtrack to something you wrote

earlier in the thread.

> Now let q be a dialog ending with a question, and x1 and x2 be two

> posible answers. I think it is reasonable to assume that the following

> statements are equivalent (all true or all false) for all q, x1, x2:

> 1. A human picked at random and asked question q is more likely to

> answer x1 than x2.

> 2. A human picked at random is more likely to judge the pair q,x1 as

> correct than q,x2.

> 3. A text corpus of human origin is more likely to contain q,x1 than

> q,x2.

I'd agree that 1 and 2 are equivalent, but dispute the equivalence of 3.

The reason is that, except for a vanishingly small number of dialogs q,

the probability of a text corpus of human origin containing q,x1 is

identical to the probability of it containing q,x2 , namely 0. As I've

shown it is easy to create dialogs such that this is the case.

> I realize that Wikipedia has a different distribution than p(s) for

> Turing tests. The proper training corpus for passing the Turing test

> would be a set of transcripts between judges and human confederates,

> like those at the Loebner prize website. Chat logs would also be a

> better choice than Wikipedia. I argue that this difference is an

> irrelevant detail. The main difference is that Wikipedia articles are

> not interactive, so a machine training on it would not learn to hold a

> conversation. Other than that, text prediction still requires high

> level language comprehension, common sense, reasoning, and problem

> solving abilities just like passing the Turing test does. If you can

> model written articles, then I think extending the model to interactive

> conversation would not be too hard.

> -- Matt Mahoney

So finally you gloss over several issues, dismissing them as irrelevant

details, or easy. You claim, apparently as an act of faith, that:

- It will be feasible to learn a sufficiently good language model from a

tractably small corpus. This is on a par with claiming that it will

be feasible to learn to play championship Go just from example games.

- Given such a model, it will be feasible to use it to pass the TT.

- The difference between the language model of TT dialogs and of a slab

of Wikipedia is unimportant. You, the programmer, can somehow extend

or convert the one into the other.

- The limited extent of the corpus will not matter. Somehow your LM

will have the capability to generalise, extrapolate, answer novel

questions, incorporate all necessary background information, know

how to do computations, solve simple problems, remember things, ...

Here's my take. It will, eventually, be feasible to create programs

that do all these latter things well enough to pass a TT, even in the

face of the types of lines of questioning I've pointed to. Inevitably

such systems will acquire much of their knowledge by learning, from

conversations and texts amongst other things. But they will be so

much more than just language models, in terms of what they consist of

and what they are capable of doing. They could actually be used to

generate language models of a certain type, ones where probabilities

were significant only in whether they were zero or not. Indeed, I

suspect that the only way you can generate a LM with sufficient

coverage to (in theory) pass the TT, is to use a system with such

capabilities as the generator.

To put it simply, I'd claim the only way you can create a LM good

enough to (in theory) pass the TT is to build a system that can

already pass the TT without any need to create that LM. And you

don't achieve that just as a side-effect of achieving compression

of 1G slabs of text that is close to the information-theoretic limit,

as that task can be achieved by much much simpler approaches.

Compressing text better is at best just one small step on the path

to implementing systems that pass the TT.

David

Aug 12, 2006, 5:01:03 AM8/12/06

to

"Ted Dunning" <ted.d...@gmail.com> writes:

> David Kinny wrote:

> > Another claim you made, to which I responded above, is:

> > "2. An explicit representation of the probability distribution of

> > natural language strings (a correct language model) enables one

> > to pass the Turing test for AI."

> >

> > You use the phrase "An explicit representation of ... ", which certainly

> > sounds like a table of probabilities indexed by strings, and "enables

> > one to pass the Turing test", which is a very strong claim.

> David Kinny wrote:

> > Another claim you made, to which I responded above, is:

> > "2. An explicit representation of the probability distribution of

> > natural language strings (a correct language model) enables one

> > to pass the Turing test for AI."

> >

> > You use the phrase "An explicit representation of ... ", which certainly

> > sounds like a table of probabilities indexed by strings, and "enables

> > one to pass the Turing test", which is a very strong claim.

> It may sound like that to you, but it doesn't sound like that to

> somebody who is taking just a bit more time to listen.

Steady on there Ted. Matt began by claiming that one could pass the

TT using a several terabyte lookup table. I'm the one who is taking

the time here to try and clarify exactly what he really means, and it

keeps changing from the laughable pop-science claims of the original

announcement. Hopefully now it's approaching something reasonable.

> An explicit representation is one that is explicitly defined. A table

> of logarithms is an explicit representation of the log function. So is

> a series expansion.

I agree, both are explicit. I really don't want to quibble about

semantics, but I'd say that an algorithm that computes logarithms in

some non-transparent way is by contrast an implicit representation.

Your usage of language may vary.

> > But it

> > seems you now want to downgrade this to a program that computes

> > probabilities, and to "is more likely to pass the TT".

> There is no downgrading involved. This is conventional usage.

The change from "can pass" to "is more likely to pass", from a

strong claim to a weak one, is certainly downgrading. And a program

is potentially feasible, while a lookup table is certainly not, as

it seems we all now agree.

> You are fighting a straw man for no discernible reason.

There is confusion being generated here that deserves to be dispelled.

David

Aug 12, 2006, 1:18:22 PM8/12/06

to

> > It is irrelevant whether a Turing machine implements a language model

> > as an algorithm or a lookup table.

>

> That might be true in theory, but we are concerned here with what can

> be feasibly implemented in current or future practice, are we not?

> > as an algorithm or a lookup table.

>

> That might be true in theory, but we are concerned here with what can

> be feasibly implemented in current or future practice, are we not?

In practice, Turing machines don't exist.

> > Define p(s) as the probability that s will occur in a two way dialog

> > between a human judge and a human confederate in the Turing test. s is

> > a prefix of the entire dialog from beginning to end. p(s) is unknown,

> > like all probabilities in real life. p(s) depends on the judge, the

> > confederate, and when the test is given.

>

> Here I'm a little confused. You say that s is a prefix of the dialog,

> and that p(s) depends on the judge, etc, and when the test is given,

> implying that this p() is for a *specific* dialog. In that case the

> distribution is trivial: all k possible prefixes have probability 1/k,

> all other strings 0 (or some variant of this). Surely you wish to

> consider the distribution over some set of dialogs, but which set?

In real life, probabilities are always unknown. Different people will

estimate them differently depending on what they know. For example, I

flip a fair coin, look at it, but don't show it to you. What is the

probability that it is heads? You say 1/2. I say 1. Both of us are

right.

I define p(s) as the probability that s will be a prefix of a dialog in

a Turing test, given that you do not know who will be the judge or the

confederate or when the test will be given.

I realize that p() is still not well defined. But that is the nature

of probability.

> For a given p() to be useful in passing the TT, it has to adequately

> cover possible dialogs. It's easy, as judge, to create novel dialogs.

> For example, the following short question whose answer is obvious has

> almost certainly never been asked before in the history of mankind.

>

> How many legs do 3 ants, an octopus and 4 aardvarks have in total?

And yet, the probability that this will answered in an unknown training

corpus will not be exactly 0.

Also, I think you will agree that this question and the answer could be

phrased as a text prediction problem, in which case a compressor that

can solve problems like this will compress smaller than one that

cannot.

> Then there's the issue of background knowledge. In a previous posting

> you addressed the need to know such things as the date by prefixing

> that information onto the actual dialog. How do you propose to

> decide which background information to add in explicitly?

If a machine responds too fast to my questions, I will know it is not

human. A dialog must contain information about who said what and when.

You can represent this information how you like.

> It seems, the way this is heading, that you need p() to be some

> idealized distribution over the set of all possible human to human TT

> dialogs prefixed by all necessary background information. In that

> case it doesn't depend on the judge, the time, etc. Right?

It does depend, but you don't have this information, so you have to

make reasonable guesses.

> > A model, p1(), is an approximation of p(). The cross entropy is

> > defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross

> > entropy is a measure of how accurately p1() approximates p(). H1 is

> > minimized when p1(s) = p(s) for all s in S. Note that H1 is the

> > expected size of s chosen randomly from distribution p() when

> > compressed with model p1(). It is required that 0 < p1(s) < 1 for all

> > s in S.

>

> > When I say there is an explicit representation of p1(), I mean that

> > there is a Turing machine that computes it.

>

> I'll return below to the question of how such a p1() could be created,

> but let's move on to another issue. Let's assume that you have an

> actual, computationally tractable implementation of an algorithm that

> computes p1(), which is a very good approximation of some suitably

> extensive p(). Now how can you actually use this to pass the TT?

>

> You've been asked a question, and the dialog so far is the string d.

> You need to find a suffix x such that p1(d+x) is non-zero (I assume

> you'll settle for any suitable answer rather than the most likely).

> This means searching over the space of possible strings x and

> repeatedly evaluating p1(d+x). This is not feasible. And if your

> algorithm in fact takes d as input and returns an x s.t. p1(d+x) > 0,

> it is not a language model as you've defined it, but something else.

This is a perfectly reasonable implementation for a Turing machine. In

practice, any language model can be converted to a generator with at

most O(n) increase in tme complexity by binary search by appending one

bit at a time. You first compute p1(d+0) and p1(d+1) and randomly

append one bit in proportion to these probabilities, repeating until

you have appended a symbol indicating the end of the message (which is

required in a dialog).

> For completeness, I'm just going to backtrack to something you wrote

> earlier in the thread.

>

> > Now let q be a dialog ending with a question, and x1 and x2 be two

> > posible answers. I think it is reasonable to assume that the following

> > statements are equivalent (all true or all false) for all q, x1, x2:

>

> > 1. A human picked at random and asked question q is more likely to

> > answer x1 than x2.

> > 2. A human picked at random is more likely to judge the pair q,x1 as

> > correct than q,x2.

> > 3. A text corpus of human origin is more likely to contain q,x1 than

> > q,x2.

>

> I'd agree that 1 and 2 are equivalent, but dispute the equivalence of 3.

> The reason is that, except for a vanishingly small number of dialogs q,

> the probability of a text corpus of human origin containing q,x1 is

> identical to the probability of it containing q,x2 , namely 0. As I've

> shown it is easy to create dialogs such that this is the case.

Without full knowledge, you cannot say that any probability is 0. If

you had full knowledge then there would be no use for probability

theory.

> > I realize that Wikipedia has a different distribution than p(s) for

> > Turing tests. The proper training corpus for passing the Turing test

> > would be a set of transcripts between judges and human confederates,

> > like those at the Loebner prize website. Chat logs would also be a

> > better choice than Wikipedia. I argue that this difference is an

> > irrelevant detail. The main difference is that Wikipedia articles are

> > not interactive, so a machine training on it would not learn to hold a

> > conversation. Other than that, text prediction still requires high

> > level language comprehension, common sense, reasoning, and problem

> > solving abilities just like passing the Turing test does. If you can

> > model written articles, then I think extending the model to interactive

> > conversation would not be too hard.

>

> > -- Matt Mahoney

>

> So finally you gloss over several issues, dismissing them as irrelevant

> details, or easy. You claim, apparently as an act of faith, that:

>

> - It will be feasible to learn a sufficiently good language model from a

> tractably small corpus. This is on a par with claiming that it will

> be feasible to learn to play championship Go just from example games.

I believe it is possible to develop lexical, semantic, and syntactic

models from unlabeled text containing only positive examples. Children

seem to do this. It is probably insufficient to develop higher level

skills like solving word problems with arithmetic. That will probably

take some help from the programmer.

A corpus consisting of articles is not sufficient to train a chatterbot

in interactive converstation. But I argue that such a model is still

useful, probably more so. If you want to translate articles, scan

them, recognize them from broadcast speech, etc., you want to train on

the type of text that it would typically encounter.

> - Given such a model, it will be feasible to use it to pass the TT.

Given about 1 GB of the right type of training material (dialogs), yes.

> - The difference between the language model of TT dialogs and of a slab

> of Wikipedia is unimportant. You, the programmer, can somehow extend

> or convert the one into the other.

There is a lot of overlap in the learning mechanisms. The difference

is mainly in the training material and the requirement for interaction.

> - The limited extent of the corpus will not matter. Somehow your LM

> will have the capability to generalise, extrapolate, answer novel

> questions, incorporate all necessary background information, know

> how to do computations, solve simple problems, remember things, ...

I think 1 GB is the right size, as explained in

http://cs.fit.edu/~mmahoney/compression/rationale.html

I do not claim that this particular corpus is right for any particular

AI problem you want to solve. You need to be careful to feed it the

information you want it to know. But I believe for the problem of

predicting text in a corpus, the best information to feed it is the

corpus itself. So I have no problem with the Wikipedia text for this

purpose.

> Here's my take. It will, eventually, be feasible to create programs

> that do all these latter things well enough to pass a TT, even in the

> face of the types of lines of questioning I've pointed to. Inevitably

> such systems will acquire much of their knowledge by learning, from

> conversations and texts amongst other things. But they will be so

> much more than just language models, in terms of what they consist of

> and what they are capable of doing. They could actually be used to

> generate language models of a certain type, ones where probabilities

> were significant only in whether they were zero or not. Indeed, I

> suspect that the only way you can generate a LM with sufficient

> coverage to (in theory) pass the TT, is to use a system with such

> capabilities as the generator.

>

> To put it simply, I'd claim the only way you can create a LM good

> enough to (in theory) pass the TT is to build a system that can

> already pass the TT without any need to create that LM. And you

> don't achieve that just as a side-effect of achieving compression

> of 1G slabs of text that is close to the information-theoretic limit,

> as that task can be achieved by much much simpler approaches.

> Compressing text better is at best just one small step on the path

> to implementing systems that pass the TT.

>

> David

Passing the Turing test requires:

1. Solving the problem of language learning.

2. Modeling interactive conversation.

3. Selecting the right training data.

I am only addressing problem 1.

-- Matt Mahoney

Aug 13, 2006, 3:59:15 PM8/13/06

to

Matt Mahoney wrote:

> David

>

> Passing the Turing test requires:

> 1. Solving the problem of language learning.

> 2. Modeling interactive conversation.

> 3. Selecting the right training data.

>

> I am only addressing problem 1.

>

have been based on databases. (like a Google search). Given a database

only "1" is needed to pass the Turing Test.

Essentially chatterboxes have done "2" quite adaquately.

I realize that this paragraph is one of high subjectivity (nothing like

compression!), but it is doubtful to what extent human beings really

think for themselves. Social scientists and evolutionary biologists

talk about memes, which I suppose you could describe in AI terms as

being elements of a database. The test is for a conversation similar to

that which you could have with a human and it is my experience that you

do not look for reason! Extreme Islamic statements would for example

pass T. They are of course memes.

Clearly if you have compression, or some other purely objective

criterion this does not apply. We are not in compression attempting to

mimic human behaviour. I think you are right, it is the only true test,

although I don't believe monolinual Wiki is the right answer.

Aug 13, 2006, 9:29:11 PM8/13/06

to

Ian Parker wrote:

> I think you are right, it is the only true test,

> although I don't believe monolinual Wiki is the right answer.

> I think you are right, it is the only true test,

> although I don't believe monolinual Wiki is the right answer.

Hutter's decision to go with 10e8 bytes was an uneasy, temporary

compromise. Certainly multilingual compression will be very valuable

-- particularly for my personal obsession which is applied relation

arithmetic.

Aug 14, 2006, 12:15:59 AM8/14/06

to

I had considered a multilingual corpus because it would encourage

research in learning language independent parsing rules rather than

hard coding rules into the program. For example, Chinese does not use

spaces between words (1 to 4 characters), making it a challenge to

build a dictionary. Nevertheless, it is possible to do this in English

text without spaces, starting with no knowledge. Such models might

help us understand how babies learn to parse continuous speech into

words at age 7-10 months, before they learn to speak.

http://cs.fit.edu/~mmahoney/dissertation/lex1.html

I used a monolingual corpus because multilingual ability is not a

requirement for intelligence. The problem is hard enough as it is. If

the same articles appeared in several languages, then compression would

require solving the language translation problem in addition to the

modeling problem.

-- Matt Mahoney

Aug 14, 2006, 4:57:47 AM8/14/06