61 views

Skip to first unread message

Aug 6, 2006, 3:39:16 PM8/6/06

to

Introducing the Hutter Prize for Lossless Compression of Human

Knowledge

Knowledge

Artificial intelligence researchers finally have an objective and

rigorously validated measure of the intelligence of their machines.

Furthermore the higher the measured intelligence of their machines the

more money they can win via the Hutter Prize.

The purse for the Hutter Prize was initially underwritten with a 50,000

Euro commitment to the prize fund by Marcus Hutter of the Swiss Dalle

Molle Institute for Artificial Intelligence, affiliated with the

University of Lugano and The University of Applied Sciences of Southern

Switzerland.

The theoretic basis of the Hutter Prize is related to an insight by the

12th century philosopher, William of Ockham, called "Ockham's Razor",

sometimes quoted as: "It is vain to do with more what can be done with

less." But it was not till the year 2000 that this was mathematically

proven*, by Marcus Hutter, to be a founding principle of intelligence.

Indeed, Hutter's Razor** might be phrased, "It is truer to explain with

less that which can be explained with more."

There have been previous tests and related prizes for artificial

intelligence, such as the Turing Test and the Loebner Prize. However,

these tests suffered from subjective definitions of intelligence.

Hutter's recent theoretic breakthrough creates a mathematics of

artificial intelligence which accurately measures the degree of

intelligence possessed by an artificial agent. It does so by measuring

how succinctly it represents knowledge of the world. As Hutter has now

proven, the most succinct computer model of the world isn't just the

most aesthetic or memory-efficient out of all models of the known

observations -- it also most accurately predicts new observations. In

short, it is the most intelligent.

Artificial intelligence has thereby entered the realm of engineering:

Lossless compression of human knowledge.

This is momentous because by optimizing for rigorous metrics, the field

of artificial intelligence may finally clarify the murky waters of

inadequate definition, within which it has been haphazardly swimming

for the last 50 years, to become both a hard science and tractable

engineering discipline.

Named for the discoverer of the proof and the initial 50,000 Euro

donor, the Hutter Prize currently targets the compression of a 100

megabyte sample of human knowledge drawn from the broadly based

Wikipedia online encyclopedia. As Moore's Law increases the capacity

of machines, and as additional donations to the prize fund increase the

incentives of contestants, the intent is to increase the amount of

knowledge targeted for compression. It is reasonable to expect that

the 100 megabyte sample will produce, at the very least, advances in

linguistic modeling. As the targeted depth and breadth of knowledge

increases, conceptual frameworks will come into play, eventually

covering the range of disciplines from political science to physics by

applying theories that prove optimal in compressing the target.

A common objection to this approach to artificial intelligence is that

it offers little that is new -- that the computational difficulty of

searching for patterns in data remains what it has always been. This

objection misses two important points:

1) Hutter's proof provides a new mathematics of intelligence allowing

for "top down" theoretic advances which may render many problems

tractable that otherwise appear intractable.

2) There is a large overlap between succinctly codified knowledge and

an intelligent compression program. Indeed, a reasonable definition of

"knowledge" is that it optimizes the compression of new observations as

instances of old patterns. This means that even if a compressor does

nothing but apply codified human knowledge, generating no new knowledge

of its own, it can still demonstrate greater intelligence than

competing programs and thereby make measurable progress toward

artificial intelligence but also -- and this is key -- progressively

more intelligent bodies of human-generated knowledge by pitting those

bodies of knowledge against each other in what might be called an

epistemological tournament.

The formula for winnings is modeled after the M-Prize or Methuselah

Mouse Prize, which awards money to longevity researchers for progress

in keeping mice alive the longest. Here, modified for compression

ratios, is the formula:

S = size of program outputting the uncompressed knowledge

Snew = new record

Sprev = previous record

P = [Sprev - Snew] / Sprev = percent improvement

Award monies:

Fund contains: Z at noon GMT on day of new record

Winner receives: Z * P

Initially Z is 50,000 Euro with a minimum payout of 500 Euro (or

minimum improvement of 1% over the prior winner).

Donations are welcome. The history of improvement in The Calgary

Corpus Compression Challenge*** is about 3% per year. The larger the

commitment from donors to the fund, the greater the rate of progress

toward a high quality body of human knowledge and, quite possibly, the

long-held promise of artificial intelligence.

For further details of the Hutter Prize see:

For discussion of the Hutter Prize see:

http://groups.google.com/group/Hutter-Prize

-- Jim Bowery

* http://www.hutter1.net/ai/uaibook.htm

** Hutter's Razor has some caveats relating to the nature of the

universe and computability, but those conditions must be met for any

computer-based intelligence.

Aug 7, 2006, 8:22:32 PM8/7/06

to

Wasn't it Fisher in the 20's who proved just about the same thing about

maximum likelihood estimates?

Certainly, there is nothing new in the thesis that if you have an

estimate q of a probability p, then the using q as a compression model

gives minimum compressed size for p = q. Moreover, this minimum is

unique. This is the same as saying that the maximum likelihood value

of q is p (not quite the same as the maximum likelihood *estimator*, of

course).

For exchangeable events x_i, this is easy to prove using the convexity

of the log function,

log(x) <= x-1

By Finetti's theorem, the x_i are conditionally independent so

p(X) = prod_i p(x_i | theta)

E[ log q(x) ] = sum_x p(x) log q(x)

But, comparing with the entropy of the distribution p,

sum_x p(x) log q(x) + H(p) = sum_x p(x) log q(x) / p(x)

and by the inequality above,

sum_x p(x) log (q(x) / p(x)) <= sum_x p(x) [q(x) / p(x) - 1]

<= sum_x q(x) - sum_x p(x) = 0

Thus

- sum_x p(x) log q(x) >= H(p)

and

- sum_x p(x) log q(x) = H(p) <==> p = q

But this is just the expected length of an optimal compressor

representing X using q as a model. Thus, minimum expected compressed

size implies p = q.

Also, the expected compressed size can be estimated by compressing a

large number of x_i, thus the minimum compressed size estimator is

asymptotically equal to p.

All of the excitement is, of course, before the asymptote kicks in.

How can we estimate p with fewer x_i.

Why this is news is beyond me.

James A. Bowery wrote:

> ... But it was not till the year 2000 that this was mathematically

> proven*, by Marcus Hutter, to be a founding principle of intelligence.

> Indeed, Hutter's Razor** might be phrased, "It is truer to explain with

> less that which can be explained with more."

>

> ... As Hutter has now

Aug 8, 2006, 4:12:35 PM8/8/06

to

Ted (and James),

We often argue about lots of things, but I am on your side here.

The insistence on limited program size pretty much excuses it from the

realm of the scientific, as it diverts people's efforts from the claimed

goals of the prize and onto clever and concise programming methods, and

in the process obscures what might otherwise be simple but powerful

compression methods. It also indirectly (and needlessly) presumes an

18MB simplicity in human experience, thought, and expression that

probably just isn't there.

Research programs like this should be coded in some user friendly and

extensively checked medium like Microsoft Access, which probably

requires 18MB just for "Hello World". Even the .NET run-time package

would exceed this. HOWEVER, it is unclear whether the run-time package

would count in the total, but the contest seems to presume an EXE

instead of (say) an MDB like Access would use. Probably the best

approach to squeeze things into the smallest EXE would be to "go retro"

with, say, VB5, but to what end - to try to fit some screwball rule?

Further, the program would be disqualified if it were to correct

grammatical errors and/or summarize cumbersome statements using the very

computer understanding that the prise is supposed to be promoting.

Indeed, abandoning lossless compression would pretty much yield what the

semantic web searching people are looking for once it gets good enough.

Further, this sort of a competition blows right past the ITAR problems

that I have posted in the past - a VERY good thing.

Yes, the world would GREATLY benefit from this sort of research, but the

"minor" flaws in this particular competition appear (to me) to be rather

fatal to its stated goals.

Sad. I was really excited about competing until I saw the flaws.

Steve Richfie1d

===============

Aug 8, 2006, 5:00:20 PM8/8/06

to

Ted Dunning wrote:

> Wasn't it Fisher in the 20's who proved just about the same thing about

> maximum likelihood estimates?

>

> Certainly, there is nothing new in the thesis that if you have an

> estimate q of a probability p, then the using q as a compression model

> gives minimum compressed size for p = q. Moreover, this minimum is

> unique. This is the same as saying that the maximum likelihood value

> of q is p (not quite the same as the maximum likelihood *estimator*, of

> course).

> Wasn't it Fisher in the 20's who proved just about the same thing about

> maximum likelihood estimates?

>

> Certainly, there is nothing new in the thesis that if you have an

> estimate q of a probability p, then the using q as a compression model

> gives minimum compressed size for p = q. Moreover, this minimum is

> unique. This is the same as saying that the maximum likelihood value

> of q is p (not quite the same as the maximum likelihood *estimator*, of

> course).

I think it was Shannon who proved this in 1948 (discete noiseless

channel capacity theorem).

http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

I could not find any reference to Fisher in his paper.

> Why this is news is beyond me.

That isn't what Hutter proved. What he proved is that the optimal

behavior of a goal seeking agent in an unknown but computable

environment is to guess that the environment is controlled by the

shortest program (Turing machine) consistent with all behavior observed

so far. Thus, a solution to the problem of maximizing the accumulated

reward signal from an unknown environment reduces to algorithmic

compression, which is an undecidable problem. However, Hutter also

proved that in the restricted case of an envoronment bounded by space s

and time t, that the problem is computable in time O(t2^s) (so the

problem is still hard).

When I developed the large text benchmark (without funding) my goal was

to promote research in NLP. There are 3 arguments why text compression

implies AI.

1. Hutter's proof in the specific case of text prediction (a problem

that is easy for people but hard for machines).

2. An explicit representation of the probability distribution of

natural language strings (a correct language model) enables one to pass

the Turing test for AI.

3. Compression ratio (perplexity) is already used in practice to

measure the quality of language models (mainly for speech recognition).

However, AI does *not* imply compression. Decompression requires a

deterministic computation to invert the compression function, and the

human brain is not deterministic. I believe this is the reason that

many people doubt the converse. I discuss these issues in

http://cs.fit.edu/~mmahoney/compression/rationale.html

-- Matt Mahoney

Aug 8, 2006, 5:45:30 PM8/8/06

to

Steve Richfie1d wrote:

> Ted (and James),

>

> We often argue about lots of things, but I am on your side here.

>

> The insistence on limited program size pretty much excuses it from the

> realm of the scientific, as it diverts people's efforts from the claimed

> goals of the prize and onto clever and concise programming methods, and

> in the process obscures what might otherwise be simple but powerful

> compression methods. It also indirectly (and needlessly) presumes an

> 18MB simplicity in human experience, thought, and expression that

> probably just isn't there.

> Ted (and James),

>

> We often argue about lots of things, but I am on your side here.

>

> The insistence on limited program size pretty much excuses it from the

> realm of the scientific, as it diverts people's efforts from the claimed

> goals of the prize and onto clever and concise programming methods, and

> in the process obscures what might otherwise be simple but powerful

> compression methods. It also indirectly (and needlessly) presumes an

> 18MB simplicity in human experience, thought, and expression that

> probably just isn't there.

Marcus Hutter and I disagreed on what the size of the test data should

be. I argued that it should be 1 GB because that is what Turing

estimated in 1950 it would take to pass the Turing test (a machine with

10^9 bits of memory, which is 1 GB based on Shannon's 1950 estimate of

1 bpc entropy of written English), Landauer's estimate of 10^9 bits of

human long term memory, and because it is about the amount of language

that the average adult has processed since birth. Hutter argued for

100 MB because it is easier to work with and makes the contest more

accessible to the public. He has a good point because many of the top

compressors are there only because they use all of the 800 MB available

on my test machine.

In the original test design I assume that all knowledge needed to pass

any text-based AI test (such as text prediction or the Turing test) is

learnable solely from text. I realize that the vast knowledge that

humans use in these tasks is not all learned verbally, but I think that

most of it *could* be. For example, a blind person can tell you that

the sky is blue. Also, if knowledge cannot easily be learned verbally,

then it is difficult to prove to somebody that you know it in a verbal

test. For example, how would you prove in the Turing test that you

know how to ride a bicycle, that you know what a banana smells like,

that you could recognize somebody's face? Can you give an example of

*any* human knowledge that cannot be learned verbally but can be

demonstrated verbally?

Thus, to successfully pass the compression test, it will be necessary

for a program to learn a language model from the test data itself as it

compresses/decompresses the data.

> Research programs like this should be coded in some user friendly and

> extensively checked medium like Microsoft Access, which probably

> requires 18MB just for "Hello World". Even the .NET run-time package

> would exceed this. HOWEVER, it is unclear whether the run-time package

> would count in the total, but the contest seems to presume an EXE

> instead of (say) an MDB like Access would use. Probably the best

> approach to squeeze things into the smallest EXE would be to "go retro"

> with, say, VB5, but to what end - to try to fit some screwball rule?

Most compressors are written in C, C++ and assembler for speed. File

I/O does not require any fancy programming environment. The executable

files are typically 100 KB or less. It is necessary to include the

size of the decompressor, otherwise I could trivially write a 100 MB

"decompressor" that simply outputs the data, thus achieving a

compressed size of 0.

> Further, the program would be disqualified if it were to correct

> grammatical errors and/or summarize cumbersome statements using the very

> computer understanding that the prise is supposed to be promoting.

The test data is very clean. Misspelled words and grammatical errors

are rare. This is one reason I chose this data set. See

http://cs.fit.edu/~mmahoney/compression/textdata.html for an analysis

of the data.

>

> Indeed, abandoning lossless compression would pretty much yield what the

> semantic web searching people are looking for once it gets good enough.

>

> Further, this sort of a competition blows right past the ITAR problems

> that I have posted in the past - a VERY good thing.

>

> Yes, the world would GREATLY benefit from this sort of research, but the

> "minor" flaws in this particular competition appear (to me) to be rather

> fatal to its stated goals.

>

> Sad. I was really excited about competing until I saw the flaws.

>

> Steve Richfie1d

-- Matt Mahoney

Aug 8, 2006, 7:54:11 PM8/8/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:

> ...

> When I developed the large text benchmark (without funding) my goal was

> to promote research in NLP. There are 3 arguments why text compression

> implies AI.

> 1. Hutter's proof in the specific case of text prediction (a problem

> that is easy for people but hard for machines).

> 2. An explicit representation of the probability distribution of

> natural language strings (a correct language model) enables one to pass

> the Turing test for AI.

Not if I'm the one asking the questions. This idea is just

the ludicrous "humungous lookup table" in disguise.

David

Aug 8, 2006, 8:50:01 PM8/8/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:

> Steve Richfie1d wrote:

> > Ted (and James),

> >

> > We often argue about lots of things, but I am on your side here.

> >

> > The insistence on limited program size pretty much excuses it from the

> > realm of the scientific, as it diverts people's efforts from the claimed

> > goals of the prize and onto clever and concise programming methods, and

> > in the process obscures what might otherwise be simple but powerful

> > compression methods. It also indirectly (and needlessly) presumes an

> > 18MB simplicity in human experience, thought, and expression that

> > probably just isn't there.

> [...]

> Thus, to successfully pass the compression test, it will be necessary

> for a program to learn a language model from the test data itself as it

> compresses/decompresses the data.

This "intelligence test" is so silly, it's hard to know where to start.

- There is no notion of "successfully passing the test", unless you

consider every lossless compressor/decompressor to pass the test, and

that includes ones that are clearly completely unintelligent.

- You are seeking minimal size of the decompressor that reproduces a

particular 100MB text file. A compressor/decompressor scheme that is

specialized for the structure and statistics of that particular file

can obviously do better than one that is more general in ability, but

this is equally obviously less intelligent than the more general one.

- It seems there is no limit on the CPU time required to generate

the decompressor, which again penalizes intelligence and rewards

brute force approaches.

- Why do you imagine that any "understanding" of the text is needed

to achieve maximal compression. A purely syntactic/statistical

approach that ignores any meaning of the text is going to be more

intelligent, by your criterion, since addressing semantics just adds

baggage and size to the decompressor. This might change if you

accepted a "fair paraphrase" rather than an exact reproduction,

but you don't.

In summary, what is proposed is no more or less than a test of the

intelligence and skill of lossless compression programmers, not of

AI programs.

David

Aug 8, 2006, 11:40:49 PM8/8/06

to

This is the point of measuring compressed size. One way to pass the

Turing test would be to look up the interrogator's exact question in

Google and reply with the text that follows from highest ranked

response. The problem is that your lookup table is several terabytes.

The challenge for you is to replace this huge table with something

smaller. That means instead of estimating the probability of a

sentence by counting Google hits, that you analyze the syntax and

semantics of the sentence to determine if it is correct, because the

number of exact matches is most likely to be zero. You have a choice

of either finding a succinct way of describing the rules that govern

whether a sentence is likely or unlikely, or you can find an efficient

way to learn these rules from just enough unlabled text.

Using compression ratio to measure the quality of a language model is

nothing new. Speech recognition researchers have been doing it for

years.

-- Matt Mahoney

Aug 9, 2006, 12:30:53 AM8/9/06

to

David Kinny wrote:

> "Matt Mahoney" <matma...@yahoo.com> writes:

>

> > Steve Richfie1d wrote:

> > > Ted (and James),

> > >

> > > We often argue about lots of things, but I am on your side here.

> > >

> > > The insistence on limited program size pretty much excuses it from the

> > > realm of the scientific, as it diverts people's efforts from the claimed

> > > goals of the prize and onto clever and concise programming methods, and

> > > in the process obscures what might otherwise be simple but powerful

> > > compression methods. It also indirectly (and needlessly) presumes an

> > > 18MB simplicity in human experience, thought, and expression that

> > > probably just isn't there.

>

> > [...]

> > Thus, to successfully pass the compression test, it will be necessary

> > for a program to learn a language model from the test data itself as it

> > compresses/decompresses the data.

>

> This "intelligence test" is so silly, it's hard to know where to start.

>

> - There is no notion of "successfully passing the test", unless you

> consider every lossless compressor/decompressor to pass the test, and

> that includes ones that are clearly completely unintelligent.

> "Matt Mahoney" <matma...@yahoo.com> writes:

>

> > Steve Richfie1d wrote:

> > > Ted (and James),

> > >

> > > We often argue about lots of things, but I am on your side here.

> > >

> > > The insistence on limited program size pretty much excuses it from the

> > > realm of the scientific, as it diverts people's efforts from the claimed

> > > goals of the prize and onto clever and concise programming methods, and

> > > in the process obscures what might otherwise be simple but powerful

> > > compression methods. It also indirectly (and needlessly) presumes an

> > > 18MB simplicity in human experience, thought, and expression that

> > > probably just isn't there.

>

> > [...]

> > Thus, to successfully pass the compression test, it will be necessary

> > for a program to learn a language model from the test data itself as it

> > compresses/decompresses the data.

>

> This "intelligence test" is so silly, it's hard to know where to start.

>

> - There is no notion of "successfully passing the test", unless you

> consider every lossless compressor/decompressor to pass the test, and

> that includes ones that are clearly completely unintelligent.

We did not set a criteria for "passing" the test because we do not know

exactly what level of compression is needed. Shannon estimated in 1950

that the written English (using a 27 character alphabet of A-Z and

space) has an entropy of between 0.6 and 1.3 bits per character. This

is better than the best compressors can do, but not by much.

Humans are better than machines at text prediction, e.g. guessing the

next character or word in "roses are ___" or "3+8=__". Clearly this

task requires lexical, syntactic, semantic and vast, real world

knowledge. For a machine to pass the AI test, it must do as well as

humans by some measure, such as the number of guesses on average until

the correct answer is guessed. The measure we use is to assign

probabilities to all possible choices and a score of -log(p) for the

probability p assigned by the guesser to the correct answer. Shannon

proved that this score is minimized when the guesser assigns the exact

same probabilities as the true (but unknown) probability distribution.

A problem with this approach is that it is conceptually difficult for

people to assign numeric probabilities. This is why Shannon used a

guessing game instead, but this resulted in the large error bounds I

described, because different distributions could lead to the same

series of guesses. Therefore, a task that remains is to more

accurately estimate the entropy of written English with respect to

humans. In particular, we need to estimate the entropy for the exact

text used in this particular data set. Until that is done, we can only

reward improvements in the state of the art.

> - You are seeking minimal size of the decompressor that reproduces a

> particular 100MB text file. A compressor/decompressor scheme that is

> specialized for the structure and statistics of that particular file

> can obviously do better than one that is more general in ability, but

> this is equally obviously less intelligent than the more general one.

I expect that a winning decompressor will be tuned specifically to this

data set. That is why the size of the decompressor is included. The

statistics can either be learned from the data or they can be hard

coded into the decompressor by the programmer. It does not matter to

us which way it is done. Any tuning done in the decompressor will

decrease the compressed file size but increase the size of the

decompressor. It is within the rules to write a "decompressor" that

simply outputs the test file verbatim, but good luck with the size of

the program.

> - It seems there is no limit on the CPU time required to generate

> the decompressor, which again penalizes intelligence and rewards

> brute force approaches.

I expect that NLP will be CPU and/or memory intensive, because other

approaches have not worked. We do not want to restrict the range of

possible solutions any more than necessary. Thus, we only require

"reasonable" computation limits by specifying that the solution must be

testable by the public and have a 30 day period of public comment. If

nobody can run the program in that time, then it loses. This rule

allows for better models as computers become more powerful.

> - Why do you imagine that any "understanding" of the text is needed

> to achieve maximal compression. A purely syntactic/statistical

> approach that ignores any meaning of the text is going to be more

> intelligent, by your criterion, since addressing semantics just adds

> baggage and size to the decompressor. This might change if you

> accepted a "fair paraphrase" rather than an exact reproduction,

> but you don't.

A "fair paraphrase" implies lossy compression, which cannot be measured

objectively. Who is to say if a paraphrase is fair or not? The extra

information needed to encode the exact text is small, and is the same

for all contestants.

And what is "understanding"? When the semantics of words are not

grounded to sensory or motor I/O, semantics is just a fuzzy identity

relation: two words are related if they are likely to appear near each

other in running text. Models like LSA exploit the transitive property

of semantics. If "moon" appears near "star" and "star" appears near

"night", then the model predicts that "night" will appear near "moon".

Semantics is just a statistical property.

> In summary, what is proposed is no more or less than a test of the

> intelligence and skill of lossless compression programmers, not of

> AI programs.

Yes, that is what the test is designed to do, unless you think that

machines can think.

> David

-- Matt Mahoney

Aug 9, 2006, 5:32:20 AM8/9/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:

I don't dispute the relevance of compression in measuring the quality

of a language model. What I dispute is the notion that you can pass

the Turing test by looking up responses in a table, compressed or not.

My claim is that I, as judge, can easily construct a sequence of

questions that cannot be answered correctly by table lookup, as the

correct answers depends on previous aspects of the conversation, or

other variable context information not even part of the conversation:

Me: I was born in December 1966.

System: That's nice. [[or whatever, no correct answer expected here]]

Me: How many days left before Xmas, not counting today?

System: ??

Despite the inadequacy of a table/language model to generate the right

answer, let's suppose somehow it comes up with it. I continue:

Me: OK, add that to the number of whole years since the first moon

landing, and substract my age, what do you get?

System: ??

Do you see the problem now?

David

Aug 9, 2006, 6:17:10 AM8/9/06

to

>[...]

>> In summary, what is proposed is no more or less than a test of the

>> intelligence and skill of lossless compression programmers, not of

>> AI programs.

>> In summary, what is proposed is no more or less than a test of the

>> intelligence and skill of lossless compression programmers, not of

>> AI programs.

>Yes, that is what the test is designed to do, unless you think that

>machines can think.

Well, I'm glad we agree on what the test measures. Why then did the

original announcement contain the following very misleading claim:

"Artificial intelligence researchers finally have an objective and

rigorously validated measure of the intelligence of their machines.

Furthermore the higher the measured intelligence of their machines

the more money they can win via the Hutter Prize."

The problem is that "AI researchers' machines" are almost always

intended to do things other than lossless compression of slabs of

text, so the test would invariably claim they have no intelligence.

David

Aug 9, 2006, 12:47:37 PM8/9/06

to

How do you propose that intelligence (artificial or otherwise) be

measured?

-- Matt Mahoney

Aug 9, 2006, 12:58:46 PM8/9/06

to

Matt Mahoney wrote:

> Ted Dunning wrote:

> > Wasn't it Fisher in the 20's who proved just about the same thing about

> > maximum likelihood estimates?

> I think it was Shannon who proved this in 1948 (discete noiseless

> channel capacity theorem).

> http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

> I could not find any reference to Fisher in his paper.

I was referring to the asymptotic correctness of MLE's, not the

compression. The fundamental insight is that the maximum likelihood

estimator converges to truth. The fact that the MLE also gives you

optimal compression is an add-on to the basic idea and does indeed

follow from Shannon's work.

Aug 9, 2006, 7:48:41 PM8/9/06

to

>-- Matt Mahoney

By carefully crafted tests of knowledge, capability, comprehension,

and ability to reason, to recognise, to solve problems, to learn...

David

Aug 9, 2006, 9:06:18 PM8/9/06

to

Think about it like this:

You've heard of functions, right? They are things with inputs and

outputs. All possible inputs are called the domain. All possible

outputs are called the range. One way to implement functions is to

make a two column table with one column labeled "inputs" and the other

labeled "outputs" where each row is what the function outputs for the

given input. You'd call it a lookup table.

Now, imagine a program that, for every _sequence_ of N text inputs, i1,

i2, i3...iN, creates N rows in the table, the first row has input of

i1, and some output o1. The input for the next row is i2 appended to

the end of i1 and the output is o2... and you continue this until you

have completed all N inputs. Agreed this could be a VERY BIG table but

so could the table representing the function "+" between two 64 bit

numbers. Yet the size of the program you need to implement the

function "+" between two 64 bit numbers is very small... measured in

only tens of bytes on some computers. Incredible isn't it? That's the

magic of what is called Kolmogorov complexity.

Anyway, so let's go back to our ridiculously big text function and ask

ourselves, how big does the program have to be to generate the program

that's equivalent to that table?

Would you agree that it might be somewhat smaller than the table itself?

Aug 9, 2006, 11:15:23 PM8/9/06

to

I don't claim that a language model can be implemented as a lookup

table. I claim

1. that a better language model is more likely to pass the Turing test,

2. that a better language model will compress text better.

By language model, I mean a probability distribution over the set of

all strings. So a model will assign p("Today is Aug. 10, 2006 ... I

was born in (rest of dialog)... what do you get? x") some number in

the open interval (0,1) for all x.

Now let q be a dialog ending with a question, and x1 and x2 be two

posible answers. I think it is reasonable to assume that the following

statements are equivalent (all true or all false) for all q, x1, x2:

1. A human picked at random and asked question q is more likely to

answer x1 than x2.

2. A human picked at random is more likely to judge the pair q,x1 as

correct than q,x2.

3. A text corpus of human origin is more likely to contain q,x1 than

q,x2.

Now suppose that a machine is able to compute p() such that the above

statements are equivalent to p(q,x1) > p(q,x2) for all q, x1, and x2.

Then that machine would answer all questions in a way that is

indistinguishable from a human. It would pass the Turing test.

Now suppose that we have two models p1() and p2() and want to know

which is better (closer to p)? One way is to test it is to ask it a

multiple choice question q with answers x1 and x2, and have each

program output the answer with the higher probability according to its

model. So if p1(q,x1) > p1(q,x2) in model p1() then the program

answers x1. If p1() answers x1 and p2() answers x2 and a human judge

decides that x1 is more correct, then p1() is judged more human.

Another test would be for the judge to first decide the correct answer

(say, x1), then ask both models to explicitly compute p1(q,x1) and

p2(q,x1). The model that gives the higher value wins. We don't

normally test humans this way because it is conceptually difficult for

people to assign numeric probabilities. But it is not hard for

machines.

This test can be repeated, for example, by multiplying the scores for

several independent questions to compute the probability that all of

the answers are correct. So given a series of questions q1, q2, q3,...

with correct answers x1, x2, x3..., the winner is the one with the

greatest value of PROD_i [p(x_i|q_i)], or equivalently, the smallest

value of -SUM_i [log p(x_i|q_i)], which is the compressed size of the

answers using ideal coding.

The advantage of this approach is that no human is required to evaluate

the language model. The score is exact, repeatable, and verifiable.

What I tried to show is that good text compression requires

sophisticated reasoning ability and vast knowledge, just as your

example Turing test session does.

-- Matt Mahoney

Aug 9, 2006, 11:43:08 PM8/9/06

to

David Kinny wrote:

> >How do you propose that intelligence (artificial or otherwise) be

> >measured?

>

> >-- Matt Mahoney

>

> By carefully crafted tests of knowledge, capability, comprehension,

> and ability to reason, to recognise, to solve problems, to learn...

>

> David

> >How do you propose that intelligence (artificial or otherwise) be

> >measured?

>

> >-- Matt Mahoney

>

> By carefully crafted tests of knowledge, capability, comprehension,

> and ability to reason, to recognise, to solve problems, to learn...

>

> David

If you were to run a competition lasting for many years among AI

systems with cash prizes, how would you design the tests so that the

competition is fair, scores are 100% objective, and systems can't

cheat?

-- Matt Mahoney

Aug 10, 2006, 6:29:34 PM8/10/06

to

You claimed above that "One way to pass the Turing test would be to

look up the interrogator's exact question in Google and reply with

the text that follows from highest ranked response. The problem is

that your lookup table is several terabytes."

This claim is nonsense - the prospect of passing the Turing test

in this way is effectively nil.

> I claim

> 1. that a better language model is more likely to pass the Turing test,

> 2. that a better language model will compress text better.

Another claim you made, to which I responded above, is:

"2. An explicit representation of the probability distribution of

natural language strings (a correct language model) enables one

to pass the Turing test for AI."

You use the phrase "An explicit representation of ... ", which certainly

sounds like a table of probabilities indexed by strings, and "enables

one to pass the Turing test", which is a very strong claim. But it

seems you now want to downgrade this to a program that computes

probabilities, and to "is more likely to pass the TT". Clearly a

program that has an infinitesimal chance of passing the TT is more

likely to pass than one that has zero chance, but it's not very

reasonable to claim that such a program can pass the TT if its

chances of soing so are sufficiently small. I could create a perl

script that has a non-zero chance to pass the TT in half an hour.

> By language model, I mean a probability distribution over the set of

> all strings. So a model will assign p("Today is Aug. 10, 2006 ... I

> was born in (rest of dialog)... what do you get? x") some number in

> the open interval (0,1) for all x.

Before I try to respond to the rest of this, please clarify for me:

- What you meant by "an explict representation of ..."

- What, exactly, does the probability of some string S represent here?

The probability of S occurring next in a particular document?

P(S occurs next in the whole corpus of written English text)?

P(S occurs next in a randomly chosen spoken English conversation)?

P(S occurs next in a randomly chosen Turing Test)?

- Do you really mean "over the set of ALL strings"? Language models

typically deal with n-grams for some small n. Are you proposing

an idealised, infinitely complex model? Or what?

David

Aug 10, 2006, 9:24:28 PM8/10/06

to

David Kinny wrote:

> You claimed above that "One way to pass the Turing test would be to

> look up the interrogator's exact question in Google and reply with

> the text that follows from highest ranked response. The problem is

> that your lookup table is several terabytes."

>

> This claim is nonsense - the prospect of passing the Turing test

> in this way is effectively nil.

> You claimed above that "One way to pass the Turing test would be to

> look up the interrogator's exact question in Google and reply with

> the text that follows from highest ranked response. The problem is

> that your lookup table is several terabytes."

>

> This claim is nonsense - the prospect of passing the Turing test

> in this way is effectively nil.

OK, it's nil. The point I was trying to make is that if you have a

small training corpus, then you have to be smart about how you learn a

language model from it. If it's large, you can use a brute force

string matching approach. I believe that 1 GB is the right size, large

enough to learn syntax and semantics, but small enough that you can't

just search for the judge's question and output whatever follows.

It is controversial whether a language model can be learned from an

unlabeled corpus of positive examples, but I believe it is possible.

It is easy to extract a lexical model (dictionary). There are

statistical semantic models such as LSA, and low level syntactic models

such as hidden Markov models, that have been learned on data sets of

similar size. I believe it is possible with the Wikipedia dataset. It

is doubtful to me that the data is sufficient to learn higher level

reasoning and problem solving skills like arithmetic without some help

from the programmer, but we shall see.

> > I claim

> > 1. that a better language model is more likely to pass the Turing test,

> > 2. that a better language model will compress text better.

>

> Another claim you made, to which I responded above, is:

> "2. An explicit representation of the probability distribution of

> natural language strings (a correct language model) enables one

> to pass the Turing test for AI."

>

> You use the phrase "An explicit representation of ... ", which certainly

> sounds like a table of probabilities indexed by strings, and "enables

> one to pass the Turing test", which is a very strong claim. But it

> seems you now want to downgrade this to a program that computes

> probabilities, and to "is more likely to pass the TT". Clearly a

> program that has an infinitesimal chance of passing the TT is more

> likely to pass than one that has zero chance, but it's not very

> reasonable to claim that such a program can pass the TT if its

> chances of soing so are sufficiently small. I could create a perl

> script that has a non-zero chance to pass the TT in half an hour.

It is irrelevant whether a Turing machine implements a language model

as an algorithm or a lookup table.

> > By language model, I mean a probability distribution over the set of

> > all strings. So a model will assign p("Today is Aug. 10, 2006 ... I

> > was born in (rest of dialog)... what do you get? x") some number in

> > the open interval (0,1) for all x.

>

> Before I try to respond to the rest of this, please clarify for me:

>

> - What you meant by "an explict representation of ..."

>

> - What, exactly, does the probability of some string S represent here?

> The probability of S occurring next in a particular document?

> P(S occurs next in the whole corpus of written English text)?

> P(S occurs next in a randomly chosen spoken English conversation)?

> P(S occurs next in a randomly chosen Turing Test)?

>

> - Do you really mean "over the set of ALL strings"? Language models

> typically deal with n-grams for some small n. Are you proposing

> an idealised, infinitely complex model? Or what?

>

> David

A language model is a function, p1(): S -> [0,1] where S is the set of

all string over some alphabet, e.g. S = {0,1}*, and furthermore, p1(s)

>= 0, and SUM_s p1(s) = 1 for s in S.

Define p(s) as the probability that s will occur in a two way dialog

between a human judge and a human confederate in the Turing test. s is

a prefix of the entire dialog from beginning to end. p(s) is unknown,

like all probabilities in real life. p(s) depends on the judge, the

confederate, and when the test is given.

A model, p1(), is an approximation of p(). The cross entropy is

defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross

entropy is a measure of how accurately p1() approximates p(). H1 is

minimized when p1(s) = p(s) for all s in S. Note that H1 is the

expected size of s chosen randomly from distribution p() when

compressed with model p1(). It is required that 0 < p1(s) < 1 for all

s in S.

When I say there is an explicit representation of p1(), I mean that

there is a Turing machine that computes it.

I realize that Wikipedia has a different distribution than p(s) for

Turing tests. The proper training corpus for passing the Turing test

would be a set of transcripts between judges and human confederates,

like those at the Loebner prize website. Chat logs would also be a

better choice than Wikipedia. I argue that this difference is an

irrelevant detail. The main difference is that Wikipedia articles are

not interactive, so a machine training on it would not learn to hold a

conversation. Other than that, text prediction still requires high

level language comprehension, common sense, reasoning, and problem

solving abilities just like passing the Turing test does. If you can

model written articles, then I think extending the model to interactive

conversation would not be too hard.

-- Matt Mahoney

Aug 11, 2006, 5:05:37 PM8/11/06

to

David Kinny wrote:

> Another claim you made, to which I responded above, is:

> "2. An explicit representation of the probability distribution of

> natural language strings (a correct language model) enables one

> to pass the Turing test for AI."

>

> You use the phrase "An explicit representation of ... ", which certainly

> sounds like a table of probabilities indexed by strings, and "enables

> one to pass the Turing test", which is a very strong claim.

It may sound like that to you, but it doesn't sound like that to

somebody who is taking just a bit more time to listen.

An explicit representation is one that is explicitly defined. A table

of logarithms is an explicit representation of the log function. So is

a series expansion.

> But it

> seems you now want to downgrade this to a program that computes

> probabilities, and to "is more likely to pass the TT".

There is no downgrading involved. This is conventional usage.

You are fighting a straw man for no discernible reason.

Aug 12, 2006, 4:50:32 AM8/12/06

to

Sorry for the delay in getting back to you here ...

That might be true in theory, but we are concerned here with what can

be feasibly implemented in current or future practice, are we not?

> > > By language model, I mean a probability distribution over the set of

> > > all strings. So a model will assign p("Today is Aug. 10, 2006 ... I

> > > was born in (rest of dialog)... what do you get? x") some number in

> > > the open interval (0,1) for all x.

> >

> > Before I try to respond to the rest of this, please clarify for me:

> >

> > - What you meant by "an explict representation of ..."

> >

> > - What, exactly, does the probability of some string S represent here?

> > The probability of S occurring next in a particular document?

> > P(S occurs next in the whole corpus of written English text)?

> > P(S occurs next in a randomly chosen spoken English conversation)?

> > P(S occurs next in a randomly chosen Turing Test)?

> >

> > - Do you really mean "over the set of ALL strings"? Language models

> > typically deal with n-grams for some small n. Are you proposing

> > an idealised, infinitely complex model? Or what?

> A language model is a function, p1(): S -> [0,1] where S is the set of

> all string over some alphabet, e.g. S = {0,1}*, and furthermore, p1(s)

> >= 0, and SUM_s p1(s) = 1 for s in S.

> Define p(s) as the probability that s will occur in a two way dialog

> between a human judge and a human confederate in the Turing test. s is

> a prefix of the entire dialog from beginning to end. p(s) is unknown,

> like all probabilities in real life. p(s) depends on the judge, the

> confederate, and when the test is given.

Here I'm a little confused. You say that s is a prefix of the dialog,

and that p(s) depends on the judge, etc, and when the test is given,

implying that this p() is for a *specific* dialog. In that case the

distribution is trivial: all k possible prefixes have probability 1/k,

all other strings 0 (or some variant of this). Surely you wish to

consider the distribution over some set of dialogs, but which set?

For a given p() to be useful in passing the TT, it has to adequately

cover possible dialogs. It's easy, as judge, to create novel dialogs.

For example, the following short question whose answer is obvious has

almost certainly never been asked before in the history of mankind.

How many legs do 3 ants, an octopus and 4 aardvarks have in total?

Then there's the issue of background knowledge. In a previous posting

you addressed the need to know such things as the date by prefixing

that information onto the actual dialog. How do you propose to

decide which background information to add in explicitly?

It seems, the way this is heading, that you need p() to be some

idealized distribution over the set of all possible human to human TT

dialogs prefixed by all necessary background information. In that

case it doesn't depend on the judge, the time, etc. Right?

> A model, p1(), is an approximation of p(). The cross entropy is

> defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross

> entropy is a measure of how accurately p1() approximates p(). H1 is

> minimized when p1(s) = p(s) for all s in S. Note that H1 is the

> expected size of s chosen randomly from distribution p() when

> compressed with model p1(). It is required that 0 < p1(s) < 1 for all

> s in S.

> When I say there is an explicit representation of p1(), I mean that

> there is a Turing machine that computes it.

I'll return below to the question of how such a p1() could be created,

but let's move on to another issue. Let's assume that you have an

actual, computationally tractable implementation of an algorithm that

computes p1(), which is a very good approximation of some suitably

extensive p(). Now how can you actually use this to pass the TT?

You've been asked a question, and the dialog so far is the string d.

You need to find a suffix x such that p1(d+x) is non-zero (I assume

you'll settle for any suitable answer rather than the most likely).

This means searching over the space of possible strings x and

repeatedly evaluating p1(d+x). This is not feasible. And if your

algorithm in fact takes d as input and returns an x s.t. p1(d+x) > 0,

it is not a language model as you've defined it, but something else.

For completeness, I'm just going to backtrack to something you wrote

earlier in the thread.

> Now let q be a dialog ending with a question, and x1 and x2 be two

> posible answers. I think it is reasonable to assume that the following

> statements are equivalent (all true or all false) for all q, x1, x2:

> 1. A human picked at random and asked question q is more likely to

> answer x1 than x2.

> 2. A human picked at random is more likely to judge the pair q,x1 as

> correct than q,x2.

> 3. A text corpus of human origin is more likely to contain q,x1 than

> q,x2.

I'd agree that 1 and 2 are equivalent, but dispute the equivalence of 3.

The reason is that, except for a vanishingly small number of dialogs q,

the probability of a text corpus of human origin containing q,x1 is

identical to the probability of it containing q,x2 , namely 0. As I've

shown it is easy to create dialogs such that this is the case.

> I realize that Wikipedia has a different distribution than p(s) for

> Turing tests. The proper training corpus for passing the Turing test

> would be a set of transcripts between judges and human confederates,

> like those at the Loebner prize website. Chat logs would also be a

> better choice than Wikipedia. I argue that this difference is an

> irrelevant detail. The main difference is that Wikipedia articles are

> not interactive, so a machine training on it would not learn to hold a

> conversation. Other than that, text prediction still requires high

> level language comprehension, common sense, reasoning, and problem

> solving abilities just like passing the Turing test does. If you can

> model written articles, then I think extending the model to interactive

> conversation would not be too hard.

> -- Matt Mahoney

So finally you gloss over several issues, dismissing them as irrelevant

details, or easy. You claim, apparently as an act of faith, that:

- It will be feasible to learn a sufficiently good language model from a

tractably small corpus. This is on a par with claiming that it will

be feasible to learn to play championship Go just from example games.

- Given such a model, it will be feasible to use it to pass the TT.

- The difference between the language model of TT dialogs and of a slab

of Wikipedia is unimportant. You, the programmer, can somehow extend

or convert the one into the other.

- The limited extent of the corpus will not matter. Somehow your LM

will have the capability to generalise, extrapolate, answer novel

questions, incorporate all necessary background information, know

how to do computations, solve simple problems, remember things, ...

Here's my take. It will, eventually, be feasible to create programs

that do all these latter things well enough to pass a TT, even in the

face of the types of lines of questioning I've pointed to. Inevitably

such systems will acquire much of their knowledge by learning, from

conversations and texts amongst other things. But they will be so

much more than just language models, in terms of what they consist of

and what they are capable of doing. They could actually be used to

generate language models of a certain type, ones where probabilities

were significant only in whether they were zero or not. Indeed, I

suspect that the only way you can generate a LM with sufficient

coverage to (in theory) pass the TT, is to use a system with such

capabilities as the generator.

To put it simply, I'd claim the only way you can create a LM good

enough to (in theory) pass the TT is to build a system that can

already pass the TT without any need to create that LM. And you

don't achieve that just as a side-effect of achieving compression

of 1G slabs of text that is close to the information-theoretic limit,

as that task can be achieved by much much simpler approaches.

Compressing text better is at best just one small step on the path

to implementing systems that pass the TT.

David

Aug 12, 2006, 5:01:03 AM8/12/06

to

"Ted Dunning" <ted.d...@gmail.com> writes:

> David Kinny wrote:

> > Another claim you made, to which I responded above, is:

> > "2. An explicit representation of the probability distribution of

> > natural language strings (a correct language model) enables one

> > to pass the Turing test for AI."

> >

> > You use the phrase "An explicit representation of ... ", which certainly

> > sounds like a table of probabilities indexed by strings, and "enables

> > one to pass the Turing test", which is a very strong claim.

> David Kinny wrote:

> > Another claim you made, to which I responded above, is:

> > "2. An explicit representation of the probability distribution of

> > natural language strings (a correct language model) enables one

> > to pass the Turing test for AI."

> >

> > You use the phrase "An explicit representation of ... ", which certainly

> > sounds like a table of probabilities indexed by strings, and "enables

> > one to pass the Turing test", which is a very strong claim.

> It may sound like that to you, but it doesn't sound like that to

> somebody who is taking just a bit more time to listen.

Steady on there Ted. Matt began by claiming that one could pass the

TT using a several terabyte lookup table. I'm the one who is taking

the time here to try and clarify exactly what he really means, and it

keeps changing from the laughable pop-science claims of the original

announcement. Hopefully now it's approaching something reasonable.

> An explicit representation is one that is explicitly defined. A table

> of logarithms is an explicit representation of the log function. So is

> a series expansion.

I agree, both are explicit. I really don't want to quibble about

semantics, but I'd say that an algorithm that computes logarithms in

some non-transparent way is by contrast an implicit representation.

Your usage of language may vary.

> > But it

> > seems you now want to downgrade this to a program that computes

> > probabilities, and to "is more likely to pass the TT".

> There is no downgrading involved. This is conventional usage.

The change from "can pass" to "is more likely to pass", from a

strong claim to a weak one, is certainly downgrading. And a program

is potentially feasible, while a lookup table is certainly not, as

it seems we all now agree.

> You are fighting a straw man for no discernible reason.

There is confusion being generated here that deserves to be dispelled.

David

Aug 12, 2006, 1:18:22 PM8/12/06

to

> > It is irrelevant whether a Turing machine implements a language model

> > as an algorithm or a lookup table.

>

> That might be true in theory, but we are concerned here with what can

> be feasibly implemented in current or future practice, are we not?

> > as an algorithm or a lookup table.

>

> That might be true in theory, but we are concerned here with what can

> be feasibly implemented in current or future practice, are we not?

In practice, Turing machines don't exist.

> > Define p(s) as the probability that s will occur in a two way dialog

> > between a human judge and a human confederate in the Turing test. s is

> > a prefix of the entire dialog from beginning to end. p(s) is unknown,

> > like all probabilities in real life. p(s) depends on the judge, the

> > confederate, and when the test is given.

>

> Here I'm a little confused. You say that s is a prefix of the dialog,

> and that p(s) depends on the judge, etc, and when the test is given,

> implying that this p() is for a *specific* dialog. In that case the

> distribution is trivial: all k possible prefixes have probability 1/k,

> all other strings 0 (or some variant of this). Surely you wish to

> consider the distribution over some set of dialogs, but which set?

In real life, probabilities are always unknown. Different people will

estimate them differently depending on what they know. For example, I

flip a fair coin, look at it, but don't show it to you. What is the

probability that it is heads? You say 1/2. I say 1. Both of us are

right.

I define p(s) as the probability that s will be a prefix of a dialog in

a Turing test, given that you do not know who will be the judge or the

confederate or when the test will be given.

I realize that p() is still not well defined. But that is the nature

of probability.

> For a given p() to be useful in passing the TT, it has to adequately

> cover possible dialogs. It's easy, as judge, to create novel dialogs.

> For example, the following short question whose answer is obvious has

> almost certainly never been asked before in the history of mankind.

>

> How many legs do 3 ants, an octopus and 4 aardvarks have in total?

And yet, the probability that this will answered in an unknown training

corpus will not be exactly 0.

Also, I think you will agree that this question and the answer could be

phrased as a text prediction problem, in which case a compressor that

can solve problems like this will compress smaller than one that

cannot.

> Then there's the issue of background knowledge. In a previous posting

> you addressed the need to know such things as the date by prefixing

> that information onto the actual dialog. How do you propose to

> decide which background information to add in explicitly?

If a machine responds too fast to my questions, I will know it is not

human. A dialog must contain information about who said what and when.

You can represent this information how you like.

> It seems, the way this is heading, that you need p() to be some

> idealized distribution over the set of all possible human to human TT

> dialogs prefixed by all necessary background information. In that

> case it doesn't depend on the judge, the time, etc. Right?

It does depend, but you don't have this information, so you have to

make reasonable guesses.

> > A model, p1(), is an approximation of p(). The cross entropy is

> > defined as H1 = -SUM_s p(s) log p1(s), for all s in S. The cross

> > entropy is a measure of how accurately p1() approximates p(). H1 is

> > minimized when p1(s) = p(s) for all s in S. Note that H1 is the

> > expected size of s chosen randomly from distribution p() when

> > compressed with model p1(). It is required that 0 < p1(s) < 1 for all

> > s in S.

>

> > When I say there is an explicit representation of p1(), I mean that

> > there is a Turing machine that computes it.

>

> I'll return below to the question of how such a p1() could be created,

> but let's move on to another issue. Let's assume that you have an

> actual, computationally tractable implementation of an algorithm that

> computes p1(), which is a very good approximation of some suitably

> extensive p(). Now how can you actually use this to pass the TT?

>

> You've been asked a question, and the dialog so far is the string d.

> You need to find a suffix x such that p1(d+x) is non-zero (I assume

> you'll settle for any suitable answer rather than the most likely).

> This means searching over the space of possible strings x and

> repeatedly evaluating p1(d+x). This is not feasible. And if your

> algorithm in fact takes d as input and returns an x s.t. p1(d+x) > 0,

> it is not a language model as you've defined it, but something else.

This is a perfectly reasonable implementation for a Turing machine. In

practice, any language model can be converted to a generator with at

most O(n) increase in tme complexity by binary search by appending one

bit at a time. You first compute p1(d+0) and p1(d+1) and randomly

append one bit in proportion to these probabilities, repeating until

you have appended a symbol indicating the end of the message (which is

required in a dialog).

> For completeness, I'm just going to backtrack to something you wrote

> earlier in the thread.

>

> > Now let q be a dialog ending with a question, and x1 and x2 be two

> > posible answers. I think it is reasonable to assume that the following

> > statements are equivalent (all true or all false) for all q, x1, x2:

>

> > 1. A human picked at random and asked question q is more likely to

> > answer x1 than x2.

> > 2. A human picked at random is more likely to judge the pair q,x1 as

> > correct than q,x2.

> > 3. A text corpus of human origin is more likely to contain q,x1 than

> > q,x2.

>

> I'd agree that 1 and 2 are equivalent, but dispute the equivalence of 3.

> The reason is that, except for a vanishingly small number of dialogs q,

> the probability of a text corpus of human origin containing q,x1 is

> identical to the probability of it containing q,x2 , namely 0. As I've

> shown it is easy to create dialogs such that this is the case.

Without full knowledge, you cannot say that any probability is 0. If

you had full knowledge then there would be no use for probability

theory.

> > I realize that Wikipedia has a different distribution than p(s) for

> > Turing tests. The proper training corpus for passing the Turing test

> > would be a set of transcripts between judges and human confederates,

> > like those at the Loebner prize website. Chat logs would also be a

> > better choice than Wikipedia. I argue that this difference is an

> > irrelevant detail. The main difference is that Wikipedia articles are

> > not interactive, so a machine training on it would not learn to hold a

> > conversation. Other than that, text prediction still requires high

> > level language comprehension, common sense, reasoning, and problem

> > solving abilities just like passing the Turing test does. If you can

> > model written articles, then I think extending the model to interactive

> > conversation would not be too hard.

>

> > -- Matt Mahoney

>

> So finally you gloss over several issues, dismissing them as irrelevant

> details, or easy. You claim, apparently as an act of faith, that:

>

> - It will be feasible to learn a sufficiently good language model from a

> tractably small corpus. This is on a par with claiming that it will

> be feasible to learn to play championship Go just from example games.

I believe it is possible to develop lexical, semantic, and syntactic

models from unlabeled text containing only positive examples. Children

seem to do this. It is probably insufficient to develop higher level

skills like solving word problems with arithmetic. That will probably

take some help from the programmer.

A corpus consisting of articles is not sufficient to train a chatterbot

in interactive converstation. But I argue that such a model is still

useful, probably more so. If you want to translate articles, scan

them, recognize them from broadcast speech, etc., you want to train on

the type of text that it would typically encounter.

> - Given such a model, it will be feasible to use it to pass the TT.

Given about 1 GB of the right type of training material (dialogs), yes.

> - The difference between the language model of TT dialogs and of a slab

> of Wikipedia is unimportant. You, the programmer, can somehow extend

> or convert the one into the other.

There is a lot of overlap in the learning mechanisms. The difference

is mainly in the training material and the requirement for interaction.

> - The limited extent of the corpus will not matter. Somehow your LM

> will have the capability to generalise, extrapolate, answer novel

> questions, incorporate all necessary background information, know

> how to do computations, solve simple problems, remember things, ...

I think 1 GB is the right size, as explained in

http://cs.fit.edu/~mmahoney/compression/rationale.html

I do not claim that this particular corpus is right for any particular

AI problem you want to solve. You need to be careful to feed it the

information you want it to know. But I believe for the problem of

predicting text in a corpus, the best information to feed it is the

corpus itself. So I have no problem with the Wikipedia text for this

purpose.

> Here's my take. It will, eventually, be feasible to create programs

> that do all these latter things well enough to pass a TT, even in the

> face of the types of lines of questioning I've pointed to. Inevitably

> such systems will acquire much of their knowledge by learning, from

> conversations and texts amongst other things. But they will be so

> much more than just language models, in terms of what they consist of

> and what they are capable of doing. They could actually be used to

> generate language models of a certain type, ones where probabilities

> were significant only in whether they were zero or not. Indeed, I

> suspect that the only way you can generate a LM with sufficient

> coverage to (in theory) pass the TT, is to use a system with such

> capabilities as the generator.

>

> To put it simply, I'd claim the only way you can create a LM good

> enough to (in theory) pass the TT is to build a system that can

> already pass the TT without any need to create that LM. And you

> don't achieve that just as a side-effect of achieving compression

> of 1G slabs of text that is close to the information-theoretic limit,

> as that task can be achieved by much much simpler approaches.

> Compressing text better is at best just one small step on the path

> to implementing systems that pass the TT.

>

> David

Passing the Turing test requires:

1. Solving the problem of language learning.

2. Modeling interactive conversation.

3. Selecting the right training data.

I am only addressing problem 1.

-- Matt Mahoney

Aug 13, 2006, 3:59:15 PM8/13/06

to

Matt Mahoney wrote:

> David

>

> Passing the Turing test requires:

> 1. Solving the problem of language learning.

> 2. Modeling interactive conversation.

> 3. Selecting the right training data.

>

> I am only addressing problem 1.

>

have been based on databases. (like a Google search). Given a database

only "1" is needed to pass the Turing Test.

Essentially chatterboxes have done "2" quite adaquately.

I realize that this paragraph is one of high subjectivity (nothing like

compression!), but it is doubtful to what extent human beings really

think for themselves. Social scientists and evolutionary biologists

talk about memes, which I suppose you could describe in AI terms as

being elements of a database. The test is for a conversation similar to

that which you could have with a human and it is my experience that you

do not look for reason! Extreme Islamic statements would for example

pass T. They are of course memes.

Clearly if you have compression, or some other purely objective

criterion this does not apply. We are not in compression attempting to

mimic human behaviour. I think you are right, it is the only true test,

although I don't believe monolinual Wiki is the right answer.

Aug 13, 2006, 9:29:11 PM8/13/06

to

Ian Parker wrote:

> I think you are right, it is the only true test,

> although I don't believe monolinual Wiki is the right answer.

> I think you are right, it is the only true test,

> although I don't believe monolinual Wiki is the right answer.

Hutter's decision to go with 10e8 bytes was an uneasy, temporary

compromise. Certainly multilingual compression will be very valuable

-- particularly for my personal obsession which is applied relation

arithmetic.

Aug 14, 2006, 12:15:59 AM8/14/06

to

I had considered a multilingual corpus because it would encourage

research in learning language independent parsing rules rather than

hard coding rules into the program. For example, Chinese does not use

spaces between words (1 to 4 characters), making it a challenge to

build a dictionary. Nevertheless, it is possible to do this in English

text without spaces, starting with no knowledge. Such models might

help us understand how babies learn to parse continuous speech into

words at age 7-10 months, before they learn to speak.

http://cs.fit.edu/~mmahoney/dissertation/lex1.html

I used a monolingual corpus because multilingual ability is not a

requirement for intelligence. The problem is hard enough as it is. If

the same articles appeared in several languages, then compression would

require solving the language translation problem in addition to the

modeling problem.

-- Matt Mahoney

Aug 14, 2006, 4:57:47 AM8/14/06

to

"Ian Parker" <ianpa...@gmail.com> writes:

> [...] Clearly if you have compression, or some other purely objective

> criterion this does not apply. We are not in compression attempting to

> mimic human behaviour. I think you are right, it is the only true test,

> although I don't believe monolinual Wiki is the right answer.

The problems with compressing text as a benchmark of intelligence are

that most systems don't compress text at all, and it is feasible to

compress text very well while understanding nothing about its meaning,

i.e. being incapable of passing a simple comprehension test.

David

Aug 14, 2006, 5:10:53 AM8/14/06

to

"Matt Mahoney" <matma...@yahoo.com> writes:

> David Kinny wrote:

> > "Matt Mahoney" <matma...@yahoo.com> writes:

> > > It is irrelevant whether a Turing machine implements a language model

> > > as an algorithm or a lookup table.

> >

> > That might be true in theory, but we are concerned here with what can

> > be feasibly implemented in current or future practice, are we not?

> In practice, Turing machines don't exist.

Tell me something I don't know. That's why I keep talking about

tractable algorithms and feasibility ...

> > > Define p(s) as the probability that s will occur in a two way dialog

> > > between a human judge and a human confederate in the Turing test. s is

> > > a prefix of the entire dialog from beginning to end. p(s) is unknown,

> > > like all probabilities in real life. p(s) depends on the judge, the

> > > confederate, and when the test is given.

> >

> > Here I'm a little confused. You say that s is a prefix of the dialog,

> > and that p(s) depends on the judge, etc, and when the test is given,

> > implying that this p() is for a *specific* dialog. In that case the

> > distribution is trivial: all k possible prefixes have probability 1/k,

> > all other strings 0 (or some variant of this). Surely you wish to

> > consider the distribution over some set of dialogs, but which set?

> In real life, probabilities are always unknown. Different people will

> estimate them differently depending on what they know. For example, I

> flip a fair coin, look at it, but don't show it to you. What is the

> probability that it is heads? You say 1/2. I say 1. Both of us are

> right.

> I define p(s) as the probability that s will be a prefix of a dialog in

> a Turing test, given that you do not know who will be the judge or the

> confederate or when the test will be given.

> I realize that p() is still not well defined. But that is the nature

> of probability.

... while you keep retreating back to unknowable idealisations that

you're somehow going to produce excellent approximations to. Your p()

contains information on all conceivable TT dialogs for all (english

speaking?) humans over all time. You're going to somehow learn this

from 1G of text. I don't think so.

> > For a given p() to be useful in passing the TT, it has to adequately

> > cover possible dialogs. It's easy, as judge, to create novel dialogs.

> > For example, the following short question whose answer is obvious has

> > almost certainly never been asked before in the history of mankind.

> >

> > How many legs do 3 ants, an octopus and 4 aardvarks have in total?

> And yet, the probability that this will answered in an unknown training

> corpus will not be exactly 0.

A mere quibble. Whether it is exactly 0 or vanishingly small, you can't

learn to reliably answer such questions by text prediction based on a

manageably small training corpus, because you can't learn a sufficiently

good approximation of your idealized p(). There are simply too many

easily asked and answered questions you will never have seen before.

Your system must be capable of responding to novel questions like this,

and LM's merely learned directly from text corpora can't.

> Also, I think you will agree that this question and the answer could be

> phrased as a text prediction problem, in which case a compressor that

> can solve problems like this will compress smaller than one that

> cannot.

Surely. But that doesn't imply the converse: that a compressor that

compresses smaller than <one that cannot solve such problems>, can solve

such problems.

> > Then there's the issue of background knowledge. In a previous posting

> > you addressed the need to know such things as the date by prefixing

> > that information onto the actual dialog. How do you propose to

> > decide which background information to add in explicitly?

> If a machine responds too fast to my questions, I will know it is not

> human.

Are you serious? This is easily addressed.

> A dialog must contain information about who said what and when.

> You can represent this information how you like.

Delimiters between utterances are indeed necessary. Timestamps are not.

In any case you've missed (or are deliberately avoiding?) the point here.

You prefixed the actual dialog with non-dialog information such as the

date, presumably to prop up your argument that you could learn an LM

that could answer questions such as "How many days to Xmas?".

The issue is that you need to add a potentially huge amount of

background information that is known by humans but absent from dialogs.

OK, that seems reasonable.

> > For completeness, I'm just going to backtrack to something you wrote

> > earlier in the thread.

> >

> > > Now let q be a dialog ending with a question, and x1 and x2 be two

> > > posible answers. I think it is reasonable to assume that the following

> > > statements are equivalent (all true or all false) for all q, x1, x2:

> >

> > > 1. A human picked at random and asked question q is more likely to

> > > answer x1 than x2.

> > > 2. A human picked at random is more likely to judge the pair q,x1 as

> > > correct than q,x2.

> > > 3. A text corpus of human origin is more likely to contain q,x1 than

> > > q,x2.

> >

> > I'd agree that 1 and 2 are equivalent, but dispute the equivalence of 3.

> > The reason is that, except for a vanishingly small number of dialogs q,

> > the probability of a text corpus of human origin containing q,x1 is

> > identical to the probability of it containing q,x2 , namely 0. As I've

> > shown it is easy to create dialogs such that this is the case.

> Without full knowledge, you cannot say that any probability is 0. If

> you had full knowledge then there would be no use for probability

> theory.

The probabilities assigned to strings by your LM is determined by how

you construct the LM, not by what you or I might consider them to be.

You have yet to explain how you will create an LM by training on a

corpus that will assign non-zero probabilities to strings it has never

encountered, including all questions that might reasonably be asked.

> > > I realize that Wikipedia has a different distribution than p(s) for

> > > Turing tests. The proper training corpus for passing the Turing test

> > > would be a set of transcripts between judges and human confederates,

> > > like those at the Loebner prize website. Chat logs would also be a

> > > better choice than Wikipedia. I argue that this difference is an

> > > irrelevant detail. The main difference is that Wikipedia articles are

> > > not interactive, so a machine training on it would not learn to hold a

> > > conversation. Other than that, text prediction still requires high

> > > level language comprehension, common sense, reasoning, and problem

> > > solving abilities just like passing the Turing test does. If you can

> > > model written articles, then I think extending the model to interactive

> > > conversation would not be too hard.

> >

> > > -- Matt Mahoney

> >

> > So finally you gloss over several issues, dismissing them as irrelevant

> > details, or easy. You claim, apparently as an act of faith, that:

> >

> > - It will be feasible to learn a sufficiently good language model from a

> > tractably small corpus. This is on a par with claiming that it will

> > be feasible to learn to play championship Go just from example games.

> I believe it is possible to develop lexical, semantic, and syntactic

> models from unlabeled text containing only positive examples. Children

> seem to do this. It is probably insufficient to develop higher level

> skills like solving word problems with arithmetic. That will probably

> take some help from the programmer.

Children have inherent capabilities that permit them to do this, and much

more. Until you work out how to reproduce a sufficient subset of those

capabilities, you can't expect to replicate the achievent. You believe

that learning a language model for a suitable corpus is most of the job.

I think you, like many others, are grossly underestimating the complexity

of the problem, and looking for simple solutions to complex problems.

> A corpus consisting of articles is not sufficient to train a chatterbot

> in interactive converstation. But I argue that such a model is still

> useful, probably more so. If you want to translate articles, scan

> them, recognize them from broadcast speech, etc., you want to train on

> the type of text that it would typically encounter.

> > - Given such a model, it will be feasible to use it to pass the TT.

> Given about 1 GB of the right type of training material (dialogs), yes.

Dream on. The information content is inadequate.

> > - The difference between the language model of TT dialogs and of a slab

> > of Wikipedia is unimportant. You, the programmer, can somehow extend

> > or convert the one into the other.

> There is a lot of overlap in the learning mechanisms. The difference

> is mainly in the training material and the requirement for interaction.

How can you claim this, when you have yet to devise the learning

mechanisms?

> > - The limited extent of the corpus will not matter. Somehow your LM

> > will have the capability to generalise, extrapolate, answer novel

> > questions, incorporate all necessary background information, know

> > how to do computations, solve simple problems, remember things, ...

> I think 1 GB is the right size, as explained in

> http://cs.fit.edu/~mmahoney/compression/rationale.html

> I do not claim that this particular corpus is right for any particular

> AI problem you want to solve. You need to be careful to feed it the

> information you want it to know. But I believe for the problem of

> predicting text in a corpus, the best information to feed it is the

> corpus itself. So I have no problem with the Wikipedia text for this

> purpose.

I totally agree that to predict the text that occurs in X, train on X.

But you've still offered no convincing argument as to how this allows

your system to answer questions that don't occur in X, or even ones

that are minor variations of those that do.

> [...]

> Passing the Turing test requires:

> 1. Solving the problem of language learning.

> 2. Modeling interactive conversation.

> 3. Selecting the right training data.

> I am only addressing problem 1.

> -- Matt Mahoney

I guess we've just about reached the point of agreeing to disagree.

David

Aug 14, 2006, 6:35:28 AM8/14/06

to

>The problems with compressing text as a benchmark of intelligence are

>that most systems don't compress text at all, and it is feasible to

>compress text very well while understanding nothing about its meaning,

>i.e. being incapable of passing a simple comprehension test.

To achieve high levels of compression, particularly with multilingual

texts, understanding is required. "What do you mean by understand?".

You need an objective definition for that too. The only possible

criterion is a Logical Posivist one. What can you do? In other words.

It is clear to me too that "meaning" is a part of the information

content of text, and a compression algorithm must at some point

consider meaning. With multilinual texts this has to be considered

early on.

- Ian Parker

Aug 14, 2006, 5:19:09 PM8/14/06

to

Actually, rules are compressed knowledge. Tractability and

comprehension constant, the more unified the rules, the less space they

occupy and the more accurate they are at predicting future observations.

Aug 14, 2006, 5:34:40 PM8/14/06

to

David Kinny wrote:

> "Matt Mahoney" <matma...@yahoo.com> writes:

> > A corpus consisting of articles is not sufficient to train a chatterbot

> > in interactive converstation. But I argue that such a model is still

> > useful, probably more so. If you want to translate articles, scan

> > them, recognize them from broadcast speech, etc., you want to train on

> > the type of text that it would typically encounter.

>

> > > - Given such a model, it will be feasible to use it to pass the TT.

>

> > Given about 1 GB of the right type of training material (dialogs), yes.

>

> Dream on. The information content is inadequate.

> "Matt Mahoney" <matma...@yahoo.com> writes:

> > A corpus consisting of articles is not sufficient to train a chatterbot

> > in interactive converstation. But I argue that such a model is still

> > useful, probably more so. If you want to translate articles, scan

> > them, recognize them from broadcast speech, etc., you want to train on

> > the type of text that it would typically encounter.

>

> > > - Given such a model, it will be feasible to use it to pass the TT.

>

> > Given about 1 GB of the right type of training material (dialogs), yes.

>

> Dream on. The information content is inadequate.

OK, then, how much is adequate? I presented evidence that the storage

capacity for verbally expressable knowledge in the human brain is about

10^9 bits.

> I totally agree that to predict the text that occurs in X, train on X.

> But you've still offered no convincing argument as to how this allows

> your system to answer questions that don't occur in X, or even ones

> that are minor variations of those that do.

If I knew how to solve the language modeling problem then I would have

solved it. The fact is that nobody knows how to solve it. That is why

there is a contest. Maybe it can't be solved.

But let's not be pessimistic. I believe the Wikipedia data set has

enough information to infer general knowledge and then apply it to

novel situations occurring later in the text. Here is an example.

There is no occurrence of the string "the United States is a place" in

enwik8 (the Hutter prize data set). Yet there are 1359 occurrences of

"in the United States", 265 of "to the United States", etc. from which

this knowledge could be inferred. Knowing whether X is a place or a

person allows me to make better predictions of the text that follows.

The phrase "X said" tells me that X is probably a person, so I can

predict phrases like "X believes", "X thinks", etc.

So if a program can make such inferences, could it not compress better

than one that couldn't? Or is this not a valid test?

-- Matt Mahoney

Message has been deleted

Aug 15, 2006, 5:31:56 AM8/15/06

to

Matt,

I am interested in this idea that intelligence is equivalent to

compression. I like it. My question is, what guarantee do we have that

a given compression has any generality?

I'm guessing this is related to the assumption that the problem be

limited to a "restricted case of an environment bounded by space s and

time t."

I'm not trying to be negative by saying this. It is just I think there

is a better solution to be found by not making that assumption. Not a

better solution to the compression problem, but a better solution to

the problem of intelligence defined as compression.

-Rob Freeman

Matt Mahoney wrote:

> Ted Dunning wrote:

> > Wasn't it Fisher in the 20's who proved just about the same thing about

> > maximum likelihood estimates?

> >

> > Certainly, there is nothing new in the thesis that if you have an

> > estimate q of a probability p, then the using q as a compression model

> > gives minimum compressed size for p = q. Moreover, this minimum is

> > unique. This is the same as saying that the maximum likelihood value

> > of q is p (not quite the same as the maximum likelihood *estimator*, of

> > course).

>

> I think it was Shannon who proved this in 1948 (discete noiseless

> channel capacity theorem).

> http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf

> I could not find any reference to Fisher in his paper.

>

> > Why this is news is beyond me.

>

> That isn't what Hutter proved. What he proved is that the optimal

> behavior of a goal seeking agent in an unknown but computable

> environment is to guess that the environment is controlled by the

> shortest program (Turing machine) consistent with all behavior observed

> so far. Thus, a solution to the problem of maximizing the accumulated

> reward signal from an unknown environment reduces to algorithmic

> compression, which is an undecidable problem. However, Hutter also

> proved that in the restricted case of an envoronment bounded by space s

> and time t, that the problem is computable in time O(t2^s) (so the

> problem is still hard).

>

> When I developed the large text benchmark (without funding) my goal was

> to promote research in NLP. There are 3 arguments why text compression

> implies AI.

>

> 1. Hutter's proof in the specific case of text prediction (a problem

> that is easy for people but hard for machines).

> 2. An explicit representation of the probability distribution of

> natural language strings (a correct language model) enables one to pass

> the Turing test for AI.

> 3. Compression ratio (perplexity) is already used in practice to

> measure the quality of language models (mainly for speech recognition).

>

> However, AI does *not* imply compression. Decompression requires a

> deterministic computation to invert the compression function, and the

> human brain is not deterministic. I believe this is the reason that

> many people doubt the converse. I discuss these issues in

> http://cs.fit.edu/~mmahoney/compression/rationale.html

>

> -- Matt Mahoney

Aug 15, 2006, 3:28:41 PM8/15/06

to

Rob Freeman wrote:

> Matt,

>

> I am interested in this idea that intelligence is equivalent to

> compression. I like it. My question is, what guarantee do we have that

> a given compression has any generality?

> Matt,

>

> I am interested in this idea that intelligence is equivalent to

> compression. I like it. My question is, what guarantee do we have that

> a given compression has any generality?

There is no such thing as a perfectly general compression function.

Every compressor that decreases the size of some inputs must increase

the size of others.

Of course, most "interesting" data, like text or images, is

compressable. We could probably define "interesting" to mean that

there is a short description of it. More formally, a string of length

n bits is interesting if there is a universal Turing machine with an

encoding shorter than n bits that outputs the string. The length of

the shortest program to do this is the algorithmic complexity.

(Obviously it depends on the programming language, but only up to a

constant, because any program can be converted to another language by

appending a compiler). The problem of computing algorithmic complexity

for all strings in any given language is not computable.

Since "general" compression is not obtainable, that isn't the goal of

the Hutter prize. I expect that the best solutions will be tuned

specifically to the data set, and not be very useful as general purpose

compressors. However, to do well, these programs will have to solve

hard problems in language modeling, like how to integrate lexical,

semantic and syntactic knowledge to recognize correct sentences and

coherent paragraphs. I think the data has sufficient knowledge to do

this if we can figure out how language is learned.

> I'm guessing this is related to the assumption that the problem be

> limited to a "restricted case of an environment bounded by space s and

> time t."

These bounds reduce AIXI from an undecidable problem to an intractable

one (O(t2^s)), so it is still a theoretical result. The goal is not to

verify AIXI. It is proven, so it doesn't need to be verified

experimentally. The goal is more modest, to understand how people

learn language. I believe this problems is tractable, or else our

brains couldn't do it.

> I'm not trying to be negative by saying this. It is just I think there

> is a better solution to be found by not making that assumption. Not a

> better solution to the compression problem, but a better solution to

> the problem of intelligence defined as compression.

The Turing test is more direct. We have the Loebner prize for that.

The Turing test requires machines to have a world model and to reason

and solve problems using it. The Wikipedia benchmark is probably

insufficient to train a system for this level of abstract thought, but

I think it will bring us to a level of language modeling much higher

than we have now.

-- Matt Mahoney

Aug 15, 2006, 8:08:55 PM8/15/06

to

Matt Mahoney wrote:

>

> There is no such thing as a perfectly general compression function.

>

> There is no such thing as a perfectly general compression function.

Is this true also for language?

> The problem of computing algorithmic complexity

> for all strings in any given language is not computable.

This is probably the result I am interested in. Why is this so?

Most especially, if it is so, why are you looking for a single

compressor as the solution to your "language learning" problem, even

for a single data set?

> Since "general" compression is not obtainable, that isn't the goal of

> the Hutter prize. I expect that the best solutions will be tuned

> specifically to the data set, and not be very useful as general purpose

> compressors. However, to do well, these programs will have to solve

> hard problems in language modeling, like how to integrate lexical,

> semantic and syntactic knowledge to recognize correct sentences and

> coherent paragraphs. I think the data has sufficient knowledge to do

> this if we can figure out how language is learned.

My impression is that you are assuming the problem with "language

modeling" has been that have not known how to compress natural

language. What I am suggesting is that the problem has been we only

ever look for a single best solution.

If you agree there are many solutions, that each data set will have a

different solution, and by implication that each subset of each

data-set will have a different best solution, why should we not

optimize our compressor on, say, each sentence as that sentence is

presented to the system?

Why do we have to optimize globally for a solution to the language

modeling problem, even when we know that global solutions are not

general?

-Rob

Aug 16, 2006, 3:49:39 AM8/16/06

to

James A. Bowery wrote:

<snip>

> Indeed, Hutter's Razor** might be phrased, "It is truer to explain with

> less that which can be explained with more."

<snip>

> Named for the discoverer of the proof and the initial 50,000 Euro

> donor, the Hutter Prize currently targets the compression of a 100

> megabyte sample of human knowledge drawn from the broadly based

> Wikipedia online encyclopedia.

<snip>

<snip>

> Indeed, Hutter's Razor** might be phrased, "It is truer to explain with

> less that which can be explained with more."

<snip>

> Named for the discoverer of the proof and the initial 50,000 Euro

> donor, the Hutter Prize currently targets the compression of a 100

> megabyte sample of human knowledge drawn from the broadly based

> Wikipedia online encyclopedia.

<snip>

Certianly one is free to put one's money down on any proposition but in

this case it might be better spent on a search for truth. Real

scientists don't really use Occam's Razor except as a convienent excuse

to dismiss things they don't like. To claim it is more true to explain

less is much like adherents of certian programming languages who claim

there is no need to comment code. I would argue the general

proposition of proving less is more cannot be proven because it is

obviously false. Let's prove that. Take any brand of encyclopedia and

reduce it to the character X. There is clearly no more truth in

stating "X" than stating the entire encyclopedia. So hereafter you may

consider Hutter's Razor as presented by Bowery's Post as proven false.

Hutter may have something worthwhile that is not represented with

sufficient accuracy by Bowery's Post. So then more accuracy yeilds

more truth, not less volume, unless we're talking about Liespace where

fewer statements are always more true and silence is golden.

If more truth exists saying less, why post at all? The Wikipedia is

also a big joke of mediocrity, enforced by the moderator method of self

appointed eletist whacked in the head peer reviews who reject real

truth which can hurt for the sake of truth by committee.

So I request you get Hutter himself to post and defend himself as

Bower's post doesn't present stuff that's even defensible to a logical

person. Sometimes truth is so important it is repeated -- see the 4

gospels on Jesus Christ.

Sincerely,

Kirk Fraser

Aug 16, 2006, 1:37:03 PM8/16/06

to

Overcomer wrote:

> James A. Bowery wrote:

> > Indeed, Hutter's Razor** might be phrased, "It is truer to explain with

> > less that which can be explained with more."

> > less that which can be explained with more."

> To claim it is more true to explain

> less

> less

Why did you read "explain with less" as "explain less"?

Aug 16, 2006, 1:43:14 PM8/16/06

to

Rob Freeman wrote:

> Most especially, if it is so, why are you looking for a single

> compressor as the solution to your "language learning" problem, even

> for a single data set?

> Most especially, if it is so, why are you looking for a single

> compressor as the solution to your "language learning" problem, even

> for a single data set?

There are actually 2 things of value that can come out of the Hutter

Prize:

1) An AI compressor that works for encyclopedic knowledge.

2) More conicise enyclopedic knowledge.

2 needn't be accomplished via 1. It can also be accomplished by humans

creating formalized knowledge that is more concise.

Aug 16, 2006, 9:38:53 PM8/16/06

to

Hi James,

I think there is a lot of value that can come out of it. Don't get me

wrong. I think you are on the right track. In contrast to some of the

other messages in this thread I'm sure compression is a very good way

to see intelligence, "meaning" anyway, more narrowly, and certainly

natural language.

But I'm most interested in the limitations Matt places on that

compression. Those limitations bring you closer to what I think is the

answer than any other work I've seen.

In particular I like these statements from Matt:

"There is no such thing as a perfectly general compression function.

Every compressor that decreases the size of some inputs must increase

the size of others."

"The problem of computing algorithmic complexity for all strings in any

given language is not computable."

"Since "general" compression is not obtainable ... I expect that the

best solutions will be tuned specifically to the data set, and not be

very useful as general purpose compressors."

I think these are the right observations. I'm just trying to remind you

there are other conclusions you could draw from those observations.

Isn't the attempt to model language by finding a single best

compression function related to the assumption that there is a general

compression function for language? If we accept that general solution

does not exist, shouldn't that make us reassess our goal (a single best

solution) rather than simply the generality of that goal?

What meaning does a global solution for a compression of language have

if it is not general?

I'm just trying to nudge you in the direction of seeing there are

possibilities for modeling language other than general/global solutions

(which surely we only seek because of a traditional bias in the way we

think about language -- as parametrized by a single grammar.)

You are already on the right track when you identify compression with

language modeling, and in particular when you accept that there is no

general solution for that compression. I just want you to think about

what that means for the way you formulate the problem

If solutions are not general, why not optimize locally instead of

globally? When we understand there are no general solutions, we

understand solutions must be local in a sense. Why compromise and make

those solutions only half local by optimizing them over the biggest

data set you can find? Think about what happens as your data set gets

larger. Aren't you only approximating more and more closely a general

solution which we already know does not exist?

My suggestion is, if solutions are necessarily local, why not address

your intelligence problem, language modeling anyway, as a search for

really relevant local solutions, and forget global solutions

completely?

-Rob

Aug 16, 2006, 9:44:00 PM8/16/06

to

Rob Freeman wrote:

> Matt Mahoney wrote:

> >

> > There is no such thing as a perfectly general compression function.

>

> Is this true also for language?

> Matt Mahoney wrote:

> >

> > There is no such thing as a perfectly general compression function.

>

> Is this true also for language?

Only if you exclude some possible strings. For "normal" text where you

exclude highly unlikely byte strings (such as nonprintable characters),

then you can always compress it. But if you allow all binary strings

of n bits, then there is no function that will compress all 2^n

possibilities. That is because there are only 2^n - 1 binary strings

of length less than n. (This is the counting theorem or counting

argument).

> > The problem of computing algorithmic complexity

> > for all strings in any given language is not computable.

>

> This is probably the result I am interested in. Why is this so?

Kolmogorov gave a simple proof of this. Suppose there was a function

K(x) that outputs the length of the shortest program (in some fixed

language) that outputs x. (This number is also known as the

algorithmic complexity or Kolmogorov complexity). Then you could write

the following program to find the first string x (in some

lexicographical ordering) that cannot be compressed to n bits for some

n:

let x = ""

while K(x) <= n do

x = next(x)

output x

Now let n be the length of this program, and you have a contradiction.

Therefore K() is not computable.

Even though K(x) is not computable, it can be proven that K(x) does not

depend on the programming language used, except for a constant

independent of x. This is because if x is coded as a program in

language L1, you can code it in any other language L2 by appending a

compiler for L1 written in L2.

The motivation for using Kolmogorov complexity as opposed to some

probability model to define information content is that it removes any

debate about which model should be used. You intuitively "know" that

"00000000000000" has less information than "hvvW80f8fwl34i". Why? It

is easier to remember. It is easier to describe. It follows a

pattern. Because it has less information, it should compress smaller.

But suppose my model says that characters are independent and letters

are more common than digits. Then my compressor would compress the

second string smaller. This is not intuitive. Instead, algorithmic

complexity has a universal prior. The first string has a shorter

program that will output it in most languages. This includes English,

when I say it is easier to describe ("14 zeros" is shorter than

"hvvW80f8fwl34i").

> Most especially, if it is so, why are you looking for a single

> compressor as the solution to your "language learning" problem, even

> for a single data set?

Essentially, the "intelligence" of a model p1(x) depends on how

accurately it approximates the true but unknown natural language model

p(x) for all strings x. Shannon proved that the cross entropy E[-SUM_x

p(x) log p1(x)] is minimized when p1(x) = p(x) for all x. We don't

know p(x) but we can approximate this value by sampling x from the

unknown distribution p(x) and computing p1(x) with the model under

test. So the test depends on the assumption that WIkipedia is a fair

sample of natural language text

There is still one big problem. Probabilities are not intrinsic

properties of real world events. They depend on what you know. If you

know that you will be compressing a specific x, then your model is p(x)

= 1 and you can compress x to 0 bits. There are two ways to solve

this. One is to not reveal x. The other is to require a universal

prior, p(x) = 2^-K(x).

The first approach is widely used. If we used this approach, we could

test compressors on maybe just a few KB of text and get fairly accurate

results. But nobody could verify or repeat our results.

We used the second approach. The problem is that K(x) is a universal

model which says nothing about natural language. Our solution is to

make x big enough (we think) to learn natural language from x itself.

Whether it is possible to learn natural language at all from unlabeled

text, and whether the Wikipedia data is the right quantity or quality

to do this is controversial. I discuss these issues further in

Aug 16, 2006, 10:29:34 PM8/16/06

to

Matt Mahoney wrote:

>

> > > The problem of computing algorithmic complexity

> > > for all strings in any given language is not computable.

> >

> > This is probably the result I am interested in. Why is this so?

>

> Kolmogorov gave a simple proof of this.

>

> > > The problem of computing algorithmic complexity

> > > for all strings in any given language is not computable.

> >

> > This is probably the result I am interested in. Why is this so?

>

> Kolmogorov gave a simple proof of this.

Good, yeah. That was what I thought you meant. I'm not arguing with it,

only how we interpret it for the "language modeling" problem.

> > Most especially, if it is so, why are you looking for a single

> > compressor as the solution to your "language learning" problem, even

> > for a single data set?

>

> Essentially, the "intelligence" of a model p1(x) depends on how

> accurately it approximates the true but unknown natural language model

> p(x) for all strings x...

_The_ "true but unknown natural language model"?

Do you agree there is an assumption here? Why do we assume a "true but

unknown" language model exists at all?

What happens if there is no "true by unknown" language model. What

happens if if all you really have is text?

-Rob

Aug 17, 2006, 2:15:58 AM8/17/06

to

Why did you ask? Obviously your first quote is contained in your

second with less.

Aug 17, 2006, 2:16:58 AM8/17/06

to

Aug 17, 2006, 12:44:18 PM8/17/06

to

Rob Freeman wrote:

>

> What happens if there is no "true by unknown" language model. What

> happens if if all you really have is text?

>

mamanthal is two (well just about). If a human has no difficulty in

distinguishing between the different meanings the clear implication is

that these bits are redundant.

I still believe that the best compressor is probably the deterministic

guesser with the top guesses having a small number of bits.

You may of course believe that there is some sort of "vital force" in

human translation and understanding. I doubt though whether many of

this usergroup's readership would agree.

- Ian Parker

Aug 17, 2006, 4:22:45 PM8/17/06

to

Overcomer wrote:

> So I request you get Hutter himself to post and defend himself as

> Bower's post doesn't present stuff that's even defensible to a logical

> person. Sometimes truth is so important it is repeated -- see the 4

> gospels on Jesus Christ.

>

> Sincerely,

> Kirk Fraser

> So I request you get Hutter himself to post and defend himself as

> Bower's post doesn't present stuff that's even defensible to a logical

> person. Sometimes truth is so important it is repeated -- see the 4

> gospels on Jesus Christ.

>

> Sincerely,

> Kirk Fraser

Marcus Hutter's FAQ answers most of your questions.

http://prize.hutter1.net/prizefaq.htm

-- Matt Mahoney

Aug 17, 2006, 4:40:23 PM8/17/06

to

I think Hutter still has to post on multilingualism. Why English? This

is one thing he does not cover. From all the other things he says I

would have thought multilingualism a natural part of what he is trying

to do.

- Ian Parker

Aug 17, 2006, 7:44:29 PM8/17/06

to

I discuss this in in

http://www.cs.fit.edu/~mmahoney/compression/rationale.html

A multilingual model is sufficient but not necessary to demonstrate AI.

The Turing test is done in one language. If a machine could answer

questions in several languages, then that's nice, but if it only speaks

one language, then it still passes. So why make the problem harder

than it needs to be?

Also, I eventually want to measure the entropy of the data set using

methods like the Shannon game. If it is multilingual, then I need to

find subjects that speak each of the languages, and then I would still

have the wrong answer because there would be translations of articles

containing the same information. To measure the entropy accurately I

would need to find multilingual subjects and present an article in one

language as context while having them predict in the other. This

complication is easily avoided by using just one language.

-- Matt Mahoney

Aug 17, 2006, 7:52:05 PM8/17/06

to

Ian Parker wrote:

> Rob Freeman wrote:

> >

> > What happens if there is no "true by unknown" language model. What

> > happens if if all you really have is text?

>

> ...> Rob Freeman wrote:

> >

> > What happens if there is no "true by unknown" language model. What

> > happens if if all you really have is text?

>

>

> You may of course believe that there is some sort of "vital force" in

> human translation and understanding. I doubt though whether many of

> this usergroup's readership would agree.

Are our imaginations really so limited that the choices come down to

"language model" or "vital force"?

Ian, on the basis of past messages I know you believe language

understanding must be based on "relaxation".

The existence of a language model would mean language could be

parametrized by rules.

By saying language understanding must be based on relaxation you are

really also saying a general language model does not exist.

Your very own "relaxation" could be a third alternative to that

mysterious "vital force".

Relaxation is just ad-hoc compression. Which is consistent with the

model these guys are suggesting, in so far that it is based on a proof

all you need for useful intelligence is compression (leaving rules to

be an added assumption, an assumption inconsistent with a proof that a

general solution is impossible, in point of fact.)

-Rob

Aug 18, 2006, 6:59:41 AM8/18/06

to

way. Speech has to be based on relaxation as it is initially ill

defined.

We may be talking slightly at cross purposes.

"El barco attravesta una cerradura" and "La estacion de ressorte" are

examples of GROSS errors in computer translation. If we were to get to

a stage were we could say "In Castile it means this, in Latin America

it means something similar but subtely different" we would have

advanced well beyond the current state of the art. If the differences

in meaning between Spain and Mexico were to become important - in other

words gross errors had been eliminated, VoIP would be virtually error

free.

Also I don't believe that in human/human communication all shades of

meaning are conveyed. Matt Mahoney says, and I agree with this, that

anything the human brain can do a computer can do. True there may be a

spiritual dimension to our lives, but this is certainly not relevant to

basic linguistic understanding.

Aug 18, 2006, 7:11:48 AM8/18/06

to

Matt Mahoney wrote:

> Ian Parker wrote:

> A multilingual model is sufficient but not necessary to demonstrate AI.

> The Turing test is done in one language. If a machine could answer

> questions in several languages, then that's nice, but if it only speaks

> one language, then it still passes. So why make the problem harder

> than it needs to be?

>

As I have said multilinualism indicates a Euclidean position.

Primavera, ressote and mamanthal are overt in Spanish but are latent in

English. Latent Semanic Analysis should be providing the correct

positions.

> Also, I eventually want to measure the entropy of the data set using

> methods like the Shannon game. If it is multilingual, then I need to

> find subjects that speak each of the languages, and then I would still

> have the wrong answer because there would be translations of articles

> containing the same information. To measure the entropy accurately I

> would need to find multilingual subjects and present an article in one

> language as context while having them predict in the other. This

> complication is easily avoided by using just one language.

>

To me entropy reflects the ability to guess the next word. Here is

something to think about - Does the ability to measure entropy, imply

you can compress. I proposed a compressor which lised the words in a

language in order of probability at a given point. Yes it does!

The same thing is true with translations and multilingualism. What you

are doing here is second guessing the the original translator.

BTW - Entropy can change on translation. An example of this is the

adjective noun order in Spanish. In French tghe academie francaise has

standardized the order. In French "grand" comes before the noun. In

Spanish the position is optional. You have the "Rio Grande" but in

other places the French order is followed.

I was just saying that a translation forces AI quickly as opposed to

the present compressors which do not give any deep insights.

Aug 20, 2006, 7:17:32 PM8/20/06

to

Ian Parker wrote:

> Matt Mahoney wrote:

> > Ian Parker wrote:

> > A multilingual model is sufficient but not necessary to demonstrate AI.

> > The Turing test is done in one language. If a machine could answer

> > questions in several languages, then that's nice, but if it only speaks

> > one language, then it still passes. So why make the problem harder

> > than it needs to be?

> >

>

> As I have said multilinualism indicates a Euclidean position.

> Primavera, ressote and mamanthal are overt in Spanish but are latent in

> English. Latent Semanic Analysis should be providing the correct

> positions.

> Matt Mahoney wrote:

> > Ian Parker wrote:

> > A multilingual model is sufficient but not necessary to demonstrate AI.

> > The Turing test is done in one language. If a machine could answer

> > questions in several languages, then that's nice, but if it only speaks

> > one language, then it still passes. So why make the problem harder

> > than it needs to be?

> >

>

> As I have said multilinualism indicates a Euclidean position.

> Primavera, ressote and mamanthal are overt in Spanish but are latent in

> English. Latent Semanic Analysis should be providing the correct

> positions.

I think it is possible to demonstrate word sense disambiguation in a

monolingual model. For example, "bank" would not predict "interest" in

context "river".

> > Also, I eventually want to measure the entropy of the data set using

> > methods like the Shannon game. If it is multilingual, then I need to

> > find subjects that speak each of the languages, and then I would still

> > have the wrong answer because there would be translations of articles

> > containing the same information. To measure the entropy accurately I

> > would need to find multilingual subjects and present an article in one

> > language as context while having them predict in the other. This

> > complication is easily avoided by using just one language.

> >

> To me entropy reflects the ability to guess the next word. Here is

> something to think about - Does the ability to measure entropy, imply

> you can compress. I proposed a compressor which lised the words in a

> language in order of probability at a given point. Yes it does!

I should have said "estimate" the entropy of a human model, not

"measure". The ability to measure entropy not only implies the ability

to compress (because entropy is a function of the model), but also

knowledge of an ideal model, which is not computable for the general

case.

-- Matt Mahoney

Aug 21, 2006, 1:11:24 AM8/21/06

to

Matt Mahoney wrote:

>

> I should have said "estimate" the entropy of a human model, not

> "measure". The ability to measure entropy not only implies the ability

> to compress (because entropy is a function of the model), but also

> knowledge of an ideal model, which is not computable for the general

> case.

>

> I should have said "estimate" the entropy of a human model, not

> "measure". The ability to measure entropy not only implies the ability

> to compress (because entropy is a function of the model), but also

> knowledge of an ideal model, which is not computable for the general

> case.

Matt, even if you change "measure" to "estimate", surely you still need

to explain how it is meaningful to try and compute something which it

has been proven is not computable.

-Rob

Aug 21, 2006, 6:59:49 AM8/21/06

to

Matt Mahoney wrote:

> But let's not be pessimistic. I believe the Wikipedia data set has

> enough information to infer general knowledge and then apply it to

> novel situations occurring later in the text.

> So if a program can make such inferences, could it not compress better

> than one that couldn't? Or is this not a valid test?

Using your own Occam's Razor reference, the smallest size any data set

can be compressed to would be with a language model that perfectly

models the data *and nothing else*. Anything more would be redundant

and unnecessary to model the data. It would be massively overfitted to

the data in question.

Necessarily, compressing the data more will result in a loss of ability

to predict things not contained within the data.

Aug 21, 2006, 4:23:19 PM8/21/06

to

Matt Mahoney wrote:

> > As I have said multilinualism indicates a Euclidean position.

> > Primavera, ressote and mamanthal are overt in Spanish but are latent in

> > English. Latent Semanic Analysis should be providing the correct

> > positions.

>

> I think it is possible to demonstrate word sense disambiguation in a

> monolingual model. For example, "bank" would not predict "interest" in

> context "river".

>

this. I could talk about fishing on the bank of a river(sp orilla,

rivera) for hours. If it taked about interest rates I could well

conclude that it was making some kind of joke. Spanish tells me

instantly what the level of understanding is. I could find out in pure

English but it would take a lot longer.

Certaily if it cannot understand that you don't fish for €50,000 it

will be unlikely to compress to the Hutter level.. Herr Hutter has set

the compression level to be just beyond was is possible with fractal

type compressors. €50,000 can only be won with AI. The problem really

comes with the lesser prizes. I would be sorry to see €10,000 going

to a slighly more efficient fractal.

Matt, you are on the committee. What precisely are the criteria for the

lesser prizes and how much discretion do the committee have? The top

prize is or course awarded according to a narrowly defined criterion.

>

> I should have said "estimate" the entropy of a human model, not

> "measure". The ability to measure entropy not only implies the ability

> to compress (because entropy is a function of the model), but also

> knowledge of an ideal model, which is not computable for the general

> case.

>

If you ESTIMATE but do not MEASURE entropy it is subjective.

- Ian Parker

Aug 21, 2006, 7:08:00 PM8/21/06

to

Let's not confuse modeling with algorithmic complexity. The ideal

model of string x is p(x) = 1. This would compress to size 0. Writing

such a compressor is trivial,e.g.

http://cs.fit.edu/~mmahoney/compression/barf.html

But you can't neglect the size of the decompressor. Using algorithmic

complexity does not allow it. Probability no longer depends on what

you know.

-- Matt Mahoney

Aug 21, 2006, 7:27:01 PM8/21/06

to

The goal is to estimate the entropy of a human language model. Given a

model p1(X) over random variable X with true but unknown distribution

X, one can estimate the cross entropy E[-log p1(x)] by averaging over a

large sampling of x from X. So if p = p1, then we are estimating the

entropy of X, defined as E[-log p(x)]. In the Shannon game, p is the

distribution of text in the corpus, and p1 is the model of the test

subject. We presume that these models are the same (or close) since

they are both of human origin.

The problem with is that it is difficult to have humans explicitly

state p1(x). Rather, it is conceptually easier for people to rank

p1(x1) > p1(x2), e.g. x1 is more likely than x2. So the Shannon game

has people guess successive characters in text. But this only places

course bounds on the result (0.6 to 1.3 bits per character for a 27

character alphabet of A-Z and space).

One possible improvement might be a form of human assisted computer

modeling, similar to what Shannon did by allowing subjects to use

1,2,3-gram and word frequency tables. At each decision point, a

program could output its next guess as e.g. a pie chart based on an

n-gram model, and the subject could make adjustments based on high

level knowledge not available to the computer. This could be made into

a game, where the subjects wins points for better compression. I think

this type of test should provide a tighter upper bound.

-- Matt Mahoney

Aug 21, 2006, 7:34:50 PM8/21/06

to

Ian Parker wrote:

> Certaily if it cannot understand that you don't fish for €50,000 it

> will be unlikely to compress to the Hutter level.. Herr Hutter has set

> the compression level to be just beyond was is possible with fractal

> type compressors. €50,000 can only be won with AI. The problem really

> comes with the lesser prizes. I would be sorry to see €10,000 going

> to a slighly more efficient fractal.

> Certaily if it cannot understand that you don't fish for €50,000 it

> will be unlikely to compress to the Hutter level.. Herr Hutter has set

> the compression level to be just beyond was is possible with fractal

> type compressors. €50,000 can only be won with AI. The problem really

> comes with the lesser prizes. I would be sorry to see €10,000 going

> to a slighly more efficient fractal.

I'm not aware of fractals being useful for text compression. AFAIK

they are only useful in lossy image compression to encode textures.

> Matt, you are on the committee. What precisely are the criteria for the

> lesser prizes and how much discretion do the committee have? The top

> prize is or course awarded according to a narrowly defined criterion.

Each 1% improvement in compression ratio of enwik8 (100 MB of Wikipedia

text) wins 500 euros. The minimum is 1%.

> > I should have said "estimate" the entropy of a human model, not

> > "measure". The ability to measure entropy not only implies the ability

> > to compress (because entropy is a function of the model), but also

> > knowledge of an ideal model, which is not computable for the general

> > case.

> >

> If you ESTIMATE but do not MEASURE entropy it is subjective.

>

> - Ian Parker

Any measurement is also an estimate, but not all estimates are

measurements. I would like to estimate the entropy of a human model

using the text in enwik8 so that we can estimate what level of

compression would achieve AI.

-- Matt Mahoney

Aug 21, 2006, 7:41:46 PM8/21/06