Google Groups Home
Help | Sign in
The C-Prize
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  19 messages - Collapse all
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
jim_bow...@hotmail.com  
View profile
 More options Jun 15 2005, 11:01 am
Newsgroups: comp.compression
From: jim_bow...@hotmail.com
Date: 15 Jun 2005 08:01:39 -0700
Local: Wed, Jun 15 2005 11:01 am
Subject: The C-Prize
What are some of the potential practical problems with running the
following proposed prize competition?

PS: Many people disagree with the posited definition of "artificial
intelligence quality", but it seems obvious to me.  I'm not sure why
its not obvious to everyone but such is the case.  Any suggestions on
how to get this across more effectively?  Or are we dealing with
different cognitive types -- some will simply 'get it' and others
won't?

http://www.geocities.com/jim_bowery/cprize.html

The C-Prize

The most crucial technology prize of all.
By Jim Bowery
Copyright May 2005
The author grants the right to copy and distribute without
modification.

Since all technology prize awards are geared toward solving crucial
problems, the most crucial technology prize award of them all would be
one that solves the rest of them:

The C-Prize -- A prize that solves the artificial intelligence problem.

The C-Prize award criterion is as follows:

Let anyone submit a program that produces, with no inputs, one of the
major natural language corpora as output.

S = size of uncompressed corpus
P = size of program outputting the uncompressed corpus
R = S/P (the compression ratio).

Award monies in a manner similar to the M-Prize:

Previous record ratio: R0
New record ratio: R1=R0+X
Fund contains: $Z at noon GMT on day of new record
Winner receives: $Z * (X/(R0+X))

Compression program and decompression program are made open source.

Explanation

A very severe meta-problem with artificial intelligence is the question
of how one can define the quality of an artificial intelligence.

Fortunately there is an objective technique for ranking the quality of
artificial intelligence:

Kolmogorov Complexity

Kolmogorov Complexity is a mathematically precise formulation of
Ockham's Razor, which basically just says "Don't over-simplify or
over-complicate things." More formally, the Kolmogorov Complexity of a
given bit string is the minimum size of a Turing machine program
required to output, with no inputs, the given bit string.

Any set of programs which purport to be the standards of artificial
intelligence can be compared by simply comparing their Artificial
Intelligence Quality. Their AIQs can be precisely measured as follows:

Take an arbitrarily large corpus of writings sampled from the world
wide web. This corpus will establish the equivalent of an IQ test. Give
the AIs the task of compressing this corpus into the smallest
representation. This representation must be a program that, taking no
outside inputs, produces the exact sample it compressed. The AIQ of an
AI is simply the ratio of the size of the uncompressed writings to the
size of the program that, when executed, produces the uncompressed
writings.

In other words, the AIQ is the compression ratio achieved by the AI on
the AIQ test.

The reason this works as an AI quality test is that compression
requires predictive modeling. If you can predict what someone is going
to say, you have modeled their mental processes and by inference have a
superset of their mental faculties.

Mechanics

The C-Prize is to be modeled after the Methusela Mouse Prize or M-Prize
where people make pledges of money to the prize fund. If you would like
to help with the set up and/or administration of this prize award
similar to the M-Prize let me know by email. 1


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jackokr...@yahoo.com  
View profile
 More options Jun 15 2005, 7:26 pm
Newsgroups: comp.compression
From: jackokr...@yahoo.com
Date: 15 Jun 2005 16:26:58 -0700
Local: Wed, Jun 15 2005 7:26 pm
Subject: Re: The C-Prize
it comes down to the money thing again but from a different source.
wouldn't it also pressure later discovery due to a higher price above
inflation?

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jim_bow...@hotmail.com  
View profile
 More options Jun 16 2005, 3:33 am
Newsgroups: comp.compression
From: jim_bow...@hotmail.com
Date: 16 Jun 2005 00:33:03 -0700
Local: Thurs, Jun 16 2005 3:33 am
Subject: Re: The C-Prize
"The money thing" can be very volatile.  For example, in 1995 I put up
$1000 of my own money for the first amateur launch to reach 200km:

http://www.geocities.com/jim_bowery/bafar.html

2 years later someone with a lot more money had put up a quarter
million dollars for the same criterion (but with a time limit):

http://groups-beta.google.com/group/sci.space.tech/browse_frm/thread/...
aka
http://tinyurl.com/bo7py

Then some guys in St. Louis started the ball rolling on the X-Prize,
with a bit different rules, which eventually paid out $10 million
dollars.

A similar story has happened with the M-Prize which is now quite
active.  Indeed the money is rolling in:

http://mprize.org/index.php?pagename=releases

Prize legislation I drafted for fusion energy technology back in 1992
didn't get funded but it did pick up the support of one of the founders
of the Tokamak program:

http://www.geocities.com/jim_bowery/BussardsLetter.html

Prize awards for technology milestones are gaining credibility these
days and if people "get it" the C-Prize could take off in an even
bigger way than the rest of them.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Mahoney  
View profile
 More options Jun 17 2005, 12:38 am
Newsgroups: comp.compression
From: "Matt Mahoney" <matmaho...@yahoo.com>
Date: 16 Jun 2005 21:38:35 -0700
Local: Fri, Jun 17 2005 12:38 am
Subject: Re: The C-Prize

jim_bow...@hotmail.com wrote:
> What are some of the potential practical problems with running the
> following proposed prize competition?

> PS: Many people disagree with the posited definition of "artificial
> intelligence quality", but it seems obvious to me.  I'm not sure why
> its not obvious to everyone but such is the case.  Any suggestions on
> how to get this across more effectively?  Or are we dealing with
> different cognitive types -- some will simply 'get it' and others
> won't?

> http://www.geocities.com/jim_bowery/cprize.html

> The C-Prize

Very interesting.  There is a similar contest that has been going on
for 9 years.  It might help you think about the details.
http://mailcom.com/challenge/

Have you picked out a text corpus?  I think this is a good approach to
the AI problem, very objective, and easier to administer than the
Loebner prize.

Last year I once proposed almost exactly the same thing to NSF with a
$50K prize, but they would not fund it.

The connection between text compression and AI is not obvious, but
there is a simple proof that it solves the Turing test.  The
compression problem is to code text strings x with length log2 1/P(x)
bits.  But if you know the probability distribution P(x) (specifically
the distribution over chat sessions) then you could build a machine
that could answer questions such that it is indistinguishable from
human.  Given a question q, you generate response r with distribution
P(r|q) = P(qr)/P(q).  You can generalize q to any sequence of questions
and responses up to the last question.

I am glossing over some assumptions that need to be more clearly
stated.  For example, I am assuming that the interrogator, machine, and
the machine's human opponent know the same distribution P(x), and that
people use the same distribution for writing text as they do for
recognition.  I think these are reasonable approximations because the
machine only has to be close enough to fool the interrogator, not an
exact match to the "average" human.

Neither AI nor text compression is solved.  In 1950 Shannon estimated
the entropy of written English to be about 1 bit per character (based
on how well humans can guess successive letters), but the best text
compressors achieve about 1.2-1.3 bpc.  These are language models for
speech recognition that use a combination of word trigram models and
semantic models based on word proximity.  I did some of my dissertation
work in this area before I switched to a less interesting topic where I
could get funding.  http://cs.fit.edu/~mmahoney/dissertation/

-- Matt Mahoney


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jim_bow...@hotmail.com  
View profile
 More options Jun 17 2005, 9:16 am
Newsgroups: comp.compression
From: jim_bow...@hotmail.com
Date: 17 Jun 2005 06:16:31 -0700
Local: Fri, Jun 17 2005 9:16 am
Subject: Re: The C-Prize
Right.  I thought I'd ask here before bothering Leonid Broukhis since
people often have ideas about how to exploit weaknesses in contest
rules that they are willing to share.  I'm sure his experience will
turn out to be invaluable.

I ran across your poster session paper  "Text Compression as a Test for
Artificial Intelligence" which is one of the things that inspired me to
look into the prize award approach.  My original motivation is my
rather strong sense that Ockham's Razor is more than a nice rule of
thumb for scientists.

Your proof is elegant, simple and convincing however people still don't
"get it".  When people start arguing about intelligence tests they can
get pretty squirrley and I think the Turing test is a poison that has
impeded thought about artificial intelligence more than it has advanced
it.  Your paper is crucial to overcoming that handicap.

However, since the 1960s there have been other arguments advanced,
primarily by the algorithmic complexity guys, that should have taken
root in the AI community but didn't.  I don't understand exactly why
that intellectual failure occurred in the AI community but it most
obviously did.

As for the text corpus I would choose:

That was one of the main questions I wanted to discuss because it is so
crucial to the success of such a prize.  Right now if I had to choose a
single text corpus I'd pick the 1 terabyte corpus that Peter Turney
used in his recent accomplishment of human level performance on the SAT
verbal analogies test -- a feat that is particularly interesting given
the exceptionally high correlation between that test and general
intelligence aka the 'g' factor.

In this regard, you may be interested in my recently published article:

"AI Breakthrough or the Mismeasure of Machine?"

http://www.kuro5hin.org/story/2005/5/26/192639/466

I'm thinking of writing another article on the C-Prize.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Mahoney  
View profile
 More options Jun 17 2005, 3:13 pm
Newsgroups: comp.compression
From: "Matt Mahoney" <matmaho...@yahoo.com>
Date: 17 Jun 2005 12:13:08 -0700
Local: Fri, Jun 17 2005 3:13 pm
Subject: Re: The C-Prize

It would be hard to conduct a test with a 1 TB corpus.  Few people have
enough computing power to do this, and I am sure you don't want to host
the corpus on your website.

I would suggest something on the order of 1 GB.  Turing estimated in
1950 that the AI problem would take 10^9 bits of memory.  He did not
say how he came up with this number.  But consider a human exposed to
150 words per minute, 12 hours a day for 20 years.  This is about 4 GB
of text, or a 1 GB zip file.  So a machine learning algorithm should
have a similar constraint.

Even this may be too high.  The average human vocabulary is 30K common
words plus an equal number of proper nouns.  To learn a word you have
to be exposed to it about 20 times.  By Zipf's law, the n'th most
frequent word in English is 0.1/n.  This means that you need about 12
million words or 50 MB.

Semantic models using LSA have used about 250 MB, like the WSJ corpus.
However, with a larger corpus there is a tendency to get lazy.  For
example, with LSA you use the fuzzy equivalence A ~ B, meaning "A has
similar meaning to B" or "A and B appear in the same proximity".

A ~ A (reflexive, used by most data compressors)
A ~ B implies B ~ A (symmetric, used in distant bigram models)
A ~ B and B ~ C implies A ~ C (transitive, used by LSA)

LSA uses the transitive property to predict that C follows A even if
those two words were never seen together before.  However with a larger
corpus you can collect the statistics directly rather than use LSA to
infer them, so you tend take the easy route.  In the above paper on
word analogy solving, I see a lot of steps were data was discarded, but
for a 1 TB corpus it didn't matter.

You might be able to construct a high quality (diverse) corpus by
finding the minimum set of documents that contain all the words in a
dictionary.  You might also consider a corpus of several languages to
test whether a model can learn a generic language without hardcoded
rules for English.

-- Matt Mahoney


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jim_bow...@hotmail.com  
View profile
 More options Jun 17 2005, 6:30 pm
Newsgroups: comp.compression
From: jim_bow...@hotmail.com
Date: 17 Jun 2005 15:30:41 -0700
Local: Fri, Jun 17 2005 6:30 pm
Subject: Re: The C-Prize
Well obviously I was thinking too big for an early contest.

Perhaps Wikipedia would suffice:

http://download.wikimedia.org/#wikipedia

The current english language edition, gzipped, is about 1GB, but as you
see, there are many Wikipedias in different languages.  I doubt they
total over 4GB gzipped.

Compressing multiple languages would have a very good side effect:
Machine translation technology could be dramatically enhanced and
properly constructed compressors would discover things about the
various cultures that might not otherwise be obvious.

Part of my preference for picking a big corpus to begin with is so that
if the contest goes big time -- which it would if proper funding levels
were applied to it -- some sort of rules might be constructed to allow
some divisions of the contest for large computer systems to crunch away
doing more brute-force statistics such as Turney used, while other
divisions could do a fairly sampled subset of the total corpus.  But
this is something we can deal with later when the competition (and
prize) gets much bigger.

PS:  Another reason I'd like to see huge corpora ultimately is that I'd
really like to see this system beat humans by a substantial amount:
We're so stupid we desperately need all the help we can get.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Mahoney  
View profile
 More options Jun 17 2005, 10:51 pm
Newsgroups: comp.compression
From: "Matt Mahoney" <matmaho...@yahoo.com>
Date: 17 Jun 2005 19:51:02 -0700
Local: Fri, Jun 17 2005 10:51 pm
Subject: Re: The C-Prize

jim_bow...@hotmail.com wrote:
> Well obviously I was thinking too big for an early contest.

> Perhaps Wikipedia would suffice:

> http://download.wikimedia.org/#wikipedia

> The current english language edition, gzipped, is about 1GB, but as you
> see, there are many Wikipedias in different languages.  I doubt they
> total over 4GB gzipped.

Nice choice.  Lots of contributors, lots of topics, high quality.

> PS:  Another reason I'd like to see huge corpora ultimately is that I'd
> really like to see this system beat humans by a substantial amount:
> We're so stupid we desperately need all the help we can get.

When computers are smarter than humans, how will we know?

-- Matt Mahoney


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Mahoney  
View profile
(1 user)  More options Jun 17 2005, 11:18 pm
Newsgroups: comp.compression
From: "Matt Mahoney" <matmaho...@yahoo.com>
Date: 17 Jun 2005 20:18:59 -0700
Local: Fri, Jun 17 2005 11:18 pm
Subject: Re: The C-Prize
Hutter's AIXI, http://www.idsia.ch/~marcus/ai/paixi.htm makes another
argument for the connection between compression and AI that is more
general than the Turing test.  He proves that the optimal behavior of
an agent (an interactive system that receives a reward signal from an
unknown environment) is to guess that the environement is most likely
computed by the shortest possible program that is consistent with the
behavior observed so far. In other words, the most likely outcome for
any experiment is the one with the simplest explanation, where
"simplest" means the smallest program that could model what you
currently know about the universe.

He gives a formal proof, but it basically says that the only possible
distribution of the infinite set of programs (or strings) with nonzero
probability is one which favors shorter programs over longer ones.
Given any string of length n with probability p > 0, there are an
infinite set of strings longer than n, but only a finite number of
these can have probability higher than p.

-- Matt Mahoney


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jim_bow...@hotmail.com  
View profile
 More options Jun 18 2005, 3:15 pm
Newsgroups: comp.compression
From: jim_bow...@hotmail.com
Date: 18 Jun 2005 12:15:47 -0700
Local: Sat, Jun 18 2005 3:15 pm
Subject: Re: The C-Prize

Matt Mahoney writes:
> He proves... the most likely outcome for
> any experiment is the one with the simplest explanation, where
> "simplest" means the smallest program that could model what you
> currently know about the universe.

Wow.

If what you say is easily verifiable by consulting mathematicians then
you've just dramatically strengthened the case for the C-Prize.  A
philanthropist can have a trusted consultant verify Hutter's proof and
the C-Prize's salience to it.  The only remaining question is whether
the C-Prize is properly administered -- something the philanthropist
can verify with a combination of business and engineering consultants.

I had run across Hutter's AIXI paper before, when searching for
Solomonoff's algorithmic probability theory, but I read just the
abstract and didn't have the background for it so I sent it off to a
friend who is a doctoral candidate in statistical mechanics to review.
Being a doctoral candidate he doesn't have much time left to review
papers ourside his area so I never got the low-down you just gave.
Thanks for the clear synopsis.

A proof like this should have been available to researchers from the
1960s work on minimum length description (AP, KC, etc.) and decision
theory but apparently no one bothered to put 2 and 2 together for what
-- 40 years now?

This is going to be a major question for science historians:

Why did the AI community fail to formalize this proof for so long?

Given Moore's Law originated at about that time, 40 years ago, there's
a good argument that this ranks as one of the greatest losses of
potential to befall humanity.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
jim_bow...@hotmail.com  
View profile
 More options Jun 18 2005, 5:28 pm
Newsgroups: comp.compression
From: jim_bow...@hotmail.com
Date: 18 Jun 2005 14:28:35 -0700
Local: Sat, Jun 18 2005 5:28 pm
Subject: Re: The C-Prize

Matt Mahoney writes:
> Nice choice.  Lots of contributors, lots of topics, high quality.

Here are the drawbacks I see to using the "cur" download of Wikipedia:

1) The purported "neutral point of view" is subject to systemic bias.
The process supported by Wikipedia biases the content toward people who
are able and willing to contribute under those circumstances.  It's not
clear how to go about identifying this bias let alone its impact.

2) If a snapshot of the "cur" downloads is delayed, it will be subject
to gaming by people who submit changes to articles that have a
Kolmogorov complexity that is known to be low just to the gamers.
Wikipedia keeps prior archives of the 'cur' downloads and it is
unlikely anyone has gamed the system so far but now that we've broached
the subject there is the potential that future versions of the 'cur'
downloads will have been so-gamed.

3) The entire history of edits is about a factor of 10 larger than the
current version.  This would be a superior corpus for the purpose of
ferreting out how various points of view bias content -- and be very
relevant to epistemology, social and political sciences -- thereby
creating a superior AI capable of considering the source of various
assertions.  If this really is beyond the capacity of computers that
would be available to viable contestants then it might be necessary to
defer use of that larger corpus until more capacity or more funding for
the prize is available.

For a discussion of the downloads see:

http://en.wikipedia.org/wiki/Wikipedia:Database_download#Weekly_datab...

> When computers are smarter than humans, how will we know?

Given Hutter's proof my guess is we'll know when they are capable of
more compression than humans given a comparable base of knowledge.
There is always the question of how do we know how much compression
humans are capable of. I don't have that question answered.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.