Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Message from discussion The C-Prize
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Matt Mahoney  
View profile  
 More options Jun 17 2005, 3:13 pm
Newsgroups: comp.compression
From: "Matt Mahoney" <matmaho...@yahoo.com>
Date: 17 Jun 2005 12:13:08 -0700
Local: Fri, Jun 17 2005 3:13 pm
Subject: Re: The C-Prize

jim_bow...@hotmail.com wrote:
> As for the text corpus I would choose:

> That was one of the main questions I wanted to discuss because it is so
> crucial to the success of such a prize.  Right now if I had to choose a
> single text corpus I'd pick the 1 terabyte corpus that Peter Turney
> used in his recent accomplishment of human level performance on the SAT
> verbal analogies test -- a feat that is particularly interesting given
> the exceptionally high correlation between that test and general
> intelligence aka the 'g' factor.

> In this regard, you may be interested in my recently published article:

> "AI Breakthrough or the Mismeasure of Machine?"

> http://www.kuro5hin.org/story/2005/5/26/192639/466

> I'm thinking of writing another article on the C-Prize.

It would be hard to conduct a test with a 1 TB corpus.  Few people have
enough computing power to do this, and I am sure you don't want to host
the corpus on your website.

I would suggest something on the order of 1 GB.  Turing estimated in
1950 that the AI problem would take 10^9 bits of memory.  He did not
say how he came up with this number.  But consider a human exposed to
150 words per minute, 12 hours a day for 20 years.  This is about 4 GB
of text, or a 1 GB zip file.  So a machine learning algorithm should
have a similar constraint.

Even this may be too high.  The average human vocabulary is 30K common
words plus an equal number of proper nouns.  To learn a word you have
to be exposed to it about 20 times.  By Zipf's law, the n'th most
frequent word in English is 0.1/n.  This means that you need about 12
million words or 50 MB.

Semantic models using LSA have used about 250 MB, like the WSJ corpus.
However, with a larger corpus there is a tendency to get lazy.  For
example, with LSA you use the fuzzy equivalence A ~ B, meaning "A has
similar meaning to B" or "A and B appear in the same proximity".

A ~ A (reflexive, used by most data compressors)
A ~ B implies B ~ A (symmetric, used in distant bigram models)
A ~ B and B ~ C implies A ~ C (transitive, used by LSA)

LSA uses the transitive property to predict that C follows A even if
those two words were never seen together before.  However with a larger
corpus you can collect the statistics directly rather than use LSA to
infer them, so you tend take the easy route.  In the above paper on
word analogy solving, I see a lot of steps were data was discarded, but
for a 1 TB corpus it didn't matter.

You might be able to construct a high quality (diverse) corpus by
finding the minimum set of documents that contain all the words in a
dictionary.  You might also consider a corpus of several languages to
test whether a model can learn a generic language without hardcoded
rules for English.

-- Matt Mahoney


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.