Account Options

  1. Sign in
The old Google Groups will be going away soon.
Switch to the new Google Groups.
Google Groups Home
« Groups Home
Message from discussion The C-Prize
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Matt Mahoney  
View profile  
 More options Jun 19 2005, 2:38 pm
Newsgroups: comp.compression
From: "Matt Mahoney" <matmaho...@yahoo.com>
Date: 19 Jun 2005 11:38:04 -0700
Local: Sun, Jun 19 2005 2:38 pm
Subject: Re: The C-Prize

jim_bow...@hotmail.com wrote:
> Matt Mahoney writes:
> > Nice choice.  Lots of contributors, lots of topics, high quality.

> Here are the drawbacks I see to using the "cur" download of Wikipedia:

> 1) The purported "neutral point of view" is subject to systemic bias.
> The process supported by Wikipedia biases the content toward people who
> are able and willing to contribute under those circumstances.  It's not
> clear how to go about identifying this bias let alone its impact.

> 2) If a snapshot of the "cur" downloads is delayed, it will be subject
> to gaming by people who submit changes to articles that have a
> Kolmogorov complexity that is known to be low just to the gamers.
> Wikipedia keeps prior archives of the 'cur' downloads and it is
> unlikely anyone has gamed the system so far but now that we've broached
> the subject there is the potential that future versions of the 'cur'
> downloads will have been so-gamed.

> 3) The entire history of edits is about a factor of 10 larger than the
> current version.  This would be a superior corpus for the purpose of
> ferreting out how various points of view bias content -- and be very
> relevant to epistemology, social and political sciences -- thereby
> creating a superior AI capable of considering the source of various
> assertions.  If this really is beyond the capacity of computers that
> would be available to viable contestants then it might be necessary to
> defer use of that larger corpus until more capacity or more funding for
> the prize is available.

> For a discussion of the downloads see:

> http://en.wikipedia.org/wiki/Wikipedia:Database_download#Weekly_datab...

I think as long as there is enough text in the data that a compressor
has to be able to learn semantics, syntax, coherence, etc. to compress
it effectively, then it should be a good test of AI.  As long as
everyone works with the same data, it should be fair.  Whatever version
is used will become outdated, but this should not matter to a
compressor that starts with no knowledge.  Wikipedia is not completely
representative of all possible human communication, but I think it is
close enough.  I don't think the edits are that useful because you end
up with lots of nearly identical texts that are easy to compress.

-- Matt Mahoney


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.