PAN-WVC-10 final corpus released

45 views
Skip to first unread message

Martin Potthast

unread,
Sep 9, 2010, 7:56:53 AM9/9/10
to pan-workshop-series
Dear all,

the final version of the PAN-WVC-10 can be found at:
http://www.webis.de/research/corpora/pan-wvc-10

Basically, it combines the training collection and the test collection
into a single corpus. Please note that the improvements that many of
you suggested will be considered for the next corpus version.

Best regards,
Martin

--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de  ---  www.netspeak.cc

Santiago M. Mola

unread,
Nov 9, 2010, 6:52:40 AM11/9/10
to pan-works...@googlegroups.com
Hi Martin,

On Thu, Sep 9, 2010 at 1:56 PM, Martin Potthast
<martin....@uni-weimar.de> wrote:
>
> the final version of the PAN-WVC-10 can be found at:
> http://www.webis.de/research/corpora/pan-wvc-10
>

I've noticed that there are, at least, 4 edits that were in the
training corpus but are not in the final one. Here they are:

6588,"Bfitzy7",328459175,328459300,"http://en.wikipedia.org/w/index.php?diff=328459300&oldid=328459175","2009-11-28T22:13:31Z","null","Brian
Fitzgerald (meteorologist)"
18609,"AFL5979",327899402,327899603,"http://en.wikipedia.org/w/index.php?diff=327899603&oldid=327899402","2009-11-25T18:23:45Z","null","Scott
Allan"
23474,"Kurdistan-tv",329104085,329104363,"http://en.wikipedia.org/w/index.php?diff=329104363&oldid=329104085","2009-12-01T20:36:35Z","null","Kurdo
TV"
35220,"Jennymhumphreys",327460610,327460685,"http://en.wikipedia.org/w/index.php?diff=327460685&oldid=327460610","2009-11-23T11:55:36Z","/*
Charity Work */","Kate Hardcastle"

I noticed the only thing they have in common is that their page does not exist.

Are there any other changes between the training corpus and the final
release that we should be aware of?

Thank you,
--
Santiago M. Mola
Jabber ID: cool...@gmail.com

Martin Potthast

unread,
Nov 9, 2010, 10:08:20 AM11/9/10
to pan-works...@googlegroups.com
Hi Santiago,

Thanks for noting.

> I've noticed that there are, at least, 4 edits that were in the

> training corpus but are not in the final one. Here they are: [...]


> I noticed the only thing they have in common is that their page does not exist.

Exactly. We were late with downloading all the meta data as well as
the wikitexts, and somehow these four edits went amiss. This is why
they have been left out in the final release 4 edits out of 15000 is
not that much, anyway.

> Are there any other changes between the training corpus and the final
> release that we should be aware of?

Not that I know of. There are still the two small errors of wrong
wikitexts for some revision ids that were reported earlier. Teresa is
correcting them at the moment, and I'll update the corpus later on.

Santiago M. Mola

unread,
Nov 9, 2010, 10:11:37 AM11/9/10
to pan-works...@googlegroups.com
Hi Martin,

On Tue, Nov 9, 2010 at 4:08 PM, Martin Potthast
<martin....@uni-weimar.de> wrote:
>
> Exactly. We were late with downloading all the meta data as well as
> the wikitexts, and somehow these four edits went amiss. This is why
> they have been left out in the final release 4 edits out of 15000 is
> not that much, anyway.

Yeah. It's definately not a problem. It's just that something blew up
here because of the number mismatch ;-)

>
> Not that I know of. There are still the two small errors of wrong
> wikitexts for some revision ids that were reported earlier. Teresa is
> correcting them at the moment, and I'll update the corpus later on.
>

Ok. Thank you for the confirmation.

Best,

Reply all
Reply to author
Forward
0 new messages