PAN Wikipedia Vandalism Corpus (Training Version) Available

Martin Potthast

unread,

Mar 15, 2010, 2:22:14 PM3/15/10

to pan-workshop-series

Dear all,

I am happy to announce that we have just now uploaded the training
collection for the vandalism detection task to our Web pages.
You can download it directly using the following link:
http://www.uni-weimar.de/medien/webis/research/corpora/pan-wvc-10/pan10/pan10-vandalism-training-collection-2010-03-15.zip

The training collection comprises 15000 edits of which 944 have been
found to be vandalism. This ratio of 6% is in concordance with the
literature. The edits have been sampled at random from a weeks worth
of edits on the Wikipedia. The annotations have been obtained using
the crowdsourcing platform Mechanical Turk from Amazon
(www.mturk.com). Each edit has been labeled by at least three
annotators recruited there. In case no more than 2/3 of the annotators
agreed, additional annotations from other annotators have been
gathered until this was the case. In some cases up to 15 annotators
have reviewed an edit. Based on these annotations we have set up the
gold standard that is to be learned by your vandalism detector.

Since the annotators we recruited represent a similar group of persons
who frequently use Wikipedia, we hope that the concept of vandalism
has been captured well, while our edit sampling strategy assures a
representative distribution of regular edits versus vandalism edits.
The high class imbalance between regular edits and vandalism edits
makes vandalism detection a particular challenge.

Unlike the preliminary training corpus offered until now, this corpus
contains for each edit its old and new revision, and a lot of meta
information on the edits as well as the annotators. We found that
having everything mis en place makes all the difference, and we'd
rather you work on the task itself than on acquiring the data
necessary. However, if you find anything important missing in the
corpus, or if you find errors, please let us know.

Good luck with your experiments!

Martin

--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.cc

Message has been deleted

Sameer Rao_200911015

unread,

Apr 8, 2010, 4:29:36 AM4/8/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

Hello Sir,
This is Regarding the Vandalism corpus. I am actually trying to
figure out the "article-revisions" directory.

> Unlike the preliminary training corpus offered until now, this corpus
> contains for each edit its old and new revision, and a lot of meta
> information on the edits as well as the annotators. We found that
> having everything mis en place makes all the difference, and we'd
> rather you work on the task itself than on acquiring the data
> necessary. However, if you find anything important missing in the
> corpus, or if you find errors, please let us know.
>

In the directory, i found that the original and the revised edit
areplaced at different locations.
Put in other way, given two edits, how do find which is the original
and which is the revised edit?
Or is it that my algorithm should find this?
I hope you understood my question.

Regards,
Sameer Rao,
DA-IICT, Gandhinagar,
India

Martin Potthast

unread,

Apr 8, 2010, 4:50:32 AM4/8/10

to pan-works...@googlegroups.com

Dear Sameer,

>> Unlike the preliminary training corpus offered until now, this corpus
>> contains for each edit its old and new revision, and a lot of meta
>> information on the edits as well as the annotators.
>

> Here, I found that the old and revised edits are placed at different
> places which makes it difficult to keep a track of which is the source
> and which is vandalized.

The physical storage position of a revision is not important: the name
of a file is its revision identifier, which is referenced in
edits.csv. So, an edit is composed of two revision IDs,
<oldrevisionid> and <newrevisionid>. Given these two integers, all you
have to do is to look up the files <oldrevisionid>.txt and
<newrevisionid>.txt in the article-revisions directory.

As an example take the first edit in edits.csv:
oldrevisionid = 327116914
newrevisionid = 327119318

The respective files can be found here:
Old article revision: ./article-revisions/part26/327116914.txt
New article revision: ./article-revisions/part07/327119318.txt

> Put in other way, given two edits how do we get to know which is the
> original and which one is revised. Or is it that our algorithm should
> make this out?

All you have to do is to traverse the article-revisions directory and
build a hash map which maps file names to the respective file paths
and which can subsequently be used to retrieve the revision files that
correspond to any given edit.

A new revision of an edit can be the old revision of another edit on
the same article. Therefore we have chosen this organization in order
not to store the respective revisions twice.

It is definitely _not_ your task to figure out which revisions follow
each other. This information is already at hand in edits.csv.

Best,

Sameer Rao_200911015

unread,

Apr 8, 2010, 7:02:17 AM4/8/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

Hey Martin,
Thanks for the wonderful reply. Now the task is clear to me.

On Apr 8, 1:50 pm, Martin Potthast <martin.potth...@uni-weimar.de>
wrote:

Regards,
Sameer Rao.

Martin Potthast

unread,

Apr 8, 2010, 7:25:46 AM4/8/10

to pan-works...@googlegroups.com

You're welcome, Sameer! :-)

Reply all

Reply to author

Forward