[PAN'10] Wikipedia revid/annotation from the pan10 corpus [code snippet (python)]

dmtr

unread,

May 5, 2010, 10:18:11 PM5/5/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

import csv
edits = csv.DictReader(open("edits.csv"))
gold = csv.DictReader(open("gold-annotations.csv"))
d = dict([(e['editid'], e['oldrevisionid']) for e in edits])

for g in gold: print d[g['editid']], g['class']

--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series

Martin Potthast

unread,

May 6, 2010, 4:01:09 AM5/6/10

to pan-works...@googlegroups.com

Good contribution, Dmitry!
This code snippet shows how simple it is to parse the PAN-WVC-10.

If you don't mind, I'll put this in the readme of the final version.

Best,
Martin

--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.cc

dmtr

unread,

May 6, 2010, 5:30:06 PM5/6/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

Sure. By the way - in that script example 'newrevisionid' should be
used instead of 'oldrevisionid'. Apparently this is the one that
uniquely identifies the edit and can be used to generate the diff,
like: 'newrevisionid': '327485098', 'editid': '17799' 'class':
'regular' 'totalannotators': '7'.

http://en.wikipedia.org/w/index.php?diff=prev&oldid=327485098
http://en.wikipedia.org/w/index.php?diff=327485098&oldid=327480713
http://en.wikipedia.org/wiki/User_talk:X-N2O
http://en.wikipedia.org/wiki/Special:Contributions/X-N2O

BTW - looks like another link spam/false negative?

-- Dmitry

On May 6, 1:01 am, Martin Potthast <martin.potth...@uni-weimar.de>
wrote:

> Good contribution, Dmitry!
> This code snippet shows how simple it is to parse the PAN-WVC-10.
>
> If you don't mind, I'll put this in the readme of the final version.
>
> Best,
> Martin
>

> On Thu, May 6, 2010 at 4:18 AM, dmtr <dchich...@gmail.com> wrote:
> > import csv
> > edits = csv.DictReader(open("edits.csv"))
> > gold = csv.DictReader(open("gold-annotations.csv"))
> > d = dict([(e['editid'], e['oldrevisionid']) for e in edits])
>
> > for g in gold: print d[g['editid']], g['class']
>
> > --
> > You received this message because you are subscribed to the Google Group "PAN".

> > Visit this group athttp://groups.google.com/group/pan-workshop-series
>
> --
> Martin Potthast
> Bauhaus-Universität Weimarwww.webis.de --- www.netspeak.cc

>
> --
> You received this message because you are subscribed to the Google Group "PAN".
> Visit this group athttp://groups.google.com/group/pan-workshop-series

Martin Potthast

unread,

May 7, 2010, 4:06:55 AM5/7/10

to pan-works...@googlegroups.com

Hi Dmitry,

> Sure. By the way - in that script example 'newrevisionid' should be
> used instead of 'oldrevisionid'. Apparently this is the one that

The reason we include both old and new revision ID is because we do
not wish to rely on Wikipedia too much. This is also why we wished
downloaded the articles and meta information about the edit. Wikipedia
may change its API and its appearance in the future.

> BTW - looks like another link spam/false negative?

This comment suggests otherwise:
http://en.wikipedia.org/w/index.php?title=User_talk:Samboy&diff=prev&oldid=327706966.

But as I said before, there appears to be a considerable gray area
between regular edits and vandalism edits.

Best,
Martin