[PAN'10] Task 2 training corpus. Gold is gold is gold is gold?

dmtr

unread,

May 6, 2010, 5:12:15 PM5/6/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

Here are a few annotations from the training corpus that look
questionable ('false negatives'). These 3 edits are from 25 edits
selected randomly (*) from the training corpus, analyzed (using
'future' information) automatically, identified as false negatives and
reviewed by me.

This edit ('newrevisionid': '326862946', 'editid': '33213', 'class':
'regular') was identified as regular in the corpus, but you can see
http://en.wikipedia.org/w/index.php?diff=326862946&oldid=326142240
that in fact it was reverted later on as link spam / link to
copyrighted material, user http://en.wikipedia.org/wiki/Special:Contributions/76.30.60.171
contributions were mostly external links and were generally reverted.

This edit ('newrevisionid': '327477782', 'editid': '14627', 'class':
'regular') was identified as regular in the corpus, but you can see
http://en.wikipedia.org/w/index.php?diff=327477782&oldid=326754371
that in the next revision it was reverted as link spam, user
http://en.wikipedia.org/wiki/Special:Contributions/221.247.180.140
only contribution was this external link reverted as a spam link.

This edit ('newrevisionid': '329972485', 'editid': '19204') was
identified as regular in the corpus, but you can see
http://en.wikipedia.org/w/index.php?diff=329972485&oldid=329950401
that in the next revision it was reverted as redundant and unnecessary
wikilinks, in the next few revisions of the article user had continued
redundant wikilinking and was reverted again and again. On the talk
page of the user you can see http://en.wikipedia.org/wiki/User_talk:24.37.41.212
that the user had a block history for overlinking.

-- Dmitry

--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series

Martin Potthast

unread,

May 7, 2010, 3:56:05 AM5/7/10

to pan-works...@googlegroups.com

Hi Dmitry,

thanks for pointing those out.

In general, false negatives are to be expected, and they will appear
more often than false positives. The annotations obtained via AMT are
not perfect.

> This edit ('newrevisionid': '326862946', 'editid': '33213', 'class':
> 'regular') was identified as regular in the corpus, but you can see
> http://en.wikipedia.org/w/index.php?diff=326862946&oldid=326142240
> that in fact it was reverted later on as link spam / link to
> copyrighted material, user http://en.wikipedia.org/wiki/Special:Contributions/76.30.60.171
> contributions were mostly external links and were generally reverted.

True, this may be considered vandalism, since the editor did not post
this link for the first time.
If he would have done it for the first time, the case would be different.

> This edit ('newrevisionid': '327477782', 'editid': '14627', 'class':
> 'regular') was identified as regular in the corpus, but you can see
> http://en.wikipedia.org/w/index.php?diff=327477782&oldid=326754371
> that in the next revision it was reverted as link spam, user
> http://en.wikipedia.org/wiki/Special:Contributions/221.247.180.140
> only contribution was this external link reverted as a spam link.

I tend to disagree on this one. The following editor removed the link
with the comment 'rvs'. Does this mean "reverted because of spam"?
If so, you got a point. Otherwise, this is just a link that got rejected.

Links appear to be a problem in general on Wikipedia, since every fact
should be linked, but not every link is citeable.

> This edit ('newrevisionid': '329972485', 'editid': '19204') was
> identified as regular in the corpus, but you can see
> http://en.wikipedia.org/w/index.php?diff=329972485&oldid=329950401
> that in the next revision it was reverted as redundant and unnecessary
> wikilinks, in the next few revisions of the article user had continued
> redundant wikilinking and was reverted again and again. On the talk
> page of the user you can see http://en.wikipedia.org/wiki/User_talk:24.37.41.212
> that the user had a block history for overlinking.

In this case it seems the editor who included the intra-Wikipedia
links got taken away a bit by the fact that for many words used in the
text there are in fact articles available. Maybe this editor was
simply unaware of Wikipedia's policy not to link everything that could
be linked but only the important things. Maybe he or she is just
stubborn. In any case, even if the editor does those kinds of things
repeatedly, edit wars are not vandalism, and otherwise I see no
indication of meaning harm.

In general, I'd say that there will be many debatable cases: we ended
up with 70 edits on which up to 30 people were completely unsure
whether the editor meant harm or no. And I'm inclined to think that
there are many more such cases we didn't find. However, our budget was
too limited to get that many votes for every edit. ;-)

However, if the features you implement are that sensitive, to detect
the above cases, we may try to get more votes on those edits where you
disagree with the gold standard to see what's happening, to further
reduce the error rate, and to judge your algorithm right. But keep an
eye on your false positive rate, because most edits are certainly
regular.

Best,
Martin

--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.cc

Santiago M. Mola

unread,

May 7, 2010, 5:08:22 AM5/7/10

to pan-works...@googlegroups.com

On Thu, May 6, 2010 at 11:12 PM, dmtr <dchi...@gmail.com> wrote:
> Here are a few annotations from the training corpus that look
> questionable ('false negatives'). These 3 edits are from 25 edits
> selected randomly (*) from the training corpus, analyzed (using
> 'future' information) automatically, identified as false negatives and
> reviewed by me.
>
> This edit ('newrevisionid': '326862946', 'editid': '33213', 'class':
> 'regular') was identified as regular in the corpus, but you can see
> http://en.wikipedia.org/w/index.php?diff=326862946&oldid=326142240
> that in fact it was reverted later on as link spam / link to
> copyrighted material, user http://en.wikipedia.org/wiki/Special:Contributions/76.30.60.171
> contributions were mostly external links and were generally reverted.

Apart of what Martin say, I'd say this is a bit clearer than the next
one. For a human, this is quickly spotted as spam, It follows a
pattern common in online games spam.

> This edit ('newrevisionid': '327477782', 'editid': '14627', 'class':
> 'regular') was identified as regular in the corpus, but you can see
> http://en.wikipedia.org/w/index.php?diff=327477782&oldid=326754371
> that in the next revision it was reverted as link spam, user
> http://en.wikipedia.org/wiki/Special:Contributions/221.247.180.140
> only contribution was this external link reverted as a spam link.

While it looks like spam, I wouldn't be inclined to classify this as
vandalism in this task. But then, depends on your approach.

Probably, the detection of spam when it is done through well-formed
and well-looking links needs to be treated as a separate problem with
an specialized tool for link analysis, considering history of the
article, related articles, contributor, diverse statistics on external
links over all Wikipedia, etc.

As Wikipedia editor, I find myself arguing about link-spam from time
to time. Sometimes it's extremly obvious and that's handled with
blacklists in the end, but in other cases it is so subtle (and even so
no-bad-faith sometimes) that senior editors can't agree on an
unanimous decision.

That said, my answer is probably a bit biased because the system I'm
working on will be absolutely incapable of detecting this as vandalism
any time soon. If yours does an advanced analysis of links, I look
forward seeing it, it'll be great ;-)

> This edit ('newrevisionid': '329972485', 'editid': '19204') was
> identified as regular in the corpus, but you can see
> http://en.wikipedia.org/w/index.php?diff=329972485&oldid=329950401
> that in the next revision it was reverted as redundant and unnecessary
> wikilinks, in the next few revisions of the article user had continued
> redundant wikilinking and was reverted again and again. On the talk
> page of the user you can see http://en.wikipedia.org/wiki/User_talk:24.37.41.212
> that the user had a block history for overlinking.

In my opinion, this looks like repeated policy violation and a deep
ignorance of Wikipedia style guidelines. But doesn't look like
vandalism.

Best regards,
--
Santiago M. Mola
Jabber ID: cool...@gmail.com

Dmitry Chichkov

unread,

May 7, 2010, 5:47:37 PM5/7/10

to pan-works...@googlegroups.com

Hi Martin,

> This edit ('newrevisionid': '329972485', 'editid': '19204') was
> identified as regular in the corpus, but you can see
> http://en.wikipedia.org/w/index.php?diff=329972485&oldid=329950401
> that in the next revision it was reverted as redundant and unnecessary
> wikilinks, in the next few revisions of the article user had continued
> redundant wikilinking and was reverted again and again. On the talk
> page of the user you can see http://en.wikipedia.org/wiki/User_talk:24.37.41.212
> that the user had a block history for overlinking.

In this case it seems the editor who included the intra-Wikipedia
links got taken away a bit by the fact that for many words used in the
text there are in fact articles available. Maybe this editor was
simply unaware of Wikipedia's policy not to link everything that could
be linked but only the important things. Maybe he or she is just
stubborn. In any case, even if the editor does those kinds of things
repeatedly, edit wars are not vandalism, and otherwise I see no
indication of meaning harm.

Here's a bit more on that case: http://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard/IncidentArchive562#Overlinking_by_User:24.37.41.212

In general, I'd say that there will be many debatable cases: we ended
up with 70 edits on which up to 30 people were completely unsure
whether the editor meant harm or no. And I'm inclined to think that
there are many more such cases we didn't find. However, our budget was
too limited to get that many votes for every edit. ;-)

Do we need to keep debatable cases in the training set?
In the evaluation set - yes - definitely. But in the training set?

However, if the features you implement are that sensitive, to detect
the above cases, we may try to get more votes on those edits where you
disagree with the gold standard to see what's happening, to further
reduce the error rate, and to judge your algorithm right. But keep an
eye on your false positive rate, because most edits are certainly
regular.

These cases were only picked up because I've used 'edit future' information.
I doubt I'd be able to identify them that easily with information only from the past.
There is something out of belief propagation problem here (NP-complete).
On the other hand we only need to get 1 bit of entropy per edit :)

-- Regards, Dmitry Chichkov

Martin Potthast

unread,

May 10, 2010, 3:19:30 AM5/10/10

to pan-works...@googlegroups.com

> Do we need to keep debatable cases in the training set?
> In the evaluation set - yes - definitely. But in the training set?

If they are not present in the training corpus, then you wouldn't be
able to prepare for them.

> These cases were only picked up because I've used 'edit future' information.
> I doubt I'd be able to identify them that easily with information only from
> the past.

That's what I thought, too.

Best,
Martin

--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.cc

Dmitry Chichkov

unread,

May 10, 2010, 4:33:53 PM5/10/10

to pan-works...@googlegroups.com

Here is an interesting one:

Diff: http://en.wikipedia.org/w/index.php?diff=328327765
Annotation: ({'editcomment': '/* History */', 'newrevisionid': '328327765', 'editid': '4090', 'diffurl': 'http://en.wikipedia.org/w/index.php?diff=328327765&oldid=328327614', 'articletitle': 'Indonesia', 'oldrevisionid': '328327614', 'edittime': '2009-11-28T05:23:25Z', 'editor': '170.140.59.68'}, {'class': 'regular', 'totalannotators': '3', 'annotators': '3', 'editid': '4090'})

Apparently 170.140.59.68 had sneakily deleted the article content in the process of removing his own blatant v.

-- Cheers, Dmitry

dmtr

unread,

Jun 7, 2010, 5:49:32 AM6/7/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

Hi Martin,

I've done more throughout analysis of the training corpus quality,
evaluating approximately 20% of it.
In brief, I can tell that it have very high quality, and very few
'false positives' (~ 0.1%).
The rate of 'false negatives' however is not as good (~15% ?), with
the most problematic areas including:

* 'Tests': 31 -- a test edit (not really v., but article is ruined
anyway.);
* 'NPOV': 23 -- violation of the neutral pov. (only from users with
npov history);
* 'Sneaky Vandalism': 17 -- edit is a part or several edits by the
same user resulting in v;
* 'Notability Guidelines Violation': 10
* 'Misinformation': 9 -- deliberate misinformation;
* 'Blanking': 9 -- edit blanking an article;
* 'Graffiti': 7 -- graffiti left by a user;

Following information were used during that evaluation (human review
stage):
* edit diff/comment;
* edit-group diff (if user did several consecutive edits in a row);
* editor talk/user pages/contributions;
* revert diff/comment (if the edit was later reverted);
* reverting editor talk/user pages/contributions;

So here are some more annotations from the training corpus that look
questionable ('false negatives'). These 62 edits are from 3045 edits

selected randomly (*) from the training corpus, analyzed (using
'future' information) automatically, identified as false negatives and

reviewed by me. There are also 2 ('false positives') similarly
analyzed and identified.

http://pan-workshop-series.googlegroups.com/web/pan-wvc-10-training-set-evaluation-3k.zip

Archived .csv file have following structure:
editid,newrevisionid,class,annotators,totalannotators,known,verified,diffurl,editgroupdiffurl,revertdiffurl,revertcomment

Fields:
known - machine label + human override ('good' = regular, bad =
vandalism)
label - human review label ('Regular', 'Constructive' are 'good'.
Everything else considered vandalism. See -legend.csv file)
editgroupdiffurl - contains a diff, if the user did several edits in a
row;
revertdiffurl - contains a diff to a revert, if the edit was later
reverted (MD5 detection only);
revertcomment - contains a revert comment, if the edit was later
reverted (MD5 detection only);

Here are a few sample lines from the -3k-positives.csv file:
22306,328702608,regular,3,4,bad,Tests,http://en.wikipedia.org/w/
index.php?diff=328702608&oldid=326099193,,http://en.wikipedia.org/w/
index.php?diff=328996039,"revert test edit and vandalism"
39110,328823508,regular,3,3,bad,Tests,http://en.wikipedia.org/w/
index.php?diff=328823508&oldid=327900510,http://en.wikipedia.org/w/
index.php?diff=328823589&oldid=327900510,http://en.wikipedia.org/w/
index.php?diff=328850862,"vandalism revert to Sift&Winnow"
10670,326837092,regular,3,3,bad,Graffiti,http://en.wikipedia.org/w/
index.php?diff=326837092&oldid=320823225,http://en.wikipedia.org/w/
index.php?diff=326837215&oldid=320823225,http://en.wikipedia.org/w/
index.php?diff=326912207,"Reverted to revision 320823225 by
[[Special:Contributions/Peterscobie|Peterscobie]]; restoring last good
version; vandalism. ([[WP:TW|TW]])"
28950,328281439,regular,3,3,bad,Original Research,http://
en.wikipedia.org/w/index.php?diff=328281439&oldid=328131790,,http://
en.wikipedia.org/w/index.php?diff=328281917,"Rmv unsourced analysis."
18793,327202262,regular,3,4,bad,Misinformation,http://en.wikipedia.org/
w/index.php?diff=327202262&oldid=327200176,http://en.wikipedia.org/w/
index.php?diff=327202262&oldid=327199257,http://en.wikipedia.org/w/
index.php?diff=327257234,"Rvt to last known stable edit - might have
missed some useful stuff; sorry"

Some stats:
{'PAN-WVC-10 label Vandalism on known': {'bad': 366, 'good': 2},
'PAN-WVC-10 label Vandalism on known'': {'Blanking': 4,
'Constructive': 3,
'Regular': 1,

'Graffiti': 29,
'Joke': 3,
'Revert Warring': 1,
'SPAM': 1,
'Tests': 2,
'Unintentional': 2,
'Vandalism': 3},

'PAN-WVC-10 label Regular on known': {'bad': 62, 'good': 2614},
'PAN-WVC-10 class Regular on verified': {
'Regular': 22,
'Constructive': 103,

'Abuse of Tags': 1,
'Blanking': 9,
'Formatting': 1,
'Graffiti': 7,
'Image Attack': 3,
'Irregular Formatting':
3,
'Joke': 1,
'Link spam': 7,
'Misinformation': 9,
'NONSence': 1,
'NPOV': 23,
'Notability Guidelines
Violation': 10,
'Original Research': 2,
'Partial self-revert':
2,
'Revert Warring': 1,
'SPAM': 1,
'Sneaky Vandalism': 17,
'Tests': 31,
'Unintentional': 1}}

-- Regards, Dmitry

Martin Potthast

unread,

Jun 7, 2010, 8:28:53 AM6/7/10

to pan-works...@googlegroups.com

Hi Dmitry,

first of all, thank you for all your hard work on this task, and
thanks also for sharing! I guess, that the other participants will
find your evaluation helpful; I certainly do!

> I've done more throughout analysis of the training corpus quality,
> evaluating approximately 20% of it.

That's quite a lot, thank you!

How did you experience manual annotation? How long did it take you?
How much could you do in a row without getting tired? I'm asking
because I noticed that getting tired happens quite quick, and that
this makes one's annotations a little more prone to error.

> In brief, I can tell that it have very high quality, and very few
> 'false positives' (~ 0.1%).
> The rate of 'false negatives' however is not as good (~15% ?), with

This sounds good. The corpus annotation process on MTurk was designed
to ensure a low false positive ratio. The false negative percentage,
however, appears a bit high to me. I.e., if this number were true, it
would contradict what has been reported time and again in the
literature that vandalism makes up about 5-7% of all edits. This may
very well be the case...

> the most problematic areas including:
>
> * 'Tests': 31 -- a test edit (not really v., but article is ruined
> anyway.);
> * 'NPOV': 23 -- violation of the neutral pov. (only from users with
> npov history);
> * 'Sneaky Vandalism': 17 -- edit is a part or several edits by the
> same user resulting in v;
> * 'Notability Guidelines Violation': 10
> * 'Misinformation': 9 -- deliberate misinformation;
> * 'Blanking': 9 -- edit blanking an article;
> * 'Graffiti': 7 -- graffiti left by a user;

Some of these are easily detected automatically, e.g. blanking. I
noticed during the MTurk annotation process that people somehow
overlooked page blanking more often than other cases.
I would not consider Tests and NPOV as part of the vandalism class,
though they may at times look like vandalism. The former is
specifically excluded from vandalism in the vandalism definition.
But then again, the different sub-classes of vandalism suggested on
Wikipedia are not very discriminative.

> So here are some more annotations from the training corpus that look
> questionable ('false negatives'). These 62 edits are from 3045 edits
> selected randomly (*) from the training corpus, analyzed (using
> 'future' information) automatically, identified as false negatives and
> reviewed by me. There are also 2 ('false positives') similarly
> analyzed and identified.
>
> http://pan-workshop-series.googlegroups.com/web/pan-wvc-10-training-set-evaluation-3k.zip

Thank you! I will be sure to review this data before compiling the
final corpus. Moreover, as mentioned before I plan on including the
machine results if they appear to be viable into the final corpus, and
I will re-annotate edits if the machines frequently disagree with the
corpus annotations. This way, it may be possible to bootstrap an even
better vandalism corpus.

Best regards,

Dmitry Chichkov

unread,

Jun 7, 2010, 4:01:04 PM6/7/10

to pan-works...@googlegroups.com

Hi Martin,

How did you experience manual annotation? How long did it take you?
How much could you do in a row without getting tired? I'm asking
because I noticed that getting tired happens quite quick, and that
this makes one's annotations a little more prone to error.

Out of 3045 under evaluation 304 edits had been flagged automatically and evaluated manually. Most of these flagged edits were corner cases requiring throughout analysis (including user/reverter contributions, talk pages, etc) and evaluation was taking several minutes per single edit. I could do approximately 10-15 evaluations in a row. Overall it took ~15 hours spread over a week.

Interestingly, what really helped - is having all these different labels. It is much easier to put the discriminative label ('good', 'constructive', 'link spam', 'test', 'graffity', 'npov', 'partial self-revert', ...) compared to the the binary label ('vandalism', 'regular'). The requirement to put the binary label can really stuck the human.

> In brief, I can tell that it have very high quality, and very few
> 'false positives' (~ 0.1%).
> The rate of 'false negatives' however is not as good (~15% ?), with

literature that vandalism makes up about 5-7% of all edits. This may
very well be the case...

I've estimated that rate against the total number of v. edits, not the total number of edits. Rate of 'false negatives' against the total number of edits would be ~2%. Note, that we are also using slightly different criteria in our evaluations. For example edits with 'sneaky vandalism' or 'partial self-revert' labels, the edits may look good by itself, but really they are a part of a group on several edits in a row made by the same user; the resulting diff - clear vandalism.

I would not consider Tests and NPOV as part of the vandalism class,
though they may at times look like vandalism. The former is
specifically excluded from vandalism in the vandalism definition.
But then again, the different sub-classes of vandalism suggested on
Wikipedia are not very discriminative.

Yes. I very much agree. In fact I think that many NPOV and Test edits are quite good, resulting in the improvement overall; simply because the article get looked upon, etc. (random noise/dynamic equilibrium, that kind of thing). And I put 'Constructive' label onto these edits and do _not_ consider them as vandalism. You'll find ~100 of these in my evaluation.

On the other hand some users repeatedly violate NPOV guidelines or do 'test' edits which really result only in a time waste for other contributors and does not improve anything. These I put into the ''NPOV' and Tests category.

By the way, it would be interesting to generalize this 'common sense' approach to v. evaluation in terms of entropy/information/free energy. Something along the lines of 'free energy' requirements to do the edit/review the article/etc. And 'free energy savings' for the users reading the article (because the information was readily available).

Thank you! I will be sure to review this data before compiling the
final corpus. Moreover, as mentioned before I plan on including the
machine results if they appear to be viable into the final corpus, and
I will re-annotate edits if the machines frequently disagree with the
corpus annotations. This way, it may be possible to bootstrap an even
better vandalism corpus.

I'll upload the machine evaluation of the complete training corpus. And if anybody have some students in a need to do some chores - there is an excellent task - reviewing the labels that disagree with the corpus data. You'll be most welcome.

-- With Regards, Dmitry

dmtr

unread,

Jun 8, 2010, 3:41:19 PM6/8/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

> > Thank you! I will be sure to review this data before compiling the
> > final corpus. Moreover, as mentioned before I plan on including the
> > machine results if they appear to be viable into the final corpus, and
> > I will re-annotate edits if the machines frequently disagree with the
> > corpus annotations. This way, it may be possible to bootstrap an even
> > better vandalism corpus.
>
> I'll upload the machine evaluation of the complete training corpus. And if
> anybody have some students in a need to do some chores - there is an
> excellent task - reviewing the labels that disagree with the corpus data.
> You'll be most welcome.

The machine evaluation of the complete _training_ corpus (with partial
human review/labels) is available at the:
http://pan-workshop-series.googlegroups.com/web/pan-wvc-10-training-set-evaluation-15k.zip

This file supersedes the pan-wvc-10-training-set-evaluation-3k.zip, so
I'm going to delete the xxx-3k file.

Archived .csv files fields:
"editid" - id in the PAN WVC 10 training corpus;
"newrevisionid" - Wikipedia revision id;
"class" - class in the PAN WVC 10 training corpus;
"annotators" - annotators in the PAN WVC 10 training corpus;
"totalannotators" - totalannotators in the PAN WVC 10 training corpus;
"known" - Machine label + human override ('good' = regular, bad =
vandalism). Generated by a maxent/megam classifier trained on the
separate corpus (full history of the Wikipedia/Rocket article).
Future revision information was used (if the revision was reverted).
Overridden for the labels that had been reviewed by a human (see
"verified" field);
"verified" - human review label ('Regular', 'Constructive' are 'good'.
Everything else considered vandalism. See -legend.csv file);
"diffurl" - contains a diff url for the user edit;
"editgroupdiffurl" - contains a diff url, if the user did several
edits in arow;
"revertdiffurl" - contains a diff url to a revert, if the edit was

later reverted (MD5 detection only);
"revertcomment" - contains a revert comment, if the edit was later
reverted (MD5 detection only);

-- With Regards, Dmitry

Martin Potthast

unread,

Jun 9, 2010, 4:01:05 AM6/9/10

to pan-works...@googlegroups.com

Thanks, once more!
Somehow, the download doesn't seem to work.

Best,
Martin

> --
> You received this message because you are subscribed to the Google Group "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series

> To unsubscribe send email to pan-workshop-se...@googlegroups.com.

Dmitry Chichkov

unread,

Jun 9, 2010, 5:38:03 AM6/9/10

to pan-works...@googlegroups.com

Looks like some kind of bug/failure on the google storage network.
Other files could not be found too: http://groups.google.com/group/pan-workshop-series/files

-- Dmitry

Santiago M. Mola

unread,

Jun 10, 2010, 6:41:28 PM6/10/10

to pan-works...@googlegroups.com

Hello Dmitry,

On Wed, Jun 9, 2010 at 11:38 AM, Dmitry Chichkov <dchi...@gmail.com> wrote:
> Looks like some kind of bug/failure on the google storage network.
> Other files could not be found too:
> http://groups.google.com/group/pan-workshop-series/files

Any chance you can upload it somehwere else or send it through email?

I'm currently doing manual review of errors from my classifier and
also found some false negatives (I'll share them soon). Also found a
good number of test/bad edits that look like vandalism in every regard
except they seem to be in good faith, so I'm trying to get them out of
the training corpus. Surely your findings will be of great help for
that.

Everytime I look at the problem, I'm more convinced that, in the
future, we will need a wider range of labels (probably multilabels) in
the corpus. Seeing every "non-bad-faith-vandalism" edit as a good edit
is an added obstacle for building a good model. Specially, test and
spam categories would be really helpful.

Thank you,

Dmitry Chichkov

unread,

Jun 10, 2010, 8:04:04 PM6/10/10

to pan-works...@googlegroups.com

Any chance you can upload it somehwere else or send it through email?

Apparently google had temporarily censored .ZIP files from google groups... http://groups.google.com/group/is-something-broken/browse_thread/thread/ee0bc4d6e3cd92da So I've re-uploaded the file as .tar.gz:
http://groups.google.com/group/pan-workshop-series/web/pan-wvc-10-training-set-evaluation-15k.tar.gz

I'm currently doing manual review of errors from my classifier and
also found some false negatives (I'll share them soon). Also found a
good number of test/bad edits that look like vandalism in every regard
except they seem to be in good faith, so I'm trying to get them out of
the training corpus. Surely your findings will be of great help for
that.

Everytime I look at the problem, I'm more convinced that, in the
future, we will need a wider range of labels (probably multilabels) in
the corpus. Seeing every "non-bad-faith-vandalism" edit as a good edit
is an added obstacle for building a good model. Specially, test and
spam categories would be really helpful.

Yes. I very much agree with that. For machine labels at least something like:
'vandalism', 'constructive' (which includes npov/non-bad-faith-vandalism/etc), 'regular'.

And for human labels I'd prefer to have classes for every type and kind of vandalism. Because:
1) These can be easily downcasted into binary labels; the opposite is not true;
2) In my opinion it's actually a lot easier for humans to put concrete labels (i.e. ''npov', 'test', etc).

There have been some work on that:
* Si-Chi Chin, W. Nick Street, Padmini Srinivasan, and David Eichmann. Detecting Wikipedia vandalism with active learning and statistical language models.
* http://en.wikipedia.org/wiki/Wikipedia:Vandalism#Types_of_vandalism

I use some internal classification based on these works and some other ideas. If you interested, you can find a draft with labels descriptions in the: pan-wvc-10-training-set-evaluation-labels-15k.csv (and there is also a helper python module available that manages labels, allows to read/write/merge datasets, etc).

-- Regards, Dmitry

Thank you,

--
Santiago M. Mola
Jabber ID: cool...@gmail.com

--

dmtr

unread,

Jun 10, 2010, 8:26:22 PM6/10/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

> I use some internal classification based on these works and some other
> ideas. If you interested, you can find a draft with labels descriptions in
> the: pan-wvc-10-training-set-evaluation-labels-15k.csv (and there is also a
> helper python module available that manages labels, allows to
> read/write/merge datasets, etc).

Here are a few labels (edit class, label, shortcut, description):
=================================================================
good, Regular, R, "Regular constructive edit done in a good faith. In
other words good edit."

good, Constructive, C, "Inaccurate, but constructive edit done in a
good faith. Includes partial vandalism cleanup, etc. In other words
good, but inaccurate edit."

bad, Vandalism, V, "Generick vandalism. Destructive edit done in a bad
faith. In other words unclassified bad edit."

bad, Tests, T, "Adding unhelpful content to a page (e.g., a few random
characters) as a test. Not done in bad faith."

bad, Unintentional, U, "Inaccurate and destructive addition or removal
of content but in the belief that it is accurate. Done in a good
faith."

bad, Blanking, B, "Removing all or significant parts of a page's
content without any reason, or replacing entire pages with nonsense."

bad, Link spam, L, "Adding or continuing to add external links to non-
notable or irrelevant sites."

bad, Graffiti, G, "Adding profanity, graffiti, random characters
(gibberish) to pages."

bad, Partial self-revert, P , "Hiding vandalism (by making two bad
edits and only reverting one or by reverting edit only partially)."

bad, Irregular Formatting, IF, "Formatting incorrectly or using
incorrect wiki markup and style."

bad, Misinformation, M, "Adding plausible misinformation to articles,
(e.g. minor alteration of facts or additions of plausible-sounding
hoaxes)."

bad, Image Attack, IA, "Uploading shock images, inappropriately
placing explicit images on pages, or simply using any image in a way
that is disruptive."

bad, Revert Warring, RW, "Reverting good faith contributions of other
users without any reason. Engaging into a revert war."

bad, Overlinking, O, "Using wiki links that have little related
content. Unnecessary linking of common words used in the common way,
for which the reader can be expected to understand the word's full
meaning in context, without any hyperlink help. A link for any single
term (other than for date formats) is excessively repeated in the same
article."

bad, Notability Guidelines Violation, NGV, "Adding article or text
that does not meet the general notability guideline."

bad, Original Research, OR, "Adding obviously non-encyclopedic
Original Research."

bad, NPOV, NPOV, "Introducing inappropriate material which is not
ideal from a NPOV perspective."

bad, SPAM, SPAM, "Adding text (with or without external links) that
promotes one's personal interests."

bad, Sneaky Vandalism, SV, "Vandalism that is harder to spot, or that
otherwise circumvents detection. Using two or more different accounts
and/or IP addresses at a time to vandalize, abuse of maintenance and
deletion templates, or reverting legitimate edits with the intent of
hindering the improvement of pages. Some vandals even follow their
vandalism with an edit that states "rv vandalism" in the edit summary
in order to give the appearance the vandalism was reverted."

-- Regards, Dmitry

Martin Potthast

unread,

Jun 11, 2010, 3:22:49 AM6/11/10

to pan-works...@googlegroups.com

All of this doesn't sound much like class labels, because their
semantics are not discriminative enough, but Dmitry is right in that
his labels can be mapped to a binary classification, e.g., vandalism
and regular.

In fact, your labels very much look like tags to me! I shouldn't be a
problem to include those into the corpus to provide for additional
meta information about the edits. The problem of course is, to supply
such tags for all edits.

Best,
Martin

> --
> You received this message because you are subscribed to the Google Group "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to pan-workshop-se...@googlegroups.com.
>

--

Santiago M. Mola

unread,

Jun 12, 2010, 5:33:15 AM6/12/10

to pan-works...@googlegroups.com

Hello,

On Fri, Jun 11, 2010 at 2:04 AM, Dmitry Chichkov <dchi...@gmail.com> wrote:
>
> So I've re-uploaded the file as .tar.gz:
>

> [...]
>

Thank you!

Your classification matches with the most obvious false negatives and
false positives that I found. I manually reviewed around 200 edits [*]
that seemed to affect negatively the performance of my system and
found these:

[*] I might missed some of them. I did around +100 in a row. Can't
tell how much did it take since I lost the notion of time in the
process ;-)

Incorrectly (or debatably) tagged as 'regular':
--------------------------------------------------------------
Clearly vandalism: 42097, 9012, 2321, 5364
Tests that might be vandalism: 38947
Tests that might be in good faith: 20606, 16104, 3949

Incorrectly tagged as 'vandalism'
---------------------------------------------
Regular: 17102
Part of a legitimate partial revert series: 11732, 8780
Debatable guidelines violation but otherwise looks 'regular': 18529

>
> I use some internal classification based on these works and some other
> ideas. If you interested, you can find a draft with labels descriptions in
> the: pan-wvc-10-training-set-evaluation-labels-15k.csv (and there is also a
> helper python module available that manages labels, allows to
> read/write/merge datasets, etc).

I'm totally out of time to do anything useful with the additional
tags, but the discrepances of the main regular/vandalism labels are a
great help.

Best regards,

dmtr

unread,

Jun 19, 2010, 6:39:59 PM6/19/10

to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.

It'd be nice to review revisions that are labeled as 'regular' in the
PAN10 training set, but in fact reverted in the Wikipedia.
There are 1134 of such revisions (out of 14079 labeled as 'regular' in
the training set). Quite a few are false negatives.

Here are the detailed stats:
=================================
Future information/reverts analysis, MD5 hash matching in the
Wikipedia database dump (enwiki-20100312-pages-meta-history).

PAN 10 Training Set:
Revisions labeled as Regular :Regular:12945 (91%) Reverted:1134
(8%)
Revisions labeled as Vandalism :Regular:134 (14%) Reverted:787
(85%)

PAN 10 Training Set (detailed reverts analysis):
Revisions labeled as Regular:
:regular revision:10860 (77%)
:regular revert (a duplicate of a previous revision):2080 (14%)
:between duplicates, by single user (reverted, most likely bad):
733 (5%)
:between duplicates, revert that was reverted (revert war):188
(1%)
:between duplicates, by other users (reverted, questionable):101
(0%)
:between duplicates, reverted (self-reverted):112 (0%)
:regular revision (duplicate of the first revision):5 (0%)

Revisions labeled as Vandalism:
:between duplicates, by single user (reverted, most likely bad):
700 (76%)
:regular revision:119 (12%)
:between duplicates, revert that was reverted (revert war):41
(4%)
:between duplicates, reverted (self-reverted):33 (3%)
:regular revert (a duplicate of a previous revision):15 (1%)
:between duplicates, by other users (reverted, questionable):13
(1%)
=================================

If someone is up to do a review, "revertdiffurl" in the pan-wvc-10-
training-set-evaluation-labels-15k.csv is nonempty for reverted
revisions identified with this algorithm.

-- Regards, Dmitry

On Jun 12, 2:33 am, "Santiago M. Mola" <cooldw...@gmail.com> wrote:
> Hello,
>

> Jabber ID: cooldw...@gmail.com

Martin Potthast

unread,

Jun 21, 2010, 3:23:00 AM6/21/10

to pan-works...@googlegroups.com

Thanks, Dmitry! We will look into this!

Best,
Martin

> --
> You received this message because you are subscribed to the Google Group "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to pan-workshop-se...@googlegroups.com.
>

--

Reply all

Reply to author

Forward