> This edit ('newrevisionid': '329972485', 'editid': '19204') wasIn this case it seems the editor who included the intra-Wikipedia
> identified as regular in the corpus, but you can see
> http://en.wikipedia.org/w/index.php?diff=329972485&oldid=329950401
> that in the next revision it was reverted as redundant and unnecessary
> wikilinks, in the next few revisions of the article user had continued
> redundant wikilinking and was reverted again and again. On the talk
> page of the user you can see http://en.wikipedia.org/wiki/User_talk:24.37.41.212
> that the user had a block history for overlinking.
links got taken away a bit by the fact that for many words used in the
text there are in fact articles available. Maybe this editor was
simply unaware of Wikipedia's policy not to link everything that could
be linked but only the important things. Maybe he or she is just
stubborn. In any case, even if the editor does those kinds of things
repeatedly, edit wars are not vandalism, and otherwise I see no
indication of meaning harm.
In general, I'd say that there will be many debatable cases: we ended
up with 70 edits on which up to 30 people were completely unsure
whether the editor meant harm or no. And I'm inclined to think that
there are many more such cases we didn't find. However, our budget was
too limited to get that many votes for every edit. ;-)
However, if the features you implement are that sensitive, to detect
the above cases, we may try to get more votes on those edits where you
disagree with the gold standard to see what's happening, to further
reduce the error rate, and to judge your algorithm right. But keep an
eye on your false positive rate, because most edits are certainly
regular.
first of all, thank you for all your hard work on this task, and
thanks also for sharing! I guess, that the other participants will
find your evaluation helpful; I certainly do!
> I've done more throughout analysis of the training corpus quality,
> evaluating approximately 20% of it.
That's quite a lot, thank you!
How did you experience manual annotation? How long did it take you?
How much could you do in a row without getting tired? I'm asking
because I noticed that getting tired happens quite quick, and that
this makes one's annotations a little more prone to error.
> In brief, I can tell that it have very high quality, and very few
> 'false positives' (~ 0.1%).
> The rate of 'false negatives' however is not as good (~15% ?), with
This sounds good. The corpus annotation process on MTurk was designed
to ensure a low false positive ratio. The false negative percentage,
however, appears a bit high to me. I.e., if this number were true, it
would contradict what has been reported time and again in the
literature that vandalism makes up about 5-7% of all edits. This may
very well be the case...
> the most problematic areas including:
>
> * 'Tests': 31 -- a test edit (not really v., but article is ruined
> anyway.);
> * 'NPOV': 23 -- violation of the neutral pov. (only from users with
> npov history);
> * 'Sneaky Vandalism': 17 -- edit is a part or several edits by the
> same user resulting in v;
> * 'Notability Guidelines Violation': 10
> * 'Misinformation': 9 -- deliberate misinformation;
> * 'Blanking': 9 -- edit blanking an article;
> * 'Graffiti': 7 -- graffiti left by a user;
Some of these are easily detected automatically, e.g. blanking. I
noticed during the MTurk annotation process that people somehow
overlooked page blanking more often than other cases.
I would not consider Tests and NPOV as part of the vandalism class,
though they may at times look like vandalism. The former is
specifically excluded from vandalism in the vandalism definition.
But then again, the different sub-classes of vandalism suggested on
Wikipedia are not very discriminative.
> So here are some more annotations from the training corpus that look
> questionable ('false negatives'). These 62 edits are from 3045 edits
> selected randomly (*) from the training corpus, analyzed (using
> 'future' information) automatically, identified as false negatives and
> reviewed by me. There are also 2 ('false positives') similarly
> analyzed and identified.
>
> http://pan-workshop-series.googlegroups.com/web/pan-wvc-10-training-set-evaluation-3k.zip
Thank you! I will be sure to review this data before compiling the
final corpus. Moreover, as mentioned before I plan on including the
machine results if they appear to be viable into the final corpus, and
I will re-annotate edits if the machines frequently disagree with the
corpus annotations. This way, it may be possible to bootstrap an even
better vandalism corpus.
Best regards,
How did you experience manual annotation? How long did it take you?
How much could you do in a row without getting tired? I'm asking
because I noticed that getting tired happens quite quick, and that
this makes one's annotations a little more prone to error.
> In brief, I can tell that it have very high quality, and very few
> 'false positives' (~ 0.1%).
> The rate of 'false negatives' however is not as good (~15% ?), with
literature that vandalism makes up about 5-7% of all edits. This may
very well be the case...
I would not consider Tests and NPOV as part of the vandalism class,
though they may at times look like vandalism. The former is
specifically excluded from vandalism in the vandalism definition.
But then again, the different sub-classes of vandalism suggested on
Wikipedia are not very discriminative.
Thank you! I will be sure to review this data before compiling the
final corpus. Moreover, as mentioned before I plan on including the
machine results if they appear to be viable into the final corpus, and
I will re-annotate edits if the machines frequently disagree with the
corpus annotations. This way, it may be possible to bootstrap an even
better vandalism corpus.
Best,
Martin
> --
> You received this message because you are subscribed to the Google Group "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to pan-workshop-se...@googlegroups.com.
On Wed, Jun 9, 2010 at 11:38 AM, Dmitry Chichkov <dchi...@gmail.com> wrote:
> Looks like some kind of bug/failure on the google storage network.
> Other files could not be found too:
> http://groups.google.com/group/pan-workshop-series/files
Any chance you can upload it somehwere else or send it through email?
I'm currently doing manual review of errors from my classifier and
also found some false negatives (I'll share them soon). Also found a
good number of test/bad edits that look like vandalism in every regard
except they seem to be in good faith, so I'm trying to get them out of
the training corpus. Surely your findings will be of great help for
that.
Everytime I look at the problem, I'm more convinced that, in the
future, we will need a wider range of labels (probably multilabels) in
the corpus. Seeing every "non-bad-faith-vandalism" edit as a good edit
is an added obstacle for building a good model. Specially, test and
spam categories would be really helpful.
Thank you,
Any chance you can upload it somehwere else or send it through email?
I'm currently doing manual review of errors from my classifier and
also found some false negatives (I'll share them soon). Also found a
good number of test/bad edits that look like vandalism in every regard
except they seem to be in good faith, so I'm trying to get them out of
the training corpus. Surely your findings will be of great help for
that.
Everytime I look at the problem, I'm more convinced that, in the
future, we will need a wider range of labels (probably multilabels) in
the corpus. Seeing every "non-bad-faith-vandalism" edit as a good edit
is an added obstacle for building a good model. Specially, test and
spam categories would be really helpful.
Thank you,
--
In fact, your labels very much look like tags to me! I shouldn't be a
problem to include those into the corpus to provide for additional
meta information about the edits. The problem of course is, to supply
such tags for all edits.
Best,
Martin
> --
> You received this message because you are subscribed to the Google Group "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to pan-workshop-se...@googlegroups.com.
>
--
On Fri, Jun 11, 2010 at 2:04 AM, Dmitry Chichkov <dchi...@gmail.com> wrote:
>
> So I've re-uploaded the file as .tar.gz:
>
> [...]
>
Thank you!
Your classification matches with the most obvious false negatives and
false positives that I found. I manually reviewed around 200 edits [*]
that seemed to affect negatively the performance of my system and
found these:
[*] I might missed some of them. I did around +100 in a row. Can't
tell how much did it take since I lost the notion of time in the
process ;-)
Incorrectly (or debatably) tagged as 'regular':
--------------------------------------------------------------
Clearly vandalism: 42097, 9012, 2321, 5364
Tests that might be vandalism: 38947
Tests that might be in good faith: 20606, 16104, 3949
Incorrectly tagged as 'vandalism'
---------------------------------------------
Regular: 17102
Part of a legitimate partial revert series: 11732, 8780
Debatable guidelines violation but otherwise looks 'regular': 18529
>
> I use some internal classification based on these works and some other
> ideas. If you interested, you can find a draft with labels descriptions in
> the: pan-wvc-10-training-set-evaluation-labels-15k.csv (and there is also a
> helper python module available that manages labels, allows to
> read/write/merge datasets, etc).
I'm totally out of time to do anything useful with the additional
tags, but the discrepances of the main regular/vandalism labels are a
great help.
Best regards,
Best,
Martin
> --
> You received this message because you are subscribed to the Google Group "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to pan-workshop-se...@googlegroups.com.
>
--