Task 2: deleted article

0 views
Skip to first unread message

Bo

unread,
Jun 10, 2010, 11:20:47 PM6/10/10
to PAN Workshop Series. Uncovering Plagiarism, Authorship, and Social Software Misuse.
Hi there,

While looking over the final set of edits to be evaluated, I
discovered that one of the pages has been deleted:

"69.203.132.155",327353991,327420472,"http://en.wikipedia.org/w/
index.php?diff=327420472&oldid=327353991","2009-11-23T05:19:07Z","/*
Media and publications */",4384143,"Jewish Task Force"


The revision text is of course still available in the pan2010
download, but I thought that I'd pass it on since it could be
important to entrants using outside signals.

-Bo

Martin Potthast

unread,
Jun 11, 2010, 3:27:45 AM6/11/10
to pan-works...@googlegroups.com
Hi Bo,

thanks for pointing this out!

Between the time we recorded the live edit log and the time we decided
to download the corresponding articles we lost more than 3000
articles/edits already. I think, there will be a lot more dead links
as time goes by, since Wikipedia doesn't stop to evolve. So,
downloading all article revisions was a good choice because it
maintains basic usefulness of the corpus in the future.

Martin

> --
> You received this message because you are subscribed to the Google Group "PAN".
> Visit this group at http://groups.google.com/group/pan-workshop-series
> To unsubscribe send email to pan-workshop-se...@googlegroups.com.
>

--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de --- www.netspeak.cc

Dmitry Chichkov

unread,
Jun 11, 2010, 4:13:58 AM6/11/10
to pan-works...@googlegroups.com
Another option is having a copy of the enwiki-20100130-pages-meta-history.xml.7z (30Gb, en-wiki dump); it contains all article revisions from the training corpus (with complete article histories and much more).
Following revisions from the test corpus seems to be missing: [326893407, 326893471, 327839049, 327887694, 327964100, 328143999, 328561560, 328625401, 328782264, 329017705, 329749020, 329800579]
I'm not sure why these revisions are missing.

By the way, if any of you are using this wikipedia dump. I did some analysis on the dump and found it sound; although some revisions in it have missing text (the text is present in the live wiki, here is an example http://en.wikipedia.org/w/index.php?oldid=9450068). Particularly affected are revisions between these  dates [2005-01-14T - 2005-05-14]. I couldn't identify any revisions with missing text in the enwiki-20100312-pages-meta-history.xml.7z, but this dump is incomplete, it cuts on the revision N=184986173. I've also tried filtering and plotting empty text revisions using the following criteria: comment starts on '/*'  (section edits) AND not an IP edit; The idea is that generally section edits do not result in the deletion of the complete article text and registered users tend to vandalize less. Consequently we can somewhat see what revisions text were missed due to  backup. Here are the resulting plots:
http://lists.wikimedia.org/pipermail/xmldatadumps-admin-l/attachments/20100517/15a9bed2/attachment.png
http://lists.wikimedia.org/pipermail/xmldatadumps-admin-l/attachments/20100517/15a9bed2/attachment-0001.png

-- Dmitry
Reply all
Reply to author
Forward
0 new messages