[PAN'10] Wikipedia edit counters

0 views
Skip to first unread message

Santiago M. Mola

unread,
May 12, 2010, 7:25:01 AM5/12/10
to pan-works...@googlegroups.com
Hello,

I have a doubt about edit counters. I guess many of you use this as a
feature for identifying vandalism, but how do you extract these edit
counters?

The easy answer for me would be to use the original edit counter [1]
which is really nice because it gives some statistical info like
average edits per page, edited namespaces, etc. But, of course, this
isn't valid because it includes information of the future of the edit.

So I'm left with parsing the Special:Contributions page for each user
and collecting edits previous to the acquisition of the corpus. Are
you using this method? Or something else? Any advice?

Also, in the future, it would be useful to include this per-user
information in the corpus.

[1] http://toolserver.org/~river/cgi-bin/count_edits

Best regards,
--
Santiago M. Mola
Jabber ID: cool...@gmail.com

--
You received this message because you are subscribed to the Google Group "PAN".
Visit this group at http://groups.google.com/group/pan-workshop-series

Martin Potthast

unread,
May 12, 2010, 4:47:50 PM5/12/10
to pan-works...@googlegroups.com
Hi Santiago,

there is always a difference between enacting a rule, and then interpreting it. ;-)

In this case I'd say you should stick with the simple solution of using the information provided by the service you linked. When judging an edit, the information obtained from there possibly includes information from the edit's future, but I'd say the effects will be negligible.

This is not like, say, you were directly analysing the edited article's revision history right _after_ the edit took place. That would be definitely out of bounds.

Best,
Martin


--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de  ---  www.netspeak.cc

Santiago M. Mola

unread,
May 12, 2010, 5:50:26 PM5/12/10
to pan-works...@googlegroups.com
Hi Martin,

On Wed, May 12, 2010 at 10:47 PM, Martin Potthast
<martin....@uni-weimar.de> wrote:
>
> In this case I'd say you should stick with the simple solution of using the
> information provided by the service you linked. When judging an edit, the
> information obtained from there possibly includes information from the
> edit's future, but I'd say the effects will be negligible.
>

Fair enough ;-)

I can think of concrete cases where using this information will be an
advantage (legitimate users that were newbies then but now they
aren't) as well as disadvantages (anonymous vandal from an IP later
used by a legitimate anonymous user). And then, using the actual
counters at the time of edition have their complementary pros and
cons. So, hey, let's see the results :-)

Thank you,
--
Santiago M. Mola
Jabber ID: cool...@gmail.com

Dmitry Chichkov

unread,
May 12, 2010, 6:08:56 PM5/12/10
to pan-works...@googlegroups.com
Hi Santiago,

A few words of caution from [1]:

Things to note:

  • A user's edit count does not reflect on the value of their contributions to Wikipedia.
  • 'Project' and 'Project talk' refer to the project namespace: Wikipedia, Wiktionary, Meta, etc.
  • The total edit count does not include deleted edits.
  • Editcountitis can be fatal.
  • Automated mass querying of the edit counter is not allowed.
  • Problems can be reported to river [at] attenuate [dot] org.

You may want to drop an e-mail to the toolserver maintainer before hitting it with your script :)

-- Cheers, Dmitry

Martin Potthast

unread,
May 12, 2010, 6:27:41 PM5/12/10
to pan-works...@googlegroups.com
Go point, Dmitry!

In the training set there will be well above 100000 edits to be analyzed. So, everyone, please don't overrun services that are not prepared to handle load, and don't violate anyone's terms and conditions.

Best,
Martin
--
Martin Potthast
Bauhaus-Universität Weimar
www.webis.de  ---  www.netspeak.cc

Dmitry Chichkov

unread,
May 12, 2010, 11:21:07 PM5/12/10
to pan-works...@googlegroups.com
Martin,

Would you be willing to contact the toolserver maintainer and facilitate the data extraction?
Here is a small code snippet that would read the corpus, extract editor names and download the data from the toolserver.

import csv, httplib, urllib
edits = list(csv.DictReader(open("edits.csv")))
editors = dict([(e['editor'].decode('utf-8'), None) for e in edits]).keys()

print editors; quit();

c = httplib.HTTPConnection("toolserver.org")
c.set_debuglevel(1)
for e in sorted(editors[0:5]):
    params = urllib.urlencode({'machread' : '1', 'user' : e, 'dbname' : 'enwiki_p'})
    c.request("GET", "/~river/cgi-bin/count_edits?" + params)
    open(e, 'w').write(c.getresponse().read())
c.close()

-- Regards, Dmitry

Martin Potthast

unread,
May 13, 2010, 5:39:49 AM5/13/10
to pan-works...@googlegroups.com
Dear all,

I've send a mail to the toolserver owner to ask for access.

Please do not use the toolserver in a way which would violate its terms, e.g., by doing "Automated mass querying of the edit counter [...]".

By the way: the same holds for all other kinds of internet services which you wish to use. Please read carefully the terms of those services, and if they disallow automatic usage, the only way to do so is to ask permission.

I'll get back to you as soon as I hear from the guy who offers this service.

Best,
Martin

Santiago M. Mola

unread,
May 13, 2010, 6:41:47 AM5/13/10
to pan-works...@googlegroups.com
On Thu, May 13, 2010 at 11:39 AM, Martin Potthast
<martin....@uni-weimar.de> wrote:
>
> I've send a mail to the toolserver owner to ask for access.

Thanks Dmitry and Martin.

Regards,
Reply all
Reply to author
Forward
0 new messages