plug -- data story on wikipedia abuse in India

69 views
Skip to first unread message

Shijith Kunhitty

unread,
Dec 21, 2021, 3:37:46 AM12/21/21
to datameet
Hi, my name is Shijith, and I'm a freelance data journalist. (Worked previously at Hindustan Times and IndiaSpend, have also contributed to datameet.org in the past.)

Just wanted to plug a data story I did recently about Wikipedia abuse in India. Such abuse is an old problem, but it's getting more media attention with users distorting facts on pages about the Delhi riots or farmer protests. Sometimes users engage in straight out vandalism where they delete whole sections from a page.

I tried to determine which wikipedia pages faced the most abuse this year, and I also introduce a twitter account that allows people to track wikipedia abuse weekly.

This is the link to the story: https://shijith.com/blog/wikipedia-page-abuse/

This is the twitter account for tracking wikipedia abuse every week: http://twitter.com/abuse_checker 

And here's the python code I used for the project: https://github.com/shijithpk/wikipedia_abuse_checker

(Am in the process of re-working the code. Right now it's querying the wikipedia API every week for the edit histories of over 150k articles, and the whole run is taking 2 days now. Discovered an API endpoint for recent changes that should make things more efficient.)

Have any questions or feedback, do let me know!
Thanks, Shijith

Shyamal Lakshminarayanan

unread,
Dec 22, 2021, 1:30:20 AM12/22/21
to data...@googlegroups.com
Dear Shijith,

This is very interesting work but I think it misses a lot of action due to the rather simple approach to identifying dispute. An example is that calls for action are often made from Whatsapp Group and on Twitter - such as ("meat puppet") campaigns to make changes - for instance to alter the entry on "Love Jihad" to read differently or for "Adam's Bridge" to be renamed which have resulted in those pages being placed on protection - that in turn drastically lowers edit revert counts. A script that follows highly-followed twitter handles (OpIndia for instance) complaining about Wikipedia pages might show up more disputed pages.

best wishes
Shyamal


Hi, my name is Shijith, and I'm a freelance data journalist. (Worked
previously at Hindustan Times and IndiaSpend, have also contributed to

 
Just wanted to plug a data story I did recently about Wikipedia abuse in
India. Such abuse is an old problem, but it's getting more media attention
with users distorting facts on pages about the Delhi riots or farmer
protests. Sometimes users engage in straight out vandalism where they
delete whole sections from a page.
 
I tried to determine which wikipedia pages faced the most abuse this year,
and I also introduce a twitter account that allows people to track
wikipedia abuse weekly.
 
This is the link to the story:
https://shijith.com/blog/wikipedia-page-abuse/
 
This is the twitter account for tracking wikipedia abuse every week:
http://twitter.com/abuse_checker
 
And here's the python code I used for the project:
https://github.com/shijithpk/wikipedia_abuse_checker
 
(Am in the process of re-working the code. Right now it's querying the
wikipedia API every week for the edit histories of over 150k articles, and
the whole run is taking 2 days now. Discovered an API endpoint for recent
changes that should make things more efficient.)
 
Have any questions or feedback, do let me know!
Thanks, Shijith
You received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to datameet+u...@googlegroups.com.

Shijith Kunhitty

unread,
Dec 22, 2021, 2:40:28 AM12/22/21
to datameet
Oh yeah, the fact that certain wikipedia pages come under protection status definitely lowers their revert counts. And I accept that as well in my blog. 

Blog excerpt: "This is more of a caveat. Some of these pages will have been placed under protection to stop getting abused. Protection could, for example, mean edits by anonymous users not appearing till they're approved, or if someone has been a registered user for less than 4 days old and has done less than 10 edits, they won't be able to edit certain pages. The page on Narendra Modi is under even stricter protection—only someone who's been a user for more than 30 days and has done over 500 edits can edit the page, and this has been the case since April. Such protections brings down the abuse levels for many pages, so that should be kept in mind."

But then you could turn around and say, hey Shijith if you're aware of this, why didn't you factor it into your calculations? It's because I think there should be a limit to the level of complexity a journalist aims for in their story. I accept that I posted this in the datameet mailing list, but the story is aimed at a general audience.

Again from my blog: "I've done work in the past that tries to be (conscientiously) rigorous, but I don't think journalism is the place for such work. Academic journals maybe, but not publications meant for the general public. In the pursuit of precision and conclusion validity, a lot of data journalism in India has become completely unreadable. I don't think there's anything wrong in admitting upfront that this post is a best-effort attempt, the conclusions may not be completely valid, but that hopefully this promotes the topic of wikipedia abuse in India as a legitimate avenue of research. And that some academic out there does a better job than I did in their paper. But I don't think a journalist should be the one doing that paper."

As for the level to which abuse of certain wikipedia pages is organised, and how that coordination is done over Telegram and Whatsapp, because many of these groups are private groups, it's difficult to monitor that activity and find out which pages they are targeting.

But even if that's the case, I don't think I'll be missing out on any pages though. Right now the script goes through 150k pages (and once i've reworked the code, all the pages) which have been assessed by WikiProject India, the group of editors that maintains pages about India, and they cover almost everything. So no scope for missing out on disputed pages.

To your point about following OpIndia or other right-wing handles to find which wikipedia pages they are targeting, the only issue I would have with that is it won't capture abuse of pages that aren't a result of coordinated, organised campaigns. Abuse of pages that is a result of individual actors editing separately, but still devastating at an aggregate level, is also interesting to me. The right-wing agenda is pretty much all pervasive now, and people don't need to be prodded by politicians, media etc. to do their bidding, they pretty much act on their own volition now :(
Reply all
Reply to author
Forward
0 new messages