New round of MediaWiki archivals

17 views
Skip to first unread message

Federico Leva (Nemo)

unread,
May 16, 2018, 10:06:28 AM5/16/18
to wikiteam...@googlegroups.com
TL;DR: Please help run and patch the current version of dumpgenerator on
Wikia wikis or other wikis which fail.
https://github.com/WikiTeam/wikiteam/blob/7c545d05b7effc240c8f20885dbcd7bad5632c94/dumpgenerator.py

----

This month I've updated the dumps of some 7k MediaWiki wikis on IA.

Out of the 23073 wikis archived on IA, 9651 were found to be alive
according to checkalive.py. I've tried to archive the non-farm wikis.

Many of those wikis fail exporting nevertheless, for the usual
neverending loops or for other reasons. To improve success, I've added
an --xmlrevisions option which only uses the API and I've committed my
usual skipping hacks as a --failfast option.

I've called this version 0.4 because changes can be rather radical,
although everything should still be the same if you don't use the option.
https://github.com/WikiTeam/wikiteam/issues/311
https://github.com/WikiTeam/wikiteam/commits/7c545d05b7effc240c8f20885dbcd7bad5632c94/dumpgenerator.py

A list from not-archived.py now returns 4113 wikis. I'd also like to
dump ~240k Wikia wikis, but --xmlrevisions still fails on them and the
list could be improved:
https://github.com/WikiTeam/wikiteam/pull/310

On OVH, which I've used this time, bandwidth and CPU are cheap so my
bottleneck was mostly the disk (lesson learnt: better spend a few dozen
euro on a bigger disk than spend hours fighting with disk limits). I
used a little patch to launcher.py and uploader.py so that the 7z files
would be written to a separate partition. Can I commit such an option to
the main repository or would it become too messy?

Federico

Emilio J. Rodríguez-Posada

unread,
May 16, 2018, 11:23:43 AM5/16/18
to wikiteam...@googlegroups.com
I am a bit busy, and Wikispaces is getting my attention now. But I hope in summer I can help here.



Federico

--
You received this message because you are subscribed to the Google Groups "wikiteam-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wikiteam-discuss+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Federico Leva (Nemo)

unread,
May 21, 2018, 2:52:36 PM5/21/18
to wikiteam...@googlegroups.com, Mark A. Hershberger
The round is mostly complete.
https://archive.org/search.php?query=subject%3A%22MediaWiki%22&sort=-publicdate

Many of the wikis in the list were actually dead. More lists of alive
non-farm wikis to archive would be welcome.

If I count correctly, 387 wikis have been archived so far which were not
before. Some of those were already known, others not. I was pleased to
see this small wiki which I didn't know although it exists since 2010,
because it means we catch up on 8 years of history:
https://archive.org/details/wiki-vafudcom

This wiki was reported by a user on WikiApiary, so it seems this method
is helping.
https://wikiapiary.com/wiki/Special:Contributions/Lucazeo

Federico
Reply all
Reply to author
Forward
0 new messages