[WARNING] Don't use dumpgenerator.py with API

54 views
Skip to first unread message

Federico Leva (Nemo)

unread,
Nov 9, 2012, 5:27:03 AM11/9/12
to wikiteam...@googlegroups.com, archiv...@googlegroups.com, mediaw...@lists.wikimedia.org, pywiki...@lists.wikimedia.org
It's completely broken:
https://code.google.com/p/wikiteam/issues/detail?id=56
It will download only a fraction of the wiki, 500 pages at most per
namespace.

Let me reiterate that
https://code.google.com/p/wikiteam/issues/detail?id=44 is a very urgent
bug and we've seen no work on it in many months. We need an actual
programmer with some knowledge of python to fix it and make the script
work properly; I know there are several on this list (and elsewhere),
please please help. The last time I, as a non-coder, tried to fix a bug,
I made things worse
(https://code.google.com/p/wikiteam/issues/detail?id=26).

Only after API is implemented/fixed, I'll be able to re-archive the 4-5
thousands wikis we've recently archived on archive.org
(https://archive.org/details/wikiteam) and possibly many more. Many of
those dumps contain errors and/or are just partial because of the
script's unreliability, and wikis die on a daily basis. (So, quoting
emijrp, there IS a deadline.)

Nemo

P.s.: Cc'ing some lists out of desperation; sorry for cross-posting.

Hydriz Wikipedia

unread,
Nov 9, 2012, 10:21:41 AM11/9/12
to wikiteam...@googlegroups.com, archiv...@googlegroups.com, mediaw...@lists.wikimedia.org, pywiki...@lists.wikimedia.org
Hi all,

I am beginning work on a port to PHP due to some issues regarding unit testing for another project of mine (if you follow me on GitHub, you will know). I hope to help out with fixing the script, but it is a good idea to get someone who knows python (pywikipedia-l people) and the MediaWiki API (mediawiki-api people) to help.
--
Regards,
Hydriz

We've created the greatest collection of shared knowledge in history. Help protect Wikipedia. Donate now: http://donate.wikimedia.org

Hydriz Wikipedia

unread,
Nov 9, 2012, 11:50:10 PM11/9/12
to wikiteam...@googlegroups.com, archiv...@googlegroups.com, mediaw...@lists.wikimedia.org, pywiki...@lists.wikimedia.org
Scott,

Nemo is referring to the dumpgenerator.py being broken on MediaWiki versions above 1.20, and it should not actually affect older MediaWiki versions.

You can safely continue with your grab. :)

On Sat, Nov 10, 2012 at 12:45 PM, Scott Boyd <scot...@gmail.com> wrote:
At this link: https://code.google.com/p/wikiteam/issues/detail?id=56 , at the bottom, there is an entry by project member nemowiki that states:

     Comment 7 by project member nemowiki, Today (9 hours ago)
    Fixed by emijrp in r806. :-)
      Status: Fixed

So does that mean this problem that "It's completely broken" is now fixed? I'm running a huge download of 64K+ page titles, and am now using the "r806" version of dumpgenerator.py. The first 35K+ page titles were downloaded with an older version). Both versions sure seem to be downloading MORE than 500 pages per namespace, but I'm not sure, since I don't know how you can tell if you are getting them all...

So is it fixed or not?


On Fri, Nov 9, 2012 at 4:27 AM, Federico Leva (Nemo) <nemo...@gmail.com> wrote:
It's completely broken: https://code.google.com/p/wikiteam/issues/detail?id=56
It will download only a fraction of the wiki, 500 pages at most per namespace.




Reply all
Reply to author
Forward
0 new messages