IMSLP dump

150 views
Skip to first unread message

Luiz Augusto

unread,
Apr 7, 2014, 12:07:49 PM4/7/14
to wikiteam...@googlegroups.com
The new IMSLP dump that I've started on 6 January just finished few hours ago.

The XML dumping piece finished on 27 February.

The bad news:

* I got a corrupted XML dump:

$ grep "<title>" *.xml -c;grep "<page>" *.xml -c;grep "</page>" *.xml -c;grep "<revision>" *.xml -c;grep "</revision>" *.xml -c
463544
463544
463526
1418977
1418976

* I cannot run a new dump. In fact, I was only waiting for this dump finishes to take a new wikibreak on wikiteam tasks (busy both on Wikimedia and on real life; dumping IMSLP isn't only a resource consumer thing, but also needs lots of babysitting).

By the way, I'm 7zipping the generated files to upload into Internet Archive, since those includes 90% of valid XML revisions and 319925 media files.

Luiz Augusto

unread,
Apr 9, 2014, 12:10:34 AM4/9/14
to wikiteam...@googlegroups.com
Upload finished: https://archive.org/details/wiki_imslp_org_20140106

Before this I've also uploaded more dumps still on opensource collection. As usual, you can found those searching for subject:wikiteam

Emilio J. Rodríguez-Posada

unread,
Apr 18, 2014, 9:10:00 AM4/18/14
to wikiteam...@googlegroups.com
Thanks for this work Luiz. Big dumps are hard to generate. Don't worry about the corrupted parts, in case of being requered they can be trimed.


--
You received this message because you are subscribed to the Google Groups "wikiteam-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wikiteam-discu...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages