Wikia dump

22 views
Skip to first unread message

Federico Leva (Nemo)

unread,
May 27, 2018, 9:40:02 AM5/27/18
to wikiteam...@googlegroups.com
I'm almost done archiving all the Wikia wikis... again and not for the
last time.

One of the motives behind
https://github.com/WikiTeam/wikiteam/issues/311 is that dumpgenerator.py
run out of memory or otherwise crashed on a number of Wikia wikis,
especially the most interesting ones.

So in 2015 we were not able to archive the wikis for which Wikia's own
dump had failed, 213 truncated dumps which Benjamin Mako Hill had listed:
<https://archive.org/details/wikia_dump_20141219>

I plan to archive the dumps in the same format as the last time, unless
there are better ideas (ZIP files containing thousands of 7z each).

There's a long tail of small wikis where most data (in bytes) is from
the huge HTML of the main page... here's just the top 10k:
https://paste.fedoraproject.org/paste/UpYghEBr0Tlq-K7nrsiPiQ/raw

I could create individual items for the top 300 wikis or so, which
uncompressed have about 1 GB XML each:
https://paste.fedoraproject.org/paste/8Hj-wPoet7hn90YQ8ORYuQ/raw

These were run after
<https://github.com/WikiTeam/wikiteam/commit/73902d39c0d8043c6ebd62abddca377a0feb71b6>.
The earlier dumps may be incomplete. Once we prove that dumpgenerator
can reach the end with all the Wikia wikis, someone with a more powerful
machine may make a new run from scratch.

Federico

Federico Leva (Nemo)

unread,
Jun 3, 2018, 5:27:00 AM6/3/18
to wikiteam...@googlegroups.com
Here we are, with 314310 wikis:
https://archive.org/details/wikia_dump_20180602

Federico

Emilio J. Rodríguez-Posada

unread,
Jun 5, 2018, 1:50:15 PM6/5/18
to wikiteam...@googlegroups.com
Great work, thanks.

Why do you prefer a mega dump instead creating an item per wiki?



Federico

--
You received this message because you are subscribed to the Google Groups "wikiteam-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to wikiteam-discuss+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Federico Leva (Nemo)

unread,
Jun 5, 2018, 2:43:07 PM6/5/18
to wikiteam...@googlegroups.com, Emilio J. Rodríguez-Posada
Emilio J. Rodríguez-Posada, 05/06/2018 20:49:
> Why do you prefer a mega dump instead creating an item per wiki?

It's just what has been done since the first snapshot uploaded by Aaron
Swartz. It should be useful for those who want an update for their
research purposes, such as Benjamin Mako Hill.

I also created items for the top 500 or so wikis, but creating 350k
items, mostly empty, doesn't seem very useful to me. What threshold do
you suggest?

Federico

Emilio J. Rodríguez-Posada

unread,
Jun 5, 2018, 4:09:22 PM6/5/18
to Federico Leva (Nemo), wikiteam...@googlegroups.com
I would create an item per wiki, but that is just my point of view.
Reply all
Reply to author
Forward
0 new messages