Wikipedia dumps downloader

emijrp

unread,

Jun 26, 2011, 8:53:15 AM6/26/11

to wikiteam...@googlegroups.com, xmldata...@lists.wikimedia.org, Wikimedia Foundation Mailing List, Research into Wikimedia content and communities

Hi all;

Can you imagine a day when Wikipedia is added to this list?[1]

WikiTeam have developed a script[2] to download all the Wikipedia dumps (and her sister projects) from dumps.wikimedia.org. It sorts in folders and checks md5sum. It only works on Linux (it uses wget).

You will need about 100GB to download all the 7z files.

Save our memory.

Regards,
emijrp

[1] http://en.wikipedia.org/wiki/Destruction_of_libraries
[2] http://code.google.com/p/wikiteam/source/browse/trunk/wikipediadownloader.py

emijrp

unread,

Jun 27, 2011, 7:07:51 AM6/27/11

to Richard Farmbrough, wikiteam...@googlegroups.com, xmldata...@lists.wikimedia.org, Research into Wikimedia content and communities, Wikimedia Foundation Mailing List

Hi Richard;

Yes, a distributed project would be probably the best solution, but it is not easy to develop, unless you use a library like bittorrent, or similar and you have many peers. Althought most of the people don't seed the files long time, so sometimes is better to depend on a few committed persons than a big but ephemeral crowd.

Regards,
emijrp

2011/6/26 Richard Farmbrough <ric...@farmbrough.co.uk>

It would be useful to have an archive of archives. I have to delete my old data dumps as time passes, for space reasons, however a team could, between them, maintain multiple copies of every data dump. This would make a nice distributed project.


_______________________________________________
Xmldatadumps-l mailing list
Xmldata...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l

emijrp

unread,

Jun 27, 2011, 7:10:40 AM6/27/11

to Research into Wikimedia content and communities, xmldata...@lists.wikimedia.org, wikiteam...@googlegroups.com, Wikimedia Foundation Mailing List, Platonides

Hi SJ;

You know that that is an old item in our TODO list ; )

I heard that Platonides developed a script for that task long time ago.

Platonides, are you there?

Regards,
emijrp

2011/6/27 Samuel Klein <sjk...@hcs.harvard.edu>

Thank you, Emijrp!

What about the dump of Commons images? [for those with 10TB to spare]

SJ

> _______________________________________________
> Wiki-research-l mailing list
> Wiki-re...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>

--
Samuel Klein identi.ca:sj w:user:sj +1 617 529 4266

_______________________________________________
Wiki-research-l mailing list
Wiki-re...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Platonides

unread,

Jun 27, 2011, 5:32:48 PM6/27/11

to emijrp, Research into Wikimedia content and communities, xmldata...@lists.wikimedia.org, wikiteam...@googlegroups.com, Wikimedia Foundation Mailing List

emijrp wrote:
> Hi SJ;
>
> You know that that is an old item in our TODO list ; )
>
> I heard that Platonides developed a script for that task long time ago.
>
> Platonides, are you there?
>
> Regards,
> emijrp

Yes, I am. :)

emijrp

unread,

Jun 28, 2011, 3:50:49 AM6/28/11

to Platonides, Research into Wikimedia content and communities, xmldata...@lists.wikimedia.org, wikiteam...@googlegroups.com, Wikimedia Foundation Mailing List

Can you share your script with us?

2011/6/27 Platonides <plato...@gmail.com>

emijrp

unread,

Jun 28, 2011, 1:21:22 PM6/28/11

to Research into Wikimedia content and communities, wikiteam...@googlegroups.com, xmldata...@lists.wikimedia.org, Wikimedia Foundation Mailing List

Hi;

@Derrick: I don't trust Amazon. Really, I don't trust Wikimedia Foundation either. They can't and/or they don't want to provide image dumps (what is worst?). Community donates images to Commons, community donates money every year, and now community needs to develop a software to extract all the images and packed them, and of course, host them in a permanent way. Crazy, right?

@Milos: Instead of spliting image dump using the first letter of filenames, I thought about spliting using the upload date (YYYY-MM-DD). So, first chunks (2005-01-01) will be tiny, and recent ones of several GB (a single day).

Regards,
emijrp

2011/6/28 Derrick Coetzee <dcoe...@eecs.berkeley.edu>

As a Commons admin I've thought a lot about the problem of
distributing Commons dumps. As for distribution, I believe BitTorrent
is absolutely the way to go, but the Torrent will require a small
network of dedicated permaseeds (servers that seed indefinitely).
These can easily be set up at low cost on Amazon EC2 "small" instances
- the disk storage for the archives is free, since small instances
include a large (~120 GB) ephemeral storage volume at no additional
cost, and the cost of bandwidth can be controlled by configuring the
BitTorrent client with either a bandwidth throttle or a transfer cap
(or both). In fact, I think all Wikimedia dumps should be available
through such a distribution solution, just as all Linux installation
media are today.

Additionally, it will be necessary to construct (and maintain) useful
subsets of Commons media, such as "all media used on the English
Wikipedia", or "thumbnails of all images on Wikimedia Commons", of
particular interest to certain content reusers, since the full set is
far too large to be of interest to most reusers. It's on this latter
point that I want your feedback: what useful subsets of Wikimedia
Commons does the research community want? Thanks for your feedback.

--=20
Derrick Coetzee
User:Dcoetzee, English Wikipedia and Wikimedia Commons administrator
http://www.eecs.berkeley.edu/~dcoetzee/

Platonides

unread,

Jun 28, 2011, 4:28:24 PM6/28/11

to wikiteam...@googlegroups.com, Gregory Maxwell, Research into Wikimedia content and communities, xmldata...@lists.wikimedia.org, Wikimedia Foundation Mailing List

emijrp wrote:
> Hi;
>
> @Derrick: I don't trust Amazon.

I disagree. Note that we only need them to keep a redundant copy of a
file. If they tried to tamper the file we could detect it with the
hashes (which should be properly secured, that's no problem).

I'd like having the hashes for the xml dumps content instead of the
compressed one, though, so it could be easily stored with better
compression without weakening the integrity check.

> Really, I don't trust Wikimedia
> Foundation either. They can't and/or they don't want to provide image
> dumps (what is worst?).

Wikimedia Foundation has provided image dumps several times in the past,
and also rsync3 access to some individuals so that they could clone it.
It's like the enwiki history dump. An image dump is complex, and even
less useful.

> Community donates images to Commons, community
> donates money every year, and now community needs to develop a software
> to extract all the images and packed them,

There's no *need* for that. In fact, such script would be trivial from
the toolserver.

> and of course, host them in a permanent way. Crazy, right?

WMF also tries hard to not lose images. We want to provide some
redundance on our own. That's perfectly fine, but it's not a
requirement. Consider that WMF could be automatically deleting page
history older than a month, or images not used on any article. *That*
would be a real problem.

> @Milos: Instead of spliting image dump using the first letter of
> filenames, I thought about spliting using the upload date (YYYY-MM-DD).
> So, first chunks (2005-01-01) will be tiny, and recent ones of several
> GB (a single day).
>
> Regards,
> emijrp

I like that idea since it means the dumps are static. They could be
placed in tape inside a safe and not needed to be taken out unless data
loss arises.

emijrp

unread,

Jun 28, 2011, 5:10:41 PM6/28/11

to wikiteam...@googlegroups.com, Gregory Maxwell, Research into Wikimedia content and communities, xmldata...@lists.wikimedia.org, Wikimedia Foundation Mailing List

2011/6/28 Platonides <plato...@gmail.com>

emijrp wrote:

Hi;

@Derrick: I don't trust Amazon.

I disagree. Note that we only need them to keep a redundant copy of a file. If they tried to tamper the file we could detect it with the hashes (which should be properly secured, that's no problem).

I didn't mean security problems. I meant just deleted files by weird terms of service. Commons hosts a lot of images which can be problematic, like nudes or copyrighted materials in some jurisdictions. They can deleted what they want and close every account they want, and we will lost the backups. Period.

And we don't only need to keep a copy of every file. We need several copies everywhere, not only in the Amazon coolcloud.

I'd like having the hashes for the xml dumps content instead of the compressed one, though, so it could be easily stored with better compression without weakening the integrity check.

Really, I don't trust Wikimedia
Foundation either. They can't and/or they don't want to provide image
dumps (what is worst?).

Wikimedia Foundation has provided image dumps several times in the past, and also rsync3 access to some individuals so that they could clone it.

Ah, OK, that is enough (?). Then, you are OK with old-and-broken XML dumps, because people can slurp all the pages using an API scrapper.

It's like the enwiki history dump. An image dump is complex, and even less useful.

It is not complex, just resources consuming. If they need to buy another 10 TB of space and more CPU, they can. $16M were donated last year. They just need to put resources in relevant stuff. WMF always says "we host the 5th website in the world", I say that they need to act like that.

Less useful? I hope they don't need such a useless dump for recovering images, just like happened in the past.

Community donates images to Commons, community
donates money every year, and now community needs to develop a software
to extract all the images and packed them,

There's no *need* for that. In fact, such script would be trivial from the toolserver.

Ah, OK, only people with toolserver account may have access to an image dump. And you say it is trivial from Toolserver and very complex from Wikimedia main servers.

and of course, host them in a permanent way. Crazy, right?

WMF also tries hard to not lose images.

I hope that, but we remember a case of lost images.

We want to provide some redundance on our own. That's perfectly fine, but it's not a requirement.

That _is_ a requirement. We can't trust Wikimedia Foundation. They lost images. They have problems to generate English Wikipedia dumps and image dumps. They had a hardware failure some months ago in the RAID which hosts the XML dumps, and they didn't offer those dumps during months, while trying to fix the crash.

Consider that WMF could be automatically deleting page history older than a month,

or images not used on any article. *That* would be a real problem.

You just don't understand how dangerous is the current status (and it was worst in the past).

Platonides

unread,

Jun 28, 2011, 6:23:32 PM6/28/11

to wikiteam...@googlegroups.com, emijrp, Gregory Maxwell, Research into Wikimedia content and communities, xmldata...@lists.wikimedia.org, Wikimedia Foundation Mailing List

emijrp wrote:
> I didn't mean security problems. I meant just deleted files by weird
> terms of service. Commons hosts a lot of images which can be
> problematic, like nudes or copyrighted materials in some jurisdictions.
> They can deleted what they want and close every account they want, and
> we will lost the backups. Period.

Good point.

> And we don't only need to keep a copy of every file. We need several
> copies everywhere, not only in the Amazon coolcloud.

Sure. Relying *just* on Amazon would be very bad.

> Wikimedia Foundation has provided image dumps several times in the
> past, and also rsync3 access to some individuals so that they could
> clone it.
>
>
> Ah, OK, that is enough (?). Then, you are OK with old-and-broken XML
> dumps, because people can slurp all the pages using an API scrapper.

If all people that wants it can get it, then it's enough. Not so much in
a timely manner, though, but that could be fixed. I'm quite confident
that if rediris rang me tomorrow offering 20Tb for hosting commosns
image dumps, that could be managed without too much problems.

> It's like the enwiki history dump. An image dump is complex, and
> even less useful.
>
> It is not complex, just resources consuming. If they need to buy another
> 10 TB of space and more CPU, they can. $16M were donated last year. They
> just need to put resources in relevant stuff. WMF always says "we host
> the 5th website in the world", I say that they need to act like that.
>
> Less useful? I hope they don't need such a useless dump for recovering
> images, just like happened in the past.

Yes, that seems sensible. You just need to convince them :)
But note that they are already making another datacenter and developing
a system with which they would keep a copy of every upload on both of
them. They are not so mean.

> Community donates images to Commons, community
> donates money every year, and now community needs to develop a
> software
> to extract all the images and packed them,
>
>
> There's no *need* for that. In fact, such script would be trivial
> from the toolserver.
>
> Ah, OK, only people with toolserver account may have access to an image
> dump. And you say it is trivial from Toolserver and very complex from
> Wikimedia main servers.

Come on. Making a script to dowload all images is trivial from the
toolserver. It's just not so easy using the api.
The complexity is for making a dump that *anyone* can download. And it's
just resources, not technical.

> and of course, host them in a permanent way. Crazy, right?
> WMF also tries hard to not lose images.
> I hope that, but we remember a case of lost images.

Yes. That's a reason for making copies, and I support that. But there's
a difference between "failures happen" and "WMF is not trying to keep
copies".

> We want to provide some redundance on our own. That's perfectly
> fine, but it's not a requirement.
>
> That _is_ a requirement. We can't trust Wikimedia Foundation. They lost
> images. They have problems to generate English Wikipedia dumps and image
> dumps. They had a hardware failure some months ago in the RAID which
> hosts the XML dumps, and they didn't offer those dumps during months,
> while trying to fix the crash.

> You just don't understand how dangerous is the current status (and it
> was worst in the past).

The big problem is its huge size. If it was 2MB everyone and his
grandmother would keep a copy.

Reply all

Reply to author

Forward