It would be useful to have an archive of archives. I have to delete my old data dumps as time passes, for space reasons, however a team could, between them, maintain multiple copies of every data dump. This would make a nice distributed project.
_______________________________________________ Xmldatadumps-l mailing list Xmldata...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/xmldatadumps-l
Thank you, Emijrp!
What about the dump of Commons images? [for those with 10TB to spare]
SJ
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-re...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
--
Samuel Klein identi.ca:sj w:user:sj +1 617 529 4266
_______________________________________________
Wiki-research-l mailing list
Wiki-re...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Yes, I am. :)
As a Commons admin I've thought a lot about the problem of
distributing Commons dumps. As for distribution, I believe BitTorrent
is absolutely the way to go, but the Torrent will require a small
network of dedicated permaseeds (servers that seed indefinitely).
These can easily be set up at low cost on Amazon EC2 "small" instances
- the disk storage for the archives is free, since small instances
include a large (~120 GB) ephemeral storage volume at no additional
cost, and the cost of bandwidth can be controlled by configuring the
BitTorrent client with either a bandwidth throttle or a transfer cap
(or both). In fact, I think all Wikimedia dumps should be available
through such a distribution solution, just as all Linux installation
media are today.
Additionally, it will be necessary to construct (and maintain) useful
subsets of Commons media, such as "all media used on the English
Wikipedia", or "thumbnails of all images on Wikimedia Commons", of
particular interest to certain content reusers, since the full set is
far too large to be of interest to most reusers. It's on this latter
point that I want your feedback: what useful subsets of Wikimedia
Commons does the research community want? Thanks for your feedback.
--=20
Derrick Coetzee
User:Dcoetzee, English Wikipedia and Wikimedia Commons administrator
http://www.eecs.berkeley.edu/~dcoetzee/
I disagree. Note that we only need them to keep a redundant copy of a
file. If they tried to tamper the file we could detect it with the
hashes (which should be properly secured, that's no problem).
I'd like having the hashes for the xml dumps content instead of the
compressed one, though, so it could be easily stored with better
compression without weakening the integrity check.
> Really, I don't trust Wikimedia
> Foundation either. They can't and/or they don't want to provide image
> dumps (what is worst?).
Wikimedia Foundation has provided image dumps several times in the past,
and also rsync3 access to some individuals so that they could clone it.
It's like the enwiki history dump. An image dump is complex, and even
less useful.
> Community donates images to Commons, community
> donates money every year, and now community needs to develop a software
> to extract all the images and packed them,
There's no *need* for that. In fact, such script would be trivial from
the toolserver.
> and of course, host them in a permanent way. Crazy, right?
WMF also tries hard to not lose images. We want to provide some
redundance on our own. That's perfectly fine, but it's not a
requirement. Consider that WMF could be automatically deleting page
history older than a month, or images not used on any article. *That*
would be a real problem.
> @Milos: Instead of spliting image dump using the first letter of
> filenames, I thought about spliting using the upload date (YYYY-MM-DD).
> So, first chunks (2005-01-01) will be tiny, and recent ones of several
> GB (a single day).
>
> Regards,
> emijrp
I like that idea since it means the dumps are static. They could be
placed in tape inside a safe and not needed to be taken out unless data
loss arises.
emijrp wrote:I disagree. Note that we only need them to keep a redundant copy of a file. If they tried to tamper the file we could detect it with the hashes (which should be properly secured, that's no problem).
Hi;
@Derrick: I don't trust Amazon.
I'd like having the hashes for the xml dumps content instead of the compressed one, though, so it could be easily stored with better compression without weakening the integrity check.Wikimedia Foundation has provided image dumps several times in the past, and also rsync3 access to some individuals so that they could clone it.
Really, I don't trust Wikimedia
Foundation either. They can't and/or they don't want to provide image
dumps (what is worst?).
It's like the enwiki history dump. An image dump is complex, and even less useful.
There's no *need* for that. In fact, such script would be trivial from the toolserver.
Community donates images to Commons, community
donates money every year, and now community needs to develop a software
to extract all the images and packed them,
WMF also tries hard to not lose images.and of course, host them in a permanent way. Crazy, right?
We want to provide some redundance on our own. That's perfectly fine, but it's not a requirement.
Consider that WMF could be automatically deleting page history older than a month,
or images not used on any article. *That* would be a real problem.
Good point.
> And we don't only need to keep a copy of every file. We need several
> copies everywhere, not only in the Amazon coolcloud.
Sure. Relying *just* on Amazon would be very bad.
> Wikimedia Foundation has provided image dumps several times in the
> past, and also rsync3 access to some individuals so that they could
> clone it.
>
>
> Ah, OK, that is enough (?). Then, you are OK with old-and-broken XML
> dumps, because people can slurp all the pages using an API scrapper.
If all people that wants it can get it, then it's enough. Not so much in
a timely manner, though, but that could be fixed. I'm quite confident
that if rediris rang me tomorrow offering 20Tb for hosting commosns
image dumps, that could be managed without too much problems.
> It's like the enwiki history dump. An image dump is complex, and
> even less useful.
>
> It is not complex, just resources consuming. If they need to buy another
> 10 TB of space and more CPU, they can. $16M were donated last year. They
> just need to put resources in relevant stuff. WMF always says "we host
> the 5th website in the world", I say that they need to act like that.
>
> Less useful? I hope they don't need such a useless dump for recovering
> images, just like happened in the past.
Yes, that seems sensible. You just need to convince them :)
But note that they are already making another datacenter and developing
a system with which they would keep a copy of every upload on both of
them. They are not so mean.
> Community donates images to Commons, community
> donates money every year, and now community needs to develop a
> software
> to extract all the images and packed them,
>
>
> There's no *need* for that. In fact, such script would be trivial
> from the toolserver.
>
> Ah, OK, only people with toolserver account may have access to an image
> dump. And you say it is trivial from Toolserver and very complex from
> Wikimedia main servers.
Come on. Making a script to dowload all images is trivial from the
toolserver. It's just not so easy using the api.
The complexity is for making a dump that *anyone* can download. And it's
just resources, not technical.
> and of course, host them in a permanent way. Crazy, right?
> WMF also tries hard to not lose images.
> I hope that, but we remember a case of lost images.
Yes. That's a reason for making copies, and I support that. But there's
a difference between "failures happen" and "WMF is not trying to keep
copies".
> We want to provide some redundance on our own. That's perfectly
> fine, but it's not a requirement.
>
> That _is_ a requirement. We can't trust Wikimedia Foundation. They lost
> images. They have problems to generate English Wikipedia dumps and image
> dumps. They had a hardware failure some months ago in the RAID which
> hosts the XML dumps, and they didn't offer those dumps during months,
> while trying to fix the crash.
> You just don't understand how dangerous is the current status (and it
> was worst in the past).
The big problem is its huge size. If it was 2MB everyone and his
grandmother would keep a copy.