Generate Metalinks with Google App Engine

63 views
Skip to first unread message

Jack Bates

unread,
Aug 14, 2012, 2:30:47 AM8/14/12
to metalink-...@googlegroups.com
Hi, what do you think about a Google App Engine app that generates Metalinks for URLs? Maybe something like this already exists?

The first time you visit, e.g. http://mintiply.appspot.com/http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2 it downloads the content and computes a digest. App Engine has *lots* of bandwidth, so this is snappy. Then it sends a response with "Digest: SHA-256=..." and "Location: ..." headers, similar to MirrorBrain

It also records the digest with Google's Datastore, so on subsequent visits, it doesn't download or recompute the digest

Finally, it also checks the Datastore for other URLs with matching digest, and sends "Link: <...>; rel=duplicate" headers for each of these. So if you visit, e.g. http://mintiply.appspot.com/http://mirror.nexcess.net/apache/trafficserver/trafficserver-3.2.0.tar.bz2 it sends "Link: <http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2>; rel=duplicate"

The idea is that this could be useful for sites that don't yet generate Metalinks, like SourceForge. You could always prefix a URL that you pass to a Metalink client with "http://mintiply.appspot.com/" to get a Metalink. Alternatively, if a Metalink client noticed that it was downloading a large file without mirror or hash metadata, it could try to get more mirrors from this app, while it continued downloading the file. As long as someone else had previously tried the same URL, or App Engine can download the file faster than the client, then it should get more mirrors in time to help finish the download. Popular downloads should have the most complete list of mirrors, since these URLs should have been tried the most

Right now it only downloads a URL once, and remembers the digest forever, which assumes that the content at the URL never changes. This is true for many downloads, but in future it could respect cache control headers

Also right now it only generates HTTP Metalinks with a whole file digest. But in future it could conceivably generate XML Metalinks with partial digests

A major limitation with this proof of concept is that I ran into some App Engine errors with downloads of any significant size, like Ubuntu ISOs. The App Engine maximum response size is 32 MB. The app overcomes this with byte ranges and downloading files in 32 MB segments. This works on my local machine with the App Engine dev server, but in production Google apparently kills the process after downloading just a few segments, because it uses too much memory. This seems wrong, since the app throws away each segment after adding it to the digest. So if it has enough memory to download one segment, it shouldn't require any more memory for additional segments. Maybe this could be worked around by manually calling the Python garbage collector, or by shrinking the segment size...

Also I ran into a second bug with App Engine URL Fetch and downloads of any significant size: http://code.google.com/p/googleappengine/issues/detail?id=7732#c6

Another thought is whether any web crawlers already maintain a database of digests that an app like this could exploit?

Here is the codes: https://github.com/jablko/mintiply/blob/master/mintiply.py

What are your thoughts? Maybe something like this already exists, or was already tried in the past...

Bram Neijt

unread,
Aug 14, 2012, 4:58:22 PM8/14/12
to metalink-...@googlegroups.com
Hi Jack,

I once created a similair thing, but it required the "owner" of the
file to host the MD5 he/she thinks it should be. It then generates a
metalink based on all the md5/sha1/sha256 hashes in the database.

The idea is that anybody can step up and start a mirror by hosting the
files and the MD5SUMS and have the service spider the MD5SUMS file.

You can find the service at: http://www.dynmirror.net/

It might be a good idea to join up the databases or do some
collaboration somewhere. Let's see what we can do. For instance, I
could add a mintiply url collection or something like that? Or maybe I
could have dynmirror register the hash/link combinations at mintiply?

Let me know what you think. Currently, I think I'm the only user of
dynmirror.net (at http://www.logfish.net/pr/ccbuild/downloads/ ).

I'd also be happy to dig up and publish the code somewhere if I havn't already.

Greets,

Bram
> --
> You received this message because you are subscribed to the Google Groups
> "Metalink Discussion" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/metalink-discussion/-/r7cq8sL0LuMJ.
> To post to this group, send email to metalink-...@googlegroups.com.
> To unsubscribe from this group, send email to
> metalink-discus...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/metalink-discussion?hl=en.

Sundaram Ananthanarayanan

unread,
Aug 15, 2012, 7:50:11 AM8/15/12
to metalink-...@googlegroups.com
Hi Jack!

I really liked your idea. I think its cool :-)

1. I tried the following URL on your server and it didnt give the correct SHA-256 hash. Maybe it was my mistake. Anyways do check it out. URL: http://www.abisource.com/downloads/abiword/2.4.6/Windows/abiword-setup-2.4.6.exe. The expected hash is SHA-256 hash is 685a82ca2a9c56861e5ca22b38e697791485664c36ad883f410dac9e96d09f62 .

2. I was just wondering if you had a cache-control header while sending request to huge files. I didn't find that in your code. It may not apply to your problem, but cache is probably included in quota of daily usage by google. Most download managers (to my knowledge) switch off cache during file downloads. It may not apply to your problem, but again its just one line of code, so it wouldn't hurt to try, Try adding an header with header name "Cache-Control" and value "no-cache". Please do ignore this if you have already tried it.

Thanks.

Neil M.

unread,
Aug 15, 2012, 1:07:58 PM8/15/12
to metalink-...@googlegroups.com
>
> Another thought is whether any web crawlers already maintain a database
of
> digests that an app like this could exploit?
>
> Here is the codes:
> https://github.com/jablko/mintiply/blob/master/mintiply.py
>
> What are your thoughts? Maybe something like this already exists, or was

> already tried in the past...

I've written a metalink crawler for .metalink files. Its pretty dumb but
it gets the job done. The code is available here:

http://metalinks.svn.sourceforge.net/viewvc/metalinks/crawler/

You can see the results here:

http://www.nabber.org/projects/metalink/crawler/list.php

I imagine it wouldn't be hard to modify to instead of grabbing the
.metalink files, parse them and dump them into your database. One
advantage to this method is any URLs that are now dead are still captured
in the .metalink files, so your AppEngine code could detect and redirect a
"dumb" browser to a working download location instead.

As for a hash database, I've been researching options for my Appupdater
project. There are some hash search type sites out there but I don't think
they will be useful in this case since I haven't seen any that track URLs,
its usually just file size, version, product name, etc. There seem to be
plenty of datasets out there for installers from the various download
websites, like sourceforge.net, softpedia, oldapps.com, etc. However, from
what I can tell there is no way to download a database from any of these,
you'd have parse the individual web pages. While possible that doesn't
seem to be a very efficient way of doing things, you'd need to customize it
for each website. Actually probably the better and easier way is to build
a .exe, .msi, etc. crawler, download the file and compute your own hashes.
It will take a lot of time and bandwidth but you'd get a really good
dataset that way. In other words have a crawler that feeds your AppEngine
code URLs to process.

Neil

Jack Bates

unread,
Aug 16, 2012, 6:42:27 AM8/16/12
to metalink-...@googlegroups.com
On Wednesday, August 15, 2012 4:50:11 AM UTC-7, Sundaram Ananthanarayanan wrote:
Hi Jack!

I really liked your idea. I think its cool :-)

Hi Sundaram and thanks for your encouragement!

1. I tried the following URL on your server and it didnt give the correct SHA-256 hash. Maybe it was my mistake. Anyways do check it out. URL: http://www.abisource.com/downloads/abiword/2.4.6/Windows/abiword-setup-2.4.6.exe. The expected hash is SHA-256 hash is 685a82ca2a9c56861e5ca22b38e697791485664c36ad883f410dac9e96d09f62 .

I think this is because the "Digest: SHA-256=..." header is Base64 encoded. The header I get is "Digest: SHA-256=aFqCyiqcVoYeXKIrOOaXeRSFZkw2rYg/QQ2snpbQn2I=" which I think is equivalent to the hash you expected:

>>> binascii.hexlify(base64.b64decode('aFqCyiqcVoYeXKIrOOaXeRSFZkw2rYg/QQ2snpbQn2I='))
'685a82ca2a9c56861e5ca22b38e697791485664c36ad883f410dac9e96d09f62'
>>>

2. I was just wondering if you had a cache-control header while sending request to huge files. I didn't find that in your code. It may not apply to your problem, but cache is probably included in quota of daily usage by google. Most download managers (to my knowledge) switch off cache during file downloads. It may not apply to your problem, but again its just one line of code, so it wouldn't hurt to try, Try adding an header with header name "Cache-Control" and value "no-cache". Please do ignore this if you have already tried it.

I haven't tried this, thanks for suggesting it. However looking at the "no-cache" directive, I think it means that intermediate caches must not use a cached copy when responding to the request. Since a purpose of the app is to speed up downloads, it needs to get the content and compute the digest as quickly as possible, and maybe an intermediate cache could actually help. Assuming an intermediate cache could possibly accelerate the download or save some bandwidth, wouldn't it be better *not* to send "Cache-Control: no-cache"?

I haven't noticed if the App Engine URL Fetch has an associated HTTP cache, and if it has, whether it has a quota. But that's a good idea to check on that

On the subject of "Cache-Control: no-cache", if a client sends this header to the app, I wonder if the app should get the content again and recompute the digest

Thanks.

Thank you for giving feedback

Jack Bates

unread,
Aug 17, 2012, 1:44:19 AM8/17/12
to metalink-...@googlegroups.com


On Tuesday, August 14, 2012 1:58:22 PM UTC-7, Bram Neijt wrote:
Hi Jack,

I once created a similair thing, but it required the "owner" of the
file to host the MD5 he/she thinks it should be. It then generates a
metalink based on all the md5/sha1/sha256 hashes in the database.

The idea is that anybody can step up and start a mirror by hosting the
files and the MD5SUMS and have the service spider the MD5SUMS file.

You can find the service at: http://www.dynmirror.net/

Cool! The design of this site is impressive. I like how it shows analytics, like recent downloads, on the front page

It might be a good idea to join up the databases or do some
collaboration somewhere. Let's see what we can do. For instance, I
could add a mintiply url collection or something like that? Or maybe I
could have dynmirror register the hash/link combinations at mintiply?

Great idea, thanks for suggesting it. The first thing that comes to mind is, how would you like to get data out of Mintiply (and into Dynmirror)? Is there an API that Mintiply could provide that would make this as easy as possible?

Let me know what you think. Currently, I think I'm the only user of
dynmirror.net (at http://www.logfish.net/pr/ccbuild/downloads/ ).

I'd also be happy to dig up and publish the code somewhere if I havn't already.

Greets,

Bram

Thanks very much for inviting me to collaborate

> metalink-discussion+unsub...@googlegroups.com.

Jack Bates

unread,
Aug 17, 2012, 4:00:22 AM8/17/12
to metalink-...@googlegroups.com
On Wednesday, August 15, 2012 10:07:58 AM UTC-7, Neil M. wrote:
>
> Another thought is whether any web crawlers already maintain a database
of
> digests that an app like this could exploit?
>
> Here is the codes:
> https://github.com/jablko/mintiply/blob/master/mintiply.py
>
> What are your thoughts? Maybe something like this already exists, or was

> already tried in the past...

I've written a metalink crawler for .metalink files.  Its pretty dumb but
it gets the job done. The code is available here:

http://metalinks.svn.sourceforge.net/viewvc/metalinks/crawler/

You can see the results here:

http://www.nabber.org/projects/metalink/crawler/list.php

I imagine it wouldn't be hard to modify to instead of grabbing the
.metalink files, parse them and dump them into your database.  One
advantage to this method is any URLs that are now dead are still captured
in the .metalink files, so your AppEngine code could detect and redirect a
"dumb" browser to a working download location instead.

Interesting idea, and thanks for writing this Metalink crawler

As for a hash database, I've been researching options for my Appupdater
project.  There are some hash search type sites out there but I don't think
they will be useful in this case since I haven't seen any that track URLs,
its usually just file size, version, product name, etc.  There seem to be
plenty of datasets out there for installers from the various download
websites, like sourceforge.net, softpedia, oldapps.com, etc.  However, from
what I can tell there is no way to download a database from any of these,
you'd have parse the individual web pages.  While possible that doesn't
seem to be a very efficient way of doing things, you'd need to customize it
for each website.  Actually probably the better and easier way is to build
a .exe, .msi, etc. crawler, download the file and compute your own hashes.
It will take a lot of time and bandwidth but you'd get a really good
dataset that way.  In other words have a crawler that feeds your AppEngine
code URLs to process.

I agree. Thanks a lot for sharing your experience researching options for Appupdater

Neil

Jack Bates

unread,
Aug 19, 2012, 3:58:24 AM8/19/12
to metalink-...@googlegroups.com
On Thursday, August 16, 2012 10:44:19 PM UTC-7, Jack Bates wrote:
On Tuesday, August 14, 2012 1:58:22 PM UTC-7, Bram Neijt wrote:
Hi Jack,

I once created a similair thing, but it required the "owner" of the
file to host the MD5 he/she thinks it should be. It then generates a
metalink based on all the md5/sha1/sha256 hashes in the database.

The idea is that anybody can step up and start a mirror by hosting the
files and the MD5SUMS and have the service spider the MD5SUMS file.

You can find the service at: http://www.dynmirror.net/

Cool! The design of this site is impressive. I like how it shows analytics, like recent downloads, on the front page

It might be a good idea to join up the databases or do some
collaboration somewhere. Let's see what we can do. For instance, I
could add a mintiply url collection or something like that? Or maybe I
could have dynmirror register the hash/link combinations at mintiply?

Great idea, thanks for suggesting it. The first thing that comes to mind is, how would you like to get data out of Mintiply (and into Dynmirror)? Is there an API that Mintiply could provide that would make this as easy as possible?

Hi Bram and thanks again for inviting me to collaborate,

As an experiment, I just added a page to export all of the data from Mintiply, in Metalink format. Let me know what you think. Could this be useful to a project like Dynmirror? or would you prefer a different format, or different data?

There isn't much data in the app yet, so dumping everything in one Metalink response works fine. If the amount of data ever gets large, we may need to rethink this

Here is the page: http://mintiply.appspot.com/export

Bram Neijt

unread,
Aug 19, 2012, 5:15:46 PM8/19/12
to metalink-...@googlegroups.com
A single page export will not work, for sure, but as for that I was
thinking about moving data out of dynmirror to mintiply.

For example, if you don't want to download the complete file before
you have a metalink, you could check at
http://www.dynmirror.net/metalink/?url=http://example.com
to see if dynmirror has any metalink information. You could use
dynmirror as a kind of caching backend for downloads.

Another thing I could do is have dynmirror redirect to mintiply if
there is no hash information available, maybe that would be a good
approach...

I'm not really sure it would add anything, but technically it should
be possible and I think it might be good to get some code commits on
dynmirror anyway ;)

Greets,

Bram
>>> > metalink-discus...@googlegroups.com.
>>> > For more options, visit this group at
>>> > http://groups.google.com/group/metalink-discussion?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Metalink Discussion" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/metalink-discussion/-/nQSS5zOJRrgJ.
>
> To post to this group, send email to metalink-...@googlegroups.com.
> To unsubscribe from this group, send email to
> metalink-discus...@googlegroups.com.

Jack Bates

unread,
Aug 21, 2012, 1:45:46 AM8/21/12
to metalink-...@googlegroups.com
On Sunday, August 19, 2012 2:15:46 PM UTC-7, Bram Neijt wrote:
A single page export will not work, for sure, but as for that I was
thinking about moving data out of dynmirror to mintiply.

For example, if you don't want to download the complete file before
you have a metalink, you could check at
http://www.dynmirror.net/metalink/?url=http://example.com
to see if dynmirror has any metalink information. You could use
dynmirror as a kind of caching backend for downloads.

Another thing I could do is have dynmirror redirect to mintiply if
there is no hash information available, maybe that would be a good
approach...

I'm not really sure it would add anything, but technically it should
be possible and I think it might be good to get some code commits on
dynmirror anyway ;)

That sounds like a good idea. Please let me know if there's anything I can do to help with this

Cheers

>>> > metalink-discussion+unsub...@googlegroups.com.
>>> > For more options, visit this group at
>>> > http://groups.google.com/group/metalink-discussion?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Metalink Discussion" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/metalink-discussion/-/nQSS5zOJRrgJ.
>
> To post to this group, send email to metalink-...@googlegroups.com.
> To unsubscribe from this group, send email to
> metalink-discussion+unsub...@googlegroups.com.

Bram Neijt

unread,
Oct 3, 2012, 1:47:06 PM10/3/12
to metalink-...@googlegroups.com
Hi everybody,

I took the time to look up my code and found out I never published
dynmirror.net.

The code is now online at https://github.com/bneijt/dynmirror.net

I'll still have to publish correct licensing information etc, and find
a good way to clean up having jinja2 in the git repo as well, but as I
have a few other projects going on I don't think I'll get to that any
time soon. If you have any questions regarding the code, feel free to
mail me directly.

Greets,

Bram
>> >>> > metalink-discus...@googlegroups.com.
>> >>> > For more options, visit this group at
>> >>> > http://groups.google.com/group/metalink-discussion?hl=en.
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> > Groups
>> > "Metalink Discussion" group.
>> > To view this discussion on the web visit
>> > https://groups.google.com/d/msg/metalink-discussion/-/nQSS5zOJRrgJ.
>> >
>> > To post to this group, send email to metalink-...@googlegroups.com.
>> > To unsubscribe from this group, send email to
>> > metalink-discus...@googlegroups.com.
>> > For more options, visit this group at
>> > http://groups.google.com/group/metalink-discussion?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Metalink Discussion" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/metalink-discussion/-/zkL9SJJaRssJ.
>
> To post to this group, send email to metalink-...@googlegroups.com.
> To unsubscribe from this group, send email to
> metalink-discus...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages