Hi, what do you think about a Google App Engine app that generates Metalinks for URLs? Maybe something like this already exists?
The first time you visit, e.g.
http://mintiply.appspot.com/http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2 it downloads the content and computes a digest. App Engine has *lots* of bandwidth, so this is snappy. Then it sends a response with "Digest: SHA-256=..." and "Location: ..." headers, similar to MirrorBrain
It also records the digest with Google's Datastore, so on subsequent visits, it doesn't download or recompute the digest
Finally, it also checks the Datastore for other URLs with matching digest, and sends "Link: <...>; rel=duplicate" headers for each of these. So if you visit, e.g.
http://mintiply.appspot.com/http://mirror.nexcess.net/apache/trafficserver/trafficserver-3.2.0.tar.bz2 it sends "Link: <
http://apache.osuosl.org/trafficserver/trafficserver-3.2.0.tar.bz2>; rel=duplicate"
The idea is that this could be useful for sites that don't yet generate Metalinks, like SourceForge. You could always prefix a URL that you pass to a Metalink client with "
http://mintiply.appspot.com/" to get a Metalink. Alternatively, if a Metalink client noticed that it was downloading a large file without mirror or hash metadata, it could try to get more mirrors from this app, while it continued downloading the file. As long as someone else had previously tried the same URL, or App Engine can download the file faster than the client, then it should get more mirrors in time to help finish the download. Popular downloads should have the most complete list of mirrors, since these URLs should have been tried the most
Right now it only downloads a URL once, and remembers the digest forever, which assumes that the content at the URL never changes. This is true for many downloads, but in future it could respect cache control headers
Also right now it only generates HTTP Metalinks with a whole file digest. But in future it could conceivably generate XML Metalinks with partial digests
A major limitation with this proof of concept is that I ran into some App Engine errors with downloads of any significant size, like Ubuntu ISOs. The App Engine maximum response size is 32 MB. The app overcomes this with byte ranges and downloading files in 32 MB segments. This works on my local machine with the App Engine dev server, but in production Google apparently kills the process after downloading just a few segments, because it uses too much memory. This seems wrong, since the app throws away each segment after adding it to the digest. So if it has enough memory to download one segment, it shouldn't require any more memory for additional segments. Maybe this could be worked around by manually calling the Python garbage collector, or by shrinking the segment size...
Also I ran into a second bug with App Engine URL Fetch and downloads of any significant size:
http://code.google.com/p/googleappengine/issues/detail?id=7732#c6Another thought is whether any web crawlers already maintain a database of digests that an app like this could exploit?
Here is the codes:
https://github.com/jablko/mintiply/blob/master/mintiply.pyWhat are your thoughts? Maybe something like this already exists, or was already tried in the past...