Apache Traffic Server caching proxy

524 views
Skip to first unread message

Jack Bates

unread,
Mar 22, 2012, 8:04:48 AM3/22/12
to Metalink Discussion
Do you know of any discussion or work to support Metalink in the
Apache Traffic Server [1] caching proxy?

I found this Google Summer of Code idea [2] ("Metalink integration
with Proxy/Cache" heading) for intermediate caching proxies to support
Metalink

And I found these two related Metalink discussion threads [3] [4]

I think a minimum viable product would add support for RFC 6249:

1. If the response status code is 3XX
2. Scan "Link: <...>; rel=duplicate" headers for URL that already
exist in the cache
3. If found, update "Location: ..." header with this URL and pass on
response

I volunteer here at the Agahozo-Shalom Youth Village in Rwanda. We use
Apache Traffic Server to overcome slow/unreliable downloads (and save
bandwidth). But because many sites distribute the same files from
different mirrors, it's frustrating for users to predict whether a
download will be fast (cache hit) or slow (cache miss). Following the
same link may be fast one time and slow the next

I think Apache Traffic Server could achieve better user experience (in
some cases) by implementing RFC 6249

[1] http://trafficserver.apache.org/
[2] http://sourceforge.net/apps/trac/metalinks/wiki/GsocIdeas
[3] http://groups.google.com/group/metalink-discussion/browse_thread/thread/b59e5d99cf879529/f364acba525c7108
[4] http://groups.google.com/group/metalink-discussion/browse_thread/thread/1849486ced99e778/7c2afbe0f3f8c41

Anthony Bryan

unread,
Mar 22, 2012, 3:29:47 PM3/22/12
to metalink-...@googlegroups.com
On Thu, Mar 22, 2012 at 8:04 AM, Jack Bates <jack....@gmail.com> wrote:
> Do you know of any discussion or work to support Metalink in the
> Apache Traffic Server [1] caching proxy?

no, & nothing comes up in searches...

> I found this Google Summer of Code idea [2] ("Metalink integration
> with Proxy/Cache" heading) for intermediate caching proxies to support
> Metalink
>
> And I found these two related Metalink discussion threads [3] [4]
>
> I think a minimum viable product would add support for RFC 6249:
>
>  1. If the response status code is 3XX
>  2. Scan "Link: <...>; rel=duplicate" headers for URL that already
> exist in the cache

2b. Scan "Digest:" header fields for matches? that is, files with the same hash.

>  3. If found, update "Location: ..." header with this URL and pass on
> response
>
> I volunteer here at the Agahozo-Shalom Youth Village in Rwanda. We use
> Apache Traffic Server to overcome slow/unreliable downloads (and save
> bandwidth). But because many sites distribute the same files from
> different mirrors, it's frustrating for users to predict whether a
> download will be fast (cache hit) or slow (cache miss). Following the
> same link may be fast one time and slow the next
>
> I think Apache Traffic Server could achieve better user experience (in
> some cases) by implementing RFC 6249

do you live in Rwanda? you're currently a college student?

I agree! this would be an excellent solution. please apply! :)

but first, you'll want to contact the project, since no one from
metalink has been in contact with them. we'd need someone from their
project that is familiar w/ it to mentor the project.

try an introductory post on their dev list:

"Impress developers or help others by participating on our dev
discussion list or follow the latest development on our commits list.

Report issues or bring patches to our Bug Tracker"


--
(( Anthony Bryan ... Metalink [ http://www.metalinker.org ]
  )) Easier, More Reliable, Self Healing Downloads

Jack Bates

unread,
Mar 24, 2012, 2:01:03 AM3/24/12
to Metalink Discussion
On Mar 22, 12:29 pm, Anthony Bryan <anthonybr...@gmail.com> wrote:
> On Thu, Mar 22, 2012 at 8:04 AM, Jack Bates <jack.ba...@gmail.com> wrote:
> do you live in Rwanda? you're currently a college student?

Yes, I live here in Rwanda, but only since December. And I am a
college student

> but first, you'll want to contact the project, since no one from
> metalink has been in contact with them. we'd need someone from their
> project that is familiar w/ it to mentor the project

Thank you very much Anthony! I will contact the Apache Traffic Server
dev list

Jack Bates

unread,
May 18, 2012, 7:41:58 AM5/18/12
to Metalink Discussion
Hi, I started work on a plugin for Apache Traffic Server and I would
love any feedback (and maybe implementation advice) from the Metalink
community

Traffic Server is a caching proxy and the goal of this plugin is to
help it work better with files distributed from multiple mirrors or
content distribution networks. Currently downloading a file that is
already cached from a different mirror is a cache miss. A lot of
download sites present users with a simple download button that
doesn't always redirect them to the same mirror, which defeats the
benefit of a caching proxy and frustrates users

I would love to hear any of your thoughts on how caching proxies could
work better with content distribution networks

For this first attempt at this plugin, the approach taken is to use
RFC 6249, Metalink/HTTP: Mirrors and Hashes. The plugin listens for
responses that are an HTTP redirect and have "Link: <...>;
rel=duplicate" headers, then scans the URLs for one that already
exists in the cache. If found then it transforms the response,
replacing the "Location: ..." header with the URL that already exists
in the cache

The code is up on GitHub [1] and works just enough that, given a
response with a "Location: ..." header that's not cached and a "Link:
<...>; rel=duplicate" header that is cached, it will rewrite the
"Location: ..." header with the cached URL

I would love any feedback on this approach

We are also thinking of using RFC 3230, Instance Digests in HTTP.
Given a response with a "Location: ..." header that's not cached and a
"Digest: ..." header, the plugin would check if another URL with the
same digest already exists in the cache and rewrite the
"Location: ..." header with that URL if so

Still more ideas include:

* Remember URLs for the same file so future requests for any of
these URLs use the same cache key. A problem is how to prevent a
malicious domain from distributing false information about URLs it
doesn't control. This could be addressed with a whitelist of domains

* Making decisions about the best mirror to choose, e.g. one that
is most cost efficient, faster, or more local

* Use content digest to detect or repair download errors

Finally, can anyone in the Metalink community recommend a reusable C/C+
+ solution for checking if a "Link: ..." header has a "rel=duplicate"
parameter? For now I am parsing these headers from scratch with
memchr(), but I expect that I am neglecting some accumulated wisdom on
getting all the RFC rules right, and maybe interoperating with
nonconformant implementations. Please let me know if you know a better
way

Here is a similar message [2] on the Traffic Server developers list,
with slightly more detail

We run Traffic Server here at a rural village in Rwanda for faster,
more reliable internet access. I am working on this as part of the
Google Summer of Code

[1] https://github.com/jablko/dedup
[2] http://mail-archives.apache.org/mod_mbox/trafficserver-dev/201205.mbox/%3C4FAE78FB.1070404%40nottheoilrig.com%3E

AR Dahal

unread,
May 21, 2012, 3:09:47 AM5/21/12
to metalink-...@googlegroups.com
Excellent start Jack. I am also a GSoC student this year working on the RFC 6249 implementation in KGet.
I guess what you have have in github is really great.

We are also thinking of using RFC 3230, Instance Digests in HTTP.
Given a response with a "Location: ..." header that's not cached and a
"Digest: ..." header, the plugin would check if another URL with the
same digest already exists in the cache and rewrite the
"Location: ..." header with that URL if so

Still more ideas include:

   * Remember URLs for the same file so future requests for any of
these URLs use the same cache key. A problem is how to prevent a
malicious domain from distributing false information about URLs it
doesn't control. This could be addressed with a whitelist of domains

   * Making decisions about the best mirror to choose, e.g. one that
is most cost efficient, faster, or more local

   * Use content digest to detect or repair download errors

Finally, can anyone in the Metalink community recommend a reusable C/C+
+ solution for checking if a "Link: ..." header has a "rel=duplicate"
parameter? For now I am parsing these headers from scratch with
memchr(), but I expect that I am neglecting some accumulated wisdom on
getting all the RFC rules right, and maybe interoperating with
nonconformant implementations. Please let me know if you know a better
way

I guess aria2 has support for Metalink/HTML and may have the reusable code you are looking for. I have gone through the code but as my implemenation is Qt/KDE dependent, I am not sure about its use to me. Still go ahead and have a look at the code. Here's the link https://github.com/tatsuhiro-t/aria2

All the best. :)

Jack Bates

unread,
Jun 2, 2012, 7:59:16 AM6/2/12
to Metalink Discussion
On May 21, 12:09 am, AR Dahal <dahalaish...@gmail.com> wrote:
> I guess aria2 has support for Metalink/HTML and may have the reusable code
> you are looking for. I have gone through the code but as my implemenation
> is Qt/KDE dependent, I am not sure about its use to me. Still go ahead and
> have a look at the code. Here's the linkhttps://github.com/tatsuhiro-t/aria2
>
> All the best. :)

I am very pleased to meet you AR and wish you all the best with KGet.
Thank you for your encouragement!

Thank you for suggesting the aria2 code, I checked it out and their
parseMetalinkHttpLink() code looks like it is written from scratch,
with C++ tools std::find() plus some utility functions. But I could
maybe copy it to the Traffic Server plugin

Thank you again for this pointer

Jack Bates

unread,
Jul 3, 2012, 7:44:21 AM7/3/12
to Metalink Discussion
Hi, this plugin for Apache Traffic Server now checks the "Digest:
SHA-256=..." header. It's still proof of concept but it's up on GitHub
[1]. Here is a post to the Traffic Server developers list with details
[2]

The plugin computes SHA-256 digests for responses from origin servers.
Then, given a response with a "Location: ..." header and a "Digest:
SHA-256=..." header, if the "Location: ..." URL is not already cached
but the digest matches content in the cache, it rewrites the
"Location: ..." header with the cached URL. This should redirect
clients to mirrors that are already cached

I'd love any feedback on this approach

Next steps are to check if the code quality is good enough. Does it
tie up the event loop while computing digests? Also the proof of
concept maps digests to cached URLs by storing URLs as objects in the
Traffic Server cache. It works, but other alternatives were discussed
like extending core with new APIs, KyotoDB, Memcached. What is the
ideal way to store digests?

This proof of concept handles the case that content is already cached
but the cached URL isn't listed among the "Link: <...>; rel=duplicate"
headers, maybe because it was downloaded from a server not
participating in the CDN, or because there are too many mirrors to
list in "Link: <...>; rel=duplicate" headers. Also this potentially
reduces the number of cache reads, because the "Link: <...>;
rel=duplicate" header means scanning URLs until one is found that's
already cached or the list is exhausted, whereas the "Digest:
SHA-256=..." header means a constant number of lookups

RFC 6249 requires a "Digest: SHA-256=..." header or "Link: <...>;
rel=duplicate" headers MUST be ignored:

> If Instance Digests are not provided by the Metalink servers, the
> Link header fields pertaining to this specification MUST be ignored.

> Metalinks contain whole file hashes as described in
> Section 6, and MUST include SHA-256, as specified in [FIPS-180-3].

Other next steps might be to add more Metalink client features, e.g.
downloading segments from multiple origin servers in parallel. A first
step might be, given a "Location: ..." header and "Link: <...>;
rel=duplicate" headers, instead of rewriting the "Location: ..."
header, transparently request the resource and replace the whole
response. Then add Metalink features to the transparent request

I suspect this breaks HTTP in some ways, and is too complicated.
Instead of adding Metalink features to the proxy, how well does the
proxy work with Metalink clients? Is there any outstanding work or
issues related to range requests?

Alex Rousskov pointed out a project for Squid to implement duplicate
transfer detection:

* http://comments.gmane.org/gmane.comp.web.squid.devel/15803
* http://comments.gmane.org/gmane.comp.web.squid.devel/16335
* http://www.hpl.hp.com/techreports/2004/HPL-2004-29.pdf

The goal of this plugin is to address the frustration of users when
they click a download button and sometimes the download completes in
seconds, when they are redirected to a mirror that is already cached,
and other times takes hours, when they are redirected to a mirror that
isn't

Per Jessen is working on another project for Squid with a similar goal
[3] but the design is a bit different

[1] https://github.com/jablko/dedup
[2] http://mail-archives.apache.org/mod_mbox/trafficserver-dev/201206.mbox/%3C4FE82F1D.2010906%40nottheoilrig.com%3E
[3] http://mirrorbrain.org/archive/mirrorbrain/0170.html
Reply all
Reply to author
Forward
0 new messages