I am working on some way to give squid some way to benefit from metalinks and I need your help.

63 views
Skip to first unread message

Eliezer Croitoru

unread,
Jan 25, 2013, 12:04:52 AM1/25/13
to Metalink Discussion
Hey,

As for now there are couple very good clients which makes use of metalinks.

I have seen that some mirrorbrain based servers are actually publishing
metalinks files in their headers.
Fedora just redirect you to the nearest mirror..

I am writing an add-on for squid http cache proxy server that will allow
it to take advantage on the data that in the metalinks files for better
cache.

Thanks,
Eliezer

--
Eliezer Croitoru
https://www1.ngtech.co.il
sip:ngt...@sip2sip.info
IT consulting for Nonprofit organizations
eliezer <at> ngtech.co.il

Anthony Bryan

unread,
Jan 25, 2013, 1:26:56 AM1/25/13
to Metalink Discussion
On Fri, Jan 25, 2013 at 12:04 AM, Eliezer Croitoru <eli...@ngtech.co.il> wrote:
> Hey,
>
> As for now there are couple very good clients which makes use of metalinks.
>
> I have seen that some mirrorbrain based servers are actually publishing
> metalinks files in their headers.
> Fedora just redirect you to the nearest mirror..
>
> I am writing an add-on for squid http cache proxy server that will allow it
> to take advantage on the data that in the metalinks files for better cache.

Eliezer, thanks for writing & apologies for not replying to your other
message quicker...but yes, this is what I was going to say, that from
a regular download you can get to a metalink automatically via HTTP
header, or from metalink you can get to the canonical link or file
actually being mirrored (by a whitelist or somesuch?)

also, check out the short 3 message thread ('Forward proxies and
CDN/mirrors') on the HTTP list
http://lists.w3.org/Archives/Public/ietf-http-wg/2012AprJun/0409.html

the HTTP list might be the best place to ask since there aren't a
whole lot of proxy people here, more metalink focus here, but there
are on the HTTP list.


& here are some other links I already sent you:

I don't know enough about squid, but the basic idea is that people are
downloading identical files (same hash/content) from multiple
locations (different URLs). you want these duplicate entries to be
consolidated. sites that use metalinks give the info to consolidate
them in an automatic way while other sites don't.

maybe check out the info on the Apache Traffic Server plugin for
metalink: https://cwiki.apache.org/confluence/display/TS/Metalink

it gives some links for squid stuff:

http://thread.gmane.org/gmane.comp.web.squid.devel/15803
http://thread.gmane.org/gmane.comp.web.squid.devel/16335
http://www.hpl.hp.com/techreports/2004/HPL-2004-29.pdf

http://wiki.jessen.ch/index/How_to_cache_openSUSE_repositories_with_Squid

--
(( Anthony Bryan ... Metalink [ http://www.metalinker.org ]
)) Easier, More Reliable, Self Healing Downloads

Bram Neijt

unread,
Jan 25, 2013, 11:35:20 AM1/25/13
to metalink-...@googlegroups.com
Hi Eliezer,

I can't help you much with the details of how you can get Squid to
work with the data in the metalink files.

Maybe I can help with some pointers on what you are trying to do. If
you would care to explain the approach: what data will squid look at,
what will squid then do?

Greets,

Bram
> --
> You received this message because you are subscribed to the Google Groups "Metalink Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to metalink-discus...@googlegroups.com.
> To post to this group, send email to metalink-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/metalink-discussion?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Anthony Bryan

unread,
Jan 27, 2013, 10:47:50 PM1/27/13
to Metalink Discussion
sorry, email sent while sleepy :)

instead of the HTTP list, the squid list is probably another good
place to discuss this

Eliezer Croitoru

unread,
Jan 28, 2013, 9:08:54 AM1/28/13
to metalink-...@googlegroups.com
On 1/25/2013 6:35 PM, Bram Neijt wrote:
> Hi Eliezer,
>
> I can't help you much with the details of how you can get Squid to
> work with the data in the metalink files.
I'm a squid Developer (not a core one) so it's OK, didn't planned that
anyone here will help me with squid code.

> Maybe I can help with some pointers on what you are trying to do. If
> you would care to explain the approach: what data will squid look at,
> what will squid then do?

OK, I will try to give you what we want to happen from the proxy point
of view and from the client one.

The cache proxy we are talking about is a forward one which is a
http+ftp ONLY proxy.
I developed a feature called store-id(see details below) which I hope
will get into squid.HEAD and squid 3.3 in the next weeks.
This feature allow admins to prevent duplication of http objects in the
cache using a small program which decides on each request url what is
the ID of it.
This feature is the successor of store_url_rewrite feature that existed
in squid 2.7.

Squid in no way for now a metalink client and from many security aspects
it's not advised for a proxy to be one.
Else then that squid and other proxies can benefit a lot from metalinks.
For full metalinks clients the everything is good since the hashes
available and they do support partial content.
With proxies there are other issues which dosn't exists on full
metalinks clients.
Since Squid dosn't implement caching for partial content the only
benefit for squid from metalinks is identifying duplicates of one object
by url.

The main issue in this case is that metalinks rely on hashes to verify
the download content While the store-id feature actually works only on a
URL.
The above can open a very deep security hole for cache poisoning.
Implementing a same-origin\same-domain policy is not an option since the
url object in the metalinks files can be from different domains\ip and
subdirectories.
A same filename policy also dosn't apply since a simple script\rewriting
can fake it.

Another issue that is not related directly to metalinks but more to
squid and maybe some other cache software is that the relevant metalink
data is suppose to be in the response to the original request which in
this stage of the download is not helping too much since the store-id is
already decided.
I can do another thing such as using the first link any-user tires as
store-id for the same urls from the metalink file.

I know it was a bit long and not directly related but since the rfc for
clients is being written and in draft mode now I think it's good to
raise these issue and maybe decide on a way to cover these gaps for
proxies benefit.

* store-id feature details:
The helper gets the request url and decides the "store-id" based on
admins algorithms.
If the admins knows about a CDN\mirror pattern of a url such as in
sourceforge(real world example):
^http:\/\/.*\.dl\.sourceforge\.net\/(.*)
which all download mirrors in their network has the same url path but
different .dl.sourceforge.net subdomain.
all requests for the file /xyx/example.tar.gz can be retrieved using
http://examplemirro1.dl.sourceforge.net/xyx/example.tar.gz
http://examplemirro2.dl.sourceforge.net/xyx/example.tar.gz
http://examplemirro3.dl.sourceforge.net/xyx/example.tar.gz

In this case the admin can use a store-id such as:
"http://dl.sourceforge.net.squid.internal/xyx/example.tar.gz" this will
result squid to store the requests from any of the mirrors into one
unified object.
The result is that if the url\file\object exists in the cache by an
older request from a mirror the current request from another mirror will
be served from cache rather then from the origin server.

<SNIP>
Best regards,
Eliezer

Eliezer Croitoru

unread,
Jan 28, 2013, 9:13:19 AM1/28/13
to metalink-...@googlegroups.com
On 1/28/2013 5:47 AM, Anthony Bryan wrote:
> sorry, email sent while sleepy:)
>
> instead of the HTTP list, the squid list is probably another good
> place to discuss this
I do agree on that.

I just posted a full response with the issues I had in mind.
Once store-id feature will be embedded into squid I will post a
prototype of a small helper that should do some tricks based on metalinks.

Regards,
Eliezer

Bram Neijt

unread,
Jan 28, 2013, 12:08:31 PM1/28/13
to metalink-...@googlegroups.com
Hi Eliezer,

I'm not yet clear on how Squid would come to know about the metalink.
Will an admin add the metalink to a list, so you have a list of
supported/trusted metalinks, or will squid detect the download of a
metalink and do something with it? Because I was in the mood, I've
added a section on both scenarios.

==== If Squid detects a metalink download and tries to do something smart
I don't really see a way of having metalinks interact with the mirror
path identifier/store-id features. The big problem, I think, is that
you have to be sure the client requesting any of the metalink urls
actually has the URL and will verify the integrity of the download
afterwards. Otherwise the client could just be visiting one of the
urls mentioned in a metalink that happened to pass through the proxy
and get the wrong data.

One thing I could think of was having squid detect the URLS in a
metalink file as probably almost static and up the cache time
regardless of what the real HTTP server hosting the file would respond
in the future.

==== If Squid trusts the metalink content, for example it was added by an admin
Then squid could use the urls to generate a regex for the store-id
extraction and you could have Squid consider the different urls as
equal using your plugin. You might need to add some functionality to
your plugin (for example use a search replace regex instead of "group
1 determines the store id", which I gathered from your example).


Good luck with the plugin!

Bram
> --
> You received this message because you are subscribed to the Google Groups
> "Metalink Discussion" group.
> To post to this group, send email to metalink-...@googlegroups.com.
> To unsubscribe from this group, send email to
> metalink-discus...@googlegroups.com.

Eliezer Croitoru

unread,
Jan 28, 2013, 1:02:37 PM1/28/13
to metalink-...@googlegroups.com, Anthony Bryan
Thanks Bram,
Notes in the text below.

On 1/28/2013 7:08 PM, Bram Neijt wrote:
> Hi Eliezer,
>
> I'm not yet clear on how Squid would come to know about the metalink.
> Will an admin add the metalink to a list, so you have a list of
> supported/trusted metalinks, or will squid detect the download of a
> metalink and do something with it? Because I was in the mood, I've
> added a section on both scenarios.

This is negotiable since metalinks are not in use in squid yet or in any
competitive forward proxy I know of.

Squid should come to know of an exact metalink file from the headers of
the download or a download redirection(301\302).

> ==== If Squid detects a metalink download and tries to do something smart
> I don't really see a way of having metalinks interact with the mirror
> path identifier/store-id features. The big problem, I think, is that
> you have to be sure the client requesting any of the metalink urls
> actually has the URL and will verify the integrity of the download
> afterwards. Otherwise the client could just be visiting one of the
> urls mentioned in a metalink that happened to pass through the proxy
> and get the wrong data.

Downloading a metalink should never ever ever result the proxy
downloading the files\urls.
This is one of the big security holes that can be opened.(I pray for the
sake of this proxy writer life)

> One thing I could think of was having squid detect the URLS in a
> metalink file as probably almost static and up the cache time
> regardless of what the real HTTP server hosting the file would respond
> in the future.
This is similar to one of my ideas of using the metalink file from a
download header to build up a small metalink based DB which will use the
first match url from the metalink file as the store-id for this file\object.
For that all mirrors should have a header pointing to the metalink
file... of the download.


> ==== If Squid trusts the metalink content, for example it was added by an admin
> Then squid could use the urls to generate a regex for the store-id
> extraction and you could have Squid consider the different urls as
> equal using your plugin. You might need to add some functionality to
> your plugin (for example use a search replace regex instead of "group
> 1 determines the store id", which I gathered from your example).
Squid or any software that will read the metalink file\content should be
reliable or else wont be used at all.
The feature(plugin) is an interface to squid which allows other software
do all the other needed math such as regex etc.
The logic I am seeking is either for a helper that will do what needed
before the download starts or squid will do it internally.

In order to use a regex of any kind I need to know what to match for..
This is were I want to use metalinks.
MirrorBrain makes life more simple with that since I can know about a
list of mirrors with the same exact content so I can add a list of
mirrors to one store-id match.

This is not really a direct metalink feature though.
Maybe there is an option to add metalinks an option that can allow more
then just like:
<url
type="http">http://mirror.aptus.co.tz/pub/ubuntu/12.10/ubuntu-12.10-desktop-i386.iso</url>

but a more plural way to define it for static CDN networks such as:
<url type="http" domain="mirror.aptus.co.tz"
base_path="pub/ubuntu/">ubuntu/12.10/ubuntu-12.10-desktop-i386.iso</url>

or any other form..
I know for example that fedora uses a specific path structure.
this is the local mirror here in israel:
http://mirror.isoc.org.il/pub/fedora/releases/17/Fedora/x86_64/iso/Fedora-17-x86_64-netinst.iso

When I download from:
http://download.fedoraproject.org/pub/fedora/linux/releases/17/Fedora/x86_64/iso/Fedora-17-x86_64-netinst.iso

I get redirected to my local mirror which has if you see a specific path
structure that can be described in a much simpler way then just writing
the whole url in the metlink file.

If the CDN network is being managed by one node and the metalinks being
compiled with a simple algorithm why do not present it to the client the
same simple way the main node see that?
In any way most metalinks are dynamic by AS or Country code in a big CDN
network.
I also remember that a path can be used in the filename so smaller file
with less unneeded strings in it.

Maybe there can be another file format then metalink for this specific
purpose what do you think?

> Good luck with the plugin!
>
> Bram
>
Thanks,
Eliezer

Jack Bates

unread,
Jan 29, 2013, 8:36:58 AM1/29/13
to Metalink Discussion

On Jan 28, 2013 4:10 PM, "Eliezer Croitoru" <eli...@ngtech.co.il> wrote:
> The main issue in this case is that metalinks rely on hashes to verify the download content While the store-id feature actually works only on a URL.
> The above can open a very deep security hole for cache poisoning.
> Implementing a same-origin\same-domain policy is not an option since the url object in the metalinks files can be from different domains\ip and subdirectories.
> A same filename policy also dosn't apply since a simple script\rewriting can fake it.

If the proxy computes the hash itself, from the content, does that solve the cache poisoning vulnerability?

Reply all
Reply to author
Forward
0 new messages