Re: [squid-dev] Digest related question.

Eliezer Croitoru

unread,

Mar 15, 2015, 10:03:31 PM3/15/15

to squi...@lists.squid-cache.org, Jack Bates, metalink-...@googlegroups.com

Hey Jack,

Thanks for the links and the help.
I have started working on a pesudo for a way to test\implement some of
the ideas but it is still very far from reality.

The basic code is at:
http://www1.ngtech.co.il/paste/1292/

Which means that if I do trust the source domain I will try to analyze a
head request first before any decision.
(there should be a way to override the request redirection or any other
thing we would want to do with it)

An option would be to add the url into a DB with one to many and many to
one relationship.
And since the pesudo code should be the first step before StoreID helper
run-time, StoreID in place can apply an internal address per a digest of
choice or even multiple choices.

My main concern until now is that if squid would have the cached object
in a digest url form such as:
http://digest.squid.internal/MD5/xyz123"

Squid would in many cases try to verify against the origin server that
the cached object has the same ETAG and MODIFICATION time.

I am unsure yet on how to "resolve" this issue in an elegant way which
would not violate the RFCs.

Another subject is the Link headers which I am still considering how to
use compared to Digest which is simple enough.

Any ideas on a direction how to solve any of the above issues(even if I
have not seen them until now) are welcomed at
squi...@lists.squid-cache.org.

Eliezer

On 15/03/2015 04:02, Jack Bates wrote:
> Some of the sites listed here send Digest headers:
> http://www.metalinker.org/samples.html#externalsites
>
> e.g.
> $ curl -v download.documentfoundation.org/libreoffice/stable/4.4.1/rpm/x86/LibreOffice_4.4.1_Linux_x86_rpm.tar.gz > /dev/null
>
> < HTTP/1.1 302 Found
> < Link:<http://tdf.mirror.rafal.ca/libreoffice/stable/4.4.1/rpm/x86/LibreOffice_4.4.1_Linux_x86_rpm.tar.gz>; rel=duplicate; pri=1; geo=ca
> < Link:<http://mirror.nexcess.net/tdf/libreoffice/stable/4.4.1/rpm/x86/LibreOffice_4.4.1_Linux_x86_rpm.tar.gz>; rel=duplicate; pri=2; geo=us
> < Digest: SHA-256=Ggb4nPDpd2T92wVi8k7+w+KND6GjGv6q1biDAWE8snY=
> < Location:http://tdf.mirror.rafal.ca/libreoffice/stable/4.4.1/rpm/x86/LibreOffice_4.4.1_Linux_x86_rpm.tar.gz
>
> Here is an actual example of rewriting the response
> so a client gets a file that's already cached:
> https://groups.google.com/d/msg/metalink-discussion/ww5fFdLfaZo/cSo4Viss85kJ

Jack Bates

unread,

Mar 16, 2015, 1:49:21 PM3/16/15

to Eliezer Croitoru, squi...@lists.squid-cache.org, metalink-...@googlegroups.com

On 15/03/15 07:03 PM, Eliezer Croitoru wrote:
> Hey Jack,
>
> Thanks for the links and the help.
> I have started working on a pesudo for a way to test\implement some of
> the ideas but it is still very far from reality.
>
> The basic code is at:
> http://www1.ngtech.co.il/paste/1292/
>
> Which means that if I do trust the source domain I will try to analyze a
> head request first before any decision.
> (there should be a way to override the request redirection or any other
> thing we would want to do with it)

Is trust an issue?
Assuming you populate the database
with digests that you compute yourself.
If an origin sends a Digest header
and you rewrite the response so a client gets a matching file,
do you need to trust the origin?

> An option would be to add the url into a DB with one to many and many to
> one relationship.
> And since the pesudo code should be the first step before StoreID helper
> run-time, StoreID in place can apply an internal address per a digest of
> choice or even multiple choices.
>
> My main concern until now is that if squid would have the cached object
> in a digest url form such as:
> http://digest.squid.internal/MD5/xyz123"
>
> Squid would in many cases try to verify against the origin server that
> the cached object has the same ETAG and MODIFICATION time.
>
> I am unsure yet on how to "resolve" this issue in an elegant way which
> would not violate the RFCs.

Good point.
When the Traffic Server plugin checks if the file is already cached,
I don't know off hand if it also checks if it's valid.
The plugin could quite possibly redirect a client
to a URL that must be revalidated with the origin.
But maybe this is still an improvement?
Maybe it's good enough, at least for a first iteration?
The plugin does check that the original URL is *not* cached
before rewriting the response.
In your pseudo code,
can you check that the original URL is *not* cached,
before querying the database with the digest?

Eliezer Croitoru

unread,

Mar 18, 2015, 4:59:17 PM3/18/15

to Henrik Nordström, squi...@lists.squid-cache.org, metalink-...@googlegroups.com, Jack Bates

Thanks Henrik,

I have been wondering about what you have mentioned.

I am still considering couple things about Deep Response Inspection
while it's might cost some more CPU time.

Eliezer

On 18/03/2015 10:23, Henrik Nordström wrote:

> mån 2015-03-16 klockan 04:03 +0200 skrev Eliezer Croitoru:
>
>> My main concern until now is that if squid would have the cached object
>> in a digest url form such as:
>> http://digest.squid.internal/MD5/xyz123"
>>
>> Squid would in many cases try to verify against the origin server that
>> the cached object has the same ETAG and MODIFICATION time.
>

> The Digest alone is only on the body, and says nothing about header
> authority. You need to get trusted object headers from somewhere else,
> i.e. the requested origin. Once you have the authoritative headers you
> can splice in the digest verified response body.
>
> This is in some sense similar to the header merging needed in ETag based
> variant handling on a single URL, but even more so as you must not take
> headers from one random URL and apply them to another requested URL
> without permission unless the requested URL permits this. Violating
> this opens a range of security concerns where headers may be injected
> giving a different result than intended by the origin.
>
> Regards
> Henrik
>

Eliezer Croitoru

unread,

Mar 18, 2015, 5:46:27 PM3/18/15

to squi...@lists.squid-cache.org, metalink-...@googlegroups.com

Hey Jack,

I will try to write down here instead of as notes.

About the trust issues:
In the Internet I am used to trusted sources since indeed the Internet
is a great place.
But like life and very far from the movies there are people out-there in
the world that just likes to benefit more then they deserve or they
think they should benefit more for any other reason.
There are cases which indeed the work should be rewarded and even
con-artists deserve more then a penny for their time.

I would not run to do the math about power consumption for others but I
do it for myself and in many cases I would prefer to run a super CPU
intensive task then relying on what ever "event" that will might trigger
a set of tasks.
The above is due to the basic fact that I have seen cases which an event
would just not come and a pending task was staying there for ages.
Back to Digest, For squid in the current state of things it would
require a bit of juggling to "re-compile" the StoreID after the file was
downloaded.(lat time I was reading the code)
It will not block collapse-forwarding but it will require more then just
CPU.
Also what it means is that as a proxy the only option to send the client
a verified download is after I downloaded the full file and there for an
issue.
So the above limits squid into a specific Database structure and a way
of operation.
It is not a bad way to do things but since Digest headers RFC exist and
Digest DBs do exist for most sane large file storage systems(even if not
available to the end user throw an API) I would prefer to first consider
them and then Digesting all sorts of files.

About a self made Digest DB:
I will trust it but it will be very limited and will be of a great help
for a non trusted source rather then a trusted one.
Trusting the origin service means that I trust a part of a very big
system that tries to ensure that I as a human will not be exploited.
There are places on the planet and on the Internet which the situation
is that the user cannot trust the service provider(s).
I learned that If I do not trust the ISP I am trying another one but I
cannot switch the Ethernet cables from my PC to the wall.
If a bunch of users will prove that the situation is that they cannot
trust their ISP for a 3-7MB RPM signed by me on-top of http I would be
glad to sit and hear their assumption and will hear the arguments about
how to first collide MD5 and then each of the SHA Digest family.
I must admit that I have encountered some engineer(which has never been
in the squid list) that have tried to make me believe that he can
replace my RPM with a fake one and the users will not notice about it.
Sorry for him he is still trying...

About the improvement, it's good to re-validate that the file have the
same creation\modification time and it should reflect that the Digest
was not changed and vice versa.
I assume that in the future when data will take less place or will
consume more then it is today and a RPM such as the one I created could
be(if at all) at will re-calculated to match a colliding Digest, another
way to verify data integrity will be there.
Until then the internet is safe enough that a simple single encrypted
"control" channel or two would be enough to verify the integrity of the
data.

And since I am human sometimes I think about something and I am sure I
did it and it happens that it was not written even if I invested lots of
time in reviewing the pesudo code line by line.
Ho well, I assume that this how humans are.. they are allowed to have a
mistake or two.

Verifying that the object is cached or not before running any logic is a
good choice if the DB is not too big and I will add it to the pesudo
later on with a new version.
HTCP would be my first choice to do that and I will make efforts and
re-learn it to make sure that the pesudo can run for testing.

Eliezer Croitoru

* I have seen some of the great series Hustel due to two recommendations
of friends and family:
http://www.imdb.com/title/tt0379632/

Reply all

Reply to author

Forward