[pkg-discuss] Code review: making transport deal with poisonous web caches

Tim Foster

unread,

Mar 5, 2012, 2:22:16 AM3/5/12

to pkg discuss

Hi there,

I've got a change here that improves pkg(1) when sitting behind a web
cache that has somehow been populated with corrupted content.

I wasn't sure whether t_pkg_install was the best place for the test
here, but couldn't see anywhere else in the testsuite that specifically
addresses transport changes - I'm happy to move it there's a better place.

https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache/poisoned-cache-webrev/

Comments welcome,

cheers,
tim
_______________________________________________
pkg-discuss mailing list
pkg-d...@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/pkg-discuss

Shawn Walker

unread,

Mar 5, 2012, 11:50:51 AM3/5/12

to Tim Foster, pkg discuss

On 03/04/12 23:22, Tim Foster wrote:
> Hi there,
>
> I've got a change here that improves pkg(1) when sitting behind a web
> cache that has somehow been populated with corrupted content.
>
> I wasn't sure whether t_pkg_install was the best place for the test
> here, but couldn't see anywhere else in the testsuite that specifically
> addresses transport changes - I'm happy to move it there's a better place.
>
> https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache/poisoned-cache-webrev/

src/modules/client/transport/repo.py:
line 1034: it should set header["Pragma"] = "no-cache" as well;
that's also true for line 513; necessary for HTTP 1.0 caches [1]

lines 213, 1037, 1752: need extra newline

lines 1750, 2019: I'd make these docstrings the same and something
like: """Pointless to attempt refetch of corrupt content for
this protocol."""

src/modules/client/transport/transport.py:
line 917: this comment is subtly different for each retry case

lines 918-926: so the problems I see with this approach are (and it
applies to most of the other cases):

* it relies on the number of retries left being sufficient
to retrieve the content (i.e. retry_count) that's probably ok,
but there should be a comment

* if this is the last retry already, then the failure
won't be added to failedreqs / failures (oops), a simple
fix for that would be to track the number of retries and
skip the redownload if retries == retry_count

* i'm not sure why it's doing "errlist.append(e)" as
errlist will be re-assigned at the next iteration of the loop
and entries in flist are sufficient to prevent a successful
return

src/tests/cli/t_pkg_install.py:
lines 784-788: instead of this, you can grab the fmri from the list
that pkgsend_bulk returns on line 755

lines 790-791: actually, simply doing a forced image-create again
doesn't clear the content, but our image_create() routine
automatically calls the unit test image_destroy() method

line 800: s/"w"/"wb"/

-Shawn

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.4

Tim Foster

unread,

Mar 5, 2012, 4:08:00 PM3/5/12

to Shawn Walker, pkg discuss

On 03/ 6/12 05:50 AM, Shawn Walker wrote:
>> https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache/poisoned-cache-webrev/

Thanks for taking a look,

> src/modules/client/transport/repo.py:
> line 1034: it should set header["Pragma"] = "no-cache" as well;
> that's also true for line 513; necessary for HTTP 1.0 caches [1]

Good point - I've fixed both of those.

> lines 213, 1037, 1752: need extra newline

For 213, it's the same class - other places where we switch from
defining normal vs. static methods don't seem to do that. I've fixed
the others though.

> lines 1750, 2019: I'd make these docstrings the same and something
> like: """Pointless to attempt refetch of corrupt content for
> this protocol."""

Sounds good.

> src/modules/client/transport/transport.py:
> line 917: this comment is subtly different for each retry case

I've fixed that.

> lines 918-926: so the problems I see with this approach are (and it
> applies to most of the other cases):
>
> * it relies on the number of retries left being sufficient
> to retrieve the content (i.e. retry_count) that's probably ok,
> but there should be a comment

Yes, I realised that at the time, and sort of threw my hands up in the
air at what to do in the face of a several failed attempts, followed by
one attempt which passes corrupt content... I've added comments.

> * if this is the last retry already, then the failure
> won't be added to failedreqs / failures (oops), a simple
> fix for that would be to track the number of retries and
> skip the redownload if retries == retry_count

Good point - thanks. Yes that seems reasonable.

> * i'm not sure why it's doing "errlist.append(e)" as
> errlist will be re-assigned at the next iteration of the loop
> and entries in flist are sufficient to prevent a successful
> return

Oops.

> src/tests/cli/t_pkg_install.py:
> lines 784-788: instead of this, you can grab the fmri from the list
> that pkgsend_bulk returns on line 755

Great. I still need to grab the version component of the FMRI in order
to quote it properly, but that's still nicer than parsing pkg info
output, thanks.

> lines 790-791: actually, simply doing a forced image-create again
> doesn't clear the content, but our image_create() routine
> automatically calls the unit test image_destroy() method

Sure - I'll add a comment.

> line 800: s/"w"/"wb"/

Fixed.

I've got an incremental, and new webrev at:
https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache-v2/

cheers,
tim

Shawn Walker

unread,

Mar 5, 2012, 4:30:00 PM3/5/12

to Tim Foster, pkg discuss

...

> I've got an incremental, and new webrev at:
> https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache-v2/

So, I'm sorry for not realising this before, but each loop through is
not another retry since there may be multiple origins involved.

The cleanest way to resolve this (but it will cause a lot more churn) is
to have __gen_repo yield 'i' as part of the tuple it returns so you know
the number of the attempt.

The only other way I can think of doing it is something like this:

first_repo = None
retries = 0
for d, v in self.__gen_repo(...):
if not last_repo:
last_repo = d
retries += 1
elif first_repo and first_repo == d:
retries += 1

...as I said, cleaner way is to have __gen_repo() yield the attempt number.

-Shawn

joha...@opensolaris.org

unread,

Mar 5, 2012, 5:17:44 PM3/5/12

to Tim Foster, pkg discuss

On Tue, Mar 06, 2012 at 10:08:00AM +1300, Tim Foster wrote:
> I've got an incremental, and new webrev at:
> https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache-v2/

On lines 1286-1298 and 1887-1899 you're not recording the fact that
you've seen a content error, which is going to skew the repository
statistics for future choices.

In general, I'm not really thrilled with this approach. It might be
simpler to set the cache-clearing headers on any subsequent attempt to
connect to a repository that has already given us a content error.

To do this, you'd want to add the iteration count to gen_repos, as Shawn
suggested. You'll also want to add a content_errors property to the
repo statistics, in stats.py. Then, instead of doing a big dance with
adjusting the retry count on the fly, simply record a content error when
verification fails. At the beginning of the 'for d, v in
self.__gen_repo()' loop, you can check whether you're on a subsequent
iteration, and if so, look at the repostats to determine if you've seen
a content error.

-j

Tim Foster

unread,

Mar 5, 2012, 5:55:42 PM3/5/12

to pkg discuss

On 03/ 6/12 11:17 AM, joha...@opensolaris.org wrote:
> On Tue, Mar 06, 2012 at 10:08:00AM +1300, Tim Foster wrote:
>> I've got an incremental, and new webrev at:
>> https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache-v2/
>
> On lines 1286-1298 and 1887-1899 you're not recording the fact that
> you've seen a content error, which is going to skew the repository
> statistics for future choices.

Fair point - I'm hand-wringing a bit here, as to whether a fast cache
that occasionally returns bad content should be treated differently to a
slow or unavailable repository..

> In general, I'm not really thrilled with this approach. It might be
> simpler to set the cache-clearing headers on any subsequent attempt to
> connect to a repository that has already given us a content error.
>
> To do this, you'd want to add the iteration count to gen_repos, as Shawn
> suggested.

That sounds reasonable too, thanks for the suggestion. I'm already part
of the way through updating __gen_repos after Shawn's recent suggestion,
so I'll factor this in too.

? You'll also want to add a content_errors property to the

> repo statistics, in stats.py. Then, instead of doing a big dance with
> adjusting the retry count on the fly, simply record a content error when
> verification fails. At the beginning of the 'for d, v in
> self.__gen_repo()' loop, you can check whether you're on a subsequent
> iteration, and if so, look at the repostats to determine if you've seen
> a content error.

I'll give that a go. Recording content errors seems like a good start,
and perhaps might allow us to take different actions on repositories
with content errors in the future.

Apart from the bug report, I don't know how prevalent this problem is in
the wild, but it certainly felt like something we should be addressing.
I'll go back to the drawing board, and send another webrev when I'm
done. Thanks for the suggestions.

cheers,
tim

joha...@opensolaris.org

unread,

Mar 5, 2012, 6:20:18 PM3/5/12

to Tim Foster, pkg discuss

On Tue, Mar 06, 2012 at 11:55:42AM +1300, Tim Foster wrote:
> On 03/ 6/12 11:17 AM, joha...@opensolaris.org wrote:
> >On Tue, Mar 06, 2012 at 10:08:00AM +1300, Tim Foster wrote:
> >>I've got an incremental, and new webrev at:
> >>https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache-v2/
> >
> >On lines 1286-1298 and 1887-1899 you're not recording the fact that
> >you've seen a content error, which is going to skew the repository
> >statistics for future choices.
>
> Fair point - I'm hand-wringing a bit here, as to whether a fast
> cache that occasionally returns bad content should be treated
> differently to a slow or unavailable repository..

I'm not sure what to make of this either. I wondered if this should be
a per-publisher policy option. If the user has configured more than one
repository for a particular publisher, then we'd probably want to choose
the one that's fastest and has the lowest error rate. However, I
understand that you're seeing this problem for system repositories,
where another location may not be configured or available. Another
option might be to only implement that policy for the system repo, but
it does seem useful in other situations.

I also wondered if it would just be simpler to provide an override
option on the command line, but that would require users to give up and
re-try the entire installation, which isn't great either.

> >You'll also want to add a content_errors property to the
> >repo statistics, in stats.py. Then, instead of doing a big dance with
> >adjusting the retry count on the fly, simply record a content error when
> >verification fails. At the beginning of the 'for d, v in
> >self.__gen_repo()' loop, you can check whether you're on a subsequent
> >iteration, and if so, look at the repostats to determine if you've seen
> >a content error.
>
> I'll give that a go. Recording content errors seems like a good
> start, and perhaps might allow us to take different actions on
> repositories with content errors in the future.

We do sort of take a different action now, in the sense that a content
error is weighted differently than a transient error. Repositories that
constantly give us content errors eventually fall out of favor, but I
suppose we could further differentiate between incorrectly cached
content and persistently corrupted content.

> Apart from the bug report, I don't know how prevalent this problem
> is in the wild, but it certainly felt like something we should be
> addressing. I'll go back to the drawing board, and send another
> webrev when I'm done. Thanks for the suggestions.

Me either really. Shawn and I discussed this a bit, long ago. But I
can't remember if it was in response to a problem we saw, or if it was
just theoretical.

Sorry to throw a monkey wrench into things this late in the review
process. My hope is that this will actually simplify the code.

-j

Tim Foster

unread,

Mar 6, 2012, 11:39:11 PM3/6/12

to pkg discuss

On 03/ 6/12 11:17 AM, joha...@opensolaris.org wrote:

> On Tue, Mar 06, 2012 at 10:08:00AM +1300, Tim Foster wrote:
>> I've got an incremental, and new webrev at:
>> https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache-v2/
>
> On lines 1286-1298 and 1887-1899 you're not recording the fact that
> you've seen a content error, which is going to skew the repository
> statistics for future choices.
>
> In general, I'm not really thrilled with this approach. It might be
> simpler to set the cache-clearing headers on any subsequent attempt to
> connect to a repository that has already given us a content error.

Ok, I've had another go at this, and have a webrev at:

https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache-v3

I'm still trying to track down an intermittent problem with the new test
I'm introducing here, where Apache seems to not cache a url fast enough
when we're running lots of tests at the same time[1], resulting in the
cache returning good data when we're expecting corrupt data (I'm using
urllib2 as the client to verify the data is corrupt, so it's not the pkg
code at fault here)

Apart from that though, comments welcome,

cheers,
tim

[1] it's possible it hasn't had a chance to flush its cache to disk, and
so my code to corrupt the cache doesn't take effect

joha...@opensolaris.org

unread,

Mar 7, 2012, 2:06:07 PM3/7/12

to Tim Foster, pkg discuss

On Wed, Mar 07, 2012 at 05:39:11PM +1300, Tim Foster wrote:
> Ok, I've had another go at this, and have a webrev at:
>
> https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache-v3

Thanks for making these modifications. I like this change. Just a few
nits:

stats.py:

- line 320: typo "operaiton" should be "operation"

transport.py:

- lines 1870-1874: The TransportFailures object is designed to
accumulate multiple exceptions, and pretty-print them once we reach
the maximum number of retries. Unless I've misunderstood what
you're trying to accomplish, it looks like this could be simplified
to the following:

except tx.InvalidContentException, e:
repostats.record_error(content=True)
failures.append(apx.InvalidDepotResponseException(
d.get_url(), "Unable to parse "
"repository's versions/0 response"))

-j

Tim Foster

unread,

Mar 7, 2012, 10:54:03 PM3/7/12

to pkg discuss

On 03/ 8/12 08:06 AM, joha...@opensolaris.org wrote:
> On Wed, Mar 07, 2012 at 05:39:11PM +1300, Tim Foster wrote:
>> Ok, I've had another go at this, and have a webrev at:
>>
>> https://cr.opensolaris.org/action/browse/pkg/timf/poisoned-cache-v3
>
> Thanks for making these modifications. I like this change.

No problem at all - the changes you suggested did result in nicer code imho.

> Just a few nits:
>
> stats.py:
>
> - line 320: typo "operaiton" should be "operation"

Thanks, fixed.

> transport.py:
>
> - lines 1870-1874: The TransportFailures object is designed to
> accumulate multiple exceptions, and pretty-print them once we reach
> the maximum number of retries. Unless I've misunderstood what
> you're trying to accomplish, it looks like this could be simplified
> to the following:
>
> except tx.InvalidContentException, e:
> repostats.record_error(content=True)
> failures.append(apx.InvalidDepotResponseException(
> d.get_url(), "Unable to parse "
> "repository's versions/0 response"))

Yep, that looks ok, and I'll make the change, but it might actually be
overkill. I was trying to harden our retrieval of versions/ as much as
the other operations, but unfortunately it looks like getting a bad
versions/ response anywhere really ruins pkg(1)'s day.

Very early on, we look for it in the middle of __gen_repo(..) and bail
out on the first instance of a corrupt version response, skipping all
subsequent retries and all of the additional logic I'm introducing here :-(

I'm inclined not to worry too much about this particular case, unless
folks feel we ought to be doing better here (as is, the system
repository won't try to cache versions/ responses, so that base should
be covered)

cheers,
tim

ps. found and fixed the test problem I was facing yesterday - my test
was doing "pkg install foo" but not taking into account that when other
tests had been run, the latest version of foo in the repo wasn't
necessarily the one I had published.

Reply all

Reply to author

Forward