Intent to Implement: Stale-While-Revalidate

267 views
Skip to first unread message

Dave Tapuska

unread,
Jun 21, 2018, 1:04:19 PM6/21/18
to blink-dev

Contact emails

dtap...@chromium.org


Explainer & Design Doc

Design Doc


Tag Review not required as this isn't web exposed.


Summary

Implement stale-while-revalidate processing in the Cache-Control header. Allow stale resources to be served from the cache while asynchronously revalidated.


Motivation

Allowing websites find a balance between rapid deployment and improved time to load is important. Websites might use Javascript bootstrapping code which might want the deployment of a resource that has a short max-age (for rapid deployment) but allow it to be served stale for a longer duration. Allowing it to be served stale to a page removes the need for the resource to be blocking the load of the page if the rest of the resources are in the cache. Authors expect that the resource would be revalidated shortly thereafter. 


Risks

Interoperability and Compatibility

Service Worker API will see the revalidation requests. This might complicate some scenarios.


Edge: No signals

Firefox: No signals

Safari: No signals

Web developers: Unknown


Ergonomics

N/A


Activation

Http Server Change would be needed. Some resources like css fonts and amp bootstrap code are already served with the cache directive.


Debuggability

Dev tools will show an additional resource requests when it is being revalidated.


Will this feature be supported on all six Blink platforms (Windows, Mac, Linux, Chrome OS, Android, and Android WebView)?

Yes


Link to entry on the feature dashboard

https://www.chromestatus.com/feature/5050913014153216


Requesting approval to ship?

No, will conduct origin trial. An Intent to Experiment will be sent once the code has landed.



PhistucK

unread,
Jun 21, 2018, 1:11:21 PM6/21/18
to Dave Tapuska, blink-dev
> Tag Review not required as this isn't web exposed.
Sounds pretty web exposed to me, even if not through an API.
The page will have stale resources which can affect its behavior, it can know that it has stale resources if there is a mixture, for example.

PhistucK


--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CAHXv1wm0iwRY7i5aOT7qCB2aRMy9dR6fU51CasW4Bctqbsh17g%40mail.gmail.com.

Ben Kelly

unread,
Jun 22, 2018, 9:30:22 AM6/22/18
to PhistucK, Dave Tapuska, blink-dev
On Thu, Jun 21, 2018 at 1:10 PM, PhistucK <phis...@gmail.com> wrote:
> Tag Review not required as this isn't web exposed.
Sounds pretty web exposed to me, even if not through an API.
The page will have stale resources which can affect its behavior, it can know that it has stale resources if there is a mixture, for example.

Also, the intent mentions that the requests will be directly visible to service workers which is quite web exposed.

Dave, what is your plan for standardizing this feature?  (Stealing Anne's usual question here...)

Since its implemented at a high level, and not just in http cache, it seems like it will need to be added to the fetch spec at a minimum.  Other specs, like html, may also need to be modified to set the new stale-while-revalidate value when they initiate loading through the fetch spec.

For example, see how fetch spec handles other cache-control header stuff:

  https://fetch.spec.whatwg.org/#concept-request-cache-mode

I do see you reference this:


But that seems to focus on http cache level changes, but your design suggests you are implementing it at a different layer in the browser in a way that will be observable.

Thanks.

Ben

David Benjamin

unread,
Jun 22, 2018, 3:56:18 PM6/22/18
to Ben Kelly, PhistucK, Dave Tapuska, blink-dev
I'm going to ask a question then answer it, since I already know and am happy with the answer, but I feel it should be mentioned in the thread somewhere... :-)

(Wearing my question-asking hat)

In the past, when we've looked at stale-while-revalidate, we had trouble trying to use it sanely in a browser. My sense is that it was originally designed with more a CDN-like use in mind. Typically in a CDN, connectivity to the origin is reliable, there is no difference between an out-of-band and in-band request, and you expect a single cache to serve many clients talking to a site. The CDN and origin also typically have some kind of relationship, so there is much less worry about whether the CDN is willing to make a request at this time.

In a browser, none of those hold. Out-of-band and in-band requests are quite different. Requests may trigger user interaction (auth prompts), and there is an expectation that a site "stops doing things" when one closes a tab. The client may be offline or have flaky connectivity, so the revalidation may fail. Or perhaps we have some local policy (extension?) that rejects such revalidations. Moreover, an async revalidation is inherently predictive. It extends the max-age and stale-while-revalidate window for future requests. If no future request hits that window, the revalidation is useless. (It is also predictive in a CDN, but as the CDN's cache services many clients, it's a very solid prediction.) The browser's HTTP cache serves not just the site developer, but also the user, who may be visiting another site or have limited resources. Being predictive, the case for dropping those under load becomes very strong.

All together, this means we must strongly consider revalidations failing on the client. In a naive stale-while-revalidate implementation, a failed revalidation acts as if we had written a larger max-age. If the site author was okay with that, why didn't they set the larger max-age? This is an apparent contradiction. The spec nominally allows for this (it only says revalidation is a SHOULD), but max-age=<1 day>, stale-while-revalidate=<1 week> means very different things depending on whether the revalidations will happen or not. We need clear semantics here between the client and the server.

(Wearing my question-answering hat)

The problem is stale-while-revalidate's semantics are tied to revalidation success. We need to decouple those. The proposed design is to clamp the stale-while-revalidate period on use: Let T be some small grace period, say one minute. If we use a resource in the stale-while-revalidate period, update the end of the period to min(currentEndpoint, now + T). Then return the resource stale, but also request that the upper layers asynchronously revalidate the resource.

This effectively implements a revalidation timeout of T. This preserves the good properties of stale-while-revalidate. If revalidation completes within now+T, cache behavior is better and we behave as a naive stale-while-revalidate implementation in the good case. At the same time, it avoids the bad properties. If revalidation fails, we start requiring a revalidation, gracefully decaying to the pre-stale-while-revalidate behavior. (For completeness, if revalidation completes, but misses the now+T timeout, that is also okay. Requests after the revalidation will still hit the cache.)

This gives clear semantics for stale-while-revalidate. The author can tune Cache-Control based on their requirements. The semantics of max-age=M and stale-while-revalidate=S are:

If the resource is within M, just use it stale. I am okay with having to wait M for an update to be rolled out, in exchange for M worth of cache. Past M, if it's still within M+S, I'm dubiously okay with it being used stale. I really want M+S worth of cache because M is too small, but I still want to know updates are rolled out after M. This is impossible, so I will concede waiting M+S for an update to be completely rolled out. But I want the update to be mostly rolled out after M. Thus there must be some bound on using resources in the [M, M+S) range. Finally, past M+S, you must revalidate. M+S is the oldest resource I'm willing to consider.

At the same time, these semantics do not depend on revalidation success, satisfying the other requirement.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.

Jeffrey Yasskin

unread,
Jun 22, 2018, 4:28:33 PM6/22/18
to Mark Nottingham, bke...@mozilla.com, davidben, Dave Tapuska, blink-dev
FYI to mnot, the author of https://tools.ietf.org/html/rfc5861, in case any of the details here argue for updating that RFC.

It's definitely worth talking to Anne about how Fetch should integrate this behavior for stale-while-revalidate. Actually getting those changes into the spec might take a lot of elaboration of the existing cache behavior, though.

Jeffrey

Dave Tapuska

unread,
Jun 22, 2018, 4:35:53 PM6/22/18
to Jeffrey Yasskin, Mark Nottingham, bke...@mozilla.com, davidben, blink-dev
Yes there is stuff to figure out with the fetch API. Perhaps we don't let it know about the request at all since it really is just a cache level operation.  And any later request for the same resource will go through the fetch API. 

Dave

To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.

Mark Nottingham

unread,
Jun 24, 2018, 7:02:32 PM6/24/18
to Jeffrey Yasskin, bke...@mozilla.com, davidben, Dave Tapuska, blink-dev
Thanks Jeffrey / Hi David,

> On 23 Jun 2018, at 6:28 am, Jeffrey Yasskin <jyas...@chromium.org> wrote:
>
> FYI to mnot, the author of https://tools.ietf.org/html/rfc5861, in case any of the details here argue for updating that RFC.
>
> It's definitely worth talking to Anne about how Fetch should integrate this behavior for stale-while-revalidate. Actually getting those changes into the spec might take a lot of elaboration of the existing cache behavior, though.
>
> Jeffrey
>
> On Fri, Jun 22, 2018 at 12:56 PM David Benjamin <davi...@chromium.org> wrote:
> I'm going to ask a question then answer it, since I already know and am happy with the answer, but I feel it should be mentioned in the thread somewhere... :-)
>
> (Wearing my question-asking hat)
>
> In the past, when we've looked at stale-while-revalidate, we had trouble trying to use it sanely in a browser. My sense is that it was originally designed with more a CDN-like use in mind. Typically in a CDN, connectivity to the origin is reliable, there is no difference between an out-of-band and in-band request, and you expect a single cache to serve many clients talking to a site. The CDN and origin also typically have some kind of relationship, so there is much less worry about whether the CDN is willing to make a request at this time.

The original motivation was back-end API caching at Yahoo!, but yes that's closer to CDN than browser.

> In a browser, none of those hold. Out-of-band and in-band requests are quite different. Requests may trigger user interaction (auth prompts), and there is an expectation that a site "stops doing things" when one closes a tab. The client may be offline or have flaky connectivity, so the revalidation may fail. Or perhaps we have some local policy (extension?) that rejects such revalidations.

Just to make sure we're on the same page -- nothing about SwR requires a cache to revalidate in any given situation; not only would that be backwards-incompatible with caches that don't implement it, but it also is against the whole nature of caching as an optimisation. Caches are free to treat it like a hint and use additional information / heuristics to help decide when and how to revalidate.

> Moreover, an async revalidation is inherently predictive. It extends the max-age and stale-while-revalidate window for future requests. If no future request hits that window, the revalidation is useless.

Yes. Effectively, a hit during the SwR window is used as an indication that it's worth trying a revalidation.

> (It is also predictive in a CDN, but as the CDN's cache services many clients, it's a very solid prediction.)

Depending on the nature of the content; while CDN hit rates are generally higher than browsers and forward caches, lots of little-used content goes through CDNs. Regardless, this actually shouldn't matter too much; if the revalidation ends up not getting used, no more requests were issued than would have been without SwR.

> The browser's HTTP cache serves not just the site developer, but also the user, who may be visiting another site or have limited resources. Being predictive, the case for dropping those under load becomes very strong.

As per above, this is *additional* efficiency that the cache can eke out if it decides it wants to drop SwR revalidations -- which would not be available without it.

> All together, this means we must strongly consider revalidations failing on the client. In a naive stale-while-revalidate implementation, a failed revalidation acts as if we had written a larger max-age.

I don't think anyone has implemented it that way; generally people will either immediately consider the object "truly" stale, or retry (with some sort of backoff heuristic).

> If the site author was okay with that, why didn't they set the larger max-age? This is an apparent contradiction. The spec nominally allows for this (it only says revalidation is a SHOULD), but max-age=<1 day>, stale-while-revalidate=<1 week> means very different things depending on whether the revalidations will happen or not. We need clear semantics here between the client and the server.

See above. SwR is designed the way it is so that it's backwards-compatible with caches that don't implement it.

> (Wearing my question-answering hat)
>
> The problem is stale-while-revalidate's semantics are tied to revalidation success.

Can you dig in here a bit? Why do you say this?

> We need to decouple those. The proposed design is to clamp the stale-while-revalidate period on use: Let T be some small grace period, say one minute. If we use a resource in the stale-while-revalidate period, update the end of the period to min(currentEndpoint, now + T). Then return the resource stale, but also request that the upper layers asynchronously revalidate the resource.
>
> This effectively implements a revalidation timeout of T. This preserves the good properties of stale-while-revalidate. If revalidation completes within now+T, cache behavior is better and we behave as a naive stale-while-revalidate implementation in the good case. At the same time, it avoids the bad properties. If revalidation fails, we start requiring a revalidation, gracefully decaying to the pre-stale-while-revalidate behavior. (For completeness, if revalidation completes, but misses the now+T timeout, that is also okay. Requests after the revalidation will still hit the cache.)
>
> This gives clear semantics for stale-while-revalidate.

This is certainly a valid approach (if I understand you correctly). I don't think we can say that it provides clarity about SwR's semantics, given that other implementations have taken other, equally valid approaches.

> The author can tune Cache-Control based on their requirements. The semantics of max-age=M and stale-while-revalidate=S are:
>
> If the resource is within M, just use it stale.

If it's within Cache-Control: max-age, how is it stale (assuming that the rest of the freshness algorithm lets it be fresh)?

> I am okay with having to wait M for an update to be rolled out, in exchange for M worth of cache. Past M, if it's still within M+S, I'm dubiously okay with it being used stale. I really want M+S worth of cache because M is too small, but I still want to know updates are rolled out after M. This is impossible, so I will concede waiting M+S for an update to be completely rolled out. But I want the update to be mostly rolled out after M. Thus there must be some bound on using resources in the [M, M+S) range. Finally, past M+S, you must revalidate. M+S is the oldest resource I'm willing to consider.

That is roughly the current semantics of SwR (delta the point about CC: max-age).

Cheers,



> At the same time, these semantics do not depend on revalidation success, satisfying the other requirement.
>
> On Fri, Jun 22, 2018 at 9:30 AM Ben Kelly <bke...@mozilla.com> wrote:
> On Thu, Jun 21, 2018 at 1:10 PM, PhistucK <phis...@gmail.com> wrote:
> > Tag Review not required as this isn't web exposed.
> Sounds pretty web exposed to me, even if not through an API.
> The page will have stale resources which can affect its behavior, it can know that it has stale resources if there is a mixture, for example.
>
> Also, the intent mentions that the requests will be directly visible to service workers which is quite web exposed.
>
> Dave, what is your plan for standardizing this feature? (Stealing Anne's usual question here...)
>
> Since its implemented at a high level, and not just in http cache, it seems like it will need to be added to the fetch spec at a minimum. Other specs, like html, may also need to be modified to set the new stale-while-revalidate value when they initiate loading through the fetch spec.
>
> For example, see how fetch spec handles other cache-control header stuff:
>
> https://fetch.spec.whatwg.org/#concept-request-cache-mode
> https://fetch.spec.whatwg.org/#requestcache
>
> I do see you reference this:
>
> https://tools.ietf.org/html/rfc5861
>
> But that seems to focus on http cache level changes, but your design suggests you are implementing it at a different layer in the browser in a way that will be observable.
>
> Thanks.
>
> Ben
>
> --
> You received this message because you are subscribed to the Google Groups "blink-dev" group.
> To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CA%2B1UsbRY%2BiLyBupStj42a%3Db%3DBH2U4zEsZV4uQ8sBZTi9Qpea%2Bg%40mail.gmail.com.
>
> --
> You received this message because you are subscribed to the Google Groups "blink-dev" group.
> To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CAF8qwaAok4omjYLWRTkyKGN4%3DdaqkspDjoausUzcVUEBCMHH6Q%40mail.gmail.com.

--
Mark Nottingham https://www.mnot.net/

David Benjamin

unread,
Jun 25, 2018, 11:10:12 AM6/25/18
to Mark Nottingham, Jeffrey Yasskin, bke...@mozilla.com, Dave Tapuska, blink-dev
On Sun, Jun 24, 2018 at 7:02 PM Mark Nottingham <mn...@mnot.net> wrote:
Thanks Jeffrey / Hi David,

> On 23 Jun 2018, at 6:28 am, Jeffrey Yasskin <jyas...@chromium.org> wrote:
>
> FYI to mnot, the author of https://tools.ietf.org/html/rfc5861, in case any of the details here argue for updating that RFC.
>
> It's definitely worth talking to Anne about how Fetch should integrate this behavior for stale-while-revalidate. Actually getting those changes into the spec might take a lot of elaboration of the existing cache behavior, though.
>
> Jeffrey
>
> On Fri, Jun 22, 2018 at 12:56 PM David Benjamin <davi...@chromium.org> wrote:
> I'm going to ask a question then answer it, since I already know and am happy with the answer, but I feel it should be mentioned in the thread somewhere... :-)
>
> (Wearing my question-asking hat)
>
> In the past, when we've looked at stale-while-revalidate, we had trouble trying to use it sanely in a browser. My sense is that it was originally designed with more a CDN-like use in mind. Typically in a CDN, connectivity to the origin is reliable, there is no difference between an out-of-band and in-band request, and you expect a single cache to serve many clients talking to a site. The CDN and origin also typically have some kind of relationship, so there is much less worry about whether the CDN is willing to make a request at this time.

The original motivation was back-end API caching at Yahoo!, but yes that's closer to CDN than browser.

> In a browser, none of those hold. Out-of-band and in-band requests are quite different. Requests may trigger user interaction (auth prompts), and there is an expectation that a site "stops doing things" when one closes a tab. The client may be offline or have flaky connectivity, so the revalidation may fail. Or perhaps we have some local policy (extension?) that rejects such revalidations.

Just to make sure we're on the same page -- nothing about SwR requires a cache to revalidate in any given situation; not only would that be backwards-incompatible with caches that don't implement it, but it also is against the whole nature of caching as an optimisation. Caches are free to treat it like a hint and use additional information / heuristics to help decide when and how to revalidate.

The problem is the behavior of stale-while-revalidate is very different depending on whether that revalidation happens at all. As you say, it's the cache's choice whether and how to revalidate. However, the exact behavior here affects the semantics significantly. Consider these two implementations:

A: The cache always serves the resource stale in the SwR window, but the revalidations basically always work because connectivity between cache and backend is solid.

B: The cache always serves the resource stale in the SwR window, but it never bothers to revalidate [or revalidations always fail]. This is fine per spec, as it's merely a SHOULD.

These two behave very differently. (A) is the intuitive semantics one would expect out of SwR. (B) is just a larger max-age, which is presumably not what the author wanted. Now, (B) is rather absurd of an implementation, but the problem is failed revalidations decay to (B), while the author expected (A)'s staleness behavior.

The effective is amplified when you consider SwR windows wide enough to be of any use for the browser. The RFC's example of Cache-Control: max-age=600, stale-while-revalidate=30 makes (A) and (B) roughly the same. But it will never be hit in a browser where, unlike a backend cache shared by multiple clients, one usually does not expect continuous access of a resource by just one client. More plausible for a browser would be a SwR window measured in weeks or days. Now (A) and (B) are very meaningfully different.
 
> Moreover, an async revalidation is inherently predictive. It extends the max-age and stale-while-revalidate window for future requests. If no future request hits that window, the revalidation is useless.

Yes. Effectively, a hit during the SwR window is used as an indication that it's worth trying a revalidation. 

The point about predictiveness is to emphasize the possibility of dropping the revalidation.

> (It is also predictive in a CDN, but as the CDN's cache services many clients, it's a very solid prediction.)

Depending on the nature of the content; while CDN hit rates are generally higher than browsers and forward caches, lots of little-used content goes through CDNs. Regardless, this actually shouldn't matter too much; if the revalidation ends up not getting used, no more requests were issued than would have been without SwR.

> The browser's HTTP cache serves not just the site developer, but also the user, who may be visiting another site or have limited resources. Being predictive, the case for dropping those under load becomes very strong.

As per above, this is *additional* efficiency that the cache can eke out if it decides it wants to drop SwR revalidations -- which would not be available without it.

I think the "no more requests" reduction is a bit more nuanced. Background and foreground requests are different. Rate limiting is usually accounted based on live requests and live tabs. (This corresponds to a user expectation: closing the tab must make it go away.) The SwR revalidations detach from that and thus need to be separately rate-limited.
 
> All together, this means we must strongly consider revalidations failing on the client. In a naive stale-while-revalidate implementation, a failed revalidation acts as if we had written a larger max-age.

I don't think anyone has implemented it that way; generally people will either immediately consider the object "truly" stale, or retry (with some sort of backoff heuristic).

To clarify, you mean that people will generally, if the revalidation fails/times out, go back and invalidate the cache entry? It sounds like we're on the same page then. However, the specification mentions nothing of the sort. We would generally consider this a critical specification flaw, This provision is needed for clear, predictable behavior.

Indeed, I believe the first implementation attempt here neglected this detail, which is a data point that the specification was lacking.

> If the site author was okay with that, why didn't they set the larger max-age? This is an apparent contradiction. The spec nominally allows for this (it only says revalidation is a SHOULD), but max-age=<1 day>, stale-while-revalidate=<1 week> means very different things depending on whether the revalidations will happen or not. We need clear semantics here between the client and the server.

See above. SwR is designed the way it is so that it's backwards-compatible with caches that don't implement it.

> (Wearing my question-answering hat)
>
> The problem is stale-while-revalidate's semantics are tied to revalidation success.

Can you dig in here a bit? Why do you say this?

See above, the difference between scenarios (A) and (B).
 
> We need to decouple those. The proposed design is to clamp the stale-while-revalidate period on use: Let T be some small grace period, say one minute. If we use a resource in the stale-while-revalidate period, update the end of the period to min(currentEndpoint, now + T). Then return the resource stale, but also request that the upper layers asynchronously revalidate the resource.
>
> This effectively implements a revalidation timeout of T. This preserves the good properties of stale-while-revalidate. If revalidation completes within now+T, cache behavior is better and we behave as a naive stale-while-revalidate implementation in the good case. At the same time, it avoids the bad properties. If revalidation fails, we start requiring a revalidation, gracefully decaying to the pre-stale-while-revalidate behavior. (For completeness, if revalidation completes, but misses the now+T timeout, that is also okay. Requests after the revalidation will still hit the cache.)
>
> This gives clear semantics for stale-while-revalidate.

This is certainly a valid approach (if I understand you correctly). I don't think we can say that it provides clarity about SwR's semantics, given that other implementations have taken other, equally valid approaches.

See above. The specification failed to detail this.
 
> The author can tune Cache-Control based on their requirements. The semantics of max-age=M and stale-while-revalidate=S are:
>
> If the resource is within M, just use it stale.

If it's within Cache-Control: max-age, how is it stale (assuming that the rest of the freshness algorithm lets it be fresh)?

s/just use it stale/just use it; it's not stale/. Terminology mismatch. :-)
 
> I am okay with having to wait M for an update to be rolled out, in exchange for M worth of cache. Past M, if it's still within M+S, I'm dubiously okay with it being used stale. I really want M+S worth of cache because M is too small, but I still want to know updates are rolled out after M. This is impossible, so I will concede waiting M+S for an update to be completely rolled out. But I want the update to be mostly rolled out after M. Thus there must be some bound on using resources in the [M, M+S) range. Finally, past M+S, you must revalidate. M+S is the oldest resource I'm willing to consider.

That is roughly the current semantics of SwR (delta the point about CC: max-age).

Right, these are the intuitive semantics. But they are not achieved without some clear provision for failed revalidations.

Mark Nottingham

unread,
Jun 26, 2018, 1:06:11 AM6/26/18
to David Benjamin, Jeffrey Yasskin, bke...@mozilla.com, Dave Tapuska, blink-dev
(I've tried to recreate the quoting below; apologies for any mistakes. Many thanks for whoever can hunt down the appropriate PM for Gmail and "persuade" them to change this behaviour).


On 26 Jun 2018, at 1:09 am, David Benjamin <davi...@chromium.org> wrote:

>> Just to make sure we're on the same page -- nothing about SwR requires a cache to revalidate in any given situation; not only would that be backwards-incompatible with caches that don't implement it, but it also is against the whole nature of caching as an optimisation. Caches are free to treat it like a hint and use additional information / heuristics to help decide when and how to revalidate.
>
> The problem is the behavior of stale-while-revalidate is very different depending on whether that revalidation happens at all. As you say, it's the cache's choice whether and how to revalidate. However, the exact behavior here affects the semantics significantly. Consider these two implementations:
>
> A: The cache always serves the resource stale in the SwR window, but the revalidations basically always work because connectivity between cache and backend is solid.
>
> B: The cache always serves the resource stale in the SwR window, but it never bothers to revalidate [or revalidations always fail]. This is fine per spec, as it's merely a SHOULD.
>
> These two behave very differently. (A) is the intuitive semantics one would expect out of SwR. (B) is just a larger max-age, which is presumably not what the author wanted. Now, (B) is rather absurd of an implementation, but the problem is failed revalidations decay to (B), while the author expected (A)'s staleness behavior.

There's also:

C: The cache don't receive any requests during the stale window until the very end, where it serves the stale content, makes the async request successfully but never uses the refreshed response before it becomes stale.

From the standpoint of the author, B and C are very similar, and C isn't that uncommon (even on a CDN or reverse proxy; we have unpopular content too, and we also have connectivity problems to the origin). The author *really* has to be comfortable with that stale content being used for its entire window, even if popular content won't exercise that in practice all the time.


> The effective is amplified when you consider SwR windows wide enough to be of any use for the browser. The RFC's example of Cache-Control: max-age=600, stale-while-revalidate=30 makes (A) and (B) roughly the same. But it will never be hit in a browser where, unlike a backend cache shared by multiple clients, one usually does not expect continuous access of a resource by just one client. More plausible for a browser would be a SwR window measured in weeks or days. Now (A) and (B) are very meaningfully different.

So, this table shows SwR values in the HTTP Archive - first column is the SwR value, second is count of that value seen in the latest run:
https://docs.google.com/spreadsheets/d/1bV32i_KvJ7_ywTPWApxGjQE_Ovb6uRD3-BEAgC5nd90/edit?usp=sharing

The biggest spike there is at seven days. Playing with the query a bit, it seems like almost all of the upper-end values are JS and CSS.

I'm sure some of those values are set with Chrome in mind, but given that SwR is pretty widely supported by CDNs and reverse proxies, I strongly suspect there's a good number targeting intermediary caches specifically.

But yes, the traffic profile for a CDN or reverse proxy is going to be very different from what you see. Do you think what you suggest will work for them equally well?


>> > (It is also predictive in a CDN, but as the CDN's cache services many clients, it's a very solid prediction.)
>>
>> Depending on the nature of the content; while CDN hit rates are generally higher than browsers and forward caches, lots of little-used content goes through CDNs. Regardless, this actually shouldn't matter too much; if the revalidation ends up not getting used, no more requests were issued than would have been without SwR.
>>
>> > The browser's HTTP cache serves not just the site developer, but also the user, who may be visiting another site or have limited resources. Being predictive, the case for dropping those under load becomes very strong.
>>
>> As per above, this is *additional* efficiency that the cache can eke out if it decides it wants to drop SwR revalidations -- which would not be available without it.
>
> I think the "no more requests" reduction is a bit more nuanced. Background and foreground requests are different. Rate limiting is usually accounted based on live requests and live tabs. (This corresponds to a user expectation: closing the tab must make it go away.) The SwR revalidations detach from that and thus need to be separately rate-limited.

I see; that makes sense from a browser standpoint, but isn't really relevant to an intermediary cache or the origin.


>> > We need to decouple those. The proposed design is to clamp the stale-while-revalidate period on use: Let T be some small grace period, say one minute. If we use a resource in the stale-while-revalidate period, update the end of the period to min(currentEndpoint, now + T). Then return the resource stale, but also request that the upper layers asynchronously revalidate the resource.
>> >
>> > This effectively implements a revalidation timeout of T. This preserves the good properties of stale-while-revalidate. If revalidation completes within now+T, cache behavior is better and we behave as a naive stale-while-revalidate implementation in the good case. At the same time, it avoids the bad properties. If revalidation fails, we start requiring a revalidation, gracefully decaying to the pre-stale-while-revalidate behavior. (For completeness, if revalidation completes, but misses the now+T timeout, that is also okay. Requests after the revalidation will still hit the cache.)
>> >
>> > This gives clear semantics for stale-while-revalidate.
>>
>> This is certainly a valid approach (if I understand you correctly). I don't think we can say that it provides clarity about SwR's semantics, given that other implementations have taken other, equally valid approaches.
>
> See above. The specification failed to detail this.

I'm having a hard time seeing this as a significant improvement in clarity; we go from "I'm OK with the response being served stale during this window, while you attempt to revalidate it in the background" to "I'm OK with it being served stale in this window, while an attempt is made to revalidate it in the background, but if that revalidation attempt fails (for some definition of failure), suddenly it's not OK to serve it stale." It introduces a new factor into whether something is allowed to be served stale.

I.e., if there's no traffic for a week and it's OK to serve a response stale, why is it not OK to serve that same response stale at the same time if there was a failed revalidation driven by a previous request sometime during that week? A transient network or server failure is not likely to affect the resource's state.

It seems like what you really want to do here is to specify how stale responses can or cannot be used when a fresh response can't be obtained. By default, HTTP caches are allowed to do this with all storable responses unless CC: must-revalidate or no-cache are present; we introduced CC: stale-if-error to provide some finer granularity there. Perhaps that's what you're looking for?

E.g., a one-week SwR window with the semantics you define might look like:

Cache-Control: max-age=3600, stale-while-revalidate=604800, stale-if-error=60

If that's interesting, I could see updating the spec to clarify the relationship between SwR and SiE.

If you do want to try to add additional constraints to error handling for SwR independently, the discussion really needs to happen on a list like ietf-h...@w3.org, since lots of caches have implemented it and they're going to want to weigh in.

Cheers,

Kinuko Yasuda

unread,
Jun 26, 2018, 3:09:39 AM6/26/18
to dtap...@chromium.org, Jeffrey Yasskin, mn...@mnot.net, Ben Kelly, David Benjamin, blink-dev
Reg: Fetch and other upper layer spec integration

My current take is that it doesn't look strictly necessary to make this observable to Fetch or Service Worker API (given they are basically cache level operation).  For Service Workers I think a potentially sane behavior could be to make this apply only to requests that actually hit HTTP cache (i.e. network fallback requests or requests that hit network from Service Workers), then they are not observable from Service Workers.  That's said I agree that we should have corresponding spec text on how each feature should (or should not) interact with this.

Also: one of offline discussions we had about this feature is how a stale response in a preload cache (which is yet to be spec'ed) should behave.  Our current tentative conclusion is that once a resource is preloaded it should be kept available even when the real load happens after the allowed stale window (as far as it's from the same page) but this probably needs better clarification on (some) spec.

Kenji Baheux

unread,
Jun 26, 2018, 3:10:04 AM6/26/18
to Mark Nottingham, David Benjamin, Jeffrey Yasskin, Ben Kelly, Dave Tapuska, blink-dev
I don't know if this has been captured in the discussion but:

This feature makes a lot of sense for popular resources, i.e. third parties.
Operators of third party services need some guarantees about the swr timeframe.

For instance, if we allow the use of stale resources for the whole max-age + swr period of time, those operators will have no choice but to use a tiny swr timeframe (e.g. a few % of max-age) which reduces the benefit of the feature. They might as well not use it.

They also don't want to make things worse for browsers that don't support the feature, i.e. better to leave max-age as-is.

When I reached out to key third party services owners, they were happy to set a large swr timeframe (e.g. max-age: 1 hour, swr: 1 week for Google Fonts CSS) if we could guarantee that a stale, and potentially broken, asset would be used at most once (or a couple of times with edge cases). In other words, they were fine with a "broken for up to 15 minutes + 1 use but non blocking for 15 minutes + 7 days, yeah better performance!" (In Google Fonts' case, the 1 week SWR would allow them to achieve 90+% async validations IIRC).

To get the most out of SWR, the async swr timeframe of popular resources needs to be set so that it can cover as much in-between access blanks as possible:
  • intra-day blanks (morning browsing => blank => lunch browsing => blank => evening browsing): several hours.
  • interday blanks (sleeping, non daily access, etc): half a day ~ several days.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.


--
Kenji BAHEUX
Product Manager - Chrome
Google Japan

Mark Nottingham

unread,
Jun 26, 2018, 7:30:45 PM6/26/18
to Kenji Baheux, David Benjamin, Jeffrey Yasskin, Ben Kelly, Dave Tapuska, blink-dev
Hi Kenji,

> On 26 Jun 2018, at 5:09 pm, Kenji Baheux <kenji...@google.com> wrote:
>
> I don't know if this has been captured in the discussion but:
>
> This feature makes a lot of sense for popular resources, i.e. third parties.
> Operators of third party services need some guarantees about the swr timeframe.
>
> For instance, if we allow the use of stale resources for the whole max-age + swr period of time, those operators will have no choice but to use a tiny swr timeframe (e.g. a few % of max-age) which reduces the benefit of the feature. They might as well not use it.
>
> They also don't want to make things worse for browsers that don't support the feature, i.e. better to leave max-age as-is.
>
> When I reached out to key third party services owners, they were happy to set a large swr timeframe (e.g. max-age: 1 hour, swr: 1 week for Google Fonts CSS) if we could guarantee that a stale, and potentially broken, asset would be used at most once (or a couple of times with edge cases). In other words, they were fine with a "broken for up to 15 minutes + 1 use but non blocking for 15 minutes + 7 days, yeah better performance!" (In Google Fonts' case, the 1 week SWR would allow them to achieve 90+% async validations IIRC).

Restricting the number of times a stale response is used is problematic; if SwR is being honoured on e.g., a CDN node, it might be used thousands of times even if the first revalidation attempt is successful.

Combining it with SiE (see previous e-mail) would give them the ability to bound the period revalidation could be broken for (if we clarify that); would that work for the cases you're aware of?


> To get the most out of SWR, the async swr timeframe of popular resources needs to be set so that it can cover as much in-between access blanks as possible:
> • intra-day blanks (morning browsing => blank => lunch browsing => blank => evening browsing): several hours.
> • interday blanks (sleeping, non daily access, etc): half a day ~ several days.

Sorry, you've lost me here; when you say "blank" do you mean "period without requests for that URL"? If so, by "cover", do you mean that the stale window is greater than these periods?

Dave Tapuska

unread,
Jun 26, 2018, 9:12:48 PM6/26/18
to Mark Nottingham, Kenji Baheux, davidben, Jeffrey Yasskin, Ben Kelly, blink-dev
I don't think stale-if-error is really what we are after since it's time origin is the time of the original reply. What we are trying to bound is how what to do with failed revalidations and collapse those to the case without stale-while-revalidate instead of allowing it for the whole staleness period. If we were to use stale-if-error this way then basically this amounts to determining how long after an attempted revalidation you'd apply the stale-if-error condition. (Ie. Defining the same timeout we already are)

A large swr window is to cover/exceed the time in between that most users don't visit a page that uses the uri. Kenji calls these "blanks". 

Yes restricting the resource to be used +1 times after a stale return from the cache is also difficult from our http cache due to multiple proceses, iframes, debuggers, etc. So we chose a small time period in which it could be continued to be returned as stale. 

Since we have limited bandwidth on mobile devices scheduling resources at the correct time is important to page load. This timeout is more or less a contract between the validator and the cache. If the validation doesn't complete in that window then the cache marks it as stale. And a subsequent request will be a blocking revalidation. I consider this an implementation detail because it is the cache's progative to return something or not.

Dave


--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.

Mark Nottingham

unread,
Jun 26, 2018, 9:19:22 PM6/26/18
to Dave Tapuska, Kenji Baheux, davidben, Jeffrey Yasskin, Ben Kelly, blink-dev


> On 27 Jun 2018, at 11:12 am, Dave Tapuska <dtap...@chromium.org> wrote:
>
> I don't think stale-if-error is really what we are after since it's time origin is the time of the original reply. What we are trying to bound is how what to do with failed revalidations and collapse those to the case without stale-while-revalidate instead of allowing it for the whole staleness period. If we were to use stale-if-error this way then basically this amounts to determining how long after an attempted revalidation you'd apply the stale-if-error condition. (Ie. Defining the same timeout we already are)

*sigh* true - good point.

David Benjamin

unread,
Jun 27, 2018, 4:00:21 PM6/27/18
to Mark Nottingham, Jeffrey Yasskin, bke...@mozilla.com, Dave Tapuska, blink-dev

On Tue, Jun 26, 2018 at 1:06 AM Mark Nottingham <mn...@mnot.net> wrote:
(I've tried to recreate the quoting below; apologies for any mistakes. Many thanks for whoever can hunt down the appropriate PM for Gmail and "persuade" them to change this behaviour).


On 26 Jun 2018, at 1:09 am, David Benjamin <davi...@chromium.org> wrote:

>> Just to make sure we're on the same page -- nothing about SwR requires a cache to revalidate in any given situation; not only would that be backwards-incompatible with caches that don't implement it, but it also is against the whole nature of caching as an optimisation. Caches are free to treat it like a hint and use additional information / heuristics to help decide when and how to revalidate.
>
> The problem is the behavior of stale-while-revalidate is very different depending on whether that revalidation happens at all. As you say, it's the cache's choice whether and how to revalidate. However, the exact behavior here affects the semantics significantly. Consider these two implementations:
>
> A: The cache always serves the resource stale in the SwR window, but the revalidations basically always work because connectivity between cache and backend is solid.
>
> B: The cache always serves the resource stale in the SwR window, but it never bothers to revalidate [or revalidations always fail]. This is fine per spec, as it's merely a SHOULD.
>
> These two behave very differently. (A) is the intuitive semantics one would expect out of SwR. (B) is just a larger max-age, which is presumably not what the author wanted. Now, (B) is rather absurd of an implementation, but the problem is failed revalidations decay to (B), while the author expected (A)'s staleness behavior.

There's also:

C: The cache don't receive any requests during the stale window until the very end, where it serves the stale content, makes the async request successfully but never uses the refreshed response before it becomes stale.

From the standpoint of the author, B and C are very similar, and C isn't that uncommon (even on a CDN or reverse proxy; we have unpopular content too, and we also have connectivity problems to the origin). The author *really* has to be comfortable with that stale content being used for its entire window, even if popular content won't exercise that in practice all the time.

I think C is a different type of distinction from A and B. C is a request pattern, while A and B are cache behaviors. Rather, C's request pattern is where A and B are indistinguishable. So it's not useful in telling us whether B's differences from A are a problem.

Agreed that no matter what semantics are, if the author must, ultimately, be okay with stale content being used for the SwR window. That is fundamental to caching. It is also true that the author must be okay with stale content being used for the max-age window. The question is what "okay" means in each of those claims.

Did you intend that the degree of comfort with stale content is the same between the SwR window and max-age window?

If yes, I think SwR was misdesigned. If we believe that max-age and SwR actually imply the same degree of comfort with stale content, then SwR should have been subtractive. Let max-age continue to describe the acceptable staleness window, and then SwR/fixed describes a suffix of it which predictlvely triggers asynchronous revalidations. Indeed I advocated for this initially. But the feedback was that this wouldn't be useful. Rather, people wanted to set SwR much *much* higher than the max-age values they were comfortable with. They wanted to express a second *larger* window where they still would tolerate the stale responses, but were less okay with it than in the shorter max-age window. They wanted a better overall update rate than the higher max-age would give. And they believed SwR had those semantics.

Indeed, at the face of it, SwR smells like it does. If one believes revalidations always succeed in a timely manner, the SwR window expresses some fuzzy comfort level where one concedes the worst case, but still wishes to get the common case. I assume these are the semantics SwR actually intended. You yourself seemed to agree with those semantics in your previous message.

This falls over when one realizes revalidations may fail. And that failed revalidations decay to treating the SwR window as a max-age window. But we assumed SwR represents a weaker degree of staleness tolerance, so this violates the server's requests. SwR's design does not work in an environment where revalidations fail. To fix this, we need some kind of bound on how much a stale resource will be used in the SwR window. *That* (plus suggestions on how to achieve the goal) is the piece that was critically missing from the specification.

> The effective is amplified when you consider SwR windows wide enough to be of any use for the browser. The RFC's example of Cache-Control: max-age=600, stale-while-revalidate=30 makes (A) and (B) roughly the same. But it will never be hit in a browser where, unlike a backend cache shared by multiple clients, one usually does not expect continuous access of a resource by just one client. More plausible for a browser would be a SwR window measured in weeks or days. Now (A) and (B) are very meaningfully different.

So, this table shows SwR values in the HTTP Archive - first column is the SwR value, second is count of that value seen in the latest run:
  https://docs.google.com/spreadsheets/d/1bV32i_KvJ7_ywTPWApxGjQE_Ovb6uRD3-BEAgC5nd90/edit?usp=sharing

The biggest spike there is at seven days. Playing with the query a bit, it seems like almost all of the upper-end values are JS and CSS.

I'm sure some of those values are set with Chrome in mind, but given that SwR is pretty widely supported by CDNs and reverse proxies, I strongly suspect there's a good number targeting intermediary caches specifically.

But yes, the traffic profile for a CDN or reverse proxy is going to be very different from what you see. Do you think what you suggest will work for them equally well?

By "what [I] suggest", do you mean the revalidation timeout behavior? Yes. I think the revalidation timeout behavior is mostly a no-op for a CDN or reverse proxy. Though you seemed to suggest earlier they indeed implemented something of that nature. (The formulation with the grace period is literally a revalidation timeout. It's the same thing.) But I'm not advocating the revalidation timeout be a MUST-level requirement. Rather it should be a suggested behavior, with the actual requirement being the semantics.
 
>> > (It is also predictive in a CDN, but as the CDN's cache services many clients, it's a very solid prediction.)
>>
>> Depending on the nature of the content; while CDN hit rates are generally higher than browsers and forward caches, lots of little-used content goes through CDNs. Regardless, this actually shouldn't matter too much; if the revalidation ends up not getting used, no more requests were issued than would have been without SwR.
>>
>> > The browser's HTTP cache serves not just the site developer, but also the user, who may be visiting another site or have limited resources. Being predictive, the case for dropping those under load becomes very strong.
>>
>> As per above, this is *additional* efficiency that the cache can eke out if it decides it wants to drop SwR revalidations -- which would not be available without it.
>
> I think the "no more requests" reduction is a bit more nuanced. Background and foreground requests are different. Rate limiting is usually accounted based on live requests and live tabs. (This corresponds to a user expectation: closing the tab must make it go away.) The SwR revalidations detach from that and thus need to be separately rate-limited.

I see; that makes sense from a browser standpoint, but isn't really relevant to an intermediary cache or the origin.

Well, yes, that's kind of the point. Cache-Control directives are some API contract between the server and the cache. Different caches have different needs, so they'll implement that contract in different ways. The specification is not clear on what exactly the requirements are. Instead it specifies one possible behavior, implicitly baking in many assumptions of intermediary caches and origins that are just not valid when revalidations might fail or be dropped.

Failing revalidations are an unavoidable fact of life, so we need to hammer down what exactly the server wanted out of those revalidations to figure out how to translate that semantics beyond the very narrow scope that the specification is currently suitable for.
 
>> > We need to decouple those. The proposed design is to clamp the stale-while-revalidate period on use: Let T be some small grace period, say one minute. If we use a resource in the stale-while-revalidate period, update the end of the period to min(currentEndpoint, now + T). Then return the resource stale, but also request that the upper layers asynchronously revalidate the resource.
>> >
>> > This effectively implements a revalidation timeout of T. This preserves the good properties of stale-while-revalidate. If revalidation completes within now+T, cache behavior is better and we behave as a naive stale-while-revalidate implementation in the good case. At the same time, it avoids the bad properties. If revalidation fails, we start requiring a revalidation, gracefully decaying to the pre-stale-while-revalidate behavior. (For completeness, if revalidation completes, but misses the now+T timeout, that is also okay. Requests after the revalidation will still hit the cache.)
>> >
>> > This gives clear semantics for stale-while-revalidate.
>>
>> This is certainly a valid approach (if I understand you correctly). I don't think we can say that it provides clarity about SwR's semantics, given that other implementations have taken other, equally valid approaches.
>
> See above. The specification failed to detail this.

I'm having a hard time seeing this as a significant improvement in clarity; we go from "I'm OK with the response being served stale during this window, while you attempt to revalidate it in the background" to "I'm OK with it being served stale in this window, while an attempt is made to revalidate it in the background, but if that revalidation attempt fails (for some definition of failure), suddenly it's not OK to serve it stale." It introduces a new factor into whether something is allowed to be served stale.
 
I.e., if there's no traffic for a week and it's OK to serve a response stale, why is it not OK to serve that same response stale at the same time if there was a failed revalidation driven by a previous request sometime during that week? A transient network or server failure is not likely to affect the resource's state.

The difference is we're in the SwR window where (I'm positing) the server wants some reasonable effort be made to revalidate the resource. In an environment where background requests are treated differently from foreground requests and the background revalidations may fail, that means we need to give up on the background ones at some point. Otherwise we risk degrading SwR to a larger max-age, which violates the server expectations.
 
It seems like what you really want to do here is to specify how stale responses can or cannot be used when a fresh response can't be obtained. By default, HTTP caches are allowed to do this with all storable responses unless CC: must-revalidate or no-cache are present; we introduced CC: stale-if-error to provide some finer granularity there. Perhaps that's what you're looking for?

E.g., a one-week SwR window with the semantics you define might look like:

Cache-Control: max-age=3600, stale-while-revalidate=604800, stale-if-error=60

If that's interesting, I could see updating the spec to clarify the relationship between SwR and SiE.

From later in the thread, it sounds like stale-if-error doesn't actually provide this. (If it did, I think it'd be rather poor to separate them because SwR on its own is not meaningful without the bound.)

But, no, what I'm looking for isn't a particular mechanism per se. As you note, HTTP caches are already allowed significant leeway in how they do things. This is great since intermediate caches and clients have different needs. Even different browsers will have different opinions on how to do things. The cost is we need to be clear on the bounds of that flexibility, otherwise headers have no meaning.
 
If you do want to try to add additional constraints to error handling for SwR independently, the discussion really needs to happen on a list like ietf-h...@w3.org, since lots of caches have implemented it and they're going to want to weigh in.

Happy to weigh in on ietf-http-wg, but I've found such things work much better after some initial discussion to make sure we're all talking about roughly the same thing. I am still puzzled by your responses on this thread which seem to simultaneously advocate for these semantics while also rejecting them. I'm probably horribly misunderstanding something, so I'd like to get to the bottom of that first. :-)

Mark Nottingham

unread,
Jun 28, 2018, 12:43:05 AM6/28/18
to David Benjamin, Jeffrey Yasskin, bke...@mozilla.com, Dave Tapuska, blink-dev
On 28 Jun 2018, at 6:00 am, David Benjamin <davi...@chromium.org> wrote:
>
>> On Tue, Jun 26, 2018 at 1:06 AM Mark Nottingham <mn...@mnot.net> wrote:
[...]
>> There's also:
>>
>> C: The cache don't receive any requests during the stale window until the very end, where it serves the stale content, makes the async request successfully but never uses the refreshed response before it becomes stale.
>>
>> From the standpoint of the author, B and C are very similar, and C isn't that uncommon (even on a CDN or reverse proxy; we have unpopular content too, and we also have connectivity problems to the origin). The author *really* has to be comfortable with that stale content being used for its entire window, even if popular content won't exercise that in practice all the time.
>
> I think C is a different type of distinction from A and B. C is a request pattern, while A and B are cache behaviors. Rather, C's request pattern is where A and B are indistinguishable. So it's not useful in telling us whether B's differences from A are a problem.
>
> Agreed that no matter what semantics are, if the author must, ultimately, be okay with stale content being used for the SwR window. That is fundamental to caching. It is also true that the author must be okay with stale content being used for the max-age window. The question is what "okay" means in each of those claims.

I'm confused here, and I think it's because you're using the word "stale" to refer to content that's fresh as per max-age.

> Did you intend that the degree of comfort with stale content is the same between the SwR window and max-age window?

If my understanding above is correct, I think this could be rephrased as:

> Did you intend that the degree of comfort with a fresh-but-not-firsthand* response being served is the same as serving a stale response within the SwR window?


To answer, I don't think we explicitly considered that; a fresh response is fresh and the server has to live with that, unless they have some invalidation mechanism for the cache. It's a tradeoff, and not always a comfortable one.

SwR similarly is a tradeoff; it has some different qualities, especially around the area that you're highlighting (i.e., *usually* the stored response will be refreshed soon after the first hit in the SwR period, but that happens at different times depending on the nature of the cache and the request stream, and failure adds muddiness).

* "firsthand" is a term from 2616 that refers to a response that hasn't spent any time in caches in the chain from the origin server to the recipient; alas, it was removed from 723x.

> If yes, I think SwR was misdesigned. If we believe that max-age and SwR actually imply the same degree of comfort with stale content, then SwR should have been subtractive.

That seems like a pretty large leap...

> Let max-age continue to describe the acceptable staleness

See above re: terminology

> window, and then SwR/fixed describes a suffix of it which predictlvely triggers asynchronous revalidations. Indeed I advocated for this initially.

We considered that, but discarded it, because it has a number of unfortunate characteristics; not only is the stale window capped to by the max-age, but you end up making more requests than you otherwise would have if you didn't support SwR. Furthermore, you get into situations where an upstream cache can be starved of requests by downstream caches that don't implement SwR, meaning async validation doesn't happen when it should. Also, if your upstream cache doesn't support SwR, it'll just return the fresh response, which doesn't help.

> But the feedback was that this wouldn't be useful. Rather, people wanted to set SwR much *much* higher than the max-age values they were comfortable with. They wanted to express a second *larger* window where they still would tolerate the stale responses, but were less okay with it than in the shorter max-age window. They wanted a better overall update rate than the higher max-age would give. And they believed SwR had those semantics.
>
> Indeed, at the face of it, SwR smells like it does. If one believes revalidations always succeed in a timely manner, the SwR window expresses some fuzzy comfort level where one concedes the worst case, but still wishes to get the common case. I assume these are the semantics SwR actually intended. You yourself seemed to agree with those semantics in your previous message.
>
> This falls over when one realizes revalidations may fail. And that failed revalidations decay to treating the SwR window as a max-age window. But we assumed SwR represents a weaker degree of staleness tolerance, so this violates the server's requests. SwR's design does not work in an environment where revalidations fail.

Do you have an data for the failure rates (in the browser case) here? It seems like it would have to be a fairly large proportion to affect things in the manner you're talking about.

> To fix this, we need some kind of bound on how much a stale resource will be used in the SwR window. *That* (plus suggestions on how to achieve the goal) is the piece that was critically missing from the specification.

As I said, AFAICT you can achieve what you want within the current design of SwR; it's just that it doesn't call out the issue explicitly.

In other words, SwR's current specification assumes that the engineer implementing the cache will inevitably deal with this issue, just as they do for all other HTTP requests; it wasn't felt worthy of calling out at the time. If we wrote the RFC today, I suspect we'd go into more detail.

[...]
>> But yes, the traffic profile for a CDN or reverse proxy is going to be very different from what you see. Do you think what you suggest will work for them equally well?
>
> By "what [I] suggest", do you mean the revalidation timeout behavior?

Yes.

> Yes. I think the revalidation timeout behavior is mostly a no-op for a CDN or reverse proxy. Though you seemed to suggest earlier they indeed implemented something of that nature. (The formulation with the grace period is literally a revalidation timeout. It's the same thing.) But I'm not advocating the revalidation timeout be a MUST-level requirement. Rather it should be a suggested behavior, with the actual requirement being the semantics.

Could you propose some actual text? I'm having a hard time understanding what you actually want to see here.


>> >> > (It is also predictive in a CDN, but as the CDN's cache services many clients, it's a very solid prediction.)
>> >>
>> >> Depending on the nature of the content; while CDN hit rates are generally higher than browsers and forward caches, lots of little-used content goes through CDNs. Regardless, this actually shouldn't matter too much; if the revalidation ends up not getting used, no more requests were issued than would have been without SwR.
>> >>
>> >> > The browser's HTTP cache serves not just the site developer, but also the user, who may be visiting another site or have limited resources. Being predictive, the case for dropping those under load becomes very strong.
>> >>
>> >> As per above, this is *additional* efficiency that the cache can eke out if it decides it wants to drop SwR revalidations -- which would not be available without it.
>> >
>> > I think the "no more requests" reduction is a bit more nuanced. Background and foreground requests are different. Rate limiting is usually accounted based on live requests and live tabs. (This corresponds to a user expectation: closing the tab must make it go away.) The SwR revalidations detach from that and thus need to be separately rate-limited.
>>
>> I see; that makes sense from a browser standpoint, but isn't really relevant to an intermediary cache or the origin.
>
> Well, yes, that's kind of the point. Cache-Control directives are some API contract between the server and the cache. Different caches have different needs, so they'll implement that contract in different ways. The specification is not clear on what exactly the requirements are. Instead it specifies one possible behavior, implicitly baking in many assumptions of intermediary caches and origins that are just not valid when revalidations might fail or be dropped.
>
> Failing revalidations are an unavoidable fact of life, so we need to hammer down what exactly the server wanted out of those revalidations to figure out how to translate that semantics beyond the very narrow scope that the specification is currently suitable for.

You say "exactly", but you seem to indicate there's a significant amount of wiggle room above and elsewhere...


[...]
>> I'm having a hard time seeing this as a significant improvement in clarity; we go from "I'm OK with the response being served stale during this window, while you attempt to revalidate it in the background" to "I'm OK with it being served stale in this window, while an attempt is made to revalidate it in the background, but if that revalidation attempt fails (for some definition of failure), suddenly it's not OK to serve it stale." It introduces a new factor into whether something is allowed to be served stale.
>>
>> I.e., if there's no traffic for a week and it's OK to serve a response stale, why is it not OK to serve that same response stale at the same time if there was a failed revalidation driven by a previous request sometime during that week? A transient network or server failure is not likely to affect the resource's state.
>
> The difference is we're in the SwR window where (I'm positing) the server wants some reasonable effort be made to revalidate the resource. In an environment where background requests are treated differently from foreground requests and the background revalidations may fail, that means we need to give up on the background ones at some point. Otherwise we risk degrading SwR to a larger max-age, which violates the server expectations.

I was with you until you said "violates" -- that seems unusually strong, unless the failure rate is very high.

[...]
>> If you do want to try to add additional constraints to error handling for SwR independently, the discussion really needs to happen on a list like ietf-h...@w3.org, since lots of caches have implemented it and they're going to want to weigh in.
>
> Happy to weigh in on ietf-http-wg, but I've found such things work much better after some initial discussion to make sure we're all talking about roughly the same thing. I am still puzzled by your responses on this thread which seem to simultaneously advocate for these semantics while also rejecting them. I'm probably horribly misunderstanding something, so I'd like to get to the bottom of that first. :-)

Oh no, I'm pretty good at both being confused and confusing others.

I'm starting to suspect you want a paragraph or two that explain this issue and suggest a non-mandatory solution. That's fine, but opening upon an RFC to inject what amounts to advice is unfortunately still a somewhat heavyweight process (although better than it used to be). If there are other changes to make to SwR/SiE I'd be up for it, but on its own I'm not yet seeing the value (but could be persuaded otherwise, particularly if others weigh in).

The other option would be to file an errata and put it into the "hold for update" state; then it would at least be recorded. That's a very lightweight process.

But I'm jumping ahead. If you could propose some text I think we could make more concrete progress.

Yoav Weiss

unread,
Jul 5, 2018, 3:19:28 AM7/5/18
to David Benjamin, Mark Nottingham, Jeffrey Yasskin, bke...@mozilla.com, Dave Tapuska, blink-dev
Can you detail what you mean by "failing revalidations"? I can think of a few scenarios:
a) The revalidation request got back a 4XX/5XX error response.
b) The revalidation request got back an eventual 200 response with a different resource than the one in the cache
c) The revalidation request hangs and eventually times out

Are there other options I'm not considering?

It's true that these scenarios are not detailed out in the RFC. At the same time, I'd expect:
a) The resource will be removed from the cache and will no longer be used stale, unless `stale-if-error` is specified for the resource (and implemented by the cache).
b) The new resource will replace the older, stale resource in the cache, and cached according to the new resource's cache directives.
c) Not sure, but after the timeout, can probably treat this the same as a)

Does that make sense? Would such a detailed definition resolve the issue you have with SwR being an "extended max-age"?
 
 
>> > We need to decouple those. The proposed design is to clamp the stale-while-revalidate period on use: Let T be some small grace period, say one minute. If we use a resource in the stale-while-revalidate period, update the end of the period to min(currentEndpoint, now + T). Then return the resource stale, but also request that the upper layers asynchronously revalidate the resource.
>> >
>> > This effectively implements a revalidation timeout of T. This preserves the good properties of stale-while-revalidate. If revalidation completes within now+T, cache behavior is better and we behave as a naive stale-while-revalidate implementation in the good case. At the same time, it avoids the bad properties. If revalidation fails, we start requiring a revalidation, gracefully decaying to the pre-stale-while-revalidate behavior. (For completeness, if revalidation completes, but misses the now+T timeout, that is also okay. Requests after the revalidation will still hit the cache.)
>> >
>> > This gives clear semantics for stale-while-revalidate.
>>
>> This is certainly a valid approach (if I understand you correctly). I don't think we can say that it provides clarity about SwR's semantics, given that other implementations have taken other, equally valid approaches.
>
> See above. The specification failed to detail this.

I'm having a hard time seeing this as a significant improvement in clarity; we go from "I'm OK with the response being served stale during this window, while you attempt to revalidate it in the background" to "I'm OK with it being served stale in this window, while an attempt is made to revalidate it in the background, but if that revalidation attempt fails (for some definition of failure), suddenly it's not OK to serve it stale." It introduces a new factor into whether something is allowed to be served stale.
 
I.e., if there's no traffic for a week and it's OK to serve a response stale, why is it not OK to serve that same response stale at the same time if there was a failed revalidation driven by a previous request sometime during that week? A transient network or server failure is not likely to affect the resource's state.

The difference is we're in the SwR window where (I'm positing) the server wants some reasonable effort be made to revalidate the resource. In an environment where background requests are treated differently from foreground requests and the background revalidations may fail, that means we need to give up on the background ones at some point. Otherwise we risk degrading SwR to a larger max-age, which violates the server expectations.
 
It seems like what you really want to do here is to specify how stale responses can or cannot be used when a fresh response can't be obtained. By default, HTTP caches are allowed to do this with all storable responses unless CC: must-revalidate or no-cache are present; we introduced CC: stale-if-error to provide some finer granularity there. Perhaps that's what you're looking for?

E.g., a one-week SwR window with the semantics you define might look like:

Cache-Control: max-age=3600, stale-while-revalidate=604800, stale-if-error=60

If that's interesting, I could see updating the spec to clarify the relationship between SwR and SiE.

From later in the thread, it sounds like stale-if-error doesn't actually provide this. (If it did, I think it'd be rather poor to separate them because SwR on its own is not meaningful without the bound.)

But, no, what I'm looking for isn't a particular mechanism per se. As you note, HTTP caches are already allowed significant leeway in how they do things. This is great since intermediate caches and clients have different needs. Even different browsers will have different opinions on how to do things. The cost is we need to be clear on the bounds of that flexibility, otherwise headers have no meaning.
 
If you do want to try to add additional constraints to error handling for SwR independently, the discussion really needs to happen on a list like ietf-h...@w3.org, since lots of caches have implemented it and they're going to want to weigh in.

Happy to weigh in on ietf-http-wg, but I've found such things work much better after some initial discussion to make sure we're all talking about roughly the same thing. I am still puzzled by your responses on this thread which seem to simultaneously advocate for these semantics while also rejecting them. I'm probably horribly misunderstanding something, so I'd like to get to the bottom of that first. :-)

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.

Mark Nottingham

unread,
Jul 5, 2018, 7:56:04 PM7/5/18
to Yoav Weiss, David Benjamin, Jeffrey Yasskin, bke...@mozilla.com, Dave Tapuska, blink-dev
My .02 -

On 5 Jul 2018, at 5:19 pm, Yoav Weiss <yo...@yoav.ws> wrote:
>
> Can you detail what you mean by "failing revalidations"? I can think of a few scenarios:
> a) The revalidation request got back a 4XX/5XX error response.

It depends. 4xx and 5xx responses can be cached, and are; they replace the stored response. A cache *could* interpret some 5xx errors as "transient server problem, maybe the stale one is better." But that's the explicit semantics of stale-if-error, not stale-while revalidate.

> b) The revalidation request got back an eventual 200 response with a different resource than the one in the cache

This is revalidation working -- the stored response is updated. It's already specified in <https://httpwg.org/specs/rfc7234.html#validation.response>

> c) The revalidation request hangs and eventually times out

yes

> Are there other options I'm not considering?

d) various "network" errors (disconnected, RST, DNS lookup fails, trash on the wire).

> It's true that these scenarios are not detailed out in the RFC. At the same time, I'd expect:
> a) The resource will be removed from the cache and will no longer be used stale, unless `stale-if-error` is specified for the resource (and implemented by the cache).
> b) The new resource will replace the older, stale resource in the cache, and cached according to the new resource's cache directives.
> c) Not sure, but after the timeout, can probably treat this the same as a)
>
> Does that make sense? Would such a detailed definition resolve the issue you have with SwR being an "extended max-age"?

Reply all
Reply to author
Forward
0 new messages