[Web facing change PSA] Heads-up: Faster assets with the HTTP caching extension "stale-while-revalidate"

1,246 views
Skip to first unread message

Kenji Baheux

unread,
May 28, 2014, 5:04:17 AM5/28/14
to Chromium-dev, blink-dev

Blinkeuses, Blinkeurs

Chromeuses, Chromeurs


Baring any unexpected events, we’ll start the implementation work for one of the HTTP caching extensions proposed by Mark Nottingham. Specifically, the “stale-while-revalidate” extension to the cache-control header.


This extension gives webmasters the ability to let the cache serve slightly stale content, as long as it refreshes things in the background.


Flow diagram for Stale-while-revalidate (2).png



This should results in faster page loads for regularly visited websites and sites that use popular third party web services (e.g. analytics, ads, social, web fonts...).




Improvements

With a reasonably sized stale-while-revalidate window(*), the following improvements would be achieved:


1. Less blocking requests

The re-validations of blocking assets (e.g. some of the css/js assets, web fonts) would not block the page anymore given that re-validation requests would happen asynchronously after the page is loaded.

2. Improved network usage efficiency

Asynchronous re-validations will improve opportunity costs. In other words, there will be more bandwidth and simultaneous connections available for other requests.




Example

With an HTTP response containing the following header:


Cache-Control: max-age=86400, stale-while-revalidate=604800


  • max-age indicates that the asset is fresh for 1 day,

  • and stale-while-revaldiate indicates that the asset may continue to be served stale for up to an additional 7 days while an asynchronous re-validation is attempted.

If the re-validation is inconclusive, or if there isn’t any traffic that triggers it, after 7 days the stale-while-revalidate function would then cease to operate, and the cached response will be "truly" stale (i.e. the next request will block and be handled normally).



References

  • RFC 5861.
  • crbug.com/348877: Feel free to star it to show your support or to simply follow along.
  • Support for HTTP cache control extensions can be found in proxy like Squid.



Best,



(*): I'll follow up with extra thoughts about how to select a reasonable value for the stale-while-revalidate window in different scenarios. Obviously, the longer you can afford the better :)

Yoav Weiss

unread,
May 28, 2014, 3:10:32 PM5/28/14
to Kenji Baheux, Chromium-dev, blink-dev
That's awesome! Looking forward to it.

Do you know if other browsers have plans to support these extensions?
Also - I noticed that the proposal draft has long expired. Is there a current draft of these proposals? If not, can we revive the old draft?



To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.

Kenji Baheux

unread,
May 29, 2014, 3:59:39 AM5/29/14
to Chromium-dev, blink-dev

Note for Yoav: Thanks for the feedback! I posted a reply but it either got lost or it's not showing up yet; I'll wait a day or so and re-post if needed.



follow up, part 1:


How to choose a good value for stale-while-revalidate

I've spent some time thinking this through but it's quite possible that I missed something. Don't hesitate to point any oversights or better solutions :)


In order to maximize the benefits of this header, one should pick a value that minimize the number of synchronous requests for as many access patterns as one can afford (e.g. daily access, weekly access).


Part 1: assets that are served by-and-for the visited website

Perhaps a slightly contrived example but lets consider a website that is accessed every morning but mainly on weekdays. Let’s also assume an original max-age of 43200 seconds (12 hours).



A bad choice would be:

  Cache-Control: max-age=43200, stale-while-revalidate=43200


This gives a maximal lifetime of 1 day and incorrectly assume that your website (and its ssets) is accessed precisely every 24 hours. If your user slightly deviates from that expectation, he will suffer some extra re-validations. Also, the Monday commute is guaranteed to be comparatively bad.



A decent choice would be:

  Cache-Control: max-age=43200, stale-while-revalidate=172800


This gives a maximal lifetime of 2 days and 12 hours. This will work for users who access the site religiously every weekday as well as the users who access almost every weekday. The only downside: the Monday commute will still be comparatively bad (Friday morning to Monday morning > 60 hours).



A better choice would then be:

  Cache-Control: max-age=43200, stale-while-revalidate=259200


This gives a maximal lifetime of 3 days and 12 hours. This would then solve the poor user experience on the Monday commute.



Comments welcomed!

In Part 2, I will try to address the other use case where the asset is served for other websites (e.g. analytics, ads, social, web fonts...).

William Chan (陈智昌)

unread,
May 29, 2014, 5:05:43 AM5/29/14
to Kenji Baheux, Chromium-dev, blink-dev
Did our caching folk agree to this? The last comment from one of our cache implementers on the bug thread is from Ricardo saying "This doesn't look to me as being ready to be implemented." I don't know if that's a stale comment, but if so, then it'd be great to update the bug thread accordingly.

Kenji Baheux

unread,
May 29, 2014, 5:27:42 AM5/29/14
to William Chan (陈智昌), Chromium-dev, blink-dev
Just added a comment on the bug.

Paraphrasing:

We discussed ideas relevant to Ricardo's point ("automatically revalidate resources after reusing them if the expiration date is within some threshold. That has the advantage of not requiring any change outside of the browser to be effective."). 

Both approaches have pros and cons and the heuristic/pro-active approach definitely needs more work.

However, from talking to different folks, I believe that there is enough traction behind this header to make it happen. The RFC feels complete and clear. So far, the only questions that have been raised could be resolved in a follow up (e.g. how would this show up in Resource Timing API, if at all).

Best,

Kenji Baheux

unread,
May 29, 2014, 7:37:00 PM5/29/14
to Yoav Weiss, Chromium-dev, blink-dev
(reposting)
Thanks Yoav for the feedback and enthusiasm!


Do you know if other browsers have plans to support these extensions?

Firefox: a bug has been filed but plans remain unknown at this point.
Safari: unknown

I also found out that support for these headers is also available for Varnish (*), Apache Traffic Server and HTTPclient.

*: at least a good enough approximation it seems.



 Also - I noticed that the proposal draft has long expired. Is there a current draft of these proposals? If not, can we revive the old draft?

An RFC doesn't expire, drafts do. You might have read this and got confused. 

My understanding: 

Kenji Baheux

unread,
Jun 1, 2014, 10:12:12 PM6/1/14
to Chromium-dev, blink-dev

follow up, part 2:


Continuation of How to choose a good value for stale-while-revalidate

Part 2: the asset is served for other websites (e.g. analytics, ads, social, web fonts...)


Determine your “window of comfort”:

  • Determine how long you are willing to have stale assets used in the wild (e.g. backward compatibility burden). When you only had max-age at your disposal, things were straightforward but with the introduction of stale-while-revalidate it’s a little bit more involved. 

    • Unless you do something fancy on the server side(1), you will need to think about assets created 2 x (max-age + stale-while-revalidate) seconds ago. 
    • The 2x factor might seem surprising but it's actually there in the max-age only case.

(1): In Part 3, I will share extra thoguhts about this topic and explain the 2x factor.




Generic goals:

  1. maximize stale-while-revalidate to minimize synchronous re-validation requests

  1. maximize max-age to minimize any kind of traffic

  2. minimize max-age to minimize time to deliver updates

  3. minimize (max-age + stale-while-revalidate) in order to meet your “window of comfort” requirement



Concrete example:

With a service that other websites use via the inclusion of a JS asset, let’s assume the following conditions:

  1. original max-age of 1 day

  2. an effectiveness of 90% is desired from the stale-while-revalidate header

  3. with a day-time timeframe of 3 hours, the probability that any given user would visit a website using the asset consistently crosses the 90% target(2)

  4. a night-time timeframe of 6 hours should be excluded in order to guarantee point 3.

  5. the service can cope with obsolete-by-a-week versions of the asset


(2): true for any days of the week or any time of the year.



A bad choice would be:

  Cache-Control: max-age=86400, stale-while-revalidate=10800


This gives a maximal lifetime of 1 day and 3 hours. It’s highly unlikely that the stale-while-revalidate window would be large enough for a significant number of users. For instance, if the asset enters its stale-while-revalidate period near the beginning of the night, we would end up with a truly stale asset on the next morning.



A reasonable choice would be:

  Cache-Control: max-age=86400, stale-while-revalidate=32400


This gives a maximal lifetime of 1 day and 9 hours (3+6). Which is enough to go over the night and get a 90% hit from a 3 hour window on the next morning.



Now let’s assume that:

    1. the service is developing new features and wants the ability to push updates about twice as fast in case something goes wrong.
    2. the service can handle twice as much re-validation traffic


A good choice would then be:

  Cache-Control: max-age=43200, stale-while-revalidate=82800


The maximal lifetime is unchanged (1 day and 11 hours) but max-age has been cut in half for the benefit of the stale-while-revalidate window.



Closing remark on fine-tuning needs
In order to help web services further optimize their choice for max-age and stale-while-revalidate, we might want to consider marking asynchronous revalidation with a specific header. 

This would help webmasters to differentiate them from the synchronous revalidation of “truly” stale assets and compare their expectations with the reality.

Opinions on this particular questions would be highly appreciated!


Best,

Kenji Baheux

unread,
Jun 9, 2014, 1:21:23 AM6/9/14
to Chromium-dev, blink-dev

Follow-up, Part 3:


Preface

stale-while-revalidate doesn’t introduce any new problems. If you are familiar with the implications of max-age, there isn’t anything new per say. For instance, when it comes to solving the “instant update” problem, the same technique of revving the resource name via a fingerprint or version number would be used. The problem outlined from the next section is specific to the few resources that must maintain the same URL.


If you are still a bit confused, head over Mark’s blog: he recently published a great post about the stale-while-revalidate header in the context of web browsers.




Deep dive on the “window of comfort”

In the second follow-up installment, I introduced the “window of comfort” as an initial consideration before thinking about the value of stale-while-revalidate and max-age (note the errata*):


Determine your “window of comfort”:

  • Determine how long you are willing to have stale assets used in the wild (e.g. backward compatibility burden). When you only had max-age at your disposal, things were straightforward but with the introduction of stale-while-revalidate it’s a little bit more involved.

  • Unless you do something fancy on the server side(1), you will need to think about all the assets that were actively served max-age + stale-while-revalidate seconds ago.




In a max-age (= X) only setup, consider the following scenario:


Diagram explaining the window of comfort - Stale-while-revalidate (2).png


As you can see, there is a period of time, marked α, during which the service is expected to work with all 3 versions of Asset A (1.0, 1.1, 1.2).


And similarly, with a max-age=M, stale-while-revalidate=S setup:

Diagram explaining the window of comfort - stale-while-revalidate (2).png




How to minimize the burdern of supporting older versions of an asset

While the issue isn’t specific to stale-while-revalidate (a max-age only setup has the same issues), addressing it can help further maximize the stale-while-revalidate window.


Maybe this is a widely known method but just in case, here is how one could theoretically only support the latest version of a given asset at any given time:

Diagram explaining the window of comfort - stale-while-revalidate trick のコピー (1).png


As soon as you have a new version of your asset (e.g. V1.1), you can setup a pre-launch window whose length is equal to max-age+stale-while-revalidate. During that window, your server should use a dynamically computed cache-control header for its 200s and 304s responses regarding asset A:


Cache-control: max-age=M+S-λ


Where λ represents how many seconds have passed since the beginning of the pre-launch window.


Theoretically, this will guarantee that any V1.0 in the wild will be stale by the time you launch V1.1. In practice, you might want to keep support for V1.0 around for a little while in order to deal with the usual shenanigans.




*: about the errata: the 2x factor came from a scenario that assumed that the lifetime of an asset was always equal to max-age.

Kenji Baheux

unread,
Jun 9, 2014, 9:23:02 PM6/9/14
to Chromium-dev, blink-dev
Let me share the draft PRD for stale-while-revalidate support in Chrome.

Some highlights:
  • opportunity assessment
  • scenarios/use cases (RFC compliant implementation, custom HTTP header for monitoring/fine tuning the values, integration with devtools, integration with Resource Timing API)
  • launch plan. In particular our intent is to experiment/iterate before deciding to stick with the feature.

Comments welcomed!

In particular, there are a couple of unresolved decisions for which we would love to hear your feedback (e.g. high priority: thoughts about the custom HTTP header's design and name proposals; next phase: how this should be represented in DevTools, how this should be integrated to Resource Timing API).


Best,

Chris Bentzel

unread,
Jun 10, 2014, 4:53:21 PM6/10/14
to Kenji Baheux, net-dev, Chromium-dev, blink-dev
Looks cool, thanks for pushing on this.

I am interested in when you are planning on doing the revalidation step. For example - if the main html resource for a page has stale-while-revalidate but there is no network connection at present do you queue up revalidations when that happens? Also - do you try to queue up revalidations so that these don't wake up the radio if all resources for a navigation are in cache?

Initial versions could just do revalidation at the time the staled cache entry is returned and we aren't any worse off than the current state of affairs (and better since the user is saved the latency), but just wondering if there are plans beyond that.

Kenji Baheux

unread,
Jun 10, 2014, 9:29:06 PM6/10/14
to Chris Bentzel, net-dev, Chromium-dev, blink-dev
2014-06-11 5:52 GMT+09:00 Chris Bentzel <cben...@chromium.org>:
Looks cool, thanks for pushing on this.

Thanks!


I am interested in when you are planning on doing the revalidation step. For example - if the main html resource for a page has stale-while-revalidate but there is no network connection at present do you queue up revalidations when that happens? Also - do you try to queue up revalidations so that these don't wake up the radio if all resources for a navigation are in cache? 

Initial versions could just do revalidation at the time the staled cache entry is returned and we aren't any worse off than the current state of affairs (and better since the user is saved the latency), but just wondering if there are plans beyond that.

Yes, initially I think we should keep it simple. If it's not too difficult, I was thinking that we could do the revalidations after the page is loaded (load event), so that other resources can fully take advantage of the yield. As for the "no network connection" case, I believe that initially we could just drop the ball and rely on the next opportunity if any (IINM, this would be spec compliant). 

If the experiment phase is successful, we would definitively seek to improve the implementation to further increase its effectiveness. Some rough ideas:

1. opportunistic scheduling of revalidations
Say resource A is served with max-age=1d,s-w-r=1d but it's only used every 3 days or so. If we just did the revalidation right after the resource is used, it would be truly stale the next time it's used.

If we could schedule the revalidation just before it's next use, we would be able to avoid this issue.

 
2. take care of mobile battery/cell-plan concerns by
  • scheduling as much as possible when plugged-in and/or on WiFi
  • avoiding taking risky bets (e.g. issue a HEAD request for infrequently used resources, abandon if you get a 200 OK for a large resource given that it's not used all that much)

3. use heuristics to cover other assets that are not served with a stale-while-revalidate header.

....

In general, I would like to have the right metrics in place in order to better prioritized the ideas. For instance, having an outcome UMA histogram with a "no network connection" category sounds valuable. Similarly, we could also count how often we ended up waking up the radio.

Let me update the PRD with these questions and thoughts.

Chris Bentzel

unread,
Jun 11, 2014, 8:57:34 AM6/11/14
to Kenji Baheux, net-dev, Chromium-dev, blink-dev
definitely put the UMA in and drive engineering work based on frequency of revalidation failure.

Christian Biesinger

unread,
Jun 16, 2014, 11:16:35 AM6/16/14
to Ben Maurer, net...@chromium.org, kenji...@chromium.org, chromium-dev, blink-dev
I would think that user expectations are that an explicit reload will always check with the server. In fact, from my time on the Firefox Networking stack, sometimes users expect not just a validation but a full load of all resources when they click reload...

-christian


On Sun, Jun 15, 2014 at 2:25 PM, Ben Maurer <ben.m...@gmail.com> wrote:
Hey,

I'm curious, how do you guys see this interacting with voluntary revalidates -- eg when the user explicitly refreshes the page?

I've been studying the behavior of Facebook's CDN recently. Most of our resources are served with a long term TTL and are invalidated by changing the name of the resource. On Chrome, even for resources that are maintained this way, we see roughly 60% of requests to our CDN result in a 304 not modified. This is largely because Chrome has a peculiar behavior on pages that are redirected to after a POST request (https://code.google.com/p/chromium/issues/detail?id=294030). Even on other browsers without this behavior, we see roughly 15-20% of requests to our CDN result in a 304 not modified.

Our static resource system NEVER changes a resource after it is created. We'd really like to make it so that a user who refreshes Facebook's home page (likely wanting to see an updated feed) doesn't send redundant requests to our CDN -- or at the very least doesn't block the page rendering.

If we sent a stale-while-revalidate on our static resources, would Chrome respect this header on an explicit refresh?

-b

Ricardo Vargas

unread,
Jun 16, 2014, 2:04:59 PM6/16/14
to Ben Maurer, Christian Biesinger, net...@chromium.org, Kenji Baheux, chromium-dev, blink-dev
As far as I understand it, the measurements for bug 294030 are all related to a very common use case that don't involve users doing anything to dictate how the page should be loaded. I think that is very different from a user explicitly asking the browser to reload the current page.

The expectation of a reload or force-reload is that the browser should issue network requests again because something looks wrong with the current page. The user is most likely trying to fix an issue and I don't think it is a good idea to get in the way. I don't think this is really a case of trying to optimize for speed.


On Mon, Jun 16, 2014 at 8:54 AM, Ben Maurer <ben.m...@gmail.com> wrote:
Hey,

I totally agree that the expectation of the user is that when they refresh all resources are up to date with the latest version from the server.

However, it's common for applications to know that URLs are truly static. For example, Facebook's static resource system generates urls like: https://static.xx.fbcdn.net/rsrc.php/v2/yV/r/l_C8JMZfIdK.js. There is no circumstance which would ever cause the content of this URL to change. If we want to change the content of this file, our system will generate a new URL and change any references.

If a website like Facebook knows that a given URL will never change, it should be able to communicate that to chrome. When the user refreshes the page, they will still see up to date versions of every resource since the server is telling the client that those URLs are never going to change.

In https://code.google.com/p/chromium/issues/detail?id=294030 we measured the impact of working around Chrome's behavior of sending conditional responses for static resources on pages that get a redirect from a POST. We saw a substantial increase in a number of our engagement metrics. Therefore, we'd really like to get the number of conditional requests down to zero for any resource that uses unique URLs.
--
You received this message because you are subscribed to the Google Groups "net-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to net-dev+u...@chromium.org.
To post to this group, send email to net...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/net-dev/CABgOVaLyQ441HAcFty%2BEUN_oHKwHOHBJH5_D8LEHzyg7HJuPUA%40mail.gmail.com.

Ben Maurer

unread,
Jun 16, 2014, 2:19:34 PM6/16/14
to Ricardo Vargas, Christian Biesinger, net...@chromium.org, Kenji Baheux, chromium-dev, blink-dev
On Facebook's CDN we see about 15-20% of our requests for static resources result in a 304 (this is from non-chrome browsers, so they are not affected by 294030). As far as we can tell, the only explanation for this is users refreshing the page. We suspect that some users may wish to see an updated news feed and have learned to use the refresh button to do this. Even if the user does see an issue on the page, doing revalidation against our static resources won't help -- we never change the resources in question and therefore would always send a 304 response. It's much more likely the issue was in the actual HTML we served the user (which would be refreshed). If there was an issue with the static resources (eg, a corrupt cache of some sort) the browser would need to apply the shift+reload behavior to resolve it.

15-20% is a substantial chunk of our CDN usage. Eliminating this would likely have a noticeable impact on our performance and engagement metrics.

Darin Fisher

unread,
Jun 16, 2014, 2:51:09 PM6/16/14
to Ben Maurer, Ricardo Vargas, Christian Biesinger, net...@chromium.org, Kenji Baheux, chromium-dev, blink-dev
You should take a look at ServiceWorker. It allows web developers to intercept and handle a normal reload locally. (It does not allow you to intercept shift+reload however.)


Ben Maurer

unread,
Jun 16, 2014, 2:53:54 PM6/16/14
to Ilya Grigorik, Ricardo Vargas, Christian Biesinger, net...@chromium.org, Kenji Baheux, chromium-dev, blink-dev
Yep, when we worked around 294030 we saw a statistically significant engagement increase due to the increased performance -- I suspect that stale-while-revalidate would have a similar effect for users who reload the page, assuming that it deferred refreshes caused by explicit user reloads. This also assumes that the revalidate was scheduled in such a way that it would not interfere with other networking traffic.

That said, we'd also love to get these requests off our CDN -- even though 304s use minimal bandwidth, it's a non-trivial usage of CPU resources.


On Mon, Jun 16, 2014 at 11:45 AM, Ilya Grigorik <igri...@google.com> wrote:

On Mon, Jun 16, 2014 at 11:19 AM, Ben Maurer <ben.m...@gmail.com> wrote:
15-20% is a substantial chunk of our CDN usage. Eliminating this would likely have a noticeable impact on our performance and engagement metrics.

I'm guessing engagement implies "faster page load --> more user activity"? In which case, stale-while-revalidate should still deliver that because the browser wouldn't block the page render on the revalidation request -- right? It wouldn't reduce the load on the CDN, but that's a separate concern.

ig

Ben Maurer

unread,
Jun 16, 2014, 3:01:37 PM6/16/14
to Darin Fisher, Ricardo Vargas, Christian Biesinger, net...@chromium.org, Kenji Baheux, chromium-dev, blink-dev
Yep, some other folks from the Chrome team pointed that out to me -- it's a very interesting spec. That said, if one of the goals of reloading is trying to help the user fix an issue on the page (as Ricardo pointed out), the more site specific logic we add to the reload event the higher the risk of us having bugs that cause the reload not to work. I suspect only developers know about shift+reload.

Serving a static resource with a long TTL is an explicit best practice (eg https://developers.google.com/speed/articles/caching suggests "If your resources change more often than that, you can change the names of the resources. A common way to do that is to embed a version number into the URLs. The main HTML page can then refer to the new versions as needed."). It'd be great if Chrome could increase the performance of reloads for people who follow this advice without requiring the use of Service Worker.

Darin Fisher

unread,
Jun 16, 2014, 3:07:12 PM6/16/14
to Ben Maurer, Ricardo Vargas, Christian Biesinger, net...@chromium.org, Kenji Baheux, chromium-dev, blink-dev
On Mon, Jun 16, 2014 at 12:01 PM, Ben Maurer <ben.m...@gmail.com> wrote:
Yep, some other folks from the Chrome team pointed that out to me -- it's a very interesting spec. That said, if one of the goals of reloading is trying to help the user fix an issue on the page (as Ricardo pointed out), the more site specific logic we add to the reload event the higher the risk of us having bugs that cause the reload not to work. I suspect only developers know about shift+reload.

"With great power comes great responsibility." The service worker should take care to honor the user's intent with reload.

 

Serving a static resource with a long TTL is an explicit best practice (eg https://developers.google.com/speed/articles/caching suggests "If your resources change more often than that, you can change the names of the resources. A common way to do that is to embed a version number into the URLs. The main HTML page can then refer to the new versions as needed."). It'd be great if Chrome could increase the performance of reloads for people who follow this advice without requiring the use of Service Worker.

By bringing up Service Worker, I didn't mean to imply an either / or scenario. Service Worker provides savvy web developers with a tool to address the reload concern. That feels like part of the solution. It is at least something we can be confident about. Changing the behavior of reload applied to static resources is a much more complicated proposition. I don't feel as confident in such a change. We need to think carefully about that. There may be some factors outside the web developer's control that lead to broken sites that can only be repaired via reload. I don't know for sure.

-Darin

Ben Maurer

unread,
Jun 16, 2014, 3:25:46 PM6/16/14
to Darin Fisher, Ricardo Vargas, Christian Biesinger, net...@chromium.org, Kenji Baheux, chromium-dev, blink-dev
Fair enough -- if a site screws something up using service worker, at least they knew what they were getting into :-).

Do you have thoughts on how we could evaluate a change to the behavior of reloading static resources? Three potential options for doing this that I mentioned in this thread are (1) using stale-while-revalidate (2) Using a new cache control header stating that the resource is immutable (3) not revalidating far-in-the-future TTLs. What are the risks you are worried about and how might they be mitigated for these options.

One risk you bring up is you would never want to create a situation where the user needs to clear their cache (or shift reload) to fix a broken site and the developer is unable to do anything to help them. In most cases, the developer has a simple recourse -- they can change the name of the resource in question. This seems like what they should do anyways.

The one case in which they can't do this is if the cached resource was the page the viewer was using. Imagine for example that through a freak accident we served an error page for www.facebook.com that had headers set in a way the browser chose not to revalidate the resource. This could create a situation where the user couldn't use Facebook again until the TTL expired or they cleared the cache.

This suggests that perhaps any measure that caused the browser not to revalidate a resource on reload should only apply to subresources, not the root document. As long as we always revalidate the root document, the developer can always fix their server to return a 200 response and then rename any subresources on the page.

-b

Kenji Baheux

unread,
Jun 16, 2014, 9:20:14 PM6/16/14
to Ben Maurer, Darin Fisher, Ricardo Vargas, Christian Biesinger, net...@chromium.org, chromium-dev, blink-dev
Hi Ben, thanks for the feedback!


Do you have thoughts on how we could evaluate a change to the behavior of reloading static resources? Three potential options for doing this that I mentioned in this thread are (1) using stale-while-revalidate


Ilya said:

I'm guessing engagement implies "faster page load --> more user activity"? In which case, stale-while-revalidate should still deliver that because the browser wouldn't block the page render on the revalidation request -- right? It wouldn't reduce the load on the CDN, but that's a separate concern.


It depends on how we handle the revalidations for resources served with the stale-while-revalidate header.

1. Triggerring
Typically, the async revalidation only kicks in when the Age of the resource is greater than max-age but smaller than max-age+stale-while-revalidate. 

However, in the regular max-age only case, when the user ask for an explicit reload, we already kick revalidations regardless of the resource's Age. For consistency, we should do the same for resources served with a stale-while-revalidate directive.

2. Implications of performing async revalidations 
Let's assume that we perform async revalidations for the resources served with stale-while-revalidate.
One implication would be that if something went wrong with the page, the user would have to hit reload twice to get back in a sane state:
  1. first time would kick sync and async revalidations
  2. second time would use the new resources obtained via the async revalidations and kick another round of sync and async revalidations.

3. Motivations for hitting reload
I'm wondering what are the main motivations for hitting reload:
  1. some network connectivity issues (taking too long, lost connection)
  2. issues with some of the resources
  3. want to see the latest updates

If we could measure each of these, we would be able to make an informed decision about these async revalidations. Strawman:

  • #1: reload happened while Chrome was still busy with network requests
  • #2: one of the revalidation for the immutable/fresh assets got a 2XX response
  • #3: all the rest?

Note: based on our metrics, I believe that:
  • RegularReload is used on the order of 35 times per 10k page loads.  
  • IgnoreCacheReload is used about once every 20K page loads. 


(2) Using a new cache control header stating that the resource is immutable (3) not revalidating far-in-the-future TTLs. What are the risks you are worried about and how might they be mitigated for these options.

I like the third option. I imagine that we could use some of the following items as hints to drive the decision to revalidate and select between async and sync:
  • Expires, max-age, stale-while-revalidate values
  • (now - last-modified)
  • history of past responses
  • # of successive reload in a short timeframe as a proxy for user frustration
  • ...


This suggests that perhaps any measure that caused the browser not to revalidate a resource on reload should only apply to subresources, not the root document. As long as we always revalidate the root document, the developer can always fix their server to return a 200 response and then rename any subresources on the page.

I think that this would not work for the third party services integrated on a page. Say, your favorite analytics solution served a wonky response with a max-age=1year. It would be painful for them to get all of their customers to fix their integration. With the ideas above, we should be able to come up with a failsafe solution.

Ben Maurer

unread,
Jun 16, 2014, 10:12:50 PM6/16/14
to Kenji Baheux, Darin Fisher, Ricardo Vargas, Christian Biesinger, net...@chromium.org, chromium-dev, blink-dev
On Mon, Jun 16, 2014 at 6:19 PM, Kenji Baheux <kenji...@chromium.org> wrote:
3. Motivations for hitting reload
I'm wondering what are the main motivations for hitting reload:
  1. some network connectivity issues (taking too long, lost connection)
  2. issues with some of the resources
  3. want to see the latest updates

If we could measure each of these, we would be able to make an informed decision about these async revalidations. Strawman:

  • #1: reload happened while Chrome was still busy with network requests
  • #2: one of the revalidation for the immutable/fresh assets got a 2XX response
  • #3: all the rest?

One other metric to look at here might be reload time - original view time. If this is long it suggests #3. It'd be really interesting to get more stats on this.

 
  • RegularReload is used on the order of 35 times per 10k page loads.  
  • IgnoreCacheReload is used about once every 20K page loads. 
That seems kind of low given the % of our responses that are 304s. Do you have any metrics on the overall rate of 304s chrome gets? Or maybe metrics from Google's CDN?

 
This suggests that perhaps any measure that caused the browser not to revalidate a resource on reload should only apply to subresources, not the root document. As long as we always revalidate the root document, the developer can always fix their server to return a 200 response and then rename any subresources on the page.

I think that this would not work for the third party services integrated on a page. Say, your favorite analytics solution served a wonky response with a max-age=1year. It would be painful for them to get all of their customers to fix their integration. With the ideas above, we should be able to come up with a failsafe solution.

Can we maybe assume here that anybody smart enough to be a huge 3rd party analytics service is also smart enough to not serve their JS with a crazy TTL. Chances are if the 3rd party analytics script was served with a long TTL it wouldn't actually break the page, and users wouldn't know to refresh.

Imagine that Google Analytics or the Facebook Like Button accidentally served a version of our JS with a 1 year TTL. Even with the rules browsers use today, we'd be pretty screwed.

Kenji Baheux

unread,
Jun 17, 2014, 12:39:37 AM6/17/14
to Ben Maurer, Darin Fisher, Ricardo Vargas, Christian Biesinger, net...@chromium.org, chromium-dev, blink-dev
On Mon, Jun 16, 2014 at 6:19 PM, Kenji Baheux <kenji...@chromium.org> wrote:
3. Motivations for hitting reload
I'm wondering what are the main motivations for hitting reload:
  1. some network connectivity issues (taking too long, lost connection)
  2. issues with some of the resources
  3. want to see the latest updates

If we could measure each of these, we would be able to make an informed decision about these async revalidations. Strawman:

  • #1: reload happened while Chrome was still busy with network requests
  • #2: one of the revalidation for the immutable/fresh assets got a 2XX response
  • #3: all the rest?

One other metric to look at here might be reload time - original view time. If this is long it suggests #3.

This might be tricky. I imagine that we could calculate #3 by doing (number_of_reloads - reloads_driven_by_#2 - reload_driven_by_#1).
 
 
It'd be really interesting to get more stats on this.


Definitely! I just filed a bug, let's continue this part of the discussion there and see if we can get this going.


 
 
  • RegularReload is used on the order of 35 times per 10k page loads.  
  • IgnoreCacheReload is used about once every 20K page loads. 
That seems kind of low given the % of our responses that are 304s. Do you have any metrics on the overall rate of 304s chrome gets? Or maybe metrics from Google's CDN?

You're right it feels low...
I'll deep dive a bit more, look for other signals and report back on the bug.


 

 
This suggests that perhaps any measure that caused the browser not to revalidate a resource on reload should only apply to subresources, not the root document. As long as we always revalidate the root document, the developer can always fix their server to return a 200 response and then rename any subresources on the page.

I think that this would not work for the third party services integrated on a page. Say, your favorite analytics solution served a wonky response with a max-age=1year. It would be painful for them to get all of their customers to fix their integration. With the ideas above, we should be able to come up with a failsafe solution.

Can we maybe assume here that anybody smart enough to be a huge 3rd party analytics service is also smart enough to not serve their JS with a crazy TTL. Chances are if the 3rd party analytics script was served with a long TTL it wouldn't actually break the page, and users wouldn't know to refresh.

Imagine that Google Analytics or the Facebook Like Button accidentally served a version of our JS with a 1 year TTL. Even with the rules browsers use today, we'd be pretty screwed.


Oops, the 1year max-age was indeed a poor choice but I think the conclusion hold:
  • let's imagine that Google Analytics or the Facebook Like Button accidentally served a wonky version of their JS with a not too crazy max-age (e.g. couple of hours, a day).
  • If the regular Reload doesn't trigger any revalidation, the user would have to
    • Shift+Reload or nuke the cache to get the fixed version.
    • wait max-age seconds at most and access the site.

If we can prove that Reloads are the main cause of the remaining 15-20% (as seen on other browsers which are not affected by crbug.com/294030) then it seems worth considering improving Reload's behavior. 

The other cause I can think of is that max-age seconds have passed and we issue a revalidation. But I expect this to explain a relatively small fraction of the requests.

I believe that the custom header proposed in the PRD would help us put numbers behind each cause.

Ben Maurer

unread,
Jun 17, 2014, 1:12:43 AM6/17/14
to Kenji Baheux, Darin Fisher, Ricardo Vargas, Christian Biesinger, net...@chromium.org, chromium-dev, blink-dev
On Mon, Jun 16, 2014 at 9:38 PM, Kenji Baheux <kenji...@chromium.org> wrote:
If we can prove that Reloads are the main cause of the remaining 15-20% (as seen on other browsers which are not affected by crbug.com/294030) then it seems worth considering improving Reload's behavior. 

I haven't been able to find any other explanation and the rates are pretty similar in all other browsers. Once we fix this chrome bug, I'll get you guys updated stats on our 304 rate to chrome.
 
The other cause I can think of is that max-age seconds have passed and we issue a revalidation. But I expect this to explain a relatively small fraction of the requests.

We serve our resources with a 1 year TTL. Virtually none of our resources last this long. We see this rate of 304s on resources which haven't existed for a year.
Reply all
Reply to author
Forward
0 new messages