Proposal for caching pushed streams

201 views
Skip to first unread message

Alek Storm

unread,
May 23, 2012, 10:42:40 PM5/23/12
to spdy...@googlegroups.com
Hi all,

Some others (https://groups.google.com/d/topic/spdy-dev/xe37J7EgcoE/discussionhttps://groups.google.com/d/topic/spdy-dev/KpDGLQYCsk0/discussionhttps://groups.google.com/d/topic/spdy-dev/Lu8XpKhl2h8/discussion), have discussed the need for fine-grained cache control for SPDY pushed streams. The client currently has no way to communicate to the server which subresources it has cached, independently of its cache-validation headers for the actual request. This problem is compounded by the fact that subresources (Javascript, CSS, etc) are often shared between multiple resources on a web site, all of which, in turn, may try to push content the client already has. In addition, some user agents (like wget or web crawlers) may be uninterested in having content pushed to them, and doing so would be a waste of bandwidth.

As a first step toward addressing these issues, I've drafted some additions to the HTTP/1.1 caching mechanisms (http://www.w3.org/Protocols/rfc2616/rfc2616.html 14.9, 14.24-14.28), which are given in rough form below. I'm not sure where these would fit into either the HTTP or SPDY specs, but they're formatted mostly in the style of RFC 2616. There is also some language in the last section that covers rules for proxy behavior. I'd love any feedback, especially additions, deletions, criticisms, or questions - I'm particularly unsure about the naming.

Thanks,
Alek Storm

Cache-Validating Headers
-------------------------------------

Subresources-If-None-Match
Subresources-If-None-Match = "Subresources-If-None-Match" ":" ( "*" | 1#entity-tag )
For each entity that would have been pushed in response to a similar GET request (without the Subresources-If-None-Match) on that resource, if:
(1) its entity tag is included in the list of entity tags; or
(a) "*" is given; and
(b) any current entity exists for that resource
then the server MUST NOT push that entity, unless required to do so because the entity's modification date fails to match that supplied in an Subresources-If-Modified-Since header field in the request. Instead, if the request method was GET or HEAD, the server SHOULD push a 304 (Not Modified) response, including the cache-related header fields (particularly ETag) of one of the entities that matched.

Comments: Clients should include entity tags for resources which included the requested URI in their Subresource-Of header when they were last fetched, or all entity tags from the requested domain for Content-Types likely to be pushed (application/javascript, text/css).

Subresources-If-Modified-Since
Subresources-If-Modified-Since = "Subresources-If-Modified-Since" ":" HTTP-date
For each entity that would have been pushed in response to a similar GET request (without the Subresources-If-Modified-Since) on that resource, if:
(1) the given date is valid; and
(a) the entity has not been modified since the given date; or
(b) the entity was not related to the requested resource at the given date; or
(i) it is not known whether the entity was related to the requested resource at the given date; and
(ii) the requested resource has not been modified since the given date
then the server MUST NOT push that entity. Instead, if the request method was GET or HEAD, the server SHOULD push a 304 (Not Modified) response. If Subresources-If-Modified-Since is not present, or if the given date is invalid, the If-Modified-Since date may be used instead, if present.

Comments: Although If-Modified-Since is also a good heuristic (and can be used if Subresources-If-Modified-Since is missing), clients will not send it if they were instructed not to cache the original URI the last time it was fetched, even if the pushed resources were cacheable.

Analogous Subresources-If-Match and Subresources-If-Unmodified-Since headers are omitted here for brevity.

Response Headers
---------------------------

Subresource-Of
Subresource-Of = "Subresource-Of" ":" 1#relativeURI
Allows the server, or intermediate caches, to specify the request URIs the resource would likely be pushed in response to.

Comments: This is actually more powerful in the hands of intermediate caches, which have more information about how different resources are interrelated.

Cache-control Directives
-----------------------------------

Default behavior: The server and any intermediate caches MAY push streams in response to a request.

no-push (client)
Indicates the both the server and any intermediate caches MUST NOT push any streams in response to this request.

no-proxy-push (server)
Indicates that any intermediate caches MUST NOT push any streams in response to this request.

push-if-original-invalidated (client)
Indicates that both the server and any intermediate caches MUST NOT push any streams in response to this request if the request is successfully revalidated (i.e. a 304 Not Modified response is sent).

push-if-invalidated (client)
Indicates that both the server and any intermediate caches MUST NOT push any stream in response to this request if the pushed resource is successfully revalidated (i.e. a 304 Not Modified would be sent).

push-max-age, push-min-fresh, push-max-stale (client)
Identical to their variants without "push-", but only apply to pushed streams.

Server Push Restrictions
-----------------------------------

Pushed streams must have one of the following response codes: 200, 204, 206, 301, 302, 303, 304, 307. A response with a different response code MUST NOT be pushed.

Servers and intermediate caches MUST NOT push two streams associated with the same URL, or a stream associated with the URL of the original request, unless both streams have response code 206 and contain non-overlapping byte ranges.

Comment: It shouldn't be possible to create a pushed stream associated with another pushed stream, but this should probably be addressed at the framing layer.

Proxy Behavior
----------------------

Proxies SHOULD multiplex multiple downstream SPDY or HTTP sessions to various clients into one upstream session.

Proxies MAY track which resources are pushed in response to requests for certain URIs.
If a proxy decides that a certain resource is related to the requested resource (either through tracking or scanning a cached copy):
(1) If the related resource is in cache and meets the cache-control restrictions, the proxy MAY push the resource's cached entry.
(2) If the related resource is not in cache, would meet the cache-control restrictions, and the proxy decides that the server does not normally push the resource (either through tracking previous requests or because the proxy is an HTTP gateway), the proxy MAY make a separate request to the server for the resource (with a no-push cache control directive), and forward the response downstream as a pushed stream. The proxy MAY follow 3xx redirects.
In either case, the proxy modifies the cache-validating and cache-control headers of the original request to exclude the pushed resource, and re-processes the request (serves from cache or forwards upstream).

If a proxy receives a push that violates a cache-control header, has already been pushed by the proxy, or the downstream framing layer is HTTP, it MUST NOT forward the push downstream, and SHOULD send RST_STREAM with status code CANCEL to the server.

Mike Belshe

unread,
May 30, 2012, 11:19:06 AM5/30/12
to spdy...@googlegroups.com
On Wed, May 23, 2012 at 7:42 PM, Alek Storm <alek....@gmail.com> wrote:
Hi all,

Some others (https://groups.google.com/d/topic/spdy-dev/xe37J7EgcoE/discussionhttps://groups.google.com/d/topic/spdy-dev/KpDGLQYCsk0/discussionhttps://groups.google.com/d/topic/spdy-dev/Lu8XpKhl2h8/discussion), have discussed the need for fine-grained cache control for SPDY pushed streams. The client currently has no way to communicate to the server which subresources it has cached, independently of its cache-validation headers for the actual request. This problem is compounded by the fact that subresources (Javascript, CSS, etc) are often shared between multiple resources on a web site, all of which, in turn, may try to push content the client already has. In addition, some user agents (like wget or web crawlers) may be uninterested in having content pushed to them, and doing so would be a waste of bandwidth.

As a first step toward addressing these issues, I've drafted some additions to the HTTP/1.1 caching mechanisms (http://www.w3.org/Protocols/rfc2616/rfc2616.html 14.9, 14.24-14.28), which are given in rough form below. I'm not sure where these would fit into either the HTTP or SPDY specs, but they're formatted mostly in the style of RFC 2616. There is also some language in the last section that covers rules for proxy behavior. I'd love any feedback, especially additions, deletions, criticisms, or questions - I'm particularly unsure about the naming.

Thanks,
Alek Storm

Cache-Validating Headers
-------------------------------------

Subresources-If-None-Match
Subresources-If-None-Match = "Subresources-If-None-Match" ":" ( "*" | 1#entity-tag )
For each entity that would have been pushed in response to a similar GET request (without the Subresources-If-None-Match) on that resource, if:
(1) its entity tag is included in the list of entity tags; or
(a) "*" is given; and
(b) any current entity exists for that resource
then the server MUST NOT push that entity, unless required to do so because the entity's modification date fails to match that supplied in an Subresources-If-Modified-Since header field in the request. Instead, if the request method was GET or HEAD, the server SHOULD push a 304 (Not Modified) response, including the cache-related header fields (particularly ETag) of one of the entities that matched. 

Comments: Clients should include entity tags for resources which included the requested URI in their Subresource-Of header when they were last fetched, or all entity tags from the requested domain for Content-Types likely to be pushed (application/javascript, text/css).


Hi Alek - 

This is good thinking, its clever and it does indeed solve the problem.

However, if I understand it right, this basically introduces a new index for HTTP client caches - the "subresource-of".   It's a lot of management.  In order to issue any request, the cache now needs to lookup the exhaustive list of potential subresources, find their tags, and send them up in a big list.  This is a large, non-trivial amount of work, and it may not be too fast either ;-)  Is there a limit to how many subresources a resource can have?  Given that we've got pages with hundreds of subresources, do all of them become subresources in this scheme?  If they did, wouldn't that mean that sending a request would require N client-side cache lookups (or a database to index etags by subresources)?  Can a single resource be a subresource of multiple other resources (e.g. a common style sheet across multiple resources)?

Overall, I think this works, and we did consider such schemes early on.  However, the practical concerns about client-side cache management made us reject it before really trying.

Mike

PS - the server management of the resource/subresource relationship is also tricky, but probably automatable.

Alek Storm

unread,
Jun 1, 2012, 12:18:01 AM6/1/12
to spdy...@googlegroups.com
Hi Mike,

On Wed, May 30, 2012 at 10:19 AM, Mike Belshe <mbe...@chromium.org> wrote:
However, if I understand it right, this basically introduces a new index for HTTP client caches - the "subresource-of".   It's a lot of management.  In order to issue any request, the cache now needs to lookup the exhaustive list of potential subresources, find their tags, and send them up in a big list.  This is a large, non-trivial amount of work, and it may not be too fast either ;-)  Is there a limit to how many subresources a resource can have?  Given that we've got pages with hundreds of subresources, do all of them become subresources in this scheme?

That depends on what you mean by "subresource". Since you mention "hundreds" of them, I assume you mean any resource referenced by a page; I meant only resources pushed by the server (certainly most of the resources needed by a page can wait to be loaded until after the HTML referencing them arrives). The beauty of HTTP cache management is that much of its implementation is left to browsers - there is nothing forcing the client to send a long Subresources-If-None-Match header if it feels that performance would be impacted negatively; it could send only the first ten etags it finds, or none at all. The browser cache could be cleared at any time by the user anyway, so it would be useless to mandate which headers "must" be sent.

If they did, wouldn't that mean that sending a request would require N client-side cache lookups (or a database to index etags by subresources)?

I think you're not giving browser developers enough credit :). To my understanding, caches are already implemented as lightweight databases already; I don't see how each cache entry storing a list of its subresources would be a non-negligible performance burden. Even if it were, don't the bandwidth savings of not having to retransmit a resource outweight the costs of looking up its etag in memory?

Can a single resource be a subresource of multiple other resources (e.g. a common style sheet across multiple resources)?

Yes, and the Subresource-Of header specifically provides servers/proxies with a way to notify the client of these relationships, so that Subresource-If-* headers can be sent on subsequent requests to new, un-cached resources whose subresources would otherwise be unknown.

Overall, I think this works, and we did consider such schemes early on.  However, the practical concerns about client-side cache management made us reject it before really trying.

In the time since posting this, I've discovered the HTTPbis working group, which would be a more appropriate place to submit this proposal for consideration in a day or two. Are you on that mailing list as well?

PS - the server management of the resource/subresource relationship is also tricky, but probably automatable.

Yes, I believe the Jetty team is exploring different strategies for doing this, including pre-defined config files and relationship discovery through the Referer header. Tornado, the Python web framework I'm implementing SPDY support for, will have several methods for both implicitly and explicitly discovering pushable resources.

Alek

Mike Belshe

unread,
Jun 7, 2012, 1:12:04 AM6/7/12
to spdy...@googlegroups.com
On Thu, May 31, 2012 at 9:18 PM, Alek Storm <alek....@gmail.com> wrote:
Hi Mike,

On Wed, May 30, 2012 at 10:19 AM, Mike Belshe <mbe...@chromium.org> wrote:
However, if I understand it right, this basically introduces a new index for HTTP client caches - the "subresource-of".   It's a lot of management.  In order to issue any request, the cache now needs to lookup the exhaustive list of potential subresources, find their tags, and send them up in a big list.  This is a large, non-trivial amount of work, and it may not be too fast either ;-)  Is there a limit to how many subresources a resource can have?  Given that we've got pages with hundreds of subresources, do all of them become subresources in this scheme?

That depends on what you mean by "subresource". Since you mention "hundreds" of them, I assume you mean any resource referenced by a page; I meant only resources pushed by the server (certainly most of the resources needed by a page can wait to be loaded until after the HTML referencing them arrives). The beauty of HTTP cache management is that much of its implementation is left to browsers - there is nothing forcing the client to send a long Subresources-If-None-Match header if it feels that performance would be impacted negatively; it could send only the first ten etags it finds, or none at all. The browser cache could be cleared at any time by the user anyway, so it would be useless to mandate which headers "must" be sent.

If they did, wouldn't that mean that sending a request would require N client-side cache lookups (or a database to index etags by subresources)?

I think you're not giving browser developers enough credit :). To my understanding, caches are already implemented as lightweight databases already; I don't see how each cache entry storing a list of its subresources would be a non-negligible performance burden. Even if it were, don't the bandwidth savings of not having to retransmit a resource outweight the costs of looking up its etag in memory?

Maybe you're right.  
 

Can a single resource be a subresource of multiple other resources (e.g. a common style sheet across multiple resources)?

Yes, and the Subresource-Of header specifically provides servers/proxies with a way to notify the client of these relationships, so that Subresource-If-* headers can be sent on subsequent requests to new, un-cached resources whose subresources would otherwise be unknown.

I don't know quite how this works; if we want to push the subresource for the first page access, and then not push it again subsequently, the subresource would have to be marked as 'subresource-of' for all the subsequent pages too to avoid being re-pushed?  I suppose other logic could be employed to handle this, but it seems like it misses the most common push case?


Overall, I think this works, and we did consider such schemes early on.  However, the practical concerns about client-side cache management made us reject it before really trying.

In the time since posting this, I've discovered the HTTPbis working group, which would be a more appropriate place to submit this proposal for consideration in a day or two. Are you on that mailing list as well?

yes, thats fine. 

Alek Storm

unread,
Jun 7, 2012, 1:45:49 AM6/7/12
to spdy...@googlegroups.com
On Thu, Jun 7, 2012 at 12:12 AM, Mike Belshe <mbe...@chromium.org> wrote:
On Thu, May 31, 2012 at 9:18 PM, Alek Storm <alek....@gmail.com> wrote:
On Wed, May 30, 2012 at 10:19 AM, Mike Belshe <mbe...@chromium.org> wrote:
Can a single resource be a subresource of multiple other resources (e.g. a common style sheet across multiple resources)?

Yes, and the Subresource-Of header specifically provides servers/proxies with a way to notify the client of these relationships, so that Subresource-If-* headers can be sent on subsequent requests to new, un-cached resources whose subresources would otherwise be unknown.

I don't know quite how this works; if we want to push the subresource for the first page access, and then not push it again subsequently, the subresource would have to be marked as 'subresource-of' for all the subsequent pages too to avoid being re-pushed?  I suppose other logic could be employed to handle this, but it seems like it misses the most common push case?

I think you're right - for sites with many pages and one subresource shared between them, transmitting every URL in the Subresource-Of header would be problematic. Perhaps it should be converted to something like the "path" attribute in Set-Cookie: interpreted as applying to every URL as a prefix. So "Subresource-Of: /static/, /foo.html" would be a subresource of /foo.html and everything in the /static/ directory.

Alek

Guy Bedford

unread,
Aug 28, 2013, 3:04:31 AM8/28/13
to spdy...@googlegroups.com
Hi,

I've been doing some experiments with node-spdy to check how the push streams cache, and have noticed two small things that seem like non ideal cache behaviour.

1. When creating a push stream with a cache-control maxAge public header, the maxAge is respected for refreshes of the page, but if I load another page with a push stream by the same name, it accepts the push stream again. Ideally it would know it has a copy from the previous page and be able to refuse the push stream.

2. When creating a push stream with an etag, ideally that would be enough information for the browser to know whether to accept or deny the push stream. There is no need for an "if-none-match" stage, as it can compare the etag to what it has, and then decide to deny or accept the push stream.

If both of the above were supported, it would make push streams ideal for common assets shared between pages, but without these two it can still be more beneficial to have separate requests to ensure the correct cache behaviour.

Any feedback on those appreciated.

Thanks,

Guy

William Chan (陈智昌)

unread,
Aug 28, 2013, 8:03:15 AM8/28/13
to spdy...@googlegroups.com

What client are you testing? Chrome? If so, yes it's known that (1) and (2) are not implemented yet. The push support is very naive. Patches welcome.

--
 
---
You received this message because you are subscribed to the Google Groups "spdy-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spdy-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Guy Bedford

unread,
Aug 28, 2013, 8:30:43 AM8/28/13
to spdy...@googlegroups.com
Thanks for the quick response. Yes this is in Chrome.

It is great to know that these may be supported. I'm working on a JavaScript module delivery server, and it would change the architecture considerably if these work out.

So do you think, given enough time, these would be expected features then?

William Chan (陈智昌)

unread,
Aug 29, 2013, 6:06:17 AM8/29/13
to spdy...@googlegroups.com
Yep.
Reply all
Reply to author
Forward
0 new messages