Downstream Caches

129 views
Skip to first unread message

Hugues Alary

unread,
Apr 1, 2014, 4:45:49 PM4/1/14
to mod-pagesp...@googlegroups.com
Hi there!

mod_pagespeed implemented an experimental feature as of 1.6.29.3 that solve the problem with downstream caches caching non-optimized pages.

This is a great feature, however, I was wondering if the mod_pagespeed team had investigated a specific feature that Varnish provides and which prevents browsers from ever having to wait on a page to be generated.

This feature, specific to Varnish, is called "req.hash_always_miss". 

The Varnish website explains the feature this way:

"When using a script to refresh or warm the cache, it can be desirable to make sure it is your script that is stuck waiting for the backend fetch, and not the next client visiting your site. You can use the req.hash_always_miss control variable to make sure this request will result in a cache miss and fetch from the backend."

Basically this allows you to not "purge" objects from your cache, but instead, re-generate them. The difference is that when you purge an object, the next client is visiting will be blocked until the object is re-generated. When you "regenerate" the object with "req.hash_always_miss", the old object is being served until the new object has been generated and is available, this way no client ever waits for the new object.

At Betabrand we are big fans of this feature and since it allows us to never ever purge objects from our cache.

When we want to regenerate a page, we make a request to the page with a specific header. Varnish reads that header and if it contains the value "regenerate", it will do a req.hash_always_miss.

Would it be possible for mod_pagespeed to use this feature instead of sending "PURGE" requests? 

Let's say I'm requesting http://www.betabrand.com/mens/pants.html

- mod_pagespeed starts optimizing the page
- sends the non optimized version 
- finish optimizing
- sends a GET request to http://www.betabrand.com/mens/pants.html with a header "Pagespeed-Optimized: 1"
- varnish reads the header
- varnish sets req.hash_always_miss = true; 
- varnish fetches the optimized version of the page and puts it in its cache.

The VCL would look like this:

vcl_recv
{

    if(req.http.Pagespeed-Optimized = "1")
    {
         set req.hash_always_miss = true;
    }

}

Thanks,
-Hugues

Jud Porter

unread,
Apr 1, 2014, 5:34:57 PM4/1/14
to mod-pagesp...@googlegroups.com
Could you do this already by redefining the semantics of PURGE in your varnish config to actually perform a GET with req.hash_always_miss?

if(req.request = "PURGE")
{
    set req.request = "GET";
    set req.hash_always_miss = true;
}

Or if you didn't want to redefine PURGE, you could define a new request type, and use ModPagespeedDownstreamCachePurgeMethod to have mod_pagespeed send it.

It likely wouldn't be too difficult for us to add the ability for mod_pagespeed to send a specific header on a purge request either. That way you could set ModPagespeedDownstreamCachePurgeMethod to GET, and have an option like ModPagespeedDownstreamCachePurgeHeader that you could use to add the Pagespeed-Opimitzed header. But a solution that works with the existing options would be best.


--
You received this message because you are subscribed to the Google Groups "mod-pagespeed-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mod-pagespeed-di...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mod-pagespeed-discuss/CAL_utwZLanHtHjVwPbfXkin4GF2TbM0V10hvOsfofAwi8DPC8Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hugues Alary

unread,
Apr 1, 2014, 5:40:07 PM4/1/14
to mod-pagesp...@googlegroups.com
Hi Jud, 

This solution makes perfect sense. I think I had explored it in the past and couldn't get it to work. However, since it's been a while I'll give it a new try.

I'll post my findings here.

Thanks!
-Hugues




Hugues Alary

unread,
Apr 1, 2014, 7:43:10 PM4/1/14
to mod-pagesp...@googlegroups.com
Hi Jud, 

So here are some of my findings. TL;DR: I got it to work, but there are a few gotchas and a small bug (my last paragraph).

In the past I had tried to ModPagespeedDownstreamCachePurgeMethod MYOWNHTTPVERB

This doesn't seem to be valid: mod_pagespeed never sends anything to varnish. I couldn't find any error message in my apache logs.

Not sure if PURGE and GET are the only verbs allowed, or, if any verb defined in the RFC are valid.

---------------

I then tried ModPagespeedDownstreamCachePurgeMethod PURGE

with the VCL we talked about earlier:

sub vcl_recv() {

if(req.request = "PURGE")
{
    set req.request = "GET";
    set req.hash_always_miss = true;
}

}

This time the PURGE request is received by varnish. 

Varnish then correctly send a GET request to the URL, but, the page received and put in cache is not optimized.

By inspecting the request made by mod_pagespeed, I realized that it includes a header "X-PSA-Purge-Request: 1", which I assumed is used by mod_pagespeed to know the request is coming from itself.

---------------

I then tried this VCL:

sub vcl_recv() {
if(req.request = "PURGE")
{
    set req.request = "GET";
    unset X-PSA-Purge-Request;
    set req.hash_always_miss = true;
}
}

This still didn't work. The PURGE request is received by Varnish. Varnish makes a GET request to the URL, but the response is not optimized.

By inspecting the HTTP request again I noticed that mod_pagespeed also adds it's user-agent to the user-agent string: mod_pagespeed/1.6.29.7-3566 in my case.

---------------

I then tried this VCL:

sub vcl_recv() {
if(req.request = "PURGE")
{
    set req.request = "GET";
    unset X-PSA-Purge-Request;
    set req.http.User-Agent = "fake user agent";
    set req.hash_always_miss = true;
}
}

It works!

The PURGE request is sent to Varnish, which in turn send a GET request to mod_pagespeed, which in turn sends the optimized version of the page.

I'm however unclear about the potential side effects of removing mod_pagespeed user-agent string and X-PSA-Purge-Request header.

---------------

While debugging I noticed a small bug:
The PURGE request sent by mod_pagespeed has a double slash in it:

14 RxRequest    c PURGE
14 RxURL        c //mens/denim.html
14 RxProtocol   c HTTP/1.1
14 RxHeader     c Host: redacted.host.tld
14 RxHeader     c Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
14 RxHeader     c Pragma: no-cache
14 RxHeader     c DNT: 1
14 RxHeader     c Referer: http://redacted.host.tld/
14 RxHeader     c Accept-Language: en-US,en;q=0.8,de;q=0.6,fr;q=0.4
14 RxHeader     c PS-CapabilityList: ll,ii,dj:
14 RxHeader     c X-Device: desktop
14 RxHeader     c X-Varnish: 717879493
14 RxHeader     c Accept-Encoding: gzip
14 RxHeader     c X-PSA-Purge-Request: 1
14 RxHeader     c User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36 mod_pagespeed/1.6.29.7-3566

It so happens that my VCL remove double slashes for some specific reason about my website, so the double slash doesn't affect me. However, I believe mod_pagespeed shouldn't be sending purge requests with a double slash.

Here's the code I use in my VCL to remove the double slash:

# We remove 'double-slashes' from the requested URL, it causes bugs with varnish
if(req.url ~ "^(.*)//(.*)$")
{
set req.url = regsub(req.url,"^(.*)//(.*)$","\1/\2");
}

I'll file a bug for this.

-Hugues

Anupama Dutta

unread,
Apr 1, 2014, 10:14:04 PM4/1/14
to mod-pagespeed-discuss
Thanks, Hugues, for the excellent feedback on this feature. This req.hash_always_miss definitely appears promising. A few comments inline ...


On Tue, Apr 1, 2014 at 7:43 PM, Hugues Alary <hug...@betabrand.com> wrote:
Hi Jud, 

So here are some of my findings. TL;DR: I got it to work, but there are a few gotchas and a small bug (my last paragraph).

In the past I had tried to ModPagespeedDownstreamCachePurgeMethod MYOWNHTTPVERB

This doesn't seem to be valid: mod_pagespeed never sends anything to varnish. I couldn't find any error message in my apache logs.

Not sure if PURGE and GET are the only verbs allowed, or, if any verb defined in the RFC are valid.


I am pretty sure that we use a defined list of HTTP methods only, and other methods won't be handled correctly. So, what you observed is probably intended.
 
---------------

I then tried ModPagespeedDownstreamCachePurgeMethod PURGE

with the VCL we talked about earlier:

sub vcl_recv() {

if(req.request = "PURGE")
{
    set req.request = "GET";
    set req.hash_always_miss = true;
}

}

This time the PURGE request is received by varnish. 

Varnish then correctly send a GET request to the URL, but, the page received and put in cache is not optimized.

By inspecting the request made by mod_pagespeed, I realized that it includes a header "X-PSA-Purge-Request: 1", which I assumed is used by mod_pagespeed to know the request is coming from itself.


You should try to leave the X-PSA-Purge-Request header in. As you correctly deduced, this header is present to prevent infinite loops wherein the pagespeed server might decide that the optimization for this re-sent request is not sufficient (i.e. below the threshold) and issue a PURGE again, and do this over and over again. Since the PURGE request is usually sent after all of the optimization work has finished, it is unlikely that this kind of loop will occur. However, it can still happen in a worst case scenario where all the resources in the HTML have say, very low caching TTLs and are expiring constantly and triggering PURGEs. So, keeping the header will be useful to avoid problems in your setup.

 
---------------

I then tried this VCL:

sub vcl_recv() {
if(req.request = "PURGE")
{
    set req.request = "GET";
    unset X-PSA-Purge-Request;
    set req.hash_always_miss = true;
}
}

This still didn't work. The PURGE request is received by Varnish. Varnish makes a GET request to the URL, but the response is not optimized.

By inspecting the HTTP request again I noticed that mod_pagespeed also adds it's user-agent to the user-agent string: mod_pagespeed/1.6.29.7-3566 in my case.

---------------

I then tried this VCL:

sub vcl_recv() {
if(req.request = "PURGE")
{
    set req.request = "GET";
    unset X-PSA-Purge-Request;
    set req.http.User-Agent = "fake user agent";
    set req.hash_always_miss = true;
}
}

It works!

The PURGE request is sent to Varnish, which in turn send a GET request to mod_pagespeed, which in turn sends the optimized version of the page.

I'm however unclear about the potential side effects of removing mod_pagespeed user-agent string and X-PSA-Purge-Request header.


This "fake user-agent string" seems like the most risky part of this change. Since the pagespeed server responds with different kinds of optimizations (like webp vs non-webp etc.) for different classes of User-Agents, using a "fake user-agent" will (I think) cause you to populate the default cache fragment and not ones that represent other cache variations. See point 2 under this "hash-key" section of our documentation to see the different cache fragments that are recommended for this feature. It might be possible to pass along the original User-Agent as a separate request header for use at this point, though I am not very sure of the pros and cons of doing this.

 
---------------

While debugging I noticed a small bug:
The PURGE request sent by mod_pagespeed has a double slash in it:

14 RxRequest    c PURGE
14 RxURL        c //mens/denim.html
14 RxProtocol   c HTTP/1.1
14 RxHeader     c Host: redacted.host.tld
14 RxHeader     c Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
14 RxHeader     c Pragma: no-cache
14 RxHeader     c DNT: 1
14 RxHeader     c Referer: http://redacted.host.tld/
14 RxHeader     c Accept-Language: en-US,en;q=0.8,de;q=0.6,fr;q=0.4
14 RxHeader     c PS-CapabilityList: ll,ii,dj:
14 RxHeader     c X-Device: desktop
14 RxHeader     c X-Varnish: 717879493
14 RxHeader     c Accept-Encoding: gzip
14 RxHeader     c X-PSA-Purge-Request: 1
14 RxHeader     c User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36 mod_pagespeed/1.6.29.7-3566

It so happens that my VCL remove double slashes for some specific reason about my website, so the double slash doesn't affect me. However, I believe mod_pagespeed shouldn't be sending purge requests with a double slash.

Here's the code I use in my VCL to remove the double slash:

# We remove 'double-slashes' from the requested URL, it causes bugs with varnish
if(req.url ~ "^(.*)//(.*)$")
{
set req.url = regsub(req.url,"^(.*)//(.*)$","\1/\2");
}

I'll file a bug for this.


One thought around this bug: Does removing the ending "/" from your ModPagespeedDownstreamCachePurgeLocationPrefix directive help eliminate the double-slash problem?

Thanks,
Anupama. 
 

For more options, visit https://groups.google.com/d/optout.



--
Anupama

Hugues Alary

unread,
Apr 1, 2014, 11:00:13 PM4/1/14
to mod-pagesp...@googlegroups.com
Hi Anupama,

Thank you for your answer.

- Changing the User-Agent to "fake user-agent" is indeed a bad idea, it was more a quick hypothesis test than a real solution. What I did in the end is just replace mod_pagespeed by something else, leaving the original User-Agent string alone.

i.e. Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36 mod_pagespeed/1.6.29.7-3566 gets transformed into Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36 mod_pagefast/1.6.29.7-3566

if(req.request ~ "PURGE")
{
        set req.request = "GET";
        set req.http.User-Agent = regsuball(req.http.User-Agent,"mod_pagespeed","mod_pagefast");
        set req.hash_always_miss = true;
}

- I also removed the config that was stripping the X-PSA-Purge-Request and the optimization still happens so that's good for me. 

- removing the trailing slash from ModPagespeedDownstreamCachePurgeLocationPrefix fixes the double slash

All in all, this setup currently runs on my development server with no issues, this sounds promising.

On a related note: 
I don't know the internals of mod_pagespeed and thus I'm unclear as to why if I don't replace "mod_pagepeed" to "mod_pagefast" in the User-Agent string, mod_pagespeed won't send me an optimized page. Could you enlighten me (without necessarily entering into too much details)?

Cheers,
-Hugues


Anupama Dutta

unread,
Apr 2, 2014, 10:30:48 AM4/2/14
to mod-pagespeed-discuss, Maksim Orlovich, Jud Porter, Joshua Marantz
On Tue, Apr 1, 2014 at 11:00 PM, Hugues Alary <hug...@betabrand.com> wrote:
Hi Anupama,

Thank you for your answer.

- Changing the User-Agent to "fake user-agent" is indeed a bad idea, it was more a quick hypothesis test than a real solution. What I did in the end is just replace mod_pagespeed by something else, leaving the original User-Agent string alone.

i.e. Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36 mod_pagespeed/1.6.29.7-3566 gets transformed into Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36 mod_pagefast/1.6.29.7-3566

if(req.request ~ "PURGE")
{
        set req.request = "GET";
        set req.http.User-Agent = regsuball(req.http.User-Agent,"mod_pagespeed","mod_pagefast");
        set req.hash_always_miss = true;
}

- I also removed the config that was stripping the X-PSA-Purge-Request and the optimization still happens so that's good for me. 

- removing the trailing slash from ModPagespeedDownstreamCachePurgeLocationPrefix fixes the double slash

All in all, this setup currently runs on my development server with no issues, this sounds promising.

I am glad your configuration works! 


On a related note: 
I don't know the internals of mod_pagespeed and thus I'm unclear as to why if I don't replace "mod_pagepeed" to "mod_pagefast" in the User-Agent string, mod_pagespeed won't send me an optimized page. Could you enlighten me (without necessarily entering into too much details)?

I don't know why this would be the case either. Adding a few folks who might know what is going on here. 

One other point regarding your new approach: Do make sure that no external request can force you to take this PURGE path and regenerate optimized HTML. Your ACLs for PURGE should be taking care of this.
 

For more options, visit https://groups.google.com/d/optout.



--
Anupama

Jan-Willem Maessen

unread,
Apr 2, 2014, 10:35:46 AM4/2/14
to mod-pagesp...@googlegroups.com, Maksim Orlovich, Jud Porter, Joshua Marantz
On Wed, Apr 2, 2014 at 10:30 AM, Anupama Dutta <anu...@google.com> wrote:

On Tue, Apr 1, 2014 at 11:00 PM, Hugues Alary <hug...@betabrand.com> wrote:
Hi Anupama,

Thank you for your answer.

- Changing the User-Agent to "fake user-agent" is indeed a bad idea, it was more a quick hypothesis test than a real solution. What I did in the end is just replace mod_pagespeed by something else, leaving the original User-Agent string alone.

i.e. Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36 mod_pagespeed/1.6.29.7-3566 gets transformed into Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36 mod_pagefast/1.6.29.7-3566

if(req.request ~ "PURGE")
{
        set req.request = "GET";
        set req.http.User-Agent = regsuball(req.http.User-Agent,"mod_pagespeed","mod_pagefast");
        set req.hash_always_miss = true;
}

- I also removed the config that was stripping the X-PSA-Purge-Request and the optimization still happens so that's good for me. 

- removing the trailing slash from ModPagespeedDownstreamCachePurgeLocationPrefix fixes the double slash

All in all, this setup currently runs on my development server with no issues, this sounds promising.

I am glad your configuration works! 


On a related note: 
I don't know the internals of mod_pagespeed and thus I'm unclear as to why if I don't replace "mod_pagepeed" to "mod_pagefast" in the User-Agent string, mod_pagespeed won't send me an optimized page. Could you enlighten me (without necessarily entering into too much details)?

I don't know why this would be the case either. Adding a few folks who might know what is going on here. 

pagespeed tries to protect itself against fetching loops by refusing to rewrite content it's fetching from itself.  Rewriting the PURGE to a GET makes pagespeed think it's fetching data in response to itself – and in some sense that's true, but varnish is intervening here to keep things reasonably sane.

The most obvious scenario here is when we do in-place resource optimization, we don't want the loopback fetch for the unoptimized resource to itself trigger re-optimization!  But there are other situations where this can occur, particularly on multi-server setups, that predate the existence of in-place optimization.

-Jan

Hugues Alary

unread,
Apr 2, 2014, 2:00:45 PM4/2/14
to mod-pagesp...@googlegroups.com, Maksim Orlovich, Jud Porter, Joshua Marantz
Thanks for your explanations, everyone.


--
You received this message because you are subscribed to the Google Groups "mod-pagespeed-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mod-pagespeed-di...@googlegroups.com.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages