Preventing Google Web Accelerator from prefetching

57 views
Skip to first unread message

Richie Hindle

unread,
Nov 16, 2005, 5:16:23 PM11/16/05
to django...@googlegroups.com
Hi,

I've written a small piece of middleware to prevent Google Web Accelerator
(or any other prefetching client) from prefetching URLs. Since this is my
first piece of middleware, I'd appreciate it if those more experienced
than me could tell me whether it looks sensible, or whether it's flawed in
some way. The intended behaviour is to return a 403 Forbidden if the
request carries an "X-Moz: prefetch" header. Here's the code:

from django.utils.httpwrappers import HttpResponseForbidden

class NoPrefetchMiddleware:
"""Prevents prefetching clients (eg. Google Web Accelerator) from
prefetching URLs. If your site ever changes state in response to a
GET request (eg. with a Logout link rather than a Logout button), you
need to suppress prefetch."""

def process_request(self, request):
if 'prefetch' in request.META.get('HTTP_X_MOZ', '').lower():
return HttpResponseForbidden()

Thanks!

--
Richie Hindle
ric...@entrian.com


Jacob Kaplan-Moss

unread,
Nov 16, 2005, 6:10:33 PM11/16/05
to django...@googlegroups.com
On Nov 16, 2005, at 4:16 PM, Richie Hindle wrote:
> I've written a small piece of middleware to prevent Google Web
> Accelerator
> (or any other prefetching client) from prefetching URLs. Since
> this is my
> first piece of middleware, I'd appreciate it if those more experienced
> than me could tell me whether it looks sensible, or whether it's
> flawed in
> some way. The intended behaviour is to return a 403 Forbidden if the
> request carries an "X-Moz: prefetch" header. Here's the code:

The code looks good and isn't flawed in any way.

However... the concept is. Developers shouldn't be blocking GWA; we
should be programming web apps that conform to expected HTTP
behavior. GWA *only* issues GET requests, and if an app modifies
data based on a GET, then the app should be considered broken.

As far as Django is concerned, this means your non-idempotent views
should check that they're not being called with GET;
django.views.decorators.http contains a set of easy view decorators
that will check for a given method transparently.

Jacob

Luke Plant

unread,
Nov 16, 2005, 6:14:54 PM11/16/05
to django...@googlegroups.com
On Wed, 16 Nov 2005 22:16:23 +0000 Richie Hindle wrote:

>
> Hi,
>
> I've written a small piece of middleware to prevent Google Web
> Accelerator (or any other prefetching client) from prefetching URLs.

Can I ask first of all why you are doing this? If you are trying to
conserve your bandwidth or similar, fine, but I know some people want
to use links (i.e. HTTP GET requests) which have side effects, which is
Bad.

Secondly, there may be some things to consider with web caches. I
think you should add a vary header to indicate that the response will
vary depending on the value of the HTTP_X_MOZ header, or some such.
There is some relevant documentation here:
http://www.djangoproject.com/documentation/cache/

Luke

--
"Mistakes: It could be that the purpose of your life is only to serve
as a warning to others." (despair.com)

Luke Plant || L.Plant.98 (at) cantab.net || http://lukeplant.me.uk/

Richie Hindle

unread,
Nov 16, 2005, 7:21:29 PM11/16/05
to django...@googlegroups.com

[Luke]
> I know some people want to use links (i.e. HTTP GET requests) which
> have side effects, which is Bad.

[Jacob]
> if an app modifies
> data based on a GET, then the app should be considered broken.

"Logout" is often a link, like it or not. (Amazon, Gmail, Yahoo...)

And yes, server resources are another issue. And Evil, that's another
issue. 8-)

> Secondly, there may be some things to consider with web caches. I
> think you should add a vary header to indicate that the response will
> vary depending on the value of the HTTP_X_MOZ header, or some such.

Good point, thanks:

def process_request(self, request):
if 'prefetch' in request.META.get('HTTP_X_MOZ', '').lower():
response = HttpResponseForbidden()
response['Vary'] = 'x-moz'
return response

--
Richie Hindle
ric...@entrian.com


Simon Willison

unread,
Nov 17, 2005, 3:47:50 AM11/17/05
to django...@googlegroups.com

On 16 Nov 2005, at 23:10, Jacob Kaplan-Moss wrote:

> However... the concept is. Developers shouldn't be blocking GWA;
> we should be programming web apps that conform to expected HTTP
> behavior. GWA *only* issues GET requests, and if an app modifies
> data based on a GET, then the app should be considered broken.

I'm afraid I just don't buy this. It holds for most cases, but there
are some significant ones where it doesn't. My favourite example is
Flickr's internal message system (or any other Webmail). It tells you
at the top of the page if you have any unread messages, and when you
view your inbox it shows unread messages in bold. The act of viewing
a message (by following a GET link) marks that message as read.

Sure, you could require people to click a "mark as read" button that
does a POST, or even have the interface to select a message to read
use POST buttons. That would suck though - it would break the ability
to open a bunch of messages in a new tab for one thing.

Meanwhile, GWA hits your inbox and instantly marks all your unread
messages as read! (That's assuming Flickr doesn't block it - I'll
have to check).

HTTP purity is a nice ideal, but until the HTML form model contains
better support for calling HTTP verbs that reflect what you are
actually trying to do it just isn't practical in every case. It's
those edge cases that make GWA's behaviour a bad idea.

Cheers,

Simon

Jeremy Dunck

unread,
Nov 17, 2005, 10:59:35 AM11/17/05
to django...@googlegroups.com
On 11/17/05, Simon Willison <swil...@gmail.com> wrote:
> HTTP purity is a nice ideal, but until the HTML form model contains
> better support for calling HTTP verbs that reflect what you are
> actually trying to do it just isn't practical in every case. It's
> those edge cases that make GWA's behaviour a bad idea.

To pile on here, another "if only" bit is that if app-level auth was
done through HTTP, then GWA could just not prefetch on any page that
would have required auth headers. As it is, GWA can't know what
cookie-based auth is doing.

Following that line, I think GWA could be safer by just not
prefetching any request that would pass along HTTP auth or -any-
cookie. The down-side is obviously less pre-fetching, but it wouldn't
be dangerous.

And if you build a non-safe operation that general robots will trip
over, well, too bad. ;-)

This still leaves open sites which pass auth info in the URL, though.

Luke Plant

unread,
Nov 17, 2005, 1:44:57 PM11/17/05
to django...@googlegroups.com
On Thu, 17 Nov 2005 00:21:29 +0000 Richie Hindle wrote:

>
>
> [Luke]
> > I know some people want to use links (i.e. HTTP GET requests) which
> > have side effects, which is Bad.
>
> [Jacob]
> > if an app modifies
> > data based on a GET, then the app should be considered broken.
>
> "Logout" is often a link, like it or not. (Amazon, Gmail, Yahoo...)

If you have to make it appear as a link, I would try these alternatives
first:

1) have a <a> link which actually does a javascript submit of a POST
form, and a <noscript> block which has an <input type=submit> which
does the same thing (most people will never browse the site with
javascript off so it doesn't matter that it doesn't look as good)

2) have an <input type=image> that looks like a link but as it is
really an input button it can do a POST form submit.

But I know that developers are not always given the freedom to do the
right thing. At work I was forced to implement a non-idempotent GET
request recently, despite my protests. At the time I didn't have the
example of Google Web Accelerator to make my point more forcefully, or
I might have won the argument. So now, if anyone browses the site we
developed with GWA installed, they will mysteriously find themselves
subscribed to every page they visit...

Luke

--
"My capacity for happiness you could fit into a matchbox without taking
out the matches first." (Marvin the paranoid android)

Eugene Lazutkin

unread,
Nov 18, 2005, 6:28:13 PM11/18/05
to django...@googlegroups.com
Inline.

"Richie Hindle" <ric...@entrian.com> wrote in message
news:mhgnn152qi1itgej3...@4ax.com...
>
>
> [Luke]
>> I know some people want to use links (i.e. HTTP GET requests) which
>> have side effects, which is Bad.
>
> [Jacob]
>> if an app modifies
>> data based on a GET, then the app should be considered broken.
>
> "Logout" is often a link, like it or not. (Amazon, Gmail, Yahoo...)

FWIW, Gmail's "Refresh" link is not a link:

<span class="lk" id="refresh">Refresh</span>

It is styled exactly like a link ("lk") with underlined blue text. There is
a code, which handles onclick event on this pseudo link.

"Sign out" is a link with parameters:
http://mail.google.com/mail/?logout&hl=en --- GWA doesn't follow links with
parameters.

My point is it is possible to prevent GWA and similar systems from following
your links without server-side support.

Thanks,

Eugene



hugo

unread,
Nov 19, 2005, 6:02:17 AM11/19/05
to Django users
>behavior. GWA *only* issues GET requests, and if an app >modifies
>data based on a GET, then the app should be considered >broken.

Actually the problem goes deeper: GWA can crawl areas that normally
can't be crawled, because they are behind logins. So GWS will hit pages
that were never meant to be hit by bots - private pages. But pages that
aren't meant for public consumption have different requirements: you
design them more often for convenience than for "good HTTP behaviour".
So you will find GET-with-sideeffects more often behind logins than
before logins (because those with side-effects on GET will already be
hit by public bots).

GWA is a very bad idea, and it is done in a very bad way. I can't think
of any other google project where they fucked up that often (the last
one being to drop the header that designates that some request is done
by GWA instead of the browser itself). Even if you code your app to
expected HTTP behaviour, GWA itself isn't allways. And we can't code
our apps to HTTP brokeness of other apps ...

So especially because of it's problems it is an absolute valid request
to know how to block it out of some web site.

bye, Georg

hugo

unread,
Nov 19, 2005, 6:06:28 AM11/19/05
to Django users
>> Secondly, there may be some things to consider with web caches. I
>> think you should add a vary header to indicate that the response will
>> vary depending on the value of the HTTP_X_MOZ header, or some such.
>
>Good point, thanks:
>
> def process_request(self, request):
> if 'prefetch' in request.META.get('HTTP_X_MOZ', '').lower():
> response = HttpResponseForbidden()
> response['Vary'] = 'x-moz'
> return response

Actually that will break all vary-header handling of Django. Better to
hook into the existing vary-header code. Look at the
django.views.decorators.vary stuff. It's mostly using the
patch_vary_headers from django.utils.cache.

bye, Georg

Richie Hindle

unread,
Nov 19, 2005, 11:58:46 AM11/19/05
to django...@googlegroups.com

[Richie]
> def process_request(self, request):
> if 'prefetch' in request.META.get('HTTP_X_MOZ', '').lower():
> response = HttpResponseForbidden()
> response['Vary'] = 'x-moz'
> return response

[Georg]
> Actually that will break all vary-header handling of Django. Better to
> hook into the existing vary-header code. Look at the
> django.views.decorators.vary stuff. It's mostly using the
> patch_vary_headers from django.utils.cache.

I wondered about that, but will the request ever get anywhere near the
views/decorators code? It will get as far as this middleware and be
rejected. Or have I misunderstood the architecture?

--
Richie Hindle
ric...@entrian.com


Reply all
Reply to author
Forward
0 new messages