The state of per-site/per-view middleware caching in Django

978 views
Skip to first unread message

Jim Dalton

unread,
Oct 20, 2011, 8:45:23 AM10/20/11
to django-d...@googlegroups.com
I spent the better part of yesterday mucking around in the dregs of Django's cache middleware and related modules, and in doing so I've come to the conclusion that, due to an accumulation of hinderances and minor bugs, the per-site and per-view caching mechanism are effectively broken for many fairly typical usage patterns.

Let me demonstrate by fictional example, with what I would consider to be a pretty typical configuration and use case for the per-site cache:

Let's pretend I'm developing a blog powered by Django. I'm using memcached, and I would like to cache pages on that blog for anonymous users, who are going to make up the vast majority of my site's visitors. Ideally, I will serve the exact same cached version of a blog post to every single anonymous visitor to my site, which will help keep server load under control, particularly when I get slashdotted/reddited/what-have-you.

Like any blog, a typical page view features the content primarily (e.g a blog post). It also has some "auth" stuff at the top right, which will say "Log in / Register" for non logged in users but show a username and welcome message for logged in users. Each blog post also has an empty comment form at the bottom of it where users can leave comments on the post. Like 99% of the websites out there, I will be using Google Analytics to track my visitors etc.

Pretty straightforward, right?

Let me count the ways that Django's cache middleware will muck up my goals in the above scenario.

First, I'm going to try use the per site cache. Here's what's going to go wrong for me:

* It's going to be virtually impossible for me to avoid my cache varying by cookie and thus by visitor. Because in my templates I am checking to see if the current user is logged in, I'm touching the session, which is going to now set the vary cookie header. That means if there is any difference in the cookies users are requesting my pages with, I'm going to be sending each user a separate cached page, keyed off of SESSION_COOKIE_NAME, which is unique for every visitor.

* Even if I avoid touching the request user somehow, the CSRF middleware presents the same issue. Because I have a comment form on every page, I have a unique CSRF token for each visitor. Thankfully Django doesn't let me completely shoot myself in the foot by caching the page with one user's token and serving it to everybody else. At least it helpfully sets a CSRF token cookie and varies on it to prevent this. However, that cookie is different for every unique user. That triggers the the same problem as above. I again cannot avoid caching a unique page for each unique visitor.

* Unfortunately, my troubles are not over, even if I resign myself to having a cache that varies per visitor. You see, Google Analytics actually sets a handful of other cookies with each page request. And guess what? The values for those cookies are unique *for each request*. This mean...I'm actually not caching at all. Cookies are unique for each and every page request thanks to Google Analytics. My per-site cache configuration is totally and completely inoperable, all because I'm using a tracking service that pretty much *everybody* uses.

Since that didn't work, I wonder if it'll work if I do per-view caching? It shouldn't work at all, should it, since it's not like any of the factors I outlined above are different if I'm using the @cache_page decorator to do my caching vs the per-site cache.

Well, the sad news is caching does "work" when I use cache_page, and that's not a good thing:

* @cache_page caches the direct output of the view/render function. It skips over the middleware that might have very good reason to introduce vary headers and doesn't introduce any vary headers of it's own. So now, with this applied, I *am* serving a cached version of this page even though I absolutely should not be. Some poor user's token is now being sent to everybody. My only chance of redemption is if I happen to have read the docs and discovered that this incantation is required to prevent having cache_page improperly cache the page:

   @cache_page(60 * 15)
   @csrf_protect
   def my_view(request):
       # ...
       
Of course, the above just puts me right back where I started at the per-site level. There was never any chance of making cache_page work any different from the per-site cache, but it certainly proved to be a temptation if I'm a hurried developer, frustrated by why my per site cache wasn't working and "thankful" for the fact that I could get the cache to start "working" with the cache_page decorator.

Hopefully the above example really makes it clear to you guys how all of the seemingly minor bugs and imperfections really do add up to a broken situation for someone coming to this with a pretty standard set of expectations and requirements.

Anyhow, the good news is that a good portion of what I have written about already has open tickets which in some cases are close to being ready for checkin:

* Google Analytics is a known issue with a proposed patch: https://code.djangoproject.com/ticket/9249

* CSRF is known to not play nicely with caching, it's documented at least: https://docs.djangoproject.com/en/dev/ref/contrib/csrf/#caching

* The actual underlying cache_page issue is ticketed: https://code.djangoproject.com/ticket/15855

Still, I can't help but feel that, to an extent, these are band aids. There is still an exceptionally narrow set of circumstances that would allow me to serve a single cached page to all anonymous visitors to my site: namely, I can't touch request.user and I can't use CSRF. Quite honestly, I'm not even sure you should be using a framework like Django if most of your pages don't have logic pertaining to a logged in vs. anonymous user, or have some kind of form on them which requires CSRF protection. Even if all of the above tickets got fixed, it seems like we're still in kind of a bad place.

I don't know that I have good solutions to any of this (though I am very much willing to contribute work toward such a solution). I do have a few ideas/questions to pose to conclude with here:

* Is it reasonable to set as a goal that Django should attempt to support per site caching for the scenario I described above? I mean, am I wrong in thinking that in an ideal world, it should be possible to serve the same cached page to all anonymous users most of the time, even if there are forms or anonymous vs. logged in user logic on it?

* Is an embedded token the only form in which CSRF protection can come from? Why can't the token be set as a cookie and the value of that cookie serve as the CSRF verification (without varying on it in the cache, obviously)? Or perhaps there's a way to dynamically generate a CSRF token via ajax after the page load? I'm certain someone much smarter and more knowledgable than I will point out why these are dreadfully horrible, unworkable ideas, but the embedded token is sort of a deal breaker for effective caching, and these days many, many sites have forms on almost every page (e.g. a hidden login form that's revealed when you press login, comment form, etc.).

* Why does the cookie have to vary if the request user object is touched on the template even though it's not authenticated? If the sessionid isn't even in the request cookie (i.e. for a first time visitor), then it doesn't require a real "check" of the session. And correct me if I'm wrong, but doesn't the session key get cycled when a user logs in anyway? In other words, a session key that represents an anonymous user will *always* represent an anonymous user. Perhaps there's a way to keep track of those so the anonymous session ids so the same anonymous cached view can be served to them all. What a waste to generate the entire page dynamically for each individual anonymous user all because of one simple key lookup. Again, this is probably a hopelessly naive idea with a sensible, obvious rebuttal, but perhaps there is some merit in coming up with a creative solution?

I have to guess some of you have already spent some brain cycles thinking about the above issues I've raised, in whole or in part, and I apologize if I'm re-hashing an old debate or am so totally off-base that I've wasted your time if you made it this far. My intent, again, is not to complain, but to see if others agree that the current state of the per-site cache is not so great, and if so, to elicit some ideas on how to best address it. It also seems to me that there is more than just one problem standing in the way of things, so "success" might require something of a coordinated effort.

Please do let me know if my concerns make sense, if my goal is a legitimate one, if I'm wrong in part or in whole, etc. etc. As I said earlier, if there's a path forward on any of the above I am happy to contribute to the effort.

Thanks for listening.

Niran Babalola

unread,
Oct 20, 2011, 1:26:36 PM10/20/11
to django-d...@googlegroups.com
On Thu, Oct 20, 2011 at 7:45 AM, Jim Dalton <jim.d...@gmail.com> wrote:
> There
> is still an exceptionally narrow set of circumstances that would allow me to
> serve a single cached page to all anonymous visitors to my site: namely, I
> can't touch request.user and I can't use CSRF.

This problem is inherent to page caching. Workarounds to avoid varying
by cookie for anonymous users are conceptually incorrect. If a single
URL can give different responses depending on who's viewing it, then
it varies by cookie. Preventing CSRF is inherently session-variable as
well. Loading the token via a separate AJAX call is possible, but
there are simpler solutions.

If you want to cache pages with small portions that vary by user, then
you want edge site includes and something like Varnish to process
them. If you want a much slower, pure-python solution that doesn't
require a separate service running somewhere, then you want
armstrong.esi[1].

- Niran

[1] <https://github.com/texastribune/armstrong.esi>. armstrong.esi
isn't part of Armstrong proper yet, but if you want to know more about
the project, head to <http://armstrongcms.org/> and
<https://github.com/armstrong/armstrong>.

Jim Dalton

unread,
Oct 20, 2011, 4:04:18 PM10/20/11
to django-d...@googlegroups.com
On Oct 20, 2011, at 10:26 AM, Niran Babalola wrote:

> This problem is inherent to page caching. Workarounds to avoid varying
> by cookie for anonymous users are conceptually incorrect. If a single
> URL can give different responses depending on who's viewing it, then
> it varies by cookie. Preventing CSRF is inherently session-variable as
> well. Loading the token via a separate AJAX call is possible, but
> there are simpler solutions.

You may in fact be correct, but I'm not convinced by what you're saying here (not that there is any onus on you to convince me of anything of course).

I"m suggesting that all anonymous users *could* receive an identical page from the server, theoretically, since the same URL does *not* need to return a different response depending on which (anonymous) user is viewing it. CSRF is obviously a trickier problem, and it's not really worth solving the anonymous user problem if CSRF isn't solved as well. But if both problems were somehow solvable, then we're in a position where per-site cache would be viable for many common scenarios such as the one I described in my original post.

If these two problems are in fact unsolvable or not worth solving because simpler alternatives exist, that's fine and understandable. Perhaps per-site/per-view caching are indeed exceptionally limited tools that are beneficial in a very limited number of use cases, and perhaps the "solution" here is tidying up the outstanding bugs and perhaps clarifying the documentation as needed to make the limitations more explicit.


>
> If you want to cache pages with small portions that vary by user, then
> you want edge site includes and something like Varnish to process
> them. If you want a much slower, pure-python solution that doesn't
> require a separate service running somewhere, then you want
> armstrong.esi[1].


Thanks. This post wasn't really about what *I* need btw; I can definitely sort out my caching strategies in other areas as I need to. The post only relates to "me" because I sat down yesterday and said, "Gee, I wonder if I could make use of Django's per-site caching feature for this project I'm working on." I turned it "on" to test it out and then spent the next 6 hours delving into the source code, IRC, ticket tracker, Google etc. to figure out why it wasn't working at all and why @cache_page was, and then after finally sorting it out and grokking all of the moving parts etc, realizing that there was extraordinarily limited value in a per-site/view caching strategy that caches per unique visitor, which is pretty much unavoidable for most common usage patterns.

So yeah, maybe it's me and I'm looking at things the wrong way, but needless to say it wasn't a particularly pleasant or worthwhile experience. Not looking for pity btw, but just wondering what I/we can or should do to make it better.

Jim

Jens Diemer

unread,
Oct 20, 2011, 5:00:18 PM10/20/11
to django-d...@googlegroups.com
Hi...

For PyLucid i made a simple cache middleware [1] simmilar to Django per-site
cache middleware [2]. But i doesn't vary on Cookies and don't cache cookies. I
simply cache only the response content.

Of course: This doesn't solve the problem if "csrfmiddlewaretoken" in content.

Here some pseudo code from [1]:
-----------------------------------------------------------------
def process_request(self, request):
if not self.use_cache(request):
return

response = cache.get(cache_key)
if response is not None:
return response

def process_response(self, request, response):
if not self.use_cache(request):
return response

# Cache only the raw content
response2 = HttpResponse(
content=response._container, status=200,
content_type=response['Content-Type']
)

patch_response_headers(response2, timeout)

cache.set(request.path, response2, timeout)

return response

-----------------------------------------------------------------

[1]
https://github.com/jedie/PyLucid/blob/master/pylucid_project/middlewares/cache.py
[2] https://docs.djangoproject.com/en/1.3/topics/cache/#the-per-site-cache


Mfg.

Jens D.

Carl Meyer

unread,
Oct 20, 2011, 9:02:34 PM10/20/11
to django-d...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Jim,

This is a really useful summary of the current state of things, thanks
for putting it together.

Re the anonymous/authenticated issue, CSRF token, and Google Analytics
cookies, it all boils down to the same root issue. And Niran is right,
what we currently do re setting Vary: Cookie is what we have to do in
order to be correct with respect to HTTP and upstream caches. For
instance, we can't just remove Vary: Cookie from unauthenticated
responses, because then upstream caches could serve that unauthenticated
response to anyone, even if they are actually authenticated.

Currently the Django page caching middleware behaves pretty much just
like an upstream cache in terms of the Vary header. Apart from the
CACHE_MIDDLEWARE_ANONYMOUS_ONLY setting, it just looks at the response,
it doesn't make use of any additional "inside information" about what
your Django site did to generate that response in order to decide what
to cache and how to cache it.

This approach is pretty attractive, because it's conceptually simple,
consistent with upstream HTTP caching, and conservative (quite unlikely
to serve the wrong cached content).

It might be possible to make it "smarter" in certain cases, and allow it
to cache more aggressively than an upstream cache can. #9249 is one
proposal to do this for cookies that aren't used on the server, either
via explicit setting or (in a recently-added proposal) via tracking
which cookie values are accessed. If we did that, plus special-cased the
session cookie if the user is unauthenticated and the session isn't used
outside of contrib.auth, I think that could possibly solve the
unauthenticated-users and GA issues.

However, this (especially the latter) would come with the cost of making
the cache middleware implementation more fragile and coupled to other
parts of the framework. And it still doesn't help with CSRF, which is a
much tougher nut to crack, because every response for pages using CSRF
come with a Set-Cookie header and probably with a CSRF token embedded in
the response content; and those both mean that response really can't be
re-used for anyone else. (Getting rid of the token embedded in the HTML
means forms couldn't ever POST without JS help, which is not an option
as the documented default approach). You can mark some form-using views
that are available to anonymous users as csrf-exempt, which exposes you
potentially to CSRF-based spam, but isn't a security issue if you aren't
treating authenticated submissions any differently from
non-authenticated ones.

Generally, I come down on the side of skepticism that introducing these
special cases into the cache middleware really buys enough to be worth
the added complexity (though I could be convinced that #9249 is worth it).

I do think we should improve the cache middleware documentation so its
limitations are outlined more clearly upfront, and point people towards
existing solutions for caching mostly-but-not-entirely-anonymous pages:
edge-side-includes, two-phase-render, and JS/AJAX fetch.

#15855, on the other hand, is a bug that really does need to be fixed. I
still don't see a better fix than the one I outlined in the ticket
description: requiring some middleware to be in MIDDLEWARE_CLASSES for
the cache_page decorator to work, and not doing the actual caching until
we hit that middleware. Or alternatively, adding an implicit "cache any
responses that had cache_page used on them" phase to response
processing, after all middleware. I think those are both ugly fixes,
though; maybe someone has a better idea. The last time I know of that
this was discussed in-depth was in
http://groups.google.com/group/django-developers/browse_frm/thread/f96e982254fbe5c3/2b02361fd6e706f4

Carl
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6gxKkACgkQ8W4rlRKtE2dnggCfeNOeAw8g4/Y5Zu6iM73HFK0m
V6EAn0mGvzLzOs0daC1UZWQp6hZnxvH8
=La3y
-----END PGP SIGNATURE-----

Jim Dalton

unread,
Oct 21, 2011, 9:02:31 AM10/21/11
to django-d...@googlegroups.com

Thanks Carl. This is definitely a good, clarifying response to what I was mulling around about.

A few thoughts of my own to add here:

* You and Nihan are certainly right about upstream caches. Regardless of what we do here, we'll have to vary by cookie in the response header. This makes sense for a site that offers authentication: Django needs to check on every page view to see if the user is authenticated, so we can't have the upstream cache holding on to a page for us.

* Agreed about how the "smartness" comes at the cost of brittleness if the implementations are too tightly coupled. That said, I can squint and sort of see an implementation that could thread the needle here. It would require something like:

- An API in the cache middleware instructing it to ignore certain cookies for the purposes of caching (i.e. something along the lines of #9249).

- Some kind of "pre-fetch" hook in the cache middleware. Whether it's a flag in the request object, a signal or something else, give other systems the ability to look at a request before it hits the FetchFromCacheMiddleware and either allow or prevent the response from being pulled from the cache. E.g if there was a flag request.invalidate_cache that defaults to False, the contrib.auth app could, in combination with the above, pull the session id from consideration in the cache key and do an authentication check on its own, invalidating the cache on its own if the user is authenticated. The core idea is what you already suggested, I'm more illustrating here that this can conceivably be implemented as an API, making it less brittle.

- Some kind of "post-fetch" hook in the cache middleware, combined with a retooling of the CSRF middleware. This is getting in the clouds here a bit, but a hook on the opposite end of the fetch operation could allow the CSRF app to add its token after the response was pulled from the cache. I say we're in the clouds here because for something like this to work the CSRF would have to do a little two-step dance. Before the UpdateCache step the CSRF would had to insert something that looked like a server-side template tag, which gets cached, and then after that step the CSRF would have to insert it's actual value. On the fetch side, the CSRF would have to make use of the post fetch hook to pull the cached paged rendered with the server side template tag thingy and then add the correct value on its way out the door. Essentially, we're talking about a poor man's two phase rendering system.

This barely qualifies as a thought exercise let alone a proposal, but my main underlying suggestion here is that if the cache middleware correctly implemented hooks of some kind in the right locations, it might well be possible for systems like auth and CSRF to do what they would need to do without coupling all these systems together in a giant ball of twine.


> I do think we should improve the cache middleware documentation so its
> limitations are outlined more clearly upfront, and point people towards
> existing solutions for caching mostly-but-not-entirely-anonymous pages:
> edge-side-includes, two-phase-render, and JS/AJAX fetch.
>
> #15855, on the other hand, is a bug that really does need to be fixed. I
> still don't see a better fix than the one I outlined in the ticket
> description: requiring some middleware to be in MIDDLEWARE_CLASSES for
> the cache_page decorator to work, and not doing the actual caching until
> we hit that middleware. Or alternatively, adding an implicit "cache any
> responses that had cache_page used on them" phase to response
> processing, after all middleware. I think those are both ugly fixes,
> though; maybe someone has a better idea. The last time I know of that
> this was discussed in-depth was in
> http://groups.google.com/group/django-developers/browse_frm/thread/f96e982254fbe5c3/2b02361fd6e706f4
>
> Carl

My thinking right now as far as moving forward:

1. Fixing #9249 and #15855. I hear your philosophical concerns about #9249 but the ubiquity of Google Analytics means we must do fine some way to fix it (IMO). Addressing these two tickets would at least ensure page caching wasn't actually broken. I'll try to jump in on those if I have time later next week. #9249 in particular seems quite close.

2. Clarifying the documentation. I think an admonition in the page caching section of the docs which outlined the present challenges a developer might face implementing it would probably have done the trick for me when I was first glancing at it. I can open a ticket on that next week, again if I have time.

It'd be great if these two got in for 1.4.

3. Addressing the other stuff is I guess for now a sort of "some day" goal. I continue to feel strongly that it's a worthy goal, particularly given that CSRF and contrib.auth are such fundamental parts of most projects and that they really are the only two things that stand in the way of page caching being a viable option in many projects. If anyone else gets inspired by this goal let me know, otherwise I'm content for the time being to let it stew.

Thanks all for listening.

Kääriäinen Anssi

unread,
Oct 21, 2011, 11:04:19 AM10/21/11
to django-d...@googlegroups.com
I do not know nearly enough about caching to participate fully in this discussion. But it strikes me that the attempt to have CSRF protected anonymous page cached is not that smart. If you have an anonymous submittable form, why bother with CSRF protection? I mean, what is it protecting against? Making complex arrangements in the caching layer for this use case seems like wasted effort. Or am I missing something obvious?

The following is from the stupid ideas department: Maybe there could be a "reverse cache" template tag, such that you would mark the places where you want changing content as non-cacheable. You would need two views for this, one which would construct the "base content" and then another which would construct the dynamic parts. Something like:

page_cached.html:
... expensive to generate content ...
{% block "login_logout" non_cacheable %}
{% endblock %}
... expensive to generate content ...

You would generate the base page by a cached render view:

def page_view_cached(request, id):
if cached(id):
return cached_content
else:
... expensive queries ...
return cached_render("page_cached.html", context, ...)

The above view would not be directly usable at all, you would need to use a wrapper view which would render the non-cacheable parts:

def page_view(request, id):
# Below would return quickly from cache most of the time
cached_portions = page_view_cached(request, id)
return render_to_response("page.html", context={cached: cached_portions, user:request.user})

where page.html would be:
{% extends cached %}
{% block login_logout %}
{% if user.is_authenticated %}
Hello, user!
{% else %}
<a href="login.html">login</a>
{% endif %}
{% endblock %}

That seems to be what is really wanted in this situation. The idea is quite simply to extend the block syntax to caching. A whole another issue is how to make this easy enough to be actually usable, and fast enough to be actually worth it.

- Anssi

________________________________________
From: django-d...@googlegroups.com [django-d...@googlegroups.com] On Behalf Of Jim Dalton [jim.d...@gmail.com]
Sent: Friday, October 21, 2011 16:02
To: django-d...@googlegroups.com
Subject: Re: The state of per-site/per-view middleware caching in Django

Thanks all for listening.

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-d...@googlegroups.com.
To unsubscribe from this group, send email to django-develop...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.

Jim Dalton

unread,
Oct 21, 2011, 2:17:51 PM10/21/11
to django-d...@googlegroups.com
On Oct 21, 2011, at 8:04 AM, Kääriäinen Anssi wrote:

> I do not know nearly enough about caching to participate fully in this discussion. But it strikes me that the attempt to have CSRF protected anonymous page cached is not that smart. If you have an anonymous submittable form, why bother with CSRF protection? I mean, what is it protecting against? Making complex arrangements in the caching layer for this use case seems like wasted effort. Or am I missing something obvious?

First issue is that CSRF can matter for anonymous users. From here http://www.squarefree.com/securitytips/web-developers.html#CSRF:

Attacks can also be based on the victim's IP address rather than cookies:

• Post an anonymous comment that is shown as coming from the victim's IP address.
...
• Perform a distributed password-guessing attack without a botnet. (This assumes they have a way to tell whether the login succeeded, perhaps by submitting a second form that isn't protected against CSRF.)

So two very common uses cases for anonymous forms are log in forms and anonymous comment forms, both of which are potentially vulnerable. I guess I feel like it's quite common to have forms on a page these days even for anonymous users.

Second is -- and I don't know about this -- but I don't know how well CSRF handles authentication conditionally. Like if I have a page and let's say that page has forms in it for logged in users but nothing for anonymous user, can I conditionally exempt the formless page from CSRF? I have no idea, but buy default I presume it's on and I presume the cache is varying on it.

So, yes, you could probably optimize a lot of this to sort of skip around the CSRF issue and it's not a deal breaker. But my main argument has been the ubiquity of CSRF + user authentication in Django projects to me means a solution to both of these is a requirement for page caching to become easy and applicable in most scenarios.

>
> The following is from the stupid ideas department: Maybe there could be a "reverse cache" template tag, such that you would mark the places where you want changing content as non-cacheable. You would need two views for this, one which would construct the "base content" and then another which would construct the dynamic parts. Something like:
>

Your idea sounds a lot like the "server side include" or "two phased template rendering" approach that I know some people are doing. Here's an excellent example of this approach being used in EveryBlock (from two years ago):

http://www.holovaty.com/writing/django-two-phased-rendering/

And looks like some core devs have been involved at some point in this implementation of that concept:

https://github.com/codysoyland/django-phased

That looks almost exactly like your idea: "django-phased contains a templatetag, phased, which defines blocks that are to be parsed during the second phase. A middleware class, PhasedRenderMiddleware, processes the response to render the parts that were skipped during the first rendering."

I guess that was sort of what I was hinting at in my previous discussion about how to handle CSRF. In the link he is taking it to the next level (where even logged in users get the page and that stuff is done after.

Anyhow it's obviously a sensible conceptual approach. It would be a stretch to fit that into the existing page caching approach of Django obviously.

Carl Meyer

unread,
Oct 21, 2011, 2:39:05 PM10/21/11
to django-d...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/21/2011 07:02 AM, Jim Dalton wrote:
> 1. Fixing #9249 and #15855. I hear your philosophical concerns about
> #9249 but the ubiquity of Google Analytics means we must do fine some
> way to fix it (IMO). Addressing these two tickets would at least
> ensure page caching wasn't actually broken. I'll try to jump in on
> those if I have time later next week. #9249 in particular seems quite
> close.
>
> 2. Clarifying the documentation. I think an admonition in the page
> caching section of the docs which outlined the present challenges a
> developer might face implementing it would probably have done the
> trick for me when I was first glancing at it. I can open a ticket on
> that next week, again if I have time.
>
> It'd be great if these two got in for 1.4.

Agreed - any work you're able to put in on any of these is very welcome.

> 3. Addressing the other stuff is I guess for now a sort of "some day"
> goal. I continue to feel strongly that it's a worthy goal,
> particularly given that CSRF and contrib.auth are such fundamental
> parts of most projects and that they really are the only two things
> that stand in the way of page caching being a viable option in many
> projects. If anyone else gets inspired by this goal let me know,
> otherwise I'm content for the time being to let it stew.

I take your point that it might be possible to do a cache-tweaking API
that could allow the cache to be more aggressive around auth and CSRF
with less coupling (though you'd still end up sprinkling cache-specific
stuff into auth and CSRF with your approach). I remain pretty skeptical
about whether this is a good idea; it seems like it could significantly
increase the surface area for bugs in the cache middleware
implementation, and just generally make the implementation harder to
maintain with correct behavior. (I have some painful experience in this
area: CACHE_MIDDLEWARE_ANONYMOUS_ONLY is the one existing, and
relatively simple, instance of the type of enhanced caching logic you're
talking about, and I made some fixes to it in the 1.3 cycle that I then
later had to fix again due to unanticipated side effects of the first
change). But at this point this is all kind of hand-waving without code
to look at.

You might also consider what's possible to do outside of core as a
third-party alternative to Django's caching middleware. When you're
proposing major and somewhat experimental changes, that can be a
powerful way to demonstrate that the idea is workable, and makes it a
lot easier to pick up users and advocates; people are generally more
willing to try out a third-party tool than to run or test with a patched
Django.

Carl
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6hvEkACgkQ8W4rlRKtE2edBACfdsW7IHoDKBrpwzwGGMx+ww5g
U+AAoLZLn1CA6c1644kzsnZRZ6xaW60B
=QBjT
-----END PGP SIGNATURE-----

h3

unread,
Oct 22, 2011, 12:13:14 AM10/22/11
to Django developers
I think for the moment, the easy fix for anonymous forms it either to
put them on a different page or
to load them with ajax.

This way the forms and thus the tokens gets generated only when
needed.

If caching and performances are a big concern, I think those
alternative are win/win solutions.

You solve your problem and remove load.

My 2¢
> Comment: Using GnuPG with Mozilla -http://enigmail.mozdev.org/

Anssi Kääriäinen

unread,
Oct 22, 2011, 7:24:42 PM10/22/11
to Django developers
On Oct 21, 9:17 pm, Jim Dalton <jim.dal...@gmail.com> wrote:
> On Oct 21, 2011, at 8:04 AM, Kääriäinen Anssi wrote:
>
> > I do not know nearly enough about caching to participate fully in this discussion. But it strikes me that the attempt to have CSRF protected anonymous page cached is not that smart. If you have an anonymous submittable form, why bother with CSRF protection? I mean, what is it protecting against? Making complex arrangements in the caching layer for this use case seems like wasted effort. Or am I missing something obvious?
>
> First issue is that CSRF can matter for anonymous users. From herehttp://www.squarefree.com/securitytips/web-developers.html#CSRF:
>
> Attacks can also be based on the victim's IP address rather than cookies:
>
>         • Post an anonymous comment that is shown as coming from the victim's IP address.
> ...
>         • Perform a distributed password-guessing attack without a botnet. (This assumes they have a way to tell whether the login succeeded, perhaps by submitting a second form that isn't protected against CSRF.)
>
> So two very common uses cases for anonymous forms are log in forms and anonymous comment forms, both of which are potentially vulnerable. I guess I feel like it's quite common to have forms on a page these days even for anonymous users.
>
> Second is -- and I don't know about this -- but I don't know how well CSRF handles authentication conditionally. Like if I have a page and let's say that page has forms in it for logged in users but nothing for anonymous user, can I conditionally exempt the formless page from CSRF? I have no idea, but buy default I presume it's on and I presume the cache is varying on it.
>
> So, yes, you could probably optimize a lot of this to sort of skip around the CSRF issue and it's not a deal breaker. But my main argument has been the ubiquity of CSRF + user authentication in Django projects to me means a solution to both of these is a requirement for page caching to become easy and applicable in most scenarios.

I can see how the above mentioned cases are useful, and as you say,
they probably are common in real world usage.

I took a different approach to phased template rendering in [https://
github.com/akaariai/django/tree/rewritable_content]. I hope it will
give some insight into solving the rewriting of already rendered
content containing csrf_token.

The idea is that template.render(context) returns a subclass of
SafeUnicode instead of just SafeUnicode. The subclass knows the
positions of rewritable parts of the content (csrf token values, for
example), and also how to rewrite those parts of the content. So, from
a template {% csrf_token %} you could get something like this back:

>>> rendered = tmpl.render(Context({'csrf_token': 'CSRF_TOKEN_VALUE'}))
>>> str(rendered)
<input type="hidden" value="CSRF_TOKEN_VALUE" /> # (pseudoish...)

>>> rendered.rewritable_parts
{'csrf_token': [(27, 42)]} # a dictionary of rewritable name -> list
of str positions where that block exists.
>>> rendered.rewrite({'csrf_token': 'NEW_VALUE'})
<input type="hidden" value="NEW_VALUE" />

There are some tests in the github branch. Those tests are the best
documentation currently available.

Rewritable rendered templates should be usable in automatic handling
of csrf_token when solving the caching problem. If you do no caching,
the user will get a normal response. If you do caching, then you will
need a hook to do the response.rewrite for the csrf_token in cache
fetching. This has been discussed already, and seems to be solvable.
The actual rewrite of the content would be easy, it is just
response.rewrite({'csrf_token': 'new_csrf_token_value'}). This way it
could be possible to cache pages containing csrf_token transparently
to the user.

The github branch also implements {% rewritable some_name %} {%
endrewritable %} tag, but as is it is not very usable. For example,
rewriting the login/logout part of the page would be much easier using
a real two-phase rendering implementation. The already mentioned
Jannis Leidel's django-phased seems to fit this task much better than
my hack.

As far as I can tell there isn't any large performance hit (actually,
using djangobench, I could not measure any difference). This might be
just a failure on my part, as that result is a bit surprising.

The biggest problem with the approach is that the csrf_token tag must
be rendered as part of nodelist. If it isn't the tracking of
start_pos,end_pos of the rewritable content will get out of sync. This
alone might be a show-stopper. I would not be surprised if there are
other non-solvable problems with the approach. All I know is that it
seems to work with include and block tags in simple templates.

The current implementation is just a quick hack. As said above, it is
possible, if not likely, that this approach is a dead-end.

- Anssi
Reply all
Reply to author
Forward
0 new messages