Last modified headers

38 views
Skip to first unread message

Joe Devon

unread,
Nov 6, 2010, 2:13:46 AM11/6/10
to web-per...@googlegroups.com
I thought I'd start a discussion on the If Modified headers.

It's a no brainer for static content to cache permanently. Then if
css/js or images change, version it so you're sure to break cache.

But what about dynamic content?

Here's a discussion:
http://www.webscalingblog.com/performance/caching-http-headers-last-modified-and-etag.html

And some code snippet for a Zend Framework implementation:
http://www.zfsnippets.com/snippets/view/id/67

Curious what your thoughts are?

By hashing the body content, you're sure to know if you have the same
content...but what a shame to not save server resources and still need
to generate the page...you do save bandwidth though.

OTOH, the concept and reality of caching is totally different. How
many times did you have a bug because you might have so many caching
layers? ESI. Cached database calls. Cached images. etc... It can get
pretty tricky and easy to overlook one.

Personally, I'd shy away from sending those headers on dynamic
content, but I did wonder if any of you have played with it in
practice.

Julien Wajsberg

unread,
Nov 6, 2010, 3:50:03 AM11/6/10
to web-per...@googlegroups.com
On 6 November 2010 07:13, Joe Devon <joed...@gmail.com> wrote:
>
> Curious what your thoughts are?

I'd say :
- do not use ETag
- yet you can easily find Last-Modified without doing a lot of work.
Of course in a dynamic site it can depend on anything, it depends on
your functionalities. In a blog it could be the datetime of the last
post (or the last comment on a post page). It can be the maximum
datetime of all dynamic generated parts in your page.

--
Julien

Ryan Witt

unread,
Nov 6, 2010, 1:53:23 PM11/6/10
to web-per...@googlegroups.com

Thanks for the snippet! Good discussion topic.

> Personally, I'd shy away from sending those headers on dynamic
> content, but I did wonder if any of you have played with it in
> practice.


Django has nice built-in middleware that do Etags/Last Modified for dynamic content. I've turned this on for some dynamic sites and it's a big win for pages where the content rarely changes. It's great for our CMS.

Code snippets backatcha:

http://docs.djangoproject.com/en/dev/topics/cache/#other-optimizations
http://code.djangoproject.com/browser/django/trunk/django/middleware/common.py#L111
http://code.djangoproject.com/browser/django/trunk/django/middleware/http.py

You got me wondering about exactly how much of a CPU hit I'm taking for this feature, so I wrote a little test program to see what the cost of Etag hashing vs Gzip (at least the way django does them): https://gist.github.com/665570

Using the same methods as Django's Etag and Gzip features, calculating the MD5 for Etags turned out to be an *order of magnitude faster* than gzip. Given this, I'd have to recommend using it!

I think you could get fancy with last modified headers as well if you stored content freshness dates in your database, but given how cheap Etags are, it might not be worth the complexity.

Do you guys agree with my methodology?

--
Ryan Witt
http://whichloadsfaster.com/


Sergey Chernyshev

unread,
Nov 6, 2010, 9:50:17 PM11/6/10
to web-per...@googlegroups.com
I think the problem is that there is no completely static or completely dynamic content in the world.
We kind-of simplified the whole model by saying - images, scripts and so on are static and HTML pages, XML and JSON are dynamic, but in reality lifetime of the data is always somewhere in-between.

Images, CSS and JS change from time to time, even company logos and almost all results of scripting represented in HTML, XML and JSON stay the same for some period of time - even if as small as few minutes.

More over, for some systems, it is completely appropriate to re-think the time parameters for dynamic results - Google's results don't have to change until page rankings for particular keywords change and even if both change constantly, it doesn't mean that particular user cares to receive newly re-ordered results as fast as they come in through ranking pipeline.

It might be different for stock-trading or security systems or something like that, but for majority of applications on the web it is very likely that cache-ing (on all levels) can be done without harm to the product while at the same time it can introduce significant improvements in user experience which will bring user loyalty and allow for much better scalability.

All that being said, it requires some additional information on storage layer and "framework"-type of code to enable data lifetime tracking for all request components (pages often contain multiple objects, XML and JSON contains trees of multiple objects and so on).

The good news is that truly uncontrollable dynamic stuff like ads, for example are implemented using JavaScript and those that are not can be implemented using JS.

On another note, the real problem of setting and checking Last-modified and If-modified-since is only with HTML pages because they are the only ones that can not have different URLs - for the rest of the content, the calling code can create version-based URLs and use infinite expiration.

The simples case is CSS/JS/Images where you can probably just use file version numbers, e.g. from Subversion using my SVN Assets library or similar project.

For dynamic content, system still needs to keep version numbers and use them in GET calls from HTML pages instead of handlin resource's LM/IMS headers.

For example, in ShowSlow, I track last update date for the URL and use IM/IMS headers for HTML page itself, but within the page, I use update dates for specific components to create request URLs when loading data into the graph and data tables which allows me to put infinite expires on them as well.

It's probably worth describing this in a blog post ;)

Sergey Chernyshev

unread,
Nov 6, 2010, 9:55:03 PM11/6/10
to web-per...@googlegroups.com
Speaking of CPU usage - much more CPU on the servers and browsers is used if data is transmitted over the network unnecessarily.

I believe there should be every effort done to cache as aggressively as possible, to match or even "outcache" the real data lifetime.

Same logic goes for gzip - network time is much more precious then CPU time and it's worth having gzip. More over, it's an easy layer to abstract comparing to many other layers of the web-pie ;)

         Sergey

Jonathan Klein

unread,
Nov 7, 2010, 9:48:07 AM11/7/10
to web-per...@googlegroups.com
Caching dynamic content is pretty challenging.  We built an in house solution that is similar to varnish, which basically does HTML output caching.  We have a ton of really complex rules around how long pages get cached for, what events trigger invalidation, how the purging happens, etc.  Granted this is slightly different from setting cache headers, but it's still along the same topic of caching dynamic pages.  

One of the big issues is the potential to cache bugs.  What if someone pushes out a bad CSS file, or a bad sprite?  Normally you would just fix it, version the filename, and you would be all set.  If you are caching your HTML you have to worry about how many pages have gotten cached with the bad filename.  The only way to be sure that you've fixed it is to purge the entire cache.

Obviously caching can have huge benefits, like Sergey mentioned you just have to be careful about what layers you cache at and how long you set your TTLs.  For high traffic sites or even high traffic pages you can get big benefits even with a ~1 hour cache life.  Most things can be cached for an hour without too much concern for stale content, and if you are getting hundreds or thousands of hits an hour that amounts to a big processing/bandwidth offload.  

It's hard to give a straight answer, it really depends on your platform and the topology of your site, but in general I think there are benefits to caching pretty much all content for some period of time (unless you are dealing with user specific data of course).  

Hope this helps.

-Jonathan

Joe Devon

unread,
Nov 7, 2010, 2:27:54 PM11/7/10
to web-per...@googlegroups.com
Really enjoying this discussion.

Thanks for the code snippet Ryan.

Sergey, Good point on CPU. Didn't think about that...

Jonathan, so true re: bad css or js file being cached...
--
I tend to be pretty cautious with stuff that brings about a bit of
improvement in performance with the potential cost of major bugs.

So many weird things happen. I once had to dynamically generate zipped
files containing screensavers and the like for Mac and Windows, cross
browser. Then have that cached by 3rd party ESI provider. And through
some arcane caching configuration, it broke in production for some
files and not others. Was so hard to really nail it down.

Sergey Chernyshev

unread,
Nov 7, 2010, 8:25:09 PM11/7/10
to web-per...@googlegroups.com
I think one other problem is that word cache is used all over the
place without distinguishing various layers where different caches can
be implemented.

This causes people to loose track of the simple logic that goes into
caching when they try to go through all the rules across many layers
instead of looking at layers separately. In reality, they are quite
independent and operate with different objects which makes them
relatively easy to plan and manage actually.

Sergey


On Sunday, November 7, 2010, Jonathan Klein <jonathan...@gmail.com> wrote:
> Caching dynamic content is pretty challenging.  We built an in house solution that is similar to varnish, which basically does HTML output caching.  We have a ton of really complex rules around how long pages get cached for, what events trigger invalidation, how the purging happens, etc.  Granted this is slightly different from setting cache headers, but it's still along the same topic of caching dynamic pages.
>
>

> One of the big issues is the potential to cache bugs.  What if rssomeone pushes out a bad CSS file, or a bad sprite?  Normally you would just fix it, version the filename, and you would be all set.  If you are caching your HTML you have to worry about how many pages have gotten cached with the bad filename.  The only way to be sure that you've fixed it is to purge the entire cache.

gekkstah

unread,
Nov 16, 2010, 4:27:25 PM11/16/10
to Web Performance
The main reason against "caching dynamic content" for me in my job is
"tracking".
You may also cache the "page impression" and your business department
gets annoyed.

I did not invest much time in this topic.
Do you think it could be worth it ?!
What do you think about tracking information vs. caching ?

Greetings from Germany
Björn

On 8 Nov., 02:25, Sergey Chernyshev <sergey.chernys...@gmail.com>
wrote:
> I think one other problem is that word cache is used all over the
> place without distinguishing various layers where different caches can
> be implemented.
>
> This causes people to loose track of the simple logic that goes into
> caching when they try to go through all the rules across many layers
> instead of looking at layers separately. In reality, they are quite
> independent and operate with different objects which makes them
> relatively easy to plan and manage actually.
>
>            Sergey
>
> On Sunday, November 7, 2010, Jonathan Klein <jonathan.n.kl...@gmail.com> wrote:
> > Caching dynamic content is pretty challenging.  We built an in house solution that is similar to varnish, which basically does HTML output caching.  We have a ton of really complex rules around how long pages get cached for, what events trigger invalidation, how the purging happens, etc.  Granted this is slightly different from setting cache headers, but it's still along the same topic of caching dynamic pages.
>
> > One of the big issues is the potential to cache bugs.  What if rssomeone pushes out a bad CSS file, or a bad sprite?  Normally you would just fix it, version the filename, and you would be all set.  If you are caching your HTML you have to worry about how many pages have gotten cached with the bad filename.  The only way to be sure that you've fixed it is to purge the entire cache.
>
> > Obviously caching can have huge benefits, like Sergey mentioned you just have to be careful about what layers you cache at and how long you set your TTLs.  For high traffic sites or even high traffic pages you can get big benefits even with a ~1 hour cache life.  Most things can be cached for an hour without too much concern for stale content, and if you are getting hundreds or thousands of hits an hour that amounts to a big processing/bandwidth offload.
>
> > It's hard to give a straight answer, it really depends on your platform and the topology of your site, but in general I think there are benefits to caching pretty much all content for some period of time (unless you are dealing with user specific data of course).
>
> > Hope this helps.
> > -Jonathan
>
> > On Sat, Nov 6, 2010 at 9:55 PM, Sergey Chernyshev <sergey.chernys...@gmail.com> wrote:
> > Speaking of CPU usage - much more CPU on the servers and browsers is used if data is transmitted over the network unnecessarily.
>
> > I believe there should be every effort done to cache as aggressively as possible, to match or even "outcache" the real data lifetime.
>
> > Same logic goes for gzip - network time is much more precious then CPU time and it's worth having gzip. More over, it's an easy layer to abstract comparing to many other layers of the web-pie ;)
>
> >          Sergey
>
> > On Sat, Nov 6, 2010 at 1:53 PM, Ryan Witt <onecreativen...@gmail.com> wrote:
>
> > Thanks for the snippet! Good discussion topic.
>
> >> Personally, I'd shy away from sending those headers on dynamic
> >> content, but I did wonder if any of you have played with it in
> >> practice.
>
> > Django has nice built-in middleware that do Etags/Last Modified for dynamic content. I've turned this on for some dynamic sites and it's a big win for pages where the content rarely changes. It's great for our CMS.
>
> > Code snippets backatcha:
>
> >http://docs.djangoproject.com/en/dev/topics/cache/#other-optimizations
> >http://code.djangoproject.com/browser/django/trunk/django/middleware/...
> >http://code.djangoproject.com/browser/django/trunk/django/middleware/...

Sergey Chernyshev

unread,
Nov 16, 2010, 5:54:52 PM11/16/10
to web-per...@googlegroups.com
In reality people don't do tracking using code within pages themselves - usually it's done with some external "pixels". You can very well cache the pages and content within them, but keep those pixels un-cacheable (all external tools do that anyway).

So I don't really see conflict between those.

         Sergey

Ryan Witt

unread,
Nov 16, 2010, 6:03:37 PM11/16/10
to web-per...@googlegroups.com
It's certainly true that RUM (real user monitoring) solutions that either process logs or capture traffic could be affected by caching.

Some people use RUM extensively, though the trend seems to be toward scripts and pixels for noscript.

--Ryan

Mike Brittain

unread,
Nov 17, 2010, 3:45:40 PM11/17/10
to web-per...@googlegroups.com
JavaScript tracking pixels, or image beacons embedded in your generated HTML, are typically considered the best way to handle tracking page views for the reason that aggressive intermediate proxies might still cache a page that you consider short-lived or completely dynamic.  That said, I've been in plenty of companies who still process raw logs for page views and that completely defeats the opportunity for page caching.

The arguments to be made for page caching, and moving to a pixel-based measurement, include:

- dynamic pages take more processing to build = more servers
- building pages for every request = slower responses
- serving static pixels is faaast and requires few servers, relative to what your web/app servers are doing

If you want to start caching dynamic pages, this can be proposed in terms of number of servers you need to buy and operate (along with engineering cost of moving to pixel-based tracking).

Mike

On Tue, Nov 16, 2010 at 4:27 PM, gekkstah <outbac...@gmx.de> wrote:

Sergey Chernyshev

unread,
Nov 17, 2010, 8:07:45 PM11/17/10
to web-per...@googlegroups.com
Definitely, not doing cacheing on all layers just for monitoring is
not a good idea.

In any case, even if you don't cache page requests, all static files
still need to be aggressively cached with URL versioning for
invalidation (not in query string but as part of file/folder name to
avoid stupid heuristics).

--

Reply all
Reply to author
Forward
0 new messages