Varnish and scaling cache invalidation

Kelly Sommers

unread,

Jul 17, 2012, 11:08:29 PM7/17/12

to distsys...@googlegroups.com

I saw a tweet from Camille Fournier this evening mentioning Varnish which is a caching HTTP reverse proxy. I should mention I don't have much experience with caching yet.

I saw a feature which was the ability to invalidate cache based on a URL and this seems to fit really well with RESTful (or any HTTP service perhaps) services but I wonder what implications this has on scalability. I thought the behaviour of the web was to be more TTL based because this is much easier to scale?

I'm thinking of a write-heavy system where updates are happening at a high rate. It seems to me that a push based cache invalidation mechanism would become a new scalability pain point? The problem is the backend may not be aware what the cache has cached so it would be constantly be communicating with the cache service on every single update. Am I correct in this assumption?

Kelly Sommers

unread,

Jul 18, 2012, 4:38:51 PM7/18/12

to distsys...@googlegroups.com

Another thought on scaling cache invalidation. Wouldn't a heavy write system that is invalidating the caches frequently at a high rate be causing contention issues within the cache service itself at the same time trying to respond to requests?

John Meagher

unread,

Jul 18, 2012, 8:08:04 PM7/18/12

to distsys...@googlegroups.com

TTL based caching is used for intermediate server handling caching rather than your own. I haven't used Varnish, but other things like it will use TTL via the expires headers. Allowing invalidation forces your local cache to grab a fresh version. The invalidation is a very cheap operation to call since it just sets a flag on the cached URL telling it that the next time someone requests that URL it needs to get a fresh copy.

In write heavy applications the need for the cache needs to be checked out. If every request to the cache is a miss then the cache is only adding to the processing time and complexity. Some benchmarking on your specific application is needed to check if the cache improves things or not.

On Tuesday, July 17, 2012 11:08:29 PM UTC-4, Kelly Sommers wrote:

Kelly Sommers

unread,

Jul 19, 2012, 12:57:01 PM7/19/12

to distsys...@googlegroups.com

John,

Your response made me think that there's probably a difference between how you handle cache invalidation for updates than inserts.

Is it fair to say that if you are receiving a heavy dose of inserts, there is no need to spam the cache service requests to invalidate because it will hit the server anyways for the data since it's new? Reason I say this is if you have 200M inserts/day you don't really want/need to flood the cache service right?

Your concern about cache hits I imagine would fall more under writes that are updates instead of inserts?

--
You received this message because you are subscribed to the Google Groups "Distributed Systems" group.
To post to this group, send email to distsys...@googlegroups.com.
To unsubscribe from this group, send email to distsys-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/distsys-discuss/-/PbLeVp30ISMJ.

For more options, visit https://groups.google.com/groups/opt_out.

John Meagher

unread,

Jul 19, 2012, 8:29:17 PM7/19/12

to distsys...@googlegroups.com

Kelly,

I agree with everything there. When inserting there is no need to talk to the caching layer since nothing should be cached. The cache won't be loaded until a read happens (well, there are some cases where you'd want to pre-load the cache, but those are rare). So insert heavy apps won't need to invalidate the cache. Update heavy apps and high-cardnalidity-low-repeat read apps are where the caching can wreck performance.

I get a lot of good architecture ideas from http://highscalability.com/. They regularly post "How does site XYZ work?" articles that cover the whole stack.

On Thursday, July 19, 2012 12:57:01 PM UTC-4, Kelly Sommers wrote:

John,

Your response made me think that there's probably a difference between how you handle cache invalidation for updates than inserts.

Is it fair to say that if you are receiving a heavy dose of inserts, there is no need to spam the cache service requests to invalidate because it will hit the server anyways for the data since it's new? Reason I say this is if you have 200M inserts/day you don't really want/need to flood the cache service right?

Your concern about cache hits I imagine would fall more under writes that are updates instead of inserts?

On Wed, Jul 18, 2012 at 8:08 PM, John Meagher <john.m...@gmail.com> wrote:

TTL based caching is used for intermediate server handling caching rather than your own. I haven't used Varnish, but other things like it will use TTL via the expires headers. Allowing invalidation forces your local cache to grab a fresh version. The invalidation is a very cheap operation to call since it just sets a flag on the cached URL telling it that the next time someone requests that URL it needs to get a fresh copy.

In write heavy applications the need for the cache needs to be checked out. If every request to the cache is a miss then the cache is only adding to the processing time and complexity. Some benchmarking on your specific application is needed to check if the cache improves things or not.

On Tuesday, July 17, 2012 11:08:29 PM UTC-4, Kelly Sommers wrote:
I saw a tweet from Camille Fournier this evening mentioning Varnish which is a caching HTTP reverse proxy. I should mention I don't have much experience with caching yet.

I saw a feature which was the ability to invalidate cache based on a URL and this seems to fit really well with RESTful (or any HTTP service perhaps) services but I wonder what implications this has on scalability. I thought the behaviour of the web was to be more TTL based because this is much easier to scale?

I'm thinking of a write-heavy system where updates are happening at a high rate. It seems to me that a push based cache invalidation mechanism would become a new scalability pain point? The problem is the backend may not be aware what the cache has cached so it would be constantly be communicating with the cache service on every single update. Am I correct in this assumption?

--

You received this message because you are subscribed to the Google Groups "Distributed Systems" group.

To post to this group, send email to distsys-discuss@googlegroups.com.
To unsubscribe from this group, send email to distsys-discuss+unsubscribe@googlegroups.com.

Hoop

unread,

Aug 15, 2012, 6:03:54 PM8/15/12

to distsys...@googlegroups.com

Sorry for jumping on this thread so long after it started. I don't have experience with Varnish but I've used Akamai and played around with Azure's Caching. One popular thing ot do is pre-populating cache nodes when you know you have something that will result in high demand. That's the one case where on insert I might want to invalidate. Especially where you have lots of content but some content is hotter than others. In that case, you either need varying TTL based on content or invalidation.

I'm always nervous about large TTLs.

Mark Nottingham (chair of the IETF HTTPbis) has a blog post talking about caching in some detail: http://www.mnot.net/cache_docs/

A few years ago he posted about some proposed caching extensions that would allow servers to define behavior in the event of stale data that I found interesting: http://www.mnot.net/blog/2007/12/12/stale

Combining short TTL with support for 304 status codes (Not Modified response) and the proposed stale-while-revalidate would allow sites and services to scale well in the face of frequent updates to a single entity. Unfortunately I don't know that the proposals went anywhere.

-hoop

To post to this group, send email to distsys...@googlegroups.com.
To unsubscribe from this group, send email to distsys-discu...@googlegroups.com.

Reply all

Reply to author

Forward