CacheManager

David Cramer

unread,

Jul 5, 2007, 2:14:22 AM7/5/07

to Django developers

So I had a cool little idea -- to build a CacheManager, which let our
team collaborate with caching controls and not duplicating data in the
cache.

So, here's a rough draft that i hammered out in 15 minutes.

http://www.davidcramer.net/code/50/django-cachemanager.html

Criticism welcomed!

Jeremy Dunck

unread,

Jul 5, 2007, 2:25:39 AM7/5/07

to django-d...@googlegroups.com

On 7/5/07, David Cramer <dcr...@gmail.com> wrote:
> http://www.davidcramer.net/code/50/django-cachemanager.html
>
> Criticism welcomed!

I think the order of bits returned from _get_sql_clause is dependent
on how the queryset is built up, so that you'll cache equivalent
result sets repeatedly.

And there's the issue of cache invalidation.

But definitely useful.

David Cramer

unread,

Jul 5, 2007, 3:10:24 AM7/5/07

to Django developers

Ya cache invalidation is something you'll always have a problem with.
The clean() method can be used (at the end of a queryset, or on the
Manager itself) to force the invalidation.

As for _get_sql_clause I guess I could make it just pull from filters/
extra args to build the key -- not as clean but it's the only other
way I can think of.

On Jul 4, 11:25 pm, "Jeremy Dunck" <jdu...@gmail.com> wrote:

Adrian Holovaty

unread,

Jul 5, 2007, 2:22:12 PM7/5/07

to django-d...@googlegroups.com

On 7/5/07, David Cramer <dcr...@gmail.com> wrote:

Oooh, this is something I've wanted to do for ages...See
http://code.djangoproject.com/ticket/5 , which was marked as wontfix
by Jacob (but I'd still like to see this feature).

Some feedback:

* I'd suggest making "key_prefix" and "expire" actual arguments to
CacheManager.__init__() rather than assuming they're in kwargs. That
way, you can set defaults in the normal way, and you get the normal
Python required-argument functionality (i.e., Python will complain if
you don't pass the argument).

* Same goes for CachedQuerySet.__init__().

* As Jeremy pointed out, there's no guaranteed order for the result of
_get_sql_clause(). You might want to order the contents of
_get_sql_clause() explicitly (alphabetically, or something), so you
get consistent cache keys. Coincidentally (or not), this is the
original non-starter that prevented me from working on this feature
myself a year or two ago. :-)

* This might be out of the scope of this code, but it might be worth
allowing the client code to set the expire time at runtime somehow,
not just when instantiating CacheManager.

* CachedQuerySet.clean() is missing a "self" in its function definition.

I'm looking forward to seeing how this matures -- thanks to you and
your team for writing it, David!

Adrian

--
Adrian Holovaty
holovaty.com | djangoproject.com

James Bennett

unread,

Jul 5, 2007, 2:31:51 PM7/5/07

to django-d...@googlegroups.com

On 7/5/07, Adrian Holovaty <holo...@gmail.com> wrote:
> Oooh, this is something I've wanted to do for ages...See
> http://code.djangoproject.com/ticket/5 , which was marked as wontfix
> by Jacob (but I'd still like to see this feature).

I'd also be interested to get input from the SoC student who's working
on (somewhat) similar stuff; if the two projects can cross-pollinate
ideas with each other, it'd be nice not to have to worry about
multiple competing implementations when the time comes ;)

--
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."

Honza Král

unread,

Jul 5, 2007, 2:41:38 PM7/5/07

to django-d...@googlegroups.com

On 7/5/07, David Cramer <dcr...@gmail.com> wrote:
>

> Ya cache invalidation is something you'll always have a problem with.
> The clean() method can be used (at the end of a queryset, or on the
> Manager itself) to force the invalidation.

We have been working on this issue recently and came up with a
mechanism that solves this problem for us - when creating a cache, we
register with a CacheInvalidator object a model, test and a cache key.
CacheInvalidator than listens for post_save signals and for every
model check the registered tests... if a test passes, the cache_key
associated with the test is deleted. So if you can construct a test
based on the QuerySet's filters etc, you will be able to invalidate
just the querysets actually affected by the change...

We haven't tested it for performance (we are building a high-volume
site) yet and we still haven't figured out how to deal with multiple
web servers connecting to one cache (our working version includes
propagating the post_save signal via some asynchronous communication
channel like apache's ActiveMQ, but we might end up with a separate
server just for the cache invalidation).

Is anybody interested in this?

> As for _get_sql_clause I guess I could make it just pull from filters/
> extra args to build the key -- not as clean but it's the only other
> way I can think of.
>
> On Jul 4, 11:25 pm, "Jeremy Dunck" <jdu...@gmail.com> wrote:
> > On 7/5/07, David Cramer <dcra...@gmail.com> wrote:
> >
> > >http://www.davidcramer.net/code/50/django-cachemanager.html
> >
> > > Criticism welcomed!
> >
> > I think the order of bits returned from _get_sql_clause is dependent
> > on how the queryset is built up, so that you'll cache equivalent
> > result sets repeatedly.
> >
> > And there's the issue of cache invalidation.
> >
> > But definitely useful.
>
>
> >
>

--
Honza Král
E-Mail: Honza...@gmail.com
ICQ#: 107471613
Phone: +420 606 678585

David Cramer

unread,

Jul 5, 2007, 5:06:44 PM7/5/07

to Django developers

Sorry, new link: http://dpaste.com/hold/13668/

> Honza Kr?l
> E-Mail: Honza.K...@gmail.com

Jeremy Dunck

unread,

Jul 5, 2007, 7:07:27 PM7/5/07

to django-d...@googlegroups.com

On 7/5/07, Honza Král <honza...@gmail.com> wrote:
> We haven't tested it for performance (we are building a high-volume
> site) yet and we still haven't figured out how to deal with multiple
> web servers connecting to one cache (our working version includes
> propagating the post_save signal via some asynchronous communication
> channel like apache's ActiveMQ, but we might end up with a separate
> server just for the cache invalidation).

FWIW, on the memcached list right now, they're writing agenda for an
upcoming hackathon.

Regex-based key purging is on the drawing board. That sounds nuts,
but Brad is a genius. The proposed approach is to have any mass
delete not actually delete immediately, but increment a generation
counter. Any get on a key with an older generation would then be
tested newer generation delete patterns and matching keys would be
expired at that time.

You might go see if you could use that and contribute to it if so.

> Is anybody interested in this?

Yes indeed. I'm going to be building something fairly high-volume and
low-latency, and am trying to come up with an efficient way of mass
invalidation, too.

David Cramer

unread,

Jul 8, 2007, 7:11:04 PM7/8/07

to Django developers

Here's about as final as we're going to get it for now: http://dpaste.com/hold/13884/

On Jul 5, 4:07 pm, "Jeremy Dunck" <jdu...@gmail.com> wrote:

Nebojša Đorđević

unread,

Jul 11, 2007, 8:25:17 AM7/11/07

to django-d...@googlegroups.com

On 09.07.2007., at 01:11, David Cramer wrote:

> Here's about as final as we're going to get it for now: http://
> dpaste.com/hold/13884/

Great work :)

About cache invalidation ... why you can use something like this
http://dpaste.com/hold/14104/

and just add:

class Foo(models.Model):
objects = CacheManager()
track_cache(Foo)

Full code is at http://dpaste.com/hold/14105/

--
Nebojša Đorđević - nesh, ICQ#43799892, http://www.linkedin.com/in/
neshdj
Studio Quattro - Niš - Serbia
http://studioquattro.biz/ | http://code.google.com/p/django-utils/
Registered Linux User 282159 [http://counter.li.org]

Honza Král

unread,

Jul 11, 2007, 8:49:07 AM7/11/07

to django-d...@googlegroups.com

This is a rough version of our cache invalidator, including a sample
decorator and a cached template tag:

http://dpaste.com/14107/

If you have multiple web servers talking to one cache (which is our
case) you can use some message passing system (like activeMQ) to
propagate the signals to other web servers

Nebojša Đorđević

unread,

Jul 11, 2007, 11:31:25 AM7/11/07

to django-d...@googlegroups.com

On 11.07.2007., at 14:25, Nebojša Đorđević wrote:

> Full code is at http://dpaste.com/hold/14105/

Here is my final take on this problem: http://dpaste.com/hold/14122/

I changed caching so that all of the queries related to the one model
are stored inside single cache key. This way I can invalidate all of
them when change is detected.

Downside is that if any of the model rows is changed all data is
invalidated so next queries will hit DB again. OTOH this way I can be
sure that I'll always get latest data from the DB.

I tried to make new Model subclass to avoid need to add
track_changes, but then I started to get some weird errors about
missing table names so I returned to this approach. (I know, I know,
model subclassing don't work, yet ;) )

Tried this code on the one of the my projects and it's working nice
(no real testing tough).

David Cramer

unread,

Jul 11, 2007, 1:27:43 PM7/11/07

to Django developers

That's an interesting solution. My main reason for not doing that is
because I've been told the dispatcher sucks (it's slow) and we're
going for speed.

I don't see the reasoning for adding QUERY_ to the cache_key. By
default the cache_key is your db_table. So if your model is
myapp.HelloWorld it will most likely be myapp_helloworld. But with
memcached and most caching solutions at the moment you cant do simply
invalidation with the engine. However, based on what they were talking
about with memcached's idea, you could set a generation counter in
memory to force a refresh of the entire cache, or parts of the cache,
or even handle it like the memcached proposal would.

Honza: do you not use a memcached pool? That's how we handle our
caching which would in turn keep the cache valid on all servers when
it's changed on one. Maybe your solution could go beyond a single pool
though, which would be interesting.

On Jul 11, 8:31 am, Nebojša Đorđević <n...@studio-quattro.com> wrote:
> On 11.07.2007., at 14:25, Nebojša Đorđević wrote:
>

> > Full code is athttp://dpaste.com/hold/14105/

>
> Here is my final take on this problem:http://dpaste.com/hold/14122/
>
> I changed caching so that all of the queries related to the one model
> are stored inside single cache key. This way I can invalidate all of
> them when change is detected.
>
> Downside is that if any of the model rows is changed all data is
> invalidated so next queries will hit DB again. OTOH this way I can be
> sure that I'll always get latest data from the DB.
>
> I tried to make new Model subclass to avoid need to add
> track_changes, but then I started to get some weird errors about
> missing table names so I returned to this approach. (I know, I know,
> model subclassing don't work, yet ;) )
>
> Tried this code on the one of the my projects and it's working nice
> (no real testing tough).
>
> --

> Nebojša Đorđević - nesh, ICQ#43799892,http://www.linkedin.com/in/
> neshdj
> Studio Quattro - Niš - Serbiahttp://studioquattro.biz/|http://code.google.com/p/django-utils/

Honza Král

unread,

Jul 11, 2007, 3:49:23 PM7/11/07

to django-d...@googlegroups.com

On 7/11/07, David Cramer <dcr...@gmail.com> wrote:
>
> That's an interesting solution. My main reason for not doing that is
> because I've been told the dispatcher sucks (it's slow) and we're
> going for speed.
>
> I don't see the reasoning for adding QUERY_ to the cache_key. By
> default the cache_key is your db_table. So if your model is
> myapp.HelloWorld it will most likely be myapp_helloworld. But with
> memcached and most caching solutions at the moment you cant do simply
> invalidation with the engine. However, based on what they were talking
> about with memcached's idea, you could set a generation counter in
> memory to force a refresh of the entire cache, or parts of the cache,
> or even handle it like the memcached proposal would.
>
> Honza: do you not use a memcached pool? That's how we handle our
> caching which would in turn keep the cache valid on all servers when
> it's changed on one. Maybe your solution could go beyond a single pool
> though, which would be interesting.

to be honest we are not in production yet...But the problem is ot with
memcache its with django - if you change an object how do you now
which cache_keys to delete? sure, you could probably delete all the
caches for the given model or something similar, but our solution
allows you to delete only the caches that were actually depending on
that given object.
You could drop cache for the day's listing, but only if you have
cached the listing before.

That's why we need some messaging system to spread the post_save
signal across the web servers (or one dedicated cache management
server, which would also host the CacheDeleter instance). Plus if we
make the message dispatching asynchronous, we won't have to worry
about the signals slowing our requests that much - it will simply be
done on the background (or even on the dedicated server). Also the
signal is only for post_save and we, being a typical CMS site, don't
have that many writes when compared to reads.

> On Jul 11, 8:31 am, Nebojša Đorđević <n...@studio-quattro.com> wrote:
> > On 11.07.2007., at 14:25, Nebojša Đorđević wrote:
> >
> > > Full code is athttp://dpaste.com/hold/14105/
> >
> > Here is my final take on this problem:http://dpaste.com/hold/14122/
> >
> > I changed caching so that all of the queries related to the one model
> > are stored inside single cache key. This way I can invalidate all of
> > them when change is detected.
> >
> > Downside is that if any of the model rows is changed all data is
> > invalidated so next queries will hit DB again. OTOH this way I can be
> > sure that I'll always get latest data from the DB.
> >
> > I tried to make new Model subclass to avoid need to add
> > track_changes, but then I started to get some weird errors about
> > missing table names so I returned to this approach. (I know, I know,
> > model subclassing don't work, yet ;) )
> >
> > Tried this code on the one of the my projects and it's working nice
> > (no real testing tough).
> >
> > --
> > Nebojša Đorđević - nesh, ICQ#43799892,http://www.linkedin.com/in/
> > neshdj
> > Studio Quattro - Niš - Serbiahttp://studioquattro.biz/|http://code.google.com/p/django-utils/
> > Registered Linux User 282159 [http://counter.li.org]
>
>
> >
>

Nebojša Đorđević

unread,

Jul 11, 2007, 6:41:41 PM7/11/07

to django-d...@googlegroups.com

On 11.07.2007., at 19:27, David Cramer wrote:

> That's an interesting solution. My main reason for not doing that is
> because I've been told the dispatcher sucks (it's slow) and we're
> going for speed.

Using CachedModel as a base class will be ideal, but ... this will
not work until model subclassing is fixed. And I really hate to write
save/delete methods just to call manager(s) clean() :)

But if the speed is critical one can always choose not to use
track_cache helper and instead add appropriate clean() calls to the
model save/delete methods and avoid dispatcher overhead.

> I don't see the reasoning for adding QUERY_ to the cache_key. By
> default the cache_key is your db_table. So if your model is
> myapp.HelloWorld it will most likely be myapp_helloworld.

QUERY_ is removed in the last version: http://dpaste.com/hold/14122/

Now there is a CQS_ prefix added to the cache keys just to (a little)
increase key uniqueness.

> But with
> memcached and most caching solutions at the moment you cant do simply
> invalidation with the engine. However, based on what they were talking
> about with memcached's idea, you could set a generation counter in
> memory to force a refresh of the entire cache, or parts of the cache,
> or even handle it like the memcached proposal would.

(if I understood this correctly)

With the current cache back-ends there are no way to do mass key
deletion based on some criteria (it will be great if I can do
something like this:
[cache.delete(c) for c in cache.keys() if c.startswith('<key_prefix>')]
) and there is no way (well, I can make *another* global registry and
some fancy way to register cache keys there, still per thread/process
only) to know which keys to delete when clean() is called.

Because of that I choose to keep all of the related caches under one
cache key (CQS_<app_name>_<model_name>) so I can delete (and
invalidate) all of the keys when data is changed. Downside of this
approach is that if you have some big rows and they are cached it can
suck-up a lot of the memory due duplicate data, maybe I can just
store key names there instead of the full data -- I'll try this in
the morning.

locmem back-end (implemented as a thread local storage, IIRC) is not
safe to use with this - well, nothing bad will happen, just there are
possibility of getting stale data from the cache because some other
process can update data without invalidating cache in the current
process.

Other back-ends are safe from this IMHO.

To be on the safe side one can make models like this:

clas Foo(models.Model):
objects = models.Manager() # AKA _default_manager
cached = CachedManager()
...
track_cache(Foo) # lazy and slower way of doing cache invalidation ;)