GSoC 2007 - Object-Level Caching

0 views
Skip to first unread message

Paul Collier

unread,
May 29, 2007, 3:37:50 AM5/29/07
to Django developers
Hello all!

Continuing with the string of posts regarding this year's GSoC, I'm
pleased to be working on new caching functionality for Django's ORM
under the mentorship of Gary Wilson[1]! Big thanks to him and all the
people from the community who made this possible. Now I've just got to
hold up my end of the bargain ;)

An updated version of my original proposal is available on Django's
wiki[2] and the project repository is available from Google Project
Hosting[3].

This project aims to condense common caching idioms into simple
QuerySet methods that deal with cached data intelligently, refill
expired cache, track instance modification/deletion, descend through
relationships, and so on. I've tried to make it accommodate the most
common use cases, be it dumping front-page news articles to cache or
keeping user profile data on hand when rendering a more dynamic site,
but I'd definitely love to hear about how cached data is really used
out in the field.

Currently, three QuerySet methods, .cache(), .cache_related(),
and .cache_set() handle individual, relation, and aggregate data
caching, respectively. Each can take arguments to specify options like
expiry time. Signals are registered behind the scenes to sync the
cache, and there's also a proposal for transactional caching. Generic
relations will also benefit greatly from .cache_related().

More detail is available on the wiki page, but lots of the little
issues that surround the implementation are still bouncing around in
my head; the page is going to be expanding and mutating throughout the
next few months, I'm sure. The timeline is a little underdeveloped,
for one. Comments and criticisms are always welcome!

The project itself is currently a single module which has a class
derived from QuerySet with all of the new functionality, and a little
hack to inject the .cache(), etc methods right back into QuerySet. It
also ships with a project in which I'll be writing proper tests. Some
sort of miniature cache-testing framework may be in order too.
Eventually, perhaps it'll become fit for inclusion in the Django
proper--but either way it doesn't require a huge amount of core
modification.

I'm also curious about the progress of QuerySet refactoring, which
would understandably have a huge impact on the code I'm writing.
Either way, this new code applies directly to a Django project I've
been working on for the last while, so I have the motivation to
maintain this project wherever it may go!

Thanks,
Paul Collier

[1] http://gdub.wordpress.com/
[2] http://code.djangoproject.com/wiki/ObjectLevelCaching
[3] http://code.google.com/p/django-object-level-caching/

Malcolm Tredinnick

unread,
May 29, 2007, 5:38:10 AM5/29/07
to django-d...@googlegroups.com
On Tue, 2007-05-29 at 07:37 +0000, Paul Collier wrote:
> Hello all!
>
> Continuing with the string of posts regarding this year's GSoC, I'm
> pleased to be working on new caching functionality for Django's ORM
> under the mentorship of Gary Wilson[1]! Big thanks to him and all the
> people from the community who made this possible. Now I've just got to
> hold up my end of the bargain ;)
>
> An updated version of my original proposal is available on Django's
> wiki[2] and the project repository is available from Google Project
> Hosting[3].
>
> This project aims to condense common caching idioms into simple
> QuerySet methods that deal with cached data intelligently, refill
> expired cache, track instance modification/deletion, descend through
> relationships, and so on. I've tried to make it accommodate the most
> common use cases, be it dumping front-page news articles to cache or
> keeping user profile data on hand when rendering a more dynamic site,
> but I'd definitely love to hear about how cached data is really used
> out in the field.
>
> Currently, three QuerySet methods, .cache(), .cache_related(),
> and .cache_set() handle individual, relation, and aggregate data
> caching, respectively. Each can take arguments to specify options like
> expiry time. Signals are registered behind the scenes to sync the
> cache, and there's also a proposal for transactional caching. Generic
> relations will also benefit greatly from .cache_related().

One of my concerns I've had in the past when people have proposed this
sort of work is how will object updates interact with cache consistency
and correctness.

The simplest example (solve this and you've solved them all, I suspect)
is if the cache contains information about a Queryset qs and part of
that queryset is object o1. If I update o1 in some other part of the
code, what assumptions are made about qs? Possibilities are:

(1) qs is invalidated from the cache and will be requeried from the
database the next time it is needed (consistency + correctness)

(2) qs doesn't change until it expires. The cache returns consistent
results (in the sense that you can predict what will be returned if you
know the state of the system), but not necessarily correct ones, because
some or all of the objects in the queryset might have changed.

(3) qs is updated somehow so that only necessary changes are requeried
on the next access. I suspect this option doesn't scale well, but it's
one possibility.

Is this one of the things bouncing in your head?


[...]


> I'm also curious about the progress of QuerySet refactoring, which
> would understandably have a huge impact on the code I'm writing.

Actually I think this is going to have a lot less impact than you think.
The elevator pitch for the QuerySet refactor is that is removes the SQL
query consruction code from the QuerySet class and moves it all into
another class (that is an attribute of QuerySet). However, all the
Django-specific logic of QuerySets, including the methods like filter(),
exclude(), etc, will not change. So a QuerySet class will pass off the
SQL work and retrieval from the database to another class, but all
user-code (including other Django code) still interacts with the
QuerySet class itself, because that is responsible for things like
turning SQL results back into objects or values or whatever.

I would suspect you can avoid messing with the SQL construction that
much (I just read through your code so far). What we can do is add a
reliable __hash__ method or something similar to the Query class (the
thing that does the SQL construction) so that instead of you having to
replace QuerySet.iterator() and look at the query as it is constructed,
you can just ask the QuerySet.query attribute what its hash is and use
that to ascertain identity.

I'm not sure what the comments about "working with relations" means in
your iterator() implementation. Again, probably won't be hard to do that
work by intercepting the QuerySet methods that are called, rather than
worrying about the query construction.

The good news/bad news (are you a glass half-full or glass half-empty
kind of guy?) is that the QuerySet refactor is now my major project, as
the unicode branch is winding down and we have to get this sucker done.

Regards,
Malcolm

Paul Collier

unread,
May 29, 2007, 12:26:54 PM5/29/07
to Django developers
> If I update o1 in some other part of the
> code, what assumptions are made about qs?
Hmm, yeah... I haven't focused enough attention on .cache_set() yet,
heheh. I was definitely just going to implement (1) first, and then
worry about something more advanced once I got to the "smart"
functionality (which I might just be making default behaviour now). At
some point I'll be writing a class with which cached queries register
so that pre/post_save signals can freshen the cache; that's should
lightweight enough to scale by itself, but to implement (3) I guess
it'd also have to track which PKs are present. I'm not really sure if
(2) is desirable.

> What we can do is add a
> reliable __hash__ method or something similar to the Query class (the
> thing that does the SQL construction) so that instead of you having to
> replace QuerySet.iterator() and look at the query as it is constructed,
> you can just ask the QuerySet.query attribute what its hash is and use
> that to ascertain identity.

Sounds great. I was wondering if there was some way to automatically
generate cache keys for .cache_set() as well. Perhaps this could tie
in quite nicely. Also, the approach I'm taking currently with the
overriden QuerySet worries me because I think it only really works
right now if .cache*() is the last method on the chain.

> The good news/bad news (are you a glass half-full or glass half-empty
> kind of guy?) is that the QuerySet refactor is now my major project, as
> the unicode branch is winding down and we have to get this sucker done.

It sounds good to me! I've had a good experience migrating to new
Django features in the past (thank you newforms!) and this sounds like
it's definitely the way to go. I'm just wondering now if I should
maintain two versions of the project... guess I'll have to see once it
rolls around.

Thanks for the comments!
Paul

Brian Rosner

unread,
May 29, 2007, 12:49:41 PM5/29/07
to django-d...@googlegroups.com
On 2007-05-29 01:37:50 -0600, Paul Collier
<pac...@gmail.com> said:

It looks like you are working some really neat stuff Paul! I came
across ticket #17 last week and thought it was an interesting ticket
and started through some prototyping. It seemed to be in my
prototyping that the project you are working on sort of overlapped some
of my ideas. It appears to actually be two different things, but kind
of not.

#17 looks like it tries to solve having multiple instances of the same
object in memory. When thinking about implementing this it starts to
work itself in to the QuerySet and how it would select an object since
it would need to look into memory first then the database.

I guess what I am trying to get at is how does your project and ticket
#17 corrolate, or better yet, is this something you thought of. Your
project seems to be when objects need to be cached over several
requests which could mean multiple instances of Django in memory at
different times.

--
Brian Rosner
http://www.brosner.com/blog


Malcolm Tredinnick

unread,
May 29, 2007, 9:01:41 PM5/29/07
to django-d...@googlegroups.com
On Tue, 2007-05-29 at 16:26 +0000, Paul Collier wrote:
> > If I update o1 in some other part of the
> > code, what assumptions are made about qs?
> Hmm, yeah... I haven't focused enough attention on .cache_set() yet,
> heheh. I was definitely just going to implement (1) first, and then
> worry about something more advanced once I got to the "smart"
> functionality (which I might just be making default behaviour now). At
> some point I'll be writing a class with which cached queries register
> so that pre/post_save signals can freshen the cache; that's should
> lightweight enough to scale by itself, but to implement (3) I guess
> it'd also have to track which PKs are present. I'm not really sure if
> (2) is desirable.

I think your first option has the same problem that you spotted in
number (3). Here's the possibly hidden problem with (1) -- which was "if
o1 changes, invalidate all querysets that are cached containing o1" --
every time an object is saved, you have to work which cached querysets
contain that object. This would be something like
O(n*avg_size_querysets) if you don't come up with a fancy hash or bitmap
or something to work out which are the dirty querysets.

Not a bad first implementation, but check that it scales (I suspect the
naïve approach will struggle to do so). I wouldn't worry about solving
this up front, since it's a localised thing that you can tweak later
once the bulk of the functionality is there, but it sounds like a fun
problem to think about.

Cheers,
Malcolm

Gary Wilson

unread,
Jun 26, 2007, 1:54:46 AM6/26/07
to django-d...@googlegroups.com
Getting back into the swing of things...

Paul Collier wrote:
>> If I update o1 in some other part of the
>> code, what assumptions are made about qs?
> Hmm, yeah... I haven't focused enough attention on .cache_set() yet,
> heheh. I was definitely just going to implement (1) first, and then
> worry about something more advanced once I got to the "smart"
> functionality (which I might just be making default behaviour now). At
> some point I'll be writing a class with which cached queries register
> so that pre/post_save signals can freshen the cache; that's should
> lightweight enough to scale by itself, but to implement (3) I guess
> it'd also have to track which PKs are present. I'm not really sure if
> (2) is desirable.

I could see how (2) would be desirable if you are expecting
qs.cache_set() to work like cache.set('myqs', qs), which is how I
interpreted your proposal. Though, I could also understand someone
wanting one of the other implementations too.

So what were your implementation thoughts about cache_set()? It appears
that you did not mean for it to simply be a shortcut for doing
cache.set('my_qs', qs). Could you please clarify this?

As far as cache correctness and the pre_save/post_save signals for
refreshing/invalidating the cache, there is sill the possibility of
objects in the database getting altered outside of the currently-running
django instance. In this case, your cache won't be correct, even with
going through all the trouble of trying to keep it updated. This is
where I think option (2) is the desired choice again, for simplicity and
explicitness. Objects are cached until they timeout or until I
explicitly refresh them. Thoughts?

Also, all this object caching seems to me like something that would tie
in nicely to the models themselves. Maybe something similar to the
ModelAdmin class of newforms-admin. Cache refreshing/invalidating
behavior could even be an option if it is implemented.

class Author:
name = CharField()

class AuthorCacheOptions(cache.ModelOptions):
# Cache Author objects?
cache = True
# How long to cache Author objects?
seconds = 3600
# Try to keep cached objects in sync with database?
# If False, cached objects are not updated until they expire.
sync = True

Now, Author query sets using .cache() would use the AuthorCacheOptions
by default when no parameters are passed.

Gary

Reply all
Reply to author
Forward
0 new messages