Continuing with the string of posts regarding this year's GSoC, I'm pleased to be working on new caching functionality for Django's ORM under the mentorship of Gary Wilson[1]! Big thanks to him and all the people from the community who made this possible. Now I've just got to hold up my end of the bargain ;)
An updated version of my original proposal is available on Django's wiki[2] and the project repository is available from Google Project Hosting[3].
This project aims to condense common caching idioms into simple QuerySet methods that deal with cached data intelligently, refill expired cache, track instance modification/deletion, descend through relationships, and so on. I've tried to make it accommodate the most common use cases, be it dumping front-page news articles to cache or keeping user profile data on hand when rendering a more dynamic site, but I'd definitely love to hear about how cached data is really used out in the field.
Currently, three QuerySet methods, .cache(), .cache_related(), and .cache_set() handle individual, relation, and aggregate data caching, respectively. Each can take arguments to specify options like expiry time. Signals are registered behind the scenes to sync the cache, and there's also a proposal for transactional caching. Generic relations will also benefit greatly from .cache_related().
More detail is available on the wiki page, but lots of the little issues that surround the implementation are still bouncing around in my head; the page is going to be expanding and mutating throughout the next few months, I'm sure. The timeline is a little underdeveloped, for one. Comments and criticisms are always welcome!
The project itself is currently a single module which has a class derived from QuerySet with all of the new functionality, and a little hack to inject the .cache(), etc methods right back into QuerySet. It also ships with a project in which I'll be writing proper tests. Some sort of miniature cache-testing framework may be in order too. Eventually, perhaps it'll become fit for inclusion in the Django proper--but either way it doesn't require a huge amount of core modification.
I'm also curious about the progress of QuerySet refactoring, which would understandably have a huge impact on the code I'm writing. Either way, this new code applies directly to a Django project I've been working on for the last while, so I have the motivation to maintain this project wherever it may go!
On Tue, 2007-05-29 at 07:37 +0000, Paul Collier wrote: > Hello all!
> Continuing with the string of posts regarding this year's GSoC, I'm > pleased to be working on new caching functionality for Django's ORM > under the mentorship of Gary Wilson[1]! Big thanks to him and all the > people from the community who made this possible. Now I've just got to > hold up my end of the bargain ;)
> An updated version of my original proposal is available on Django's > wiki[2] and the project repository is available from Google Project > Hosting[3].
> This project aims to condense common caching idioms into simple > QuerySet methods that deal with cached data intelligently, refill > expired cache, track instance modification/deletion, descend through > relationships, and so on. I've tried to make it accommodate the most > common use cases, be it dumping front-page news articles to cache or > keeping user profile data on hand when rendering a more dynamic site, > but I'd definitely love to hear about how cached data is really used > out in the field.
> Currently, three QuerySet methods, .cache(), .cache_related(), > and .cache_set() handle individual, relation, and aggregate data > caching, respectively. Each can take arguments to specify options like > expiry time. Signals are registered behind the scenes to sync the > cache, and there's also a proposal for transactional caching. Generic > relations will also benefit greatly from .cache_related().
One of my concerns I've had in the past when people have proposed this sort of work is how will object updates interact with cache consistency and correctness.
The simplest example (solve this and you've solved them all, I suspect) is if the cache contains information about a Queryset qs and part of that queryset is object o1. If I update o1 in some other part of the code, what assumptions are made about qs? Possibilities are:
(1) qs is invalidated from the cache and will be requeried from the database the next time it is needed (consistency + correctness)
(2) qs doesn't change until it expires. The cache returns consistent results (in the sense that you can predict what will be returned if you know the state of the system), but not necessarily correct ones, because some or all of the objects in the queryset might have changed.
(3) qs is updated somehow so that only necessary changes are requeried on the next access. I suspect this option doesn't scale well, but it's one possibility.
Is this one of the things bouncing in your head?
[...]
> I'm also curious about the progress of QuerySet refactoring, which > would understandably have a huge impact on the code I'm writing.
Actually I think this is going to have a lot less impact than you think. The elevator pitch for the QuerySet refactor is that is removes the SQL query consruction code from the QuerySet class and moves it all into another class (that is an attribute of QuerySet). However, all the Django-specific logic of QuerySets, including the methods like filter(), exclude(), etc, will not change. So a QuerySet class will pass off the SQL work and retrieval from the database to another class, but all user-code (including other Django code) still interacts with the QuerySet class itself, because that is responsible for things like turning SQL results back into objects or values or whatever.
I would suspect you can avoid messing with the SQL construction that much (I just read through your code so far). What we can do is add a reliable __hash__ method or something similar to the Query class (the thing that does the SQL construction) so that instead of you having to replace QuerySet.iterator() and look at the query as it is constructed, you can just ask the QuerySet.query attribute what its hash is and use that to ascertain identity.
I'm not sure what the comments about "working with relations" means in your iterator() implementation. Again, probably won't be hard to do that work by intercepting the QuerySet methods that are called, rather than worrying about the query construction.
The good news/bad news (are you a glass half-full or glass half-empty kind of guy?) is that the QuerySet refactor is now my major project, as the unicode branch is winding down and we have to get this sucker done.
> If I update o1 in some other part of the > code, what assumptions are made about qs?
Hmm, yeah... I haven't focused enough attention on .cache_set() yet, heheh. I was definitely just going to implement (1) first, and then worry about something more advanced once I got to the "smart" functionality (which I might just be making default behaviour now). At some point I'll be writing a class with which cached queries register so that pre/post_save signals can freshen the cache; that's should lightweight enough to scale by itself, but to implement (3) I guess it'd also have to track which PKs are present. I'm not really sure if (2) is desirable.
> What we can do is add a > reliable __hash__ method or something similar to the Query class (the > thing that does the SQL construction) so that instead of you having to > replace QuerySet.iterator() and look at the query as it is constructed, > you can just ask the QuerySet.query attribute what its hash is and use > that to ascertain identity.
Sounds great. I was wondering if there was some way to automatically generate cache keys for .cache_set() as well. Perhaps this could tie in quite nicely. Also, the approach I'm taking currently with the overriden QuerySet worries me because I think it only really works right now if .cache*() is the last method on the chain.
> The good news/bad news (are you a glass half-full or glass half-empty > kind of guy?) is that the QuerySet refactor is now my major project, as > the unicode branch is winding down and we have to get this sucker done.
It sounds good to me! I've had a good experience migrating to new Django features in the past (thank you newforms!) and this sounds like it's definitely the way to go. I'm just wondering now if I should maintain two versions of the project... guess I'll have to see once it rolls around.
> Continuing with the string of posts regarding this year's GSoC, I'm > pleased to be working on new caching functionality for Django's ORM > under the mentorship of Gary Wilson[1]! Big thanks to him and all the > people from the community who made this possible. Now I've just got to > hold up my end of the bargain ;)
> An updated version of my original proposal is available on Django's > wiki[2] and the project repository is available from Google Project > Hosting[3].
> This project aims to condense common caching idioms into simple > QuerySet methods that deal with cached data intelligently, refill > expired cache, track instance modification/deletion, descend through > relationships, and so on. I've tried to make it accommodate the most > common use cases, be it dumping front-page news articles to cache or > keeping user profile data on hand when rendering a more dynamic site, > but I'd definitely love to hear about how cached data is really used > out in the field.
> Currently, three QuerySet methods, .cache(), .cache_related(), > and .cache_set() handle individual, relation, and aggregate data > caching, respectively. Each can take arguments to specify options like > expiry time. Signals are registered behind the scenes to sync the > cache, and there's also a proposal for transactional caching. Generic > relations will also benefit greatly from .cache_related().
> More detail is available on the wiki page, but lots of the little > issues that surround the implementation are still bouncing around in > my head; the page is going to be expanding and mutating throughout the > next few months, I'm sure. The timeline is a little underdeveloped, > for one. Comments and criticisms are always welcome!
> The project itself is currently a single module which has a class > derived from QuerySet with all of the new functionality, and a little > hack to inject the .cache(), etc methods right back into QuerySet. It > also ships with a project in which I'll be writing proper tests. Some > sort of miniature cache-testing framework may be in order too. > Eventually, perhaps it'll become fit for inclusion in the Django > proper--but either way it doesn't require a huge amount of core > modification.
> I'm also curious about the progress of QuerySet refactoring, which > would understandably have a huge impact on the code I'm writing. > Either way, this new code applies directly to a Django project I've > been working on for the last while, so I have the motivation to > maintain this project wherever it may go!
It looks like you are working some really neat stuff Paul! I came across ticket #17 last week and thought it was an interesting ticket and started through some prototyping. It seemed to be in my prototyping that the project you are working on sort of overlapped some of my ideas. It appears to actually be two different things, but kind of not.
#17 looks like it tries to solve having multiple instances of the same object in memory. When thinking about implementing this it starts to work itself in to the QuerySet and how it would select an object since it would need to look into memory first then the database.
I guess what I am trying to get at is how does your project and ticket #17 corrolate, or better yet, is this something you thought of. Your project seems to be when objects need to be cached over several requests which could mean multiple instances of Django in memory at different times.
On Tue, 2007-05-29 at 16:26 +0000, Paul Collier wrote: > > If I update o1 in some other part of the > > code, what assumptions are made about qs? > Hmm, yeah... I haven't focused enough attention on .cache_set() yet, > heheh. I was definitely just going to implement (1) first, and then > worry about something more advanced once I got to the "smart" > functionality (which I might just be making default behaviour now). At > some point I'll be writing a class with which cached queries register > so that pre/post_save signals can freshen the cache; that's should > lightweight enough to scale by itself, but to implement (3) I guess > it'd also have to track which PKs are present. I'm not really sure if > (2) is desirable.
I think your first option has the same problem that you spotted in number (3). Here's the possibly hidden problem with (1) -- which was "if o1 changes, invalidate all querysets that are cached containing o1" -- every time an object is saved, you have to work which cached querysets contain that object. This would be something like O(n*avg_size_querysets) if you don't come up with a fancy hash or bitmap or something to work out which are the dirty querysets.
Not a bad first implementation, but check that it scales (I suspect the naïve approach will struggle to do so). I wouldn't worry about solving this up front, since it's a localised thing that you can tweak later once the bulk of the functionality is there, but it sounds like a fun problem to think about.
Paul Collier wrote: >> If I update o1 in some other part of the >> code, what assumptions are made about qs? > Hmm, yeah... I haven't focused enough attention on .cache_set() yet, > heheh. I was definitely just going to implement (1) first, and then > worry about something more advanced once I got to the "smart" > functionality (which I might just be making default behaviour now). At > some point I'll be writing a class with which cached queries register > so that pre/post_save signals can freshen the cache; that's should > lightweight enough to scale by itself, but to implement (3) I guess > it'd also have to track which PKs are present. I'm not really sure if > (2) is desirable.
I could see how (2) would be desirable if you are expecting qs.cache_set() to work like cache.set('myqs', qs), which is how I interpreted your proposal. Though, I could also understand someone wanting one of the other implementations too.
So what were your implementation thoughts about cache_set()? It appears that you did not mean for it to simply be a shortcut for doing cache.set('my_qs', qs). Could you please clarify this?
As far as cache correctness and the pre_save/post_save signals for refreshing/invalidating the cache, there is sill the possibility of objects in the database getting altered outside of the currently-running django instance. In this case, your cache won't be correct, even with going through all the trouble of trying to keep it updated. This is where I think option (2) is the desired choice again, for simplicity and explicitness. Objects are cached until they timeout or until I explicitly refresh them. Thoughts?
Also, all this object caching seems to me like something that would tie in nicely to the models themselves. Maybe something similar to the ModelAdmin class of newforms-admin. Cache refreshing/invalidating behavior could even be an option if it is implemented.
class Author: name = CharField()
class AuthorCacheOptions(cache.ModelOptions): # Cache Author objects? cache = True # How long to cache Author objects? seconds = 3600 # Try to keep cached objects in sync with database? # If False, cached objects are not updated until they expire. sync = True
Now, Author query sets using .cache() would use the AuthorCacheOptions by default when no parameters are passed.