Bulk Delete - Take 3, descriptor style

Russell Keith-Magee

unread,

Feb 11, 2006, 3:51:26 AM2/11/06

to Django Developers

Hi all,

Now that the dust has settled on the descriptor work, I've turned back
to bulk delete.

I've attached a new patch to the original ticket -
http://code.djangoproject.com/ticket/1219 - which is updated to
reflect the descriptor changes.

Notes:

1) As with the last version, the joins_allowed parameter to
_get_sql_clause is no longer required.

2) The DELETE_ALL protection mechanism has been removed. Rather than
have a special keyword, I have opted to remove the delete method from
the manager. The accidental obliteration case of
'Articles.objects.delete()' is no longer possible - you would need to
execute 'Articles.objects.all().delete()' to destroy everything. Since
this pretty much reads as an explicit 'destroy all', I didn't think
the safety is required.

3) As noted in the last attempt, the SQL that is generated for a
single object delete is almost identical to the SQL that is generated
for a bulk delete. The changes to manipulators mean that the
manipulator is no longer a good place for the common implementation,
so it now sits in a utility method in query.py.

Comments? Have I got it right this time?

Russ Magee %-)

Luke Plant

unread,

Feb 11, 2006, 5:58:07 AM2/11/06

to django-d...@googlegroups.com

On Saturday 11 February 2006 08:51, Russell Keith-Magee wrote:

> Comments? Have I got it right this time?

It looks OK to me. I think the main issue is whether we want methods
that can alter data on the QuerySet directly. This kind of relates to
a change I mentioned before - having the 'RelatedManager' classes act
as QuerySets (probably by inheritance). This would change whether you
would have to do things like reporter.article_set.all(), or just
reporter.article_set. Also, without your patch, you could do
reporter.article_set.delete() (whether this was desired or not), with
it you would have to do reporter.article_set.all().delete().
Personally I'm for removing the need to do .all() with the
RelatedManagers, in both cases.

On the safety issue, there is still the problem of being able to do
{{ reporters.delete }} in the template (assuming you've set a QuerySet
of reporters as 'reporters' in the context). This can be easily fixed
by setting the 'alters_data' attribute on the delete() function.

Nitpicking: these line in the tests need to be changed:
# the article is represented by "<Article object>", because we haven't
given
# the Article model a __repr__() method.

Regards,

Luke

--
Life is complex. It has both real and imaginary components.

Luke Plant || L.Plant.98 (at) cantab.net || http://lukeplant.me.uk/

Russell Keith-Magee

unread,

Feb 11, 2006, 9:47:29 AM2/11/06

to django-d...@googlegroups.com

On 2/11/06, Luke Plant <luke....@gmail.com> wrote:
>
> On Saturday 11 February 2006 08:51, Russell Keith-Magee wrote:
>
> > Comments? Have I got it right this time?
>

> I think the main issue is whether we want methods
> that can alter data on the QuerySet directly

I don't follow - why is direct modification of the query set a bad
thing? Is the problem one of expected behaviour, or is it an internal
operation problem?

Side Note - while I was thinking about this comment, I found a problem
with my patch. The QuerySet.delete() method needs to have
'self._result_cache = None' added to the end to ensure that the cache
is cleared after the delete is performed. This means that if the query
set that was deleted is reused, it will be re-evaluated (as an empty
set). I've made this change locally, and modified the last
many-to-many unit test to test for it.

> This would change whether you
> would have to do things like reporter.article_set.all(), or just
> reporter.article_set.

I thought the implied .all() thing was knocked on the head a few weeks
back because of problems with caching:

http://groups.google.com/group/django-developers/browse_thread/thread/f64127c9b63e2ae5/3362cb72dc6641cf

Is this change still on the table?

> Also, without your patch, you could do
> reporter.article_set.delete() (whether this was desired or not), with
> it you would have to do reporter.article_set.all().delete().
> Personally I'm for removing the need to do .all() with the
> RelatedManagers, in both cases.

Intentional, because of safety. If you want to bulk delete, you have
to explicity nominate that you want to delete all() the elements of
the set. This is especially important in avoiding accidental confusion
between reporter.article_set.clear() and reporter.article_set.delete()
(which isn't outside the realm of possibility with new users, or old
users who haven't had enough sleep :-).

> This can be easily fixed
> by setting the 'alters_data' attribute on the delete() function.

> Nitpicking: these line in the tests need to be changed:

I've added these two fixes, and updated the patch on the ticket with
these two changes, plus the cache clearing modification.

Russ Magee %-)

Luke Plant

unread,

Feb 11, 2006, 11:05:49 AM2/11/06

to django-d...@googlegroups.com

On Saturday 11 February 2006 14:47, Russell Keith-Magee wrote:

> > I think the main issue is whether we want methods
> > that can alter data on the QuerySet directly
>
> I don't follow - why is direct modification of the query set a bad
> thing? Is the problem one of expected behaviour, or is it an internal
> operation problem?

I wasn't saying it was a bad thing, just highlighting the separation
that exists at the moment between the 'Manager' class and the
'QuerySet' class. But I was wrong anyway - the current implementation
does have delete() on QuerySets, just not add(), remove() and clear().
It wouldn't be possible to have add() on QuerySet (at least not
if .filter() was used), but remove() and clear() would be possible I
think.

> > This would change whether you
> > would have to do things like reporter.article_set.all(), or just
> > reporter.article_set.
>
> I thought the implied .all() thing was knocked on the head a few
> weeks back because of problems with caching:

The example I gave was a related object lookup, and so the caching
problem doesn't apply. The Manager objects persist since they are
attached to the model class, but the RelatedManager objects aren't.

> > Also, without your patch, you could do
> > reporter.article_set.delete() (whether this was desired or not),
> > with it you would have to do reporter.article_set.all().delete().
> > Personally I'm for removing the need to do .all() with the
> > RelatedManagers, in both cases.
>
> Intentional, because of safety. If you want to bulk delete, you have
> to explicity nominate that you want to delete all() the elements of
> the set. This is especially important in avoiding accidental
> confusion between reporter.article_set.clear() and
> reporter.article_set.delete() (which isn't outside the realm of
> possibility with new users, or old users who haven't had enough sleep
> :-).

For the delete() case, I guess I'm not that bothered either way about
having to do .all(), since I don't need to do bulk delete that often.
However, whether for the Manager or the RelatedManager, it's a bit of a
deviation from the other methods - for all the others, the Manager has
proxy methods to the QuerySet. It might be confusing to miss this one
out. Having a special case in a core API simply to make something
*harder* to do seems a bit ugly, and fairly un-pythonic too - IMO the
'alters_data' attribute protects non-Python developers and that is
enough.

I don't actually find the whole safety argument very convincing. I keep
a back up of my database, and if I were to accidentally issue a command
like Articles.objects.delete(), I would have absolutely no-one to blame
but myself - what did I expect it to do? Even without reading the API
docs, I could have guessed, and if I didn't bother to read the docs for
an API that is used to manage my database I'm doubly to blame. The
only time I'm likely to execute .delete() in a I-wonder-what-this-does
kind of way is when I'm messing around learning the API, which I'm not
really going to do with a real, un-backed-up database.

With .delete() on a set of related objects, I can see there is slightly
more room for confusion, but then again it can't get much simpler than
"delete() deletes things from the database."

I'm actually more bothered by the .all() required everytime you use
related lookups to actually get the objects (which isn't what this
thread is really about, I know). There is no reason for it technically
(the reason for Managers was caching and persistence), and it kind of
breaks the name - it is no longer an 'article_set', it is something
that gives you an article set if you do 'all()' on it. Since
'article_set' could easily be accessed in templates (I have lots of
instances in my own project), I think using it needs to be simpler.

Robert Wittams

unread,

Feb 11, 2006, 2:07:43 PM2/11/06

to django-d...@googlegroups.com

Luke Plant wrote:
> I'm actually more bothered by the .all() required everytime you use
> related lookups to actually get the objects (which isn't what this
> thread is really about, I know). There is no reason for it technically
> (the reason for Managers was caching and persistence), and it kind of
> breaks the name - it is no longer an 'article_set', it is something
> that gives you an article set if you do 'all()' on it. Since
> 'article_set' could easily be accessed in templates (I have lots of
> instances in my own project), I think using it needs to be simpler.

I still think this would be a lot easier if we didn't use the name
'all()' to mean 'give me a cached version of this query'. '.cached()'
would seem more sensible. Managers would be uncached query sets by
default, whereas related object sets would reasonably be cached by
default ( A class is going to stick around between requests. An instance
isn't unless you take heroic measures to ensure it does ).

So I would suggest:

Article.objects:

is a query set that supports all the normal operations, but is uncached

Article.objects.cached():

is a query set that is cached. Otherwise the same as Article.objects

my_article.author_set

a query set that is cached. If there were a convincing use case, maybe
a .uncached() method would be sensible.

Anyway, I think .all() is a very confusing term for this, and I hope it
goes away...

The other solution to this is to have a 'context' variable in the vein
of PEP 343, which the caches would be attached to (rather than being in
the caches themselves).

This variable would be set at the beginning/end of requests, and also
when switching threads in a multi threaded server, if any exist... This
way the .cached() would be unnecessary - the cache would be scoped to
the request in the web context, and caching contexts could be explicitly
set in other contexts.

Robert Wittams

unread,

Feb 11, 2006, 2:17:24 PM2/11/06

to django-d...@googlegroups.com

Robert Wittams wrote:
> The other solution to this is to have a 'context' variable in the vein
> of PEP 343, which the caches would be attached to (rather than being in
> the caches themselves).
>

Er, I meant 'in the query sets themselves'...

Luke Plant

unread,

Feb 11, 2006, 3:40:00 PM2/11/06

to django-d...@googlegroups.com

On Saturday 11 February 2006 19:07, Robert Wittams wrote:

> So I would suggest:
>
> Article.objects:
>
> is a query set that supports all the normal operations, but is
> uncached
>
> Article.objects.cached():
>
> is a query set that is cached. Otherwise the same as Article.objects

In this case, would Article.objects.filter(foo=bar) also automatically
turn caching on, or would you have to do
Article.objects.cached().filter(foo=bar)
?

hugo

unread,

Feb 11, 2006, 5:03:10 PM2/11/06

to Django developers

>This variable would be set at the beginning/end of requests, and also
>when switching threads in a multi threaded server, if any exist... This
>way the .cached() would be unnecessary - the cache would be scoped to
>the request in the web context, and caching contexts could be explicitly
>set in other contexts.

I've done something like this with my fluid environment stuff for
Python. Quite handy and can easily be used outside the context of web
requests, too. And it could be used for some other parts in Django, too
- we do already have some places where we have threadlocals (like the
current language) which could be expressed easily with fluid
(dynamically scoped) variables.

bye, Georg

Robert Wittams

unread,

Feb 11, 2006, 5:49:29 PM2/11/06

to django-d...@googlegroups.com

Luke Plant wrote:
> On Saturday 11 February 2006 19:07, Robert Wittams wrote:
>
>
>>So I would suggest:
>>
>>Article.objects:
>>
>> is a query set that supports all the normal operations, but is
>>uncached
>>
>>Article.objects.cached():
>>
>> is a query set that is cached. Otherwise the same as Article.objects
>
>
> In this case, would Article.objects.filter(foo=bar) also automatically
> turn caching on, or would you have to do
> Article.objects.cached().filter(foo=bar)
> ?
>
> Luke

I was imagining that any call that created a new query set (ie modified
some search parameter) would turn caching on. Of course, this wouldn't
cache across calls that happened to end up with the same query. ( That
would necessitate hanging the cache off a context variable as I talked
about before).

The only real 'trouble' situation is the manager that is stored in the
class, I think.

Russell Keith-Magee

unread,

Feb 12, 2006, 3:33:28 AM2/12/06

to django-d...@googlegroups.com

On 2/12/06, Robert Wittams <rob...@wittams.com> wrote:

> I was imagining that any call that created a new query set (ie modified
> some search parameter) would turn caching on. Of course, this wouldn't
> cache across calls that happened to end up with the same query. ( That
> would necessitate hanging the cache off a context variable as I talked
> about before).

I think I'm missing something here - mod the dislike of all(), how
does this differ from what happens right now?

Article.objects.all() (name problems notwithstanding) returns a
QuerySet with a cache that is populated the first time the query is
interrogated. If you apply a filter, you get a new Query Set. If you
don't interrogate for data, there is no cache populated.

Admittedly, the downside here is that end user doesn't have much
control over the clearing/repopulation of the cache, or disabling use
of the cache altogether, on any particular QuerySet. However, the user
can always clone a query to get a fresh query with a clean cache;
alternatively, we could add QuerySet.clear_cache(),
QuerySet(cache=False), and similar mechanisms to QuerySet.

> The only real 'trouble' situation is the manager that is stored in the
> class, I think.

Which brings us back to why all() was introduced in the first place -
problems with the lifespan of the cache of the 'manager as QuerySet'
object.

While I conceed that Article.objects -> set of all objects is a very
nice mental model, I can see us working ourselves into knots trying to
overcoming caching problems on an Article.objects 'manager QuerySet'.
all() isn't quite as nice as a mental model, but it certainly doesn't
offend me, and it seems to have a lot less problems (or at least is
restricted to problems inherent in any caching mechanism).

Russ Magee %-)

Robert Wittams

unread,

Feb 12, 2006, 10:38:28 AM2/12/06

to django-d...@googlegroups.com

Russell Keith-Magee wrote:

> While I conceed that Article.objects -> set of all objects is a very
> nice mental model, I can see us working ourselves into knots trying to
> overcoming caching problems on an Article.objects 'manager QuerySet'.
> all() isn't quite as nice as a mental model, but it certainly doesn't
> offend me, and it seems to have a lot less problems (or at least is
> restricted to problems inherent in any caching mechanism).

You seem to have just repeated what I said .. I'm trying to work out
what point you are making.

My position is :

The name '.all()' in no way indicative of what the purpose of the call
is. It is an accidental name chanced upon during implementation as far
as I can see.

I explained a way that would allow us not to have to use
'.all()'/'.cached()' ( storing the caches in a global dict keyed by a
tuple of (db_context,queryspec) rather than in the queryset itself).
This is clearly doable.

When I said 'the only real trouble situation', I was explaining why this
was needed at all, not saying 'this doesn't really matter'.

So basically the choices are:

a) .all(), I think it is fairly confusing in mapping the name to the
semantics

b) .cached(), the same semantics but at least it does what it says..

c) externalise the caching, but this means decisions have to be made
about how to set the db_context etc etc. Logically this should be done
via events or middleware... and if transactions/unitofwork gets looked
at that would be the logical db_context I'm talking about above...
but I'm sure there will be cries of 'coupling' whatever decision is made
here... sigh.

Russell Keith-Magee

unread,

Feb 12, 2006, 7:45:00 PM2/12/06

to django-d...@googlegroups.com

On 2/12/06, Robert Wittams <rob...@wittams.com> wrote:

You seem to have just repeated what I said .. I'm trying to work out
what point you are making.

I think I just got all turned around trying to follow the conversation - apologies for making you repeat yourself.

So basically the choices are:

a) .all(), I think it is fairly confusing in mapping the name to the
semantics

b) .cached(), the same semantics but at least it does what it says..

I don't have any particular problem with the name all() as currently implemented. However, if Article.objects is modified to return an uncached query set, I agree that the semantics are all wrong.

However, rather than having cached() as a factory/proxy method on the manager, wouldn't a better approach be to make cached() a 'cache enabling' clone method on QuerySet? i.e, cached() has the same operation as filter() - it returns a clone, but with caching enabled. Example usage:

- Article.objects is the only way to get a clean query set, caching disabled by default
- Article.objects.cached() takes the base query set, and returns a clone with caching enabled
- Article.objects.filter (...) returns a filtered, uncached query,
- Article.objects.filter(...).cached() is the same query, but with a cache.
- q = Article.objects.filter(...).cached(); q.cached() returns a second version of the filter query, with a clean cache

This also removes the need to proxy the QuerySet methods through the manager. This really appeals to me, because the Article.objects.filter() notation in the current implementation bugs me - to me, it should be Article.objects.all().filter().

It might also be a good idea to add a 'reset_cache()' method to allow developers to reset the cache on a QuerySet (no-op if the query is not cached), and maybe even a 'non_cached()' method to return a cloned QuerySet with the cache disabled;

- Article.objects.cached().non_cached()
- Article.objects.non_cached()

being 2 examples of producing Article.objects in redundant ways.

c) externalise the caching, but this means decisions have to be made
about how to set the db_context etc etc. Logically this should be done
via events or middleware... and if transactions/unitofwork gets looked
at that would be the logical db_context I'm talking about above...
but I'm sure there will be cries of 'coupling' whatever decision is made
here... sigh.

Externalized caching is an interesting idea, but worries me slightly. Caching is one of those areas where the exact behaviour that you want can be highly application and situation dependent. I am more comfortable with putting making the developer take responsibility for what they want to cache, when they want to cache it, and the lifespan of that cache. Particularly if we can make the default behaviour 'don't cache', with caching being an opt-in behaviour (which the approach I described earlier on this mail would do).

A global cache that automagically works out the most recent version of a query has all sorts of potential for expectation to differ from implementation, (i.e., developer writes their app, doesn't get the result they expect because the caching implementation cleared a cached when they were not expecting it to be, and complains "why doesn't it work"). To boot, it will be a little hairy to implement, there will be cries of 'coupling', etc... it seems like asking for a lot of headaches without a whole lot of benefit at the end of the day.

Russ Magee %-)

Robert Wittams

unread,

Feb 13, 2006, 6:06:11 AM2/13/06

to django-d...@googlegroups.com

Russell Keith-Magee wrote:
> However, rather than having cached() as a factory/proxy method on the
> manager, wouldn't a better approach be to make cached() a 'cache enabling'
> clone method on QuerySet?

Er, yeah. That was kind of assumed. I'm still working from the
assumption that we want to make .objects a queryset.

> i.e, cached() has the same operation as filter() -
> it returns a clone, but with caching enabled. Example usage:
>
> - Article.objects is the only way to get a clean query set, caching disabled
> by default
> - Article.objects.cached() takes the base query set, and returns a clone
> with caching enabled

> - Article.objects.filter(...) returns a filtered, uncached query,

> - Article.objects.filter(...).cached() is the same query, but with a cache.

Yes, this is exactly what I intended to convey... sorry if I didn't make
myself clear...

> - q = Article.objects.filter(...).cached(); q.cached() returns a second
> version of the filter query, with a clean cache

Hm, not really sure if that is obvious. I was assuming cached would be
something like:

def cached(self):
if not self._cached:
other = self.clone()
other._cached = True
return other
else:
return self

Ie, it just returns itself if it will "do". Query sets can be passed
around, and the caller is intending to say "make sure this is cached".

For your intention, I imagine
p = q.clone()
p.reset_cache()

conveys the meaning a bit better. Does that make sense?

> This also removes the need to proxy the QuerySet methods through the
> manager. This really appeals to me, because the Article.objects.filter()
> notation in the current implementation bugs me - to me, it should be
> Article.objects.all().filter().

And then the .all() really starts to look meaningless.

> It might also be a good idea to add a 'reset_cache()' method to allow
> developers to reset the cache on a QuerySet (no-op if the query is not
> cached), and maybe even a 'non_cached()' method to return a cloned QuerySet
> with the cache disabled;

Yes, I think these are necessary.

> Externalized caching is an interesting idea, but worries me slightly.
> Caching is one of those areas where the exact behaviour that you want can be
> highly application and situation dependent. I am more comfortable with
> putting making the developer take responsibility for what they want to
> cache, when they want to cache it, and the lifespan of that cache.
> Particularly if we can make the default behaviour 'don't cache', with
> caching being an opt-in behaviour (which the approach I described earlier on
> this mail would do).

Also remember that related objects have a queryset attached to them - eg
my_article.reporters . The assumption so far has been that this will be
cached by default, ie you would use .non_cached() on them to get a
non-caching version. This does make it harder to understand whether
caching is on or not ( but also kind of makes sense as there are very
few situations you want this uncached.)

The only thing that was assumed to be uncached was the .objects as it is
in a class which sticks around.

So without externalising, the choices are:

a) inconsistency with managers and related objects having different
default caching behaviour.
b) the need to have a lot of .cached() being used on related objects,
when that is likely to be what people want in 95% of situations.

> A global cache that automagically works out the most recent version of a
> query has all sorts of potential for expectation to differ from
> implementation, (i.e., developer writes their app, doesn't get the result
> they expect because the caching implementation cleared a cached when they
> were not expecting it to be, and complains "why doesn't it work").

Externalising caching means that consistent *and* non-ugly behaviour can
be offered for all "query set entry points" - managers and related objects.

The choice of which consistent behaviour is:
a) All caches last the http request. In a non http context, use these
boundary functions ( or a with statement in python 2.5) .
b) Caches last the length of a db transaction. In an http context, the
db transaction will last the length of an http request unless explicitly
changed using these view decorators.

In both:
Use .non_cached() to get a non caching query set. Use .reset_cache()
to reset the cache in a query set.

> To boot,
> it will be a little hairy to implement, there will be cries of 'coupling',
> etc... it seems like asking for a lot of headaches without a whole lot of
> benefit at the end of the day.

I think it is clearly a better solution, but I don't have the time to do
it myself. The first solution is better than the current situation though.

Russell Keith-Magee

unread,

Feb 13, 2006, 8:33:29 AM2/13/06

to django-d...@googlegroups.com

On 2/13/06, Robert Wittams <rob...@wittams.com> wrote:

> Yes, this is exactly what I intended to convey... sorry if I didn't make
> myself clear...

It might take me few attempts to get the point, but as long as we all
end up on the same page... :-)

> > - q = Article.objects.filter(...).cached(); q.cached() returns a second
> > version of the filter query, with a clean cache
>
> Hm, not really sure if that is obvious. I was assuming cached would be
> something like:

...

> Ie, it just returns itself if it will "do". Query sets can be passed
> around, and the caller is intending to say "make sure this is cached".

I was assuming that cached() would _always_ clone. If the source
QuerySet is non-cached, you get a clone with caching enabled; if the
source is cached, you get a new QuerySet with a clean cache. The use
case I can see is:

p = Article.objects.filter(...) # Original, uncached query
q = p.cached() # Copied, cache version 1
for obj in q: ... # evaluate cache 1
#Add an object that would match q
r = q.cached() # Copy of cached query, cache version 2
for obj in r: ... # evaluate cache 2; new object is in list
for obj in q: ... # Iterate over cache 1; new object not in list

This way, the cache becomes a store of what was in the database when
the query was executed, rather than there being a unique cache for any
given query. More on this later...

> And then the .all() really starts to look meaningless.

Yup. Hey, this understanding each other thing is fun! :-)

> So without externalising, the choices are:
>
> a) inconsistency with managers and related objects having different
> default caching behaviour.
> b) the need to have a lot of .cached() being used on related objects,
> when that is likely to be what people want in 95% of situations.

I would tend to go with (b), with a chorus of 'fix this in the
documentation'. i.e., make the documentation very clear about the fact
that a cache is available, and might be a good way to optimize
performance in some cases.

Also - 95%? Really? The perfomance hit of non-caching only matters if
you iterate over any given QuerySet more than once per http request.
Maybe I'm being unimaginative, but thinking over my common use cases,
multiple iterations over a QuerySet per http request would be the
exception, rather than the rule (or at the very least, nowhere near
95% of all use cases).

There is also an option (c): make the manager a cached query, and
document the need to use non_cached(). If multiple iterations over a
query set really is the 95% use case, it would make sense to me to
cache across the board.

> Externalising caching means that consistent *and* non-ugly behaviour can
> be offered for all "query set entry points" - managers and related objects.

Well - for your definition of consistent and non-ugly, anyway :-)

I think we may have different ideas on the cardinality of the
query-cache releationship. Consider:

p = Article.objects.filter(headline='xyz').cached()
q = Article.objects.filter(headline='xyz').cached()

If I am understanding your caching model correctly, you would ideally
like to see p and q using the same cache. This model is almost
impossible to acheive without externalization; the mapping of this
caching model onto the existing framework (where p and q have
different caches unless q is a clone of p) is very much ugly. Ergo,
pro externalization.

My problem with this model is that p.reset_cache() would clear q's
cache, too. On top of that, there is the question of when the system
automagically flushes the cache (per http request and per transaction
being two reasonable suggestions).

I would argue that p and q should have independent caches. This way,
once you have iterated over p, you know exactly what is there; if you
iterate over p again, you will get always get exactly the same result,
regardless of what you have done to q. This makes a populated cache a
snapshot of the state of the database at a given time. No need to
worry if someone/something else has changed the database - my snapshot
is always the same.

It's a slightly different model of caching, but it is a consistent and
easily explainable model, IMHO non-ugly, and doesn't require
externalization to achieve - it is entirely covered by the existing
framework, plus the modifications we have been discussing.

Russ Magee %-)

Russell Keith-Magee

unread,

Feb 13, 2006, 8:44:02 AM2/13/06

to django-d...@googlegroups.com

While all the cache talk is interesting and important, it is somewhat
separate from the original purpose of this thread: critique of the
bulk delete patch.

So far, the only criticism seems to revolve around the significance of
all(), and exactly how the cache works/should work. However, bulk
delete is not the only operation that would be affected by changes to
the caching implementation.

Are there any objections to committing the bulk delete patch as it
currently stands (accepting that there may be changes if the cache
mechanism changes)?

Russ Magee %-)

Robert Wittams

unread,

Feb 13, 2006, 10:31:23 AM2/13/06

to django-d...@googlegroups.com

Russell Keith-Magee wrote:
> I was assuming that cached() would _always_ clone. If the source
> QuerySet is non-cached, you get a clone with caching enabled; if the
> source is cached, you get a new QuerySet with a clean cache. The use
> case I can see is:
>
> p = Article.objects.filter(...) # Original, uncached query
> q = p.cached() # Copied, cache version 1
> for obj in q: ... # evaluate cache 1
> #Add an object that would match q
> r = q.cached() # Copy of cached query, cache version 2
> for obj in r: ... # evaluate cache 2; new object is in list
> for obj in q: ... # Iterate over cache 1; new object not in list
>
> This way, the cache becomes a store of what was in the database when
> the query was executed, rather than there being a unique cache for any
> given query. More on this later...

Hm, the issue I would have with that is that the execution of the query
is lazy, ie where the .cached() call is does not determine what is being
cached. ( And I think the laziness is more important than the caching ;-)

So people end up doing iter(q) just to fill the cache, it seems a bit
strained : if you want to compare the contents, why not do list(q)? You
know that will stay the same.

ie.

p = Article.objects.filter(...) # Original, uncached query

q = list(p)

#Add an object that would match q

r = list(p)
for obj in r: ... # new object is in list
for obj in q: ... # new object not in list

I dunno, this seems clearer to me for that particular use case. Don't
care too much either way, it just seems your model gives .cached() quite
a lot of responsibilities. But I think your model is something everyone
can get used to, and can become idiomatic. Also it means less churn when
people want to do more than just iterate through a list, eg call a
custom method etc. So either way is fine with me.

>> So without externalising, the choices are:
>>
>> a) inconsistency with managers and related objects having different
>> default caching behaviour.
>> b) the need to have a lot of .cached() being used on related objects,
>> when that is likely to be what people want in 95% of situations.
>
> I would tend to go with (b), with a chorus of 'fix this in the
> documentation'. i.e., make the documentation very clear about the fact
> that a cache is available, and might be a good way to optimize
> performance in some cases.
>
> Also - 95%? Really? The perfomance hit of non-caching only matters if
> you iterate over any given QuerySet more than once per http request.
> Maybe I'm being unimaginative, but thinking over my common use cases,
> multiple iterations over a QuerySet per http request would be the
> exception, rather than the rule (or at the very least, nowhere near
> 95% of all use cases).

Yes, I was thinking of ' 95% of cases where caching matters at all' ;-)

I agree a lot of the time it doesn't matter. I also agree that b) is
probably preferable to a)...

Also I guess we need to take into account that doing eg

p = list(Article.objects)

would cost more if caching than non caching, and probably covers a lot
of cases. ( I hope).

> There is also an option (c): make the manager a cached query, and
> document the need to use non_cached(). If multiple iterations over a
> query set really is the 95% use case, it would make sense to me to
> cache across the board.

I think this probably would end up confusing and surprising. Docs can
only go so far ;-)

>> Externalising caching means that consistent *and* non-ugly behaviour can
>> be offered for all "query set entry points" - managers and related objects.
>
> Well - for your definition of consistent and non-ugly, anyway :-)

I just meant not having .cached() sprinkled around everywhere, really.

> I think we may have different ideas on the cardinality of the
> query-cache releationship. Consider:
>
> p = Article.objects.filter(headline='xyz').cached()
> q = Article.objects.filter(headline='xyz').cached()
>
> If I am understanding your caching model correctly, you would ideally
> like to see p and q using the same cache. This model is almost
> impossible to acheive without externalization; the mapping of this
> caching model onto the existing framework (where p and q have
> different caches unless q is a clone of p) is very much ugly. Ergo,
> pro externalization.

Yes, that was the idea.

> My problem with this model is that p.reset_cache() would clear q's
> cache, too. On top of that, there is the question of when the system
> automagically flushes the cache (per http request and per transaction
> being two reasonable suggestions).

Yes, in that case I would make reset_cache() a function, eg

reset_cache(q)

I think this conveys that the cache is global.

> I would argue that p and q should have independent caches. This way,
> once you have iterated over p, you know exactly what is there; if you
> iterate over p again, you will get always get exactly the same result,
> regardless of what you have done to q. This makes a populated cache a
> snapshot of the state of the database at a given time. No need to
> worry if someone/something else has changed the database - my snapshot
> is always the same.

Yeah, I'm not convinced this is natural with a lazily fetched thing
though. I'm not sure that a huge number of people will internalise the
'throwaway iter() to cache' idiom...

> It's a slightly different model of caching, but it is a consistent and
> easily explainable model, IMHO non-ugly, and doesn't require
> externalization to achieve - it is entirely covered by the existing
> framework, plus the modifications we have been discussing.
>
> Russ Magee %-)

Yes, I am happy with either model. I do tend to think that the existing
model will end up with a lot of .cached everywhere, which might be
distracting ( probably a better term than ugly in this case).

Adrian Holovaty

unread,

Feb 13, 2006, 12:29:37 PM2/13/06

to django-d...@googlegroups.com

On 2/11/06, Russell Keith-Magee <freakb...@gmail.com> wrote:
> Now that the dust has settled on the descriptor work, I've turned back
> to bulk delete.
>
> I've attached a new patch to the original ticket -
> http://code.djangoproject.com/ticket/1219 - which is updated to
> reflect the descriptor changes.

This change, at a glance, looks good. Go ahead and commit it.

As for renaming all() to cached() and/or making SomeModel.objects a
QuerySet directly and having to deal with cached vs. non-cached query
sets -- we've already had this discussion, decided on it and moved on.
It's SomeModel.objects.all(), and there's no having to mess with
cached vs. non-cached query sets. The way it's currently handled in
magic-removal is great, and a fine improvement. Let's move forward.

Adrian

--
Adrian Holovaty
holovaty.com | djangoproject.com | chicagocrime.org

Luke Plant

unread,

Feb 13, 2006, 1:52:17 PM2/13/06

to django-d...@googlegroups.com

On Monday 13 February 2006 17:29, Adrian Holovaty wrote:

> As for renaming all() to cached() and/or making SomeModel.objects a
> QuerySet directly and having to deal with cached vs. non-cached query
> sets -- we've already had this discussion, decided on it and moved
> on. It's SomeModel.objects.all(), and there's no having to mess with
> cached vs. non-cached query sets. The way it's currently handled in
> magic-removal is great, and a fine improvement. Let's move forward.

I think the discussion we had before centred on Article.objects.all()
(i.e. Managers), but the discussion I brought up again was specifically
to do with RelatedManagers (it then digressed into Managers in
general). Specifically, I now have this in some of my templates (my
app has 'camps' which have zero or more 'leaders'):

{% for leader in camp.leaders.all %}
<li>{{ leader.name }}:<br/>
{{ leader.info|escape|linebreaksbr }}</li>
{% endfor %}

Before I had this:

{% for leader in camp.get_leader_list %}
<li>{{ leader.name }}:<br/>
{{ leader.info|escape|linebreaksbr }}</li>
{% endfor %}

My point was I don't think this is a 'fine improvement' from a template
author's point of view - it's a definite step backwards IMO (and
s/all/cached would be just as bad, if not worse), especially
considering the docs that are available in 'admin/doc/models/', which
gave you everything you needed to know before. No other lists of
objects that are populated into the context will work like this (e.g.
objects which doesn't come from the ORM, or ORM objects with a property
that returns some custom list), it's just a wart of implementation that
the template authors shouldn't have to see. And there isn't a
technical reason for it either - it should be relatively easy to make
it go away.

Luke Plant

unread,

Feb 13, 2006, 2:44:21 PM2/13/06

to django-d...@googlegroups.com

On Monday 13 February 2006 18:52, Luke Plant wrote:

> And there isn't a technical reason for it either - it should be
> relatively easy to make it go away.

The attached patch does so - the .all() call is now optional on related
objects (it's still there due to inheritance, but I've fixed it so it
does the right thing). The patch is probably pretty rough, but it
passes all the tests that were passing before (i.e. all but 4), and
I've removed all (or almost all I think) instances of .all() on related
objects in the tests. I don't have any more time to work on this
though!

remove_all_for_related_managers.diff

Luke Plant

unread,

Feb 13, 2006, 3:23:51 PM2/13/06

to django-d...@googlegroups.com

On Monday 13 February 2006 19:44, Luke Plant wrote:

> The attached patch does so - the .all() call is now optional on
> related objects (it's still there due to inheritance, but I've fixed
> it so it does the right thing).

I listened to the code a bit more carefully, and couldn't let this lie -
the attached patch is a much cleaner implementation than the first, and
doesn't allow .all() on the related objects, since it no longer
inherits from Manager. As before, it's probably still rough around the
edges. All tests still passing though.

remove_all_for_related_managers_2.diff

Robert Wittams

unread,

Feb 13, 2006, 7:36:29 PM2/13/06

to django-d...@googlegroups.com

Adrian Holovaty wrote:
>
> This change, at a glance, looks good. Go ahead and commit it.
>
> As for renaming all() to cached() and/or making SomeModel.objects a
> QuerySet directly and having to deal with cached vs. non-cached query
> sets -- we've already had this discussion, decided on it and moved on.
> It's SomeModel.objects.all(), and there's no having to mess with
> cached vs. non-cached query sets. The way it's currently handled in
> magic-removal is great, and a fine improvement. Let's move forward.
>
> Adrian

The name 'objects' was predicated on the manager being a query set and
really makes very little sense anymore. It isn't a collection of
objects, its just some arbitrary thing that happens to have a method
that returns a query set... right?

kmh

unread,

Feb 13, 2006, 8:20:00 PM2/13/06

to Django developers

FWIW I also find 'all' a bit ugly.

It's not so bad in Python. My objection is the point that Luke has
raised: template authors shouldn't be bothered by it. It is such a
common operation in Django templates to want to iterate over a set of
objects that it seems worth making the syntax as nice as possible.

Kieran

Russell Keith-Magee

unread,

Feb 14, 2006, 12:17:38 AM2/14/06

to django-d...@googlegroups.com

On 2/14/06, Adrian Holovaty <holo...@gmail.com> wrote:

This change, at a glance, looks good. Go ahead and commit it.

Ok. Will do.

cached vs. non-cached query sets. The way it's currently handled in
magic-removal is great, and a fine improvement. Let's move forward.

If this is the will of the gods, it shall be done. However, some of those decisions were made in the abstract. Now that we have code to look at, some problems/alternatives have bubbled to the surface.

If you go back to the original thread in which all() was decided upon (in particular, the post where you "finally wraps his brain around the problem"):
http://groups.google.com/group/django-developers/browse_thread/thread/f64127c9b63e2ae5/3362cb72dc6641cf

I believe we have sorted out the issues that were raised in that thread that lead to the all() decision:
1) if Article.objects is cache disabled there is no persistence problem
2) Article.objects.cached() provides easy access to caching if optimization is required
3) list(Article.objects) is possible, but certainly not required
4) In the some_view() example used in the thread, the decision to use caching can be kept out of the template.
5) All the proxies of get_pub_date_list that you wanted to avoid already have been - using QuerySet descendents like DateQuerySet
6) All of the proxies that the all() solution requires can be removed

I know there is little point in continuously revisiting old decisions, but this is a big decision that we will have to live with for a long time, so it seems important to make sure we have it right.

Russ Magee %-)

Robert Wittams

unread,

Feb 14, 2006, 5:01:37 AM2/14/06

to django-d...@googlegroups.com

I agree with this 100%... ( it can happen !)

Luke Plant

unread,

Feb 15, 2006, 2:04:51 PM2/15/06

to django-d...@googlegroups.com

On Tuesday 14 February 2006 05:17, Russell Keith-Magee wrote:
> On 2/14/06, Adrian Holovaty <holo...@gmail.com> wrote:
> > This change, at a glance, looks good. Go ahead and commit it.
>
> Ok. Will do.

After your commit, I updated my patch that removes '.all()' for related
objects, to take account of these changes -- it's attached.

Regards,

Luke

--
"Making it up? Why should I want to make anything up? Life's bad enough
as it is without wanting to invent any more of it." (Marvin the
paranoid android)

remove_all_for_related_managers_3.diff

Reply all

Reply to author

Forward