I finally got fed up with the common performance problem of ending up
doing O(n) DB queries when you need the (many) related objects of a list
of objects, and did something about it:
https://code.djangoproject.com/ticket/16937
I'm pretty sure the concept is something desirable, the problem is the
implementation.
It is much uglier than I'd like, but I think the current way that
QuerySets/Managers/related managers interact make this inevitable.
The current behaviour also makes any kind of 3rd party monkey patching
solution extremely hard. For example, the related object descriptors
make it pretty hard to decorate instances of the queryset with some
proxy objects, because they intercept any assignment of attributes of
the same name. (i.e. if you have foo.bar_set, you can't do foo.bar_set
== some_precalculated_objects as a performance hack). And the way the
descriptors work mean that you get new manager objects (even new manager
*classes*) every time you access the property, and this really works
against any kind of monkey patching.
If you want to implement this outside core, and you don't want to change
templates and other code, I think you'd have to create a proxy for the
entire model instance, and I imagine it would be a total pain.
So, given how common this performance problem is, and how hard it is to
implement outside core, I think we should consider even an ugly
implementation. I think it is also inevitable that it is going to
involve QuerySet/related managers hacking each other's internals, or
duplicating to some extent. In my implementation, I put most of the
ugliness into QuerySet, and a few hooks into related managers. It might
be possible to clean it up a bit.
There is also a fairly nasty use of .extra() which is needed so that
after we retrieve M2M related objects in bulk we know which row on the
primary table they refer to.
I've got tests and docs in the patch on the ticket. I would say it is
more than proof-of-concept at this stage, but probably less than
finished patch - I may have forgotten some cases that we need to cover.
Feedback welcome!
Regards,
Luke
--
"I spilled spot remover on my dog. Now he's gone." (Steven Wright)
Luke Plant || http://lukeplant.me.uk/
--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-d...@googlegroups.com.
To unsubscribe from this group, send email to django-develop...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
> I'm not a fan of this, for a few reasons, firstly: because it feels
> wrong for a QuerySet to execute multiple queries. This isn't a deal
> breaker, just something that struck my conceptually initially. Second I
> disagree that it's difficult to do outside of core, it's not as easy as
> it ought to be, but it's very possible (citation: I've done it :)).
Would you like to share your solution? I found it pretty difficult to
come up with anything that:
1) could be done on a per-query basis and
2) didn't require changes to the code that would use the QuerySet
objects i.e. fully API compatible.
The one avenue I didn't explore yet was proxying the entire model
instance, which I'm sure would work, but could have lots of annoying
corner cases with Python magic methods etc.
Just replying to your other objections, now I've had some time to think
about them:
On 27/09/11 03:23, Alex Gaynor wrote:
> I'm not a fan of this, for a few reasons, firstly: because it feels
> wrong for a QuerySet to execute multiple queries. This isn't a deal
> breaker, just something that struck my conceptually initially.
For me, QuerySet is at a level of abstraction where I don't think it
guarantees a single query. We have quite a bit of precedent here I think:
- some methods produce new QuerySets that only execute queries
when you iterate over them.
- some methods execute queries immediately, like get(), exists() and
count(), and *some* slicing operations.
- of the latter type, some are optimised to *not* to do queries if
we can get the answer from the result cache. (This is not even
explicitly documented).
So I think QuerySet is at a level where it is expected to give you some
more intelligent, optimised tools for executing queries. The actual
number of queries executed is non-trivial to work out by looking at the
code in front of you, but the documentation, combined with analysis of
how the QuerySet was built will give you an upper bound.
You can also consider select_related() in this light - code as simple as
'map(unicode, queryset)' can do 1 query or N queries, depending on how
the QuerySet was constructed.
> Finally (for now ;)) it doesn't feel right for something inside core to
> have caveats like "only works if you use .all()", there's a very good
> technical reason for this, but something about it is irking me the wrong
> way.
The 'only works if you use all()' bit is nothing other than the normal
behaviour of QuerySets - if you evaluate one QuerySet and fill its
result cache, those results will not be used by any QuerySets based off
the first via filter() etc. I probably worded it badly. In the
implementation, I only modified the RelatedManager.all() methods because
if you are doing anything else, you will end up with a clone that has an
unfilled QuerySet anyway (if I'm thinking correctly), which is
presumably what you want if you do that.
For me, the most annoying limitation is the 'single depth' thing. I
think it is entirely possible to extend this work and fix that, but that
can be added later as an enhancement and the single level version is a
useful enough optimisation in its own right.
> On 27/09/11 03:23, Alex Gaynor wrote:
>
>> I'm not a fan of this, for a few reasons, firstly: because it feels
>> wrong for a QuerySet to execute multiple queries. This isn't a deal
>> breaker, just something that struck my conceptually initially. Second I
>> disagree that it's difficult to do outside of core, it's not as easy as
>> it ought to be, but it's very possible (citation: I've done it :)).
>
> Would you like to share your solution? I found it pretty difficult to
> come up with anything that:
>
> 1) could be done on a per-query basis and
> 2) didn't require changes to the code that would use the QuerySet
> objects i.e. fully API compatible.
I don't believe (2) is an important requirement. It may even be counter-productive.
For me, the ideal solution would allow to filter or annotate the prefetched QuerySet – and thus, obj.relmanager.all() would be a bad API to access the prefetched objects.
I'd like to propose something that works more like annotations:
>>> Foo.objects.prefetch_related(related_list=Foo.related_objects.filter(x__gt=42))
This would require related object descriptors to return something like a ParametrizedQuerySet object when accessed via the model class, e.g.:
>>> Foo.related_set.filter(x__gt=42).apply(*foo_objects)
# equivalent to: Related.objects.filter(foo__in= foo_objects).filter(x__gt=42)
Or, simpler, but less generic:
>>> Foo.related_set.filter(x__gt=42)
'foo', Related.objects.filter(x__gt=42)
This syntax would also make it somewhat obvious that you're going to use more than one db query.
Finally, and to support Alex point, if you drop your second requirement, it's well possible to implement this (modulo my proposed API) with the public QuerySet API + some introspection.
__
Johannes
If you are trying to improve the performance of existing code, it's
quite an important requirement. It's analogous to select_related - if
you had to use different code in order to get the benefits from
select_related it wouldn't be nearly so useful. Template authors
shouldn't have to learn a new set of attributes in order to benefit from
either of these performance related APIs.
> For me, the ideal solution would allow to filter or annotate the
> prefetched QuerySet – and thus, obj.relmanager.all() would be a bad
> API to access the prefetched objects. I'd like to propose something
> that works more like annotations:
This is a much bigger feature, and introduces a whole bunch of changes
and additions which I don't think are realistic until we've got basic
functionality in place.
Regards,
I'd just like to chime in to say this should definitely be part of
core - it's a common requirement, and whilst it could be a third party
app, it certainly feels much more at home in core.
On Sep 27, 1:13 pm, Luke Plant <L.Pla...@cantab.net> wrote:
> For me, QuerySet is at a level of abstraction where I don't think it
> guarantees a single query. We have quite a bit of precedent here I think:
I also agree on this - many ORMs give the option to fetch objects
using a separate select (I'm thinking for example hibernates
"FetchMode.SELECT"). I think if you explicitly use this then you
probably need to understand what it does like anything else.
--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/cQKYc4OPQaAJ.
To post to this group, send email to django-d...@googlegroups.com.
To unsubscribe from this group, send email to django-develop...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.
It does work fine with pagination because it jumps into action only when
the QuerySet is evaluated, and then proceeds only via forcing evaluation
of the existing query.
The patch is going to need reworking, especially after some more
cleanups/changes have just been committed. I think the result will have
to be cleaner, with more explicit support in the related manager code
and better separation of concerns.
I also have a case where I really need it more than single depth, which
may motivate me to solve the problem more fully - but don't know when I
will get time for that!
Luke
--
"I think you ought to know I'm feeling very depressed." (Marvin the
paranoid android)
Luke Plant || http://lukeplant.me.uk/
> When I did this externally a number of years ago, I basically subclassed
> ManyToManyField, overrode a bunch of code (quite a bit of copy paste as
> I recall), and it's related manager and made it return a custom
> queryset, which used a cache off of the object, which was populated by
> another custom queryset which overrode iterator() in order to do the
> bulk loading.
I've had another go, and I think it should be a lot more convincing now.
https://code.djangoproject.com/attachment/ticket/16937/prefetch_3.diff
It supports arbitrary depth, so you can do things like:
User.objects.prefetch_related('groups__permissions')
and it will also traverse singly-related objects (foreign key and
one-to-one) in the chain.
The implementation is significantly nicer, as it splits up
responsibilities between QuerySet and the related managers the way it
ought to be. One piece of evidence that the separation is good is that I
implemented support for GenericRelations as a final step, and did so
without touching the core 'prefetch_related' code at all - I simply
added some methods to the manager produced by GenericRelation. This also
means that the implementation could provide support for things like
GenericRelation but in 3rd party apps.
The implementation can also cope with the presence of a
prefetch_related() fields already on any prefetched QuerySets (which can
happen if a default manager has used prefetch_related), and will simply
fold in the work to optimize the queries.
With these things in place, it was literally a couple of lines to take
one of my admin pages from 176 queries to 8.
I think the combination of features here makes it a *very* compelling
alternative to doing it outside core. Have I convinced you Alex? :-)
I would like at some point to tackle the ability to prefetch objects
with custom filters, as discussed on the ticket, rather than just the
'all()' case. However, that can wait, and I'd like some people to have a
bash on this and find the bugs. It would be really nice if this could
get in before the 1.4 alpha, because it has turned out to be a pretty
neat feature I think.
Regards,
Luke
--
"It is a truth universally acknowledged, that a single man in
possession of a good fortune, must be in want of a wife." (Jane
Austen)
Luke Plant || http://lukeplant.me.uk/
The patch for this is now ready, as far as I'm concerned, but I'd like
to bring it up here again before committing, mainly because Alex Gaynor
expressed some doubts.
The latest patch is on the ticket:
https://code.djangoproject.com/ticket/16937
It is much longer than before, but mainly because of docs and tests.
Are there any objections to this? Anssi (akaariai) has done a lot of
review of the code, added a bunch of tests and contributed some big
performance improvements (although more could still be done), and
various bugs have been ironed out. To summarise, it now:
- supports related objects of related objects, to arbitrary depth
- supports GenericRelation
- fully supports non-autocreated 'through' models, which
can have FK to non-PK fields.
- works as expected with .exists() and .count(), as well
as .all(), plus any other QuerySet method that gets its
results from a populated result cache if available.
All of this means that you can see big performance improvements simply
by adding a single prefetch_related() call at the right place.
Anssi is planning further extensions, but I think these should wait
until the main functionality has been adopted.
Regards,
Luke
--
"I washed a sock. Then I put it in the dryer. When I took it out,
it was gone." (Steven Wright)
Luke Plant || http://lukeplant.me.uk/
Thanks for the feedback, sorry for the length of my reply, I tried to be
detailed to avoid you having to read the code.
On 03/10/11 20:44, Carl Meyer wrote:
> My only real concern is one I'm a bit surprised hasn't been raised
> already: API and naming sanity. If I'm a new user coming to this API,
> as far as I'm concerned "select_related" and "prefetch_related" may as
> well be synonyms - they don't tell me anything useful about how the
> functions differ. And even knowing the internal details, any
> justification for the naming seems tortured at best: both methods use
> selects, and both prefetch something that would otherwise have to be
> fetched separately later.
To me, 'prefetch' does suggest a separate phase of fetching, whereas
'select' does suggest we are including some other things within a SQL
select. I do think they need to be separate methods, for reasons given
below, and if anyone has a better name for 'prefetch_related' I'm all ears.
> Given that this and select_related operate on entirely disjoint sets
> of fields, is there a good technical reason not to simply merge this
> functionality into select_related? Sure, the underlying implementation
> is quite different, but philosophically the ORM doesn't generally aim
> to be a transparent reflection of what's happening at the SQL layer,
> it aims to present a coherent object API. At that level I think
> select_related and prefetch_related are pretty much indistinguishable
> from each other. A note in the docs that using select_related on an
> m2m will result in an extra query (but many fewer than if you didn't
> use it and iterated over the m2m related objects for each row!) seems
> to me a quite adequate nod to the implementation difference.
Given that both these methods exist solely for performance reasons, I
don't think it is necessarily a good thing to attempt to hide what kind
of thing is actually going on. The developer who chooses to add either
of these needs to know the typical performance characteristics that are
invoked by choosing one or the other, so I don't think it actually helps
to hide this. Yes, at the higher level there isn't much distinction, but
if you are using either method you have already abandoned the higher
level, because you are worrying about exactly what you are asking the
database to do, and we should make things easy for such people.
> As the builder of the bikeshed, you get to paint it, so if you've
> already considered all this and haven't come up with any better API
> naming options, I'll withdraw my concern. But it seems to me it's at
> least worth some public discussion before locking ourselves into an
> API.
There are some quite big semantic and technical differences between the
two that really make merging them a bad idea IMO.
First, prefetch_related causes a QuerySet to effectively lose the
benefits of chunking that normally happen, because in order to get the
next level we have to get all the PKs (or whatever) from the first
level. This means we must fully populate the cache as soon as the
QuerySet is evaluated, so that we can do prefetching before any of the
instances are used. This is a documented side effect.
But we really don't want to do that if select_related is called, since
there is no need. Documenting the side effect would be harder if we
merged the two methods, and isolating the side effect could be possibly
much harder.
We cannot work out from the fields specified whether the user means
'prefetch_related' or 'select_related', because prefetch_related works
after the query is evaluated, and *has* to if we want it to be as
powerful as it is, and select_related works before the query is
evaluated, and *has* to.
For example, prefetch_related supports stepping over *any* singly
related object, including those from properties - see:
https://code.djangoproject.com/attachment/ticket/16937/prefetch_7.diff
tests/modeltests/prefetch_related/models.py L141
tests/modeltests/prefetch_related/tests.py L392
The support for GenericForeignKey also works this way.
I've also written prefetch_related to hopefully be usable by 3rd party
'many-to-many-like' relationships, as might be provided by something
like TaggableManager in django-taggit. The support for GenericRelation
works this way - there is no specific support for anything in
contrib.content_type in the prefetch code - instead the GenericRelation
manager code just implements the same API that the core related manager
code implements.
So prefetch_related works entirely by traversing attributes, and seeing
if it can find things that quack like a ManyRelatedManager, and I think
this is a strength of the implementation.
But we don't know what attributes are available, or what type of thing
they return, before we construct the objects. This makes is pretty hard
to decide whether something is a prefetch_related thing or
select_related thing. I guess we could assume that everything that we
can't interpret as a 'select_related' is a 'prefetch_related'. That's
not currently particularly easy, as the lower-level code supporting
select_related at the moment silently swallows bad fields.
But even if this were fixed, I think it would make the whole API more
difficult to predict - you would have to say 'if your lookups include
any of these elements, you get this behaviour and these side-effects,
otherwise this other set of behaviour'. And predicting behaviour is
crucial here, because controlling the number and nature of the queries
is actually the whole reason these methods exist.
Finally, there is also the fact that 'select_related' currently has API
issues of its own that would make merging very hard. For instance, what
would 'depth=N' mean? Does it now include m2m? That could be a *massive*
performance regression in *many* cases. There are issues about
chaining/clearing select_related() which would only be confused by
adding prefetch_related to the mix. In fact, in some of my own code that
I've adapted to use prefetch_related, I've already come across examples
where I want to clear prefetch_related behaviour, and leave
select_related alone. This makes me think that merging the APIs, at
least from the current starting point, would be a bad move.
Regards,
Luke
--
"I was sad because I had no shoes, until I met a man who had no
feet. So I said, "Got any shoes you're not using?" (Steven Wright)
Luke Plant || http://lukeplant.me.uk/
Hi Luke,
Thanks for the thorough reply! I'm convinced that it wouldn't make sense
to merge select_related and prefetch_related; I wish I had a better
suggestion for a name for prefetch_related, but I don't. Consider my
concern withdrawn.
Carl
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk6KbYUACgkQ8W4rlRKtE2cUAQCeMBOXsB+m7olr0IECRGaTFy4/
liMAoMYGJJYllcptL40PsOMdJK5Kzgdr
=C1sv
-----END PGP SIGNATURE-----