prefetch_related - new feature suggestion

Luke Plant

unread,

Sep 26, 2011, 9:50:46 PM9/26/11

to django-d...@googlegroups.com

Hi all,

I finally got fed up with the common performance problem of ending up
doing O(n) DB queries when you need the (many) related objects of a list
of objects, and did something about it:

https://code.djangoproject.com/ticket/16937

I'm pretty sure the concept is something desirable, the problem is the
implementation.

It is much uglier than I'd like, but I think the current way that
QuerySets/Managers/related managers interact make this inevitable.

The current behaviour also makes any kind of 3rd party monkey patching
solution extremely hard. For example, the related object descriptors
make it pretty hard to decorate instances of the queryset with some
proxy objects, because they intercept any assignment of attributes of
the same name. (i.e. if you have foo.bar_set, you can't do foo.bar_set
== some_precalculated_objects as a performance hack). And the way the
descriptors work mean that you get new manager objects (even new manager
*classes*) every time you access the property, and this really works
against any kind of monkey patching.

If you want to implement this outside core, and you don't want to change
templates and other code, I think you'd have to create a proxy for the
entire model instance, and I imagine it would be a total pain.

So, given how common this performance problem is, and how hard it is to
implement outside core, I think we should consider even an ugly
implementation. I think it is also inevitable that it is going to
involve QuerySet/related managers hacking each other's internals, or
duplicating to some extent. In my implementation, I put most of the
ugliness into QuerySet, and a few hooks into related managers. It might
be possible to clean it up a bit.

There is also a fairly nasty use of .extra() which is needed so that
after we retrieve M2M related objects in bulk we know which row on the
primary table they refer to.

I've got tests and docs in the patch on the ticket. I would say it is
more than proof-of-concept at this stage, but probably less than
finished patch - I may have forgotten some cases that we need to cover.

Feedback welcome!

Regards,

Luke

--
"I spilled spot remover on my dog. Now he's gone." (Steven Wright)

Luke Plant || http://lukeplant.me.uk/

Alex Gaynor

unread,

Sep 26, 2011, 10:23:50 PM9/26/11

to django-d...@googlegroups.com

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-d...@googlegroups.com.
To unsubscribe from this group, send email to django-develop...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.

I'm not a fan of this, for a few reasons, firstly: because it feels wrong for a QuerySet to execute multiple queries. This isn't a deal breaker, just something that struck my conceptually initially. Second I disagree that it's difficult to do outside of core, it's not as easy as it ought to be, but it's very possible (citation: I've done it :)). Finally (for now ;)) it doesn't feel right for something inside core to have caveats like "only works if you use .all()", there's a very good technical reason for this, but something about it is irking me the wrong way.

Needs more thought by me'ly yours,

Alex

--
"I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero

Luke Plant

unread,

Sep 26, 2011, 11:18:04 PM9/26/11

to django-d...@googlegroups.com

On 27/09/11 03:23, Alex Gaynor wrote:

> I'm not a fan of this, for a few reasons, firstly: because it feels
> wrong for a QuerySet to execute multiple queries. This isn't a deal
> breaker, just something that struck my conceptually initially. Second I
> disagree that it's difficult to do outside of core, it's not as easy as
> it ought to be, but it's very possible (citation: I've done it :)).

Would you like to share your solution? I found it pretty difficult to
come up with anything that:

1) could be done on a per-query basis and
2) didn't require changes to the code that would use the QuerySet
objects i.e. fully API compatible.

The one avenue I didn't explore yet was proxying the entire model
instance, which I'm sure would work, but could have lots of annoying
corner cases with Python magic methods etc.

koenb

unread,

Sep 27, 2011, 5:33:38 AM9/27/11

to Django developers

On 27 sep, 05:18, Luke Plant <L.Plant...@cantab.net> wrote:
> On 27/09/11 03:23, Alex Gaynor wrote:
>
> Would you like to share your solution? I found it pretty difficult to
> come up with anything that:
>
> 1) could be done on a per-query basis and
> 2) didn't require changes to the code that would use the QuerySet
> objects i.e. fully API compatible.
>
> The one avenue I didn't explore yet was proxying the entire model
> instance, which I'm sure would work, but could have lots of annoying
> corner cases with Python magic methods etc.
>
> Luke
>

I once created django-selectreverse [1] to do something alike, but
that did require changing your code (no API compat.) and was also
somewhat limited.

It would be very nice to have something in core for this.

Koen

1: http://code.google.com/p/django-selectreverse/

Luke Plant

unread,

Sep 27, 2011, 8:13:04 AM9/27/11

to django-d...@googlegroups.com

Hi Alex,

Just replying to your other objections, now I've had some time to think
about them:

On 27/09/11 03:23, Alex Gaynor wrote:

> I'm not a fan of this, for a few reasons, firstly: because it feels
> wrong for a QuerySet to execute multiple queries. This isn't a deal
> breaker, just something that struck my conceptually initially.

For me, QuerySet is at a level of abstraction where I don't think it
guarantees a single query. We have quite a bit of precedent here I think:

- some methods produce new QuerySets that only execute queries
when you iterate over them.

- some methods execute queries immediately, like get(), exists() and
count(), and *some* slicing operations.

- of the latter type, some are optimised to *not* to do queries if
we can get the answer from the result cache. (This is not even
explicitly documented).

So I think QuerySet is at a level where it is expected to give you some
more intelligent, optimised tools for executing queries. The actual
number of queries executed is non-trivial to work out by looking at the
code in front of you, but the documentation, combined with analysis of
how the QuerySet was built will give you an upper bound.

You can also consider select_related() in this light - code as simple as
'map(unicode, queryset)' can do 1 query or N queries, depending on how
the QuerySet was constructed.

> Finally (for now ;)) it doesn't feel right for something inside core to
> have caveats like "only works if you use .all()", there's a very good
> technical reason for this, but something about it is irking me the wrong
> way.

The 'only works if you use all()' bit is nothing other than the normal
behaviour of QuerySets - if you evaluate one QuerySet and fill its
result cache, those results will not be used by any QuerySets based off
the first via filter() etc. I probably worded it badly. In the
implementation, I only modified the RelatedManager.all() methods because
if you are doing anything else, you will end up with a clone that has an
unfilled QuerySet anyway (if I'm thinking correctly), which is
presumably what you want if you do that.

For me, the most annoying limitation is the 'single depth' thing. I
think it is entirely possible to extend this work and fix that, but that
can be added later as an enhancement and the single level version is a
useful enough optimisation in its own right.

Peter

unread,

Sep 27, 2011, 5:37:06 PM9/27/11

to Django developers

I'd just like to chime in to say this should definitely be part of
core - it's a common requirement, and whilst it could be a third party
app, it certainly feels much more at home in core.

On Sep 27, 1:13 pm, Luke Plant <L.Plant...@cantab.net> wrote:
> For me, QuerySet is at a level of abstraction where I don't think it
> guarantees a single query. We have quite a bit of precedent here I think:

I also agree on this - many ORMs give the option to fetch objects
using a separate select (I'm thinking for example hibernates
"FetchMode.SELECT"). I think if you explicitly use this then you
probably need to understand what it does like anything else.

I've only had a brief look at the patch, but no doubt even if it needs
some polish it'll be a good start!

Johannes Dollinger

unread,

Sep 27, 2011, 7:07:06 PM9/27/11

to django-d...@googlegroups.com

Am 27.09.2011 um 05:18 schrieb Luke Plant:

> On 27/09/11 03:23, Alex Gaynor wrote:
>
>> I'm not a fan of this, for a few reasons, firstly: because it feels
>> wrong for a QuerySet to execute multiple queries. This isn't a deal
>> breaker, just something that struck my conceptually initially. Second I
>> disagree that it's difficult to do outside of core, it's not as easy as
>> it ought to be, but it's very possible (citation: I've done it :)).
>
> Would you like to share your solution? I found it pretty difficult to
> come up with anything that:
>
> 1) could be done on a per-query basis and
> 2) didn't require changes to the code that would use the QuerySet
> objects i.e. fully API compatible.

I don't believe (2) is an important requirement. It may even be counter-productive.
For me, the ideal solution would allow to filter or annotate the prefetched QuerySet – and thus, obj.relmanager.all() would be a bad API to access the prefetched objects.
I'd like to propose something that works more like annotations:

>>> Foo.objects.prefetch_related(related_list=Foo.related_objects.filter(x__gt=42))

This would require related object descriptors to return something like a ParametrizedQuerySet object when accessed via the model class, e.g.:

>>> Foo.related_set.filter(x__gt=42).apply(*foo_objects)
# equivalent to: Related.objects.filter(foo__in= foo_objects).filter(x__gt=42)

Or, simpler, but less generic:

>>> Foo.related_set.filter(x__gt=42)
'foo', Related.objects.filter(x__gt=42)

This syntax would also make it somewhat obvious that you're going to use more than one db query.
Finally, and to support Alex point, if you drop your second requirement, it's well possible to implement this (modulo my proposed API) with the public QuerySet API + some introspection.

__
Johannes

Luke Plant

unread,

Sep 28, 2011, 3:58:25 AM9/28/11

to django-d...@googlegroups.com

On 28/09/11 00:07, Johannes Dollinger wrote:
>
> Am 27.09.2011 um 05:18 schrieb Luke Plant:
>
>> 1) could be done on a per-query basis and
>> 2) didn't require changes to the code that would use the QuerySet
>> objects i.e. fully API compatible.
>
> I don't believe (2) is an important requirement. It may even be
> counter-productive.

If you are trying to improve the performance of existing code, it's
quite an important requirement. It's analogous to select_related - if
you had to use different code in order to get the benefits from
select_related it wouldn't be nearly so useful. Template authors
shouldn't have to learn a new set of attributes in order to benefit from
either of these performance related APIs.

> For me, the ideal solution would allow to filter or annotate the
> prefetched QuerySet – and thus, obj.relmanager.all() would be a bad
> API to access the prefetched objects. I'd like to propose something
> that works more like annotations:

This is a much bigger feature, and introduces a whole bunch of changes
and additions which I don't think are realistic until we've got basic
functionality in place.

Regards,

Florian Apolloner

unread,

Sep 29, 2011, 4:40:52 PM9/29/11

to django-d...@googlegroups.com

On Tuesday, September 27, 2011 11:37:06 PM UTC+2, Peter wrote:

I'd just like to chime in to say this should definitely be part of
core - it's a common requirement, and whilst it could be a third party
app, it certainly feels much more at home in core.

+1, especially if it works with pagination (which is often not the case for 3rd party apps which fall back to raw sql or so…)

On Sep 27, 1:13 pm, Luke Plant <L.Pla...@cantab.net> wrote:
> For me, QuerySet is at a level of abstraction where I don't think it
> guarantees a single query. We have quite a bit of precedent here I think:

I also agree on this - many ORMs give the option to fetch objects
using a separate select (I'm thinking for example hibernates
"FetchMode.SELECT"). I think if you explicitly use this then you
probably need to understand what it does like anything else.

Jupp, btw don't we even already have some cases where eg assertNumQueries fails since mysql executes more queries? (I remember something like that on #django-dev -- or I am completely dumb :))

Cheers,
Florian

Alex Gaynor

unread,

Sep 29, 2011, 4:43:53 PM9/29/11

to django-d...@googlegroups.com

--

You received this message because you are subscribed to the Google Groups "Django developers" group.

To view this discussion on the web visit https://groups.google.com/d/msg/django-developers/-/cQKYc4OPQaAJ.

To post to this group, send email to django-d...@googlegroups.com.
To unsubscribe from this group, send email to django-develop...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.

When I did this externally a number of years ago, I basically subclassed ManyToManyField, overrode a bunch of code (quite a bit of copy paste as I recall), and it's related manager and made it return a custom queryset, which used a cache off of the object, which was populated by another custom queryset which overrode iterator() in order to do the bulk loading.

Luke Plant

unread,

Sep 30, 2011, 6:43:02 AM9/30/11

to django-d...@googlegroups.com

On 29/09/11 21:40, Florian Apolloner wrote:
>
>
> On Tuesday, September 27, 2011 11:37:06 PM UTC+2, Peter wrote:
>
> I'd just like to chime in to say this should definitely be part of
> core - it's a common requirement, and whilst it could be a third party
> app, it certainly feels much more at home in core.
>
>
> +1, especially if it works with pagination (which is often not the case
> for 3rd party apps which fall back to raw sql or so…)

It does work fine with pagination because it jumps into action only when
the QuerySet is evaluated, and then proceeds only via forcing evaluation
of the existing query.

The patch is going to need reworking, especially after some more
cleanups/changes have just been committed. I think the result will have
to be cleaner, with more explicit support in the related manager code
and better separation of concerns.

I also have a case where I really need it more than single depth, which
may motivate me to solve the problem more fully - but don't know when I
will get time for that!

Luke

--
"I think you ought to know I'm feeling very depressed." (Marvin the
paranoid android)

Luke Plant || http://lukeplant.me.uk/

Luke Plant

unread,

Sep 30, 2011, 10:42:17 PM9/30/11

to django-d...@googlegroups.com

On 29/09/11 21:43, Alex Gaynor wrote:

> When I did this externally a number of years ago, I basically subclassed
> ManyToManyField, overrode a bunch of code (quite a bit of copy paste as
> I recall), and it's related manager and made it return a custom
> queryset, which used a cache off of the object, which was populated by
> another custom queryset which overrode iterator() in order to do the
> bulk loading.

I've had another go, and I think it should be a lot more convincing now.

https://code.djangoproject.com/attachment/ticket/16937/prefetch_3.diff

It supports arbitrary depth, so you can do things like:

User.objects.prefetch_related('groups__permissions')

and it will also traverse singly-related objects (foreign key and
one-to-one) in the chain.

The implementation is significantly nicer, as it splits up
responsibilities between QuerySet and the related managers the way it
ought to be. One piece of evidence that the separation is good is that I
implemented support for GenericRelations as a final step, and did so
without touching the core 'prefetch_related' code at all - I simply
added some methods to the manager produced by GenericRelation. This also
means that the implementation could provide support for things like
GenericRelation but in 3rd party apps.

The implementation can also cope with the presence of a
prefetch_related() fields already on any prefetched QuerySets (which can
happen if a default manager has used prefetch_related), and will simply
fold in the work to optimize the queries.

With these things in place, it was literally a couple of lines to take
one of my admin pages from 176 queries to 8.

I think the combination of features here makes it a *very* compelling
alternative to doing it outside core. Have I convinced you Alex? :-)

I would like at some point to tackle the ability to prefetch objects
with custom filters, as discussed on the ticket, rather than just the
'all()' case. However, that can wait, and I'd like some people to have a
bash on this and find the bugs. It would be really nice if this could
get in before the 1.4 alpha, because it has turned out to be a pretty
neat feature I think.

Regards,

Luke

--
"It is a truth universally acknowledged, that a single man in
possession of a good fortune, must be in want of a wife." (Jane
Austen)

Luke Plant || http://lukeplant.me.uk/

drakkan

unread,

Oct 1, 2011, 6:42:08 AM10/1/11

to Django developers

This look great! I think this approach is better than the one's used
by other orms that force a single, heavy, query to retrieve the whole
many to many object graphs, would be nice to see it in 1.4

thanks
Nicola

Luke Plant

unread,

Oct 3, 2011, 11:04:44 AM10/3/11

to django-d...@googlegroups.com

Hi all,

The patch for this is now ready, as far as I'm concerned, but I'd like
to bring it up here again before committing, mainly because Alex Gaynor
expressed some doubts.

The latest patch is on the ticket:

https://code.djangoproject.com/ticket/16937

It is much longer than before, but mainly because of docs and tests.

Are there any objections to this? Anssi (akaariai) has done a lot of
review of the code, added a bunch of tests and contributed some big
performance improvements (although more could still be done), and
various bugs have been ironed out. To summarise, it now:

- supports related objects of related objects, to arbitrary depth

- supports GenericRelation

- fully supports non-autocreated 'through' models, which
can have FK to non-PK fields.

- works as expected with .exists() and .count(), as well
as .all(), plus any other QuerySet method that gets its
results from a populated result cache if available.

All of this means that you can see big performance improvements simply
by adding a single prefetch_related() call at the right place.

Anssi is planning further extensions, but I think these should wait
until the main functionality has been adopted.

Regards,

Luke

--
"I washed a sock. Then I put it in the dryer. When I took it out,
it was gone." (Steven Wright)

Luke Plant || http://lukeplant.me.uk/

Carl Meyer

unread,

Oct 3, 2011, 3:44:28 PM10/3/11

to Django developers

Hi Luke,

On Oct 3, 9:04 am, Luke Plant <L.Plant...@cantab.net> wrote:
> The patch for this is now ready, as far as I'm concerned, but I'd like
> to bring it up here again before committing, mainly because Alex Gaynor
> expressed some doubts.
>
> The latest patch is on the ticket:
>
> https://code.djangoproject.com/ticket/16937

I know I'm a bit late to the party here (had a work deadline last
Friday). I think this functionality is valuable and makes sense to
have in the ORM, thanks for all the work on it. Though I don't have
time to go through the patch in detail, it seems like its gotten
several pairs of eyes over it closely already.

My only real concern is one I'm a bit surprised hasn't been raised
already: API and naming sanity. If I'm a new user coming to this API,
as far as I'm concerned "select_related" and "prefetch_related" may as
well be synonyms - they don't tell me anything useful about how the
functions differ. And even knowing the internal details, any
justification for the naming seems tortured at best: both methods use
selects, and both prefetch something that would otherwise have to be
fetched separately later.

Given that this and select_related operate on entirely disjoint sets
of fields, is there a good technical reason not to simply merge this
functionality into select_related? Sure, the underlying implementation
is quite different, but philosophically the ORM doesn't generally aim
to be a transparent reflection of what's happening at the SQL layer,
it aims to present a coherent object API. At that level I think
select_related and prefetch_related are pretty much indistinguishable
from each other. A note in the docs that using select_related on an
m2m will result in an extra query (but many fewer than if you didn't
use it and iterated over the m2m related objects for each row!) seems
to me a quite adequate nod to the implementation difference.

As the builder of the bikeshed, you get to paint it, so if you've
already considered all this and haven't come up with any better API
naming options, I'll withdraw my concern. But it seems to me it's at
least worth some public discussion before locking ourselves into an
API.

Carl

Anssi Kääriäinen

unread,

Oct 3, 2011, 4:11:10 PM10/3/11

to Django developers

On Oct 3, 6:04 pm, Luke Plant <L.Plant...@cantab.net> wrote:
> Hi all,
>
> The patch for this is now ready, as far as I'm concerned, but I'd like
> to bring it up here again before committing, mainly because Alex Gaynor
> expressed some doubts.

I have done a review of the patch, and I think it is commit ready.
There has been some bug squashing, but I feel it is very good at the
moment. Hope to see it soon in trunk!

About merging prefetch_related and select_related: That does make
sense. APIs are not my strongest area of expertise, so I wont try to
discuss about this more.

> Anssi is planning further extensions, but I think these should wait
> until the main functionality has been adopted.

The real reason for this post is that I would like to talk already at
this point about the extensions I have planned and also about the API
for the extensions. There are two extension currently planned:
- Allow fetching into different variable than .related_manager.all(),
done by:
Book.objects.prefetch_related(R('authors',
to_attr='authors_prefetched'))
- Allow usage of custom queryset
qs = Book.objects.prefetch_related(
R('authors', to_attr='young_authors',
qs=Author.objects.filter(age__lte=30)).order_by('-age')
)
Now qs[0].young_authors will have all the authors of the first book in
the qs ordered by age.

There is also the possibility to use prefetched objects in a chain:

qs = Author.objects.prefetch_related(
R('books', to_attr='django_books',
qs=Book.objects.filter(name__icontains='django')),
R('django_books__authors', to_attr='young_authors',
qs=Author.objects.filter(age__lte=30))
)
Now qs[0].django_books will have all the books about Django written by
the author, and qs[0].django_books[0].young_authors will have all the
young authors of the first django book in django_books.

Using this feature it would be pretty easy to fetch discussion threads
and for each thread posts made today in two SQL queries by just
issuing .prefetch_related. Or even fetch forums, today's threads, and
all the posts made today per thread in three SQL queries, again by
just issuing .prefetch_related.

Fetching latest n posts per thread is hard to do efficiently in SQL
but that would be a killer feature. Window functions can do that, but
they are not efficient (at least not in PostgreSQL) and not available
in SQLite or MySQL.

But I am (once again) getting carried away. There is a working patch
in the ticket mentioned above implementing the features. Although it
is a bit stale compared to the latest work in the ticket. I will
create a separate ticket for this when prefetch_related lands trunk.

I hope to get the extensions included in 1.4. When is feature freeze
coming?

Thoughts?

- Anssi

Anssi Kääriäinen

unread,

Oct 3, 2011, 5:31:24 PM10/3/11

to Django developers

On Oct 3, 10:44 pm, Carl Meyer <c...@oddbird.net> wrote:
> Hi Luke,
>
> On Oct 3, 9:04 am, Luke Plant <L.Plant...@cantab.net> wrote:
>
> > The patch for this is now ready, as far as I'm concerned, but I'd like
> > to bring it up here again before committing, mainly because Alex Gaynor
> > expressed some doubts.
>
> > The latest patch is on the ticket:
>
> >https://code.djangoproject.com/ticket/16937
>
> I know I'm a bit late to the party here (had a work deadline last
> Friday). I think this functionality is valuable and makes sense to
> have in the ORM, thanks for all the work on it. Though I don't have
> time to go through the patch in detail, it seems like its gotten
> several pairs of eyes over it closely already.
>
> My only real concern is one I'm a bit surprised hasn't been raised
> already: API and naming sanity. If I'm a new user coming to this API,
> as far as I'm concerned "select_related" and "prefetch_related" may as
> well be synonyms - they don't tell me anything useful about how the
> functions differ. And even knowing the internal details, any
> justification for the naming seems tortured at best: both methods use
> selects, and both prefetch something that would otherwise have to be
> fetched separately later.

One convincing reason for merging select_related and prefetch_related
is that if you have a setup like:
Author -> foreign key: favorite_book -> Book -> m2m: authors -> Author

If you do Author.objects.prefetch_related('favorite_book__authors')
without issuing .select_related you will currently run a query for
each author to fetch his favorite book. Then the authors would be
fetched in a single query using prefetch. This is of course not what
the user wants. If the above prefetch_related call results in
automatic select_related, it would be a good idea to have the method
called .select_related.

The only problem I can think of is that if you for some reason want to
clear prefetch_related but not select_related you can't do that. But
that is not too bad of a problem.

- Anssi

Luke Plant

unread,

Oct 3, 2011, 9:48:39 PM10/3/11

to django-d...@googlegroups.com

Hi Carl,

Thanks for the feedback, sorry for the length of my reply, I tried to be
detailed to avoid you having to read the code.

On 03/10/11 20:44, Carl Meyer wrote:

> My only real concern is one I'm a bit surprised hasn't been raised
> already: API and naming sanity. If I'm a new user coming to this API,
> as far as I'm concerned "select_related" and "prefetch_related" may as
> well be synonyms - they don't tell me anything useful about how the
> functions differ. And even knowing the internal details, any
> justification for the naming seems tortured at best: both methods use
> selects, and both prefetch something that would otherwise have to be
> fetched separately later.

To me, 'prefetch' does suggest a separate phase of fetching, whereas
'select' does suggest we are including some other things within a SQL
select. I do think they need to be separate methods, for reasons given
below, and if anyone has a better name for 'prefetch_related' I'm all ears.

> Given that this and select_related operate on entirely disjoint sets
> of fields, is there a good technical reason not to simply merge this
> functionality into select_related? Sure, the underlying implementation
> is quite different, but philosophically the ORM doesn't generally aim
> to be a transparent reflection of what's happening at the SQL layer,
> it aims to present a coherent object API. At that level I think
> select_related and prefetch_related are pretty much indistinguishable
> from each other. A note in the docs that using select_related on an
> m2m will result in an extra query (but many fewer than if you didn't
> use it and iterated over the m2m related objects for each row!) seems
> to me a quite adequate nod to the implementation difference.

Given that both these methods exist solely for performance reasons, I
don't think it is necessarily a good thing to attempt to hide what kind
of thing is actually going on. The developer who chooses to add either
of these needs to know the typical performance characteristics that are
invoked by choosing one or the other, so I don't think it actually helps
to hide this. Yes, at the higher level there isn't much distinction, but
if you are using either method you have already abandoned the higher
level, because you are worrying about exactly what you are asking the
database to do, and we should make things easy for such people.

> As the builder of the bikeshed, you get to paint it, so if you've
> already considered all this and haven't come up with any better API
> naming options, I'll withdraw my concern. But it seems to me it's at
> least worth some public discussion before locking ourselves into an
> API.

There are some quite big semantic and technical differences between the
two that really make merging them a bad idea IMO.

First, prefetch_related causes a QuerySet to effectively lose the
benefits of chunking that normally happen, because in order to get the
next level we have to get all the PKs (or whatever) from the first
level. This means we must fully populate the cache as soon as the
QuerySet is evaluated, so that we can do prefetching before any of the
instances are used. This is a documented side effect.

But we really don't want to do that if select_related is called, since
there is no need. Documenting the side effect would be harder if we
merged the two methods, and isolating the side effect could be possibly
much harder.

We cannot work out from the fields specified whether the user means
'prefetch_related' or 'select_related', because prefetch_related works
after the query is evaluated, and *has* to if we want it to be as
powerful as it is, and select_related works before the query is
evaluated, and *has* to.

For example, prefetch_related supports stepping over *any* singly
related object, including those from properties - see:

https://code.djangoproject.com/attachment/ticket/16937/prefetch_7.diff

tests/modeltests/prefetch_related/models.py L141
tests/modeltests/prefetch_related/tests.py L392

The support for GenericForeignKey also works this way.

I've also written prefetch_related to hopefully be usable by 3rd party
'many-to-many-like' relationships, as might be provided by something
like TaggableManager in django-taggit. The support for GenericRelation
works this way - there is no specific support for anything in
contrib.content_type in the prefetch code - instead the GenericRelation
manager code just implements the same API that the core related manager
code implements.

So prefetch_related works entirely by traversing attributes, and seeing
if it can find things that quack like a ManyRelatedManager, and I think
this is a strength of the implementation.

But we don't know what attributes are available, or what type of thing
they return, before we construct the objects. This makes is pretty hard
to decide whether something is a prefetch_related thing or
select_related thing. I guess we could assume that everything that we
can't interpret as a 'select_related' is a 'prefetch_related'. That's
not currently particularly easy, as the lower-level code supporting
select_related at the moment silently swallows bad fields.

But even if this were fixed, I think it would make the whole API more
difficult to predict - you would have to say 'if your lookups include
any of these elements, you get this behaviour and these side-effects,
otherwise this other set of behaviour'. And predicting behaviour is
crucial here, because controlling the number and nature of the queries
is actually the whole reason these methods exist.

Finally, there is also the fact that 'select_related' currently has API
issues of its own that would make merging very hard. For instance, what
would 'depth=N' mean? Does it now include m2m? That could be a *massive*
performance regression in *many* cases. There are issues about
chaining/clearing select_related() which would only be confused by
adding prefetch_related to the mix. In fact, in some of my own code that
I've adapted to use prefetch_related, I've already come across examples
where I want to clear prefetch_related behaviour, and leave
select_related alone. This makes me think that merging the APIs, at
least from the current starting point, would be a bad move.

Regards,

Luke

--
"I was sad because I had no shoes, until I met a man who had no
feet. So I said, "Got any shoes you're not using?" (Steven Wright)

Luke Plant || http://lukeplant.me.uk/

Carl Meyer

unread,

Oct 3, 2011, 10:20:53 PM10/3/11

to django-d...@googlegroups.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Luke,

Thanks for the thorough reply! I'm convinced that it wouldn't make sense
to merge select_related and prefetch_related; I wish I had a better
suggestion for a name for prefetch_related, but I don't. Consider my
concern withdrawn.

Carl
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk6KbYUACgkQ8W4rlRKtE2cUAQCeMBOXsB+m7olr0IECRGaTFy4/
liMAoMYGJJYllcptL40PsOMdJK5Kzgdr
=C1sv
-----END PGP SIGNATURE-----

Reply all

Reply to author

Forward