select_related() and reverse foreign keys -OR- how to I get a blog post and its comments?

3,835 views
Skip to first unread message

sphogan

unread,
Feb 23, 2009, 12:40:06 PM2/23/09
to Django users
I'm surprised that this hadn't come up for me before, but nonetheless,
here I am.

My problem is that I want to get related items (in a similar situation
to getting blog articles with their comments) without looping a second
lookup (since it would be a couple thousand items and building a page
off a couple thousand queries isn't a good idea).

Basically, the schema is like this:

Article(models.Model):
#fields for article

Comment(models.Model):
#fields for comment
article = models.ForeignKey(Article)

Simple enough. If I wanted to get all the comments and their related
article, I could do:

Comment.objects.select_related('article').all()

However, select_related() only works on the object the foreign key is
declared on (http://docs.djangoproject.com/en/dev/ref/models/querysets/
#id4; specifically, "You can only refer to ForeignKey relations in the
list of fields passed to select_related.") Presumably this is because
it does an INNER JOIN rather than a LEFT OUTER JOIN and would
therefore miss articles with no comments.

What I would like to be able to do is:

Article.objects.select_related('comment_set').all()

This seems like a somewhat simple case and there seems to be no
mention of it in the documentation anywhere.

Sean.

sphogan

unread,
Feb 23, 2009, 12:40:06 PM2/23/09
to Django users

Alex Gaynor

unread,
Feb 23, 2009, 3:22:11 PM2/23/09
to django...@googlegroups.com
That's not possible, but that doesn't really affect anything, you still have that accessor available to you, but it does an extra SQL query, that is to say select_related does not affect the data avilable to you, it is a performance optimization.

Alex

--
"I disapprove of what you say, but I will defend to the death your right to say it." --Voltaire
"The people's good is the highest law."--Cicero

Torsten Bronger

unread,
Feb 23, 2009, 3:49:15 PM2/23/09
to django...@googlegroups.com
Hallöchen!

Alex Gaynor writes:

> On Mon, Feb 23, 2009 at 12:40 PM, sphogan <sph...@gmail.com> wrote:
>

>> [...]


>>
>> However, select_related() only works on the object the foreign
>> key is declared on
>> (http://docs.djangoproject.com/en/dev/ref/models/querysets/ #id4

>> <http://docs.djangoproject.com/en/dev/ref/models/querysets/%0A#id4>;


>> specifically, "You can only refer to ForeignKey relations in the
>> list of fields passed to select_related.") Presumably this is
>> because it does an INNER JOIN rather than a LEFT OUTER JOIN and
>> would therefore miss articles with no comments.
>>
>> What I would like to be able to do is:
>>
>> Article.objects.select_related('comment_set').all()
>>
>> This seems like a somewhat simple case and there seems to be no
>> mention of it in the documentation anywhere.
>

> That's not possible,

By principle, or just because Django can't do it (yet)? The related
feature request can be found at
<http://code.djangoproject.com/ticket/2238>. It it, I refer to
http://rubynugs.blogspot.com/2008/07/couple-of-stupid-things-about-djano.html
where a Rails fan claims that Rails can do it.

I don't know much about SQL so I really would like to know whether
this could be realised.

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: torsten...@jabber.rwth-aachen.de

Alex Gaynor

unread,
Feb 23, 2009, 3:55:49 PM2/23/09
to django...@googlegroups.com
I have no idea if it's possible, by my inclination is that it isn't since no one seems to have ponied up the SQL to make it happen :) .  Further SQL generally returns a list of tuples of scalars(to use there python datatypes) SQL doesn't really ever return a collection, so my inclination is that it can't.  Someone better versed in SQL can probably say for sure.  Regardless it's still easy to get the data:

p =Post.object.get(pk=2)
p.comment_set.all() # all the comments for that post.

Sean Patrick Hogan

unread,
Feb 23, 2009, 4:10:38 PM2/23/09
to django...@googlegroups.com
It isn't *an* extra SQL query that I'm worried about.  It's that it's going to be about 2,000 extra SQL queries in a loop.  If it were just constructing a page with one item and then its related items, that's not a problem to do two queries, but when you're looping over 2,000 items and running an extra query for all 2,000, it becomes a large performance issue.  When you're dealing with that many items, it does affect the viability of your application.

Rails can do this.  However, the problem isn't a simple one to solve.  I'll probably look at their code sometime later this week.  Going back to my example, SQL is going to give you the list of tuples of scalars as Aaron notes.  This isn't a problem that SQL can solve since an SQL result set will look like a spreadsheet and (for obvious reasons) you just can't have many related objects shown in a spreadsheet.

BUT, you can iterate over objects in python.  So, let's say I had this result set:
Article1, Comment1
Article 1, Comment 4
Article 1, Comment 8
Article 2, Comment 3
Article 2 Comment5

Ugly!  But I could loop over that and aggregate them together as such:

final_results = []
for item in results:
  if item.pk != last_item.pk:
    final_results.append(item)
  item.comment_set = []
  item.comment_set.append(#a comment object from those columns)
  last_item = item

Basically, each loop you look to see if it's the same article and so you realise that you should only be appending the comment to the comments list.  That seems easy enough, but then what do you do when you have multiple relations you're trying to get in the select_related?  It gets a lot harder.

Anyway, I'll look over the Rails source sometime later this week and see if there is a simple enough pattern to port or maybe just hack a simple case through for single calls for the reverse relationship which isn't so hard.

One thing that would be nice, however, is if select_related() would raise an exception when you tried to call it and passed in a name that wasn't a foreign key (and therefore it couldn't get) so that you'd get an error message rather than having that error message pass silently.
Errors should never pass silently.
Unless explicitly silenced.
~Zen of Python

Well, I guess I have a project for this week!
Sean.

Alex Gaynor

unread,
Feb 23, 2009, 4:14:50 PM2/23/09
to django...@googlegroups.com
Sean,

If you figured out how exactly the SQL that makes the eager loading of reverse relations possible I'm sure there would be quite a few people interested in making it possible in Django(myself among them, though not because I have a use, I just like puzzles :) ).

Karen Tracey

unread,
Feb 23, 2009, 4:26:20 PM2/23/09
to django...@googlegroups.com
On Mon, Feb 23, 2009 at 12:40 PM, sphogan <sph...@gmail.com> wrote:

Given an Article instance a, a.comment_set.all() will get you the related comment in one SQL query.  If you want that query to pull in all related stuff, then use a.comment_set.select_related().all().  You can see the effect via some simple shell experiments.  Below I substitute 'Puzzles' for 'Article' and 'Clues' for 'Comments', as for my DB they have the same one-many relationship you are describing, and Clues have another related field that may be of interest when looping over a clues set.

Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
[GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from crossword.models import Puzzles
>>> from django.db import connection
>>> p = Puzzles.objects.get(pk=2834)
>>> p.clues_set.count()
74L
>>> len(connection.queries)
2
>>> for c in p.clues_set.all():
...    print c.EntryID,
...
<output deleted>
>>> len(connection.queries)
77

Pulling the clues_set and the EntryID for each clue required 75 queries: 1 for the set itself, 1 for each clue to retrieve the related EntryID value.  However you can cut that down to one query by simply specifying select_related() as appropriate.  Using a different puzzle just to guarantee no cached results are used:

>>> p = Puzzles.objects.get(pk=8643)
>>> p.clues_set.count()
78L
>>> len(connection.queries)
79
>>> for c in p.clues_set.select_related().all():
...    print c.EntryID,
...
<output deleted>
>>> len(connection.queries)
80
>>>

In this case the loop only took one query, which retrieved the set of clues and all their related objects.

Maybe this helpful for you to reduce the number of queries you see you code generating?  Might not get down to 1 if you are looping over many Articles, but 1 per Article is better than #Comments referenced by the set of Articles?

Karen

Malcolm Tredinnick

unread,
Feb 23, 2009, 6:11:58 PM2/23/09
to django...@googlegroups.com
On Mon, 2009-02-23 at 21:49 +0100, Torsten Bronger wrote:
> Hallöchen!
>
> Alex Gaynor writes:
>
> > On Mon, Feb 23, 2009 at 12:40 PM, sphogan <sph...@gmail.com> wrote:

[...]


> >> What I would like to be able to do is:
> >>
> >> Article.objects.select_related('comment_set').all()
> >>
> >> This seems like a somewhat simple case and there seems to be no
> >> mention of it in the documentation anywhere.
> >

> > That's not possible,
>

> By principle, or just because Django can't do it (yet)? The related
> feature request can be found at
> <http://code.djangoproject.com/ticket/2238>.

Actually, this is really just a disguised version of #5768, which is
talking about values() -- since that contains this problem plus the
presentation issue subtlety. Read the comments there for information
(I've repeated the basic hard part of this, below).

> It it, I refer to
> http://rubynugs.blogspot.com/2008/07/couple-of-stupid-things-about-djano.html
> where a Rails fan claims that Rails can do it.
>
> I don't know much about SQL so I really would like to know whether
> this could be realised.

It's possible, but it's not at all trivial. Firstly, there's a
representation issue for values(): there are multiple *_set rows
returned for any individual result in the original query. So it requires
a bunch more munging of results.

Secondly, there's the "don't shoot yourself in the foot" issue, in that
it's *very* easy to create a horribly inefficient query. The problem is
when you start querying for both foo_set and bar_set. If they return X
and Y results, respectively, for a given row, we need to make sure we
only get back O(X + Y) results from the database, not O(X * Y) results.
the simplest ways to construct the query for these multi-valued returns
leads to the second case, so extra care has to be put into the situation
with multiple multi-valued return results.

We've always said we'll put this in once there's an excellent patch that
avoids the second problem and has a decent return format for values()

Hopefully that should put to rest all the debate in this thread about
what's going on. We'll do, but somebody has to write the patch and it's
not trivial. (That somebody might be me one day, but I also about a few
hundred other things to work on in Django, so it's prioritised
accordingly).

Regards,
Malcolm

Sean Patrick Hogan

unread,
Feb 24, 2009, 11:34:55 AM2/24/09
to django...@googlegroups.com
Thanks for the input!  So, I'm assuming by "easiest case" you mean just calling LEFT OUTER JOINs.  That's Rails behavior and then it deals with it in Ruby.  I'm not too familiar with SQLAlchemy, but it looks like that's what they're doing in their EagerLoader as well.

The only way I can think of overcoming this would be to do an SQL UNION with the LEFT OUTER JOIN in the union.  That way you're doing one join per UNION clause and should end up with X + Y rows (which is better in cases of many multiple related objects, but in the case that there is 1 or none, would actually add more to the return set).  It's ugly because you have to specify null fillers for the things that won't exist. As I realise how silly it was to do an OUTER JOIN there, I retract and change it to INNER JOINs within the UNIONs (since we don't need to get duplicate rows for relations that don't exist).

That would seem to allow for decent iteration over something to propagate the appropriate *_set list attributes.

To illustrate using the blog example (extended with an authors m2m relationship so that an article can have many authors):

SELECT * FROM articles /*fill nulls for other columns, cut for readability*/
UNION
SELECT * FROM articles /*could also possibly just fill nulls for articles to make the iteration easier*/
INNER JOIN comments ON comments.article_id=articles.id
UNION
SELECT * FROM articles
INNER JOIN authors_m2m_table ON authors_m2m_table.article_id=articles.id
INNER JOIN authors ON authors_m2m_table.author_id=authors.id


The problem with that is that any filters would have to be repeated multiple times - like if you just wanted articles written by Malcom, you'd have to add the explicit joins on each clause to the union.

Well, my brain has checked out for lunch so I think I'll follow it.  Maybe someone here has a better idea.

Sean.

Malcolm Tredinnick

unread,
Feb 24, 2009, 9:31:08 PM2/24/09
to django...@googlegroups.com
On Tue, 2009-02-24 at 11:34 -0500, Sean Patrick Hogan wrote:
> Thanks for the input! So, I'm assuming by "easiest case" you mean
> just calling LEFT OUTER JOINs. That's Rails behavior and then it
> deals with it in Ruby. I'm not too familiar with SQLAlchemy, but it
> looks like that's what they're doing in their EagerLoader as well.
>
> The only way I can think of overcoming this would be to do an SQL
> UNION with the LEFT OUTER JOIN in the union. That way you're doing
> one join per UNION clause and should end up with X + Y rows (which is
> better in cases of many multiple related objects, but in the case that
> there is 1 or none, would actually add more to the return set). It's
> ugly because you have to specify null fillers for the things that
> won't exist. As I realise how silly it was to do an OUTER JOIN there,
> I retract and change it to INNER JOINs within the UNIONs (since we
> don't need to get duplicate rows for relations that don't exist).

Yes, that's the idea.

>
> That would seem to allow for decent iteration over something to
> propagate the appropriate *_set list attributes.
>
> To illustrate using the blog example (extended with an authors m2m
> relationship so that an article can have many authors):
>
> SELECT * FROM articles /*fill nulls for other columns, cut for
> readability*/
> UNION
> SELECT * FROM articles /*could also possibly just fill nulls for
> articles to make the iteration easier*/
> INNER JOIN comments ON comments.article_id=articles.id
> UNION
> SELECT * FROM articles
> INNER JOIN authors_m2m_table ON
> authors_m2m_table.article_id=articles.id
> INNER JOIN authors ON authors_m2m_table.author_id=authors.id

This is almost correct, except that you will have different numbers of
fields in each union sub-query (or possibly the same number). So, in
order to be able to differentiate which rows are for which results,
you'd also select a constant value (1, 2, 3, ...) to indicate which
sub-query it's coming from.


>
>
> The problem with that is that any filters would have to be repeated
> multiple times - like if you just wanted articles written by Malcom,
> you'd have to add the explicit joins on each clause to the union.

Yes, they're effectively all separate queries that have the results
jammed together. It's still slightly (but only slightly, in most cases)
more efficient than doing the same thing in Python. Whether all the
added complexity is worth it, though, or whether we just restrict it to
one field and say "do the rest in Python" remains to be seen from how
complicated the patch ends up looking.

You've certainly understand the details of the solution. You can see
that it gets a little complex. Not technically difficult, just fiddly to
implement. Which is kind of why it hasn't been done yet (at least by
me). The payback on effort required is lower than for a bunch of other
things that are biting people.

Regards,
Malcolm


Reply all
Reply to author
Forward
0 new messages