subqueries returning more than one row?

Eric Abrahamsen

unread,

Jul 27, 2008, 4:28:10 AM7/27/08

to django...@googlegroups.com

Short version: I'm trying to create an extra() subquery that returns a
list of pks from a M2M related model. Right now the SQL complains that
the subquery returns more than one row (which is what I want, of
course). The end goal is that each instance in the queryset would have
a new attribute which is a list of M2M pks. Is this possible with pure
SQL? I'm more interested in the convenience of the attribute than in
saving (probably negligible amounts of) time in reducing related
lookup queries, so I'll do it with python in the view logic if
necessary.

The long version:

I've got a Book model, linked by M2M field called 'categories' to a
DataCategory model. In the database, this is dbase_book linked to
dbase_datacategory via dbase_book_categories.

One view outputs all the books, and I'd like to use only the pk of
each category associated with each book. Right now I'm doing that by
looping over book.categories.all and outputting the pk, but this seems
wasteful and is also an inconvenient format for how I'm using it in
the template.

So I thought of using an extra() call when the original list of Books
is created. I thought the corresponding QuerySet would be this:

books = Book.objects.extra(select={'cats': "SELECT
dbase_datacategory.id FROM dbase_datacategory JOIN
dbase_book_categories ON (dbase_datacategory.id =
dbase_book_categories.datacategory_id) WHERE
dbase_book_categories.book_id = dbase_book.id"})

This produces an empty set in Django, presumably because of the SQL
error "Subquery returns more than 1 row". I'm sadly ignorant about SQL
– is it possible to have this subquery return some sort of a list? If
so, is it going to be more work than it's worth?

Thanks,
Eric

Eric Abrahamsen

unread,

Jul 27, 2008, 5:18:54 AM7/27/08

to Django users

And a related question...

If this has to be done in Python, it can be done pretty simply like
this:

for b in books:
b.cats = b.categories.values_list('id', flat=True)

I'd be happy with this, except that the list of all books gets
filtered and excluded into several different context variables later
in the view, and the new cats attribute doesn't 'stick' when that
happens. I thought that iterating over a queryset evaluated it, and
that that would mean that the python objects were 'present' and
modifiable. Clearly I'm wrong about something... What's going on
there?

Thanks again,
Eric

Malcolm Tredinnick

unread,

Jul 27, 2008, 12:56:57 PM7/27/08

to django...@googlegroups.com

On Sun, 2008-07-27 at 16:28 +0800, Eric Abrahamsen wrote:
> Short version: I'm trying to create an extra() subquery that returns a
> list of pks from a M2M related model. Right now the SQL complains that
> the subquery returns more than one row (which is what I want, of
> course). The end goal is that each instance in the queryset would have
> a new attribute which is a list of M2M pks. Is this possible with pure
> SQL? I'm more interested in the convenience of the attribute than in
> saving (probably negligible amounts of) time in reducing related
> lookup queries, so I'll do it with python in the view logic if
> necessary.

Well, give yourself a reward of choice for a clear problem description.
It's nice to know in advance that you realise the trade-offs and why
you're wanting to work at the Python level. :-)

At the SQL level, the result of any query is a sequence of rows, each
row containing a collection of output columns (the values you are
selecting). So if you were wanting to collect a bunch of results all
related to a single value in some other column, you would get back a
sequence of rows that looks like this:

value1, related_value_1
value1, related_value_2
value1, related_value_3

There's no way for an SQL result set to somehow merge all of
related_value_1, ..., related_value_3 into a single set of results and
return them in a single row. But that's what you're trying to do, which
is why you are seeing the error.

Because of the complexities of merging a result set like the above into
something that looks like the object for value1 with a list of objects
for the related_values hanging off it, Django doesn't support that style
of query in any form (the pattern crops up in a few places). It's
certainly not something we can support through extra(), in any case.
That would require parsing whatever you pass into extra(), which would
ultimately require a full SQL parser in Django. Not going to happen! :-)

So anything you pass into extra(), for any parameter, is treated as an
opaque string. It's entirely up to you to make sure it fits in nicely
with the query. Really, extra() is just for some edge cases, not a tool
of first choice. I realise you probably realise that, since you seem to
understand the error you're seeing, but I just wanted to make it clear.
This isn't "wait a while and we'll fix it" territory -- it's a case of
"don't use that approach".

>
> The long version:
>
> I've got a Book model, linked by M2M field called 'categories' to a
> DataCategory model. In the database, this is dbase_book linked to
> dbase_datacategory via dbase_book_categories.
>
> One view outputs all the books, and I'd like to use only the pk of
> each category associated with each book. Right now I'm doing that by
> looping over book.categories.all and outputting the pk, but this seems
> wasteful and is also an inconvenient format for how I'm using it in
> the template.

The approach you mention in your second mail in this thread, using a
single values() call looks like the most efficient approach to me -- one
db call per book -- if you only want to use the ORM. If you were
comfortable with raw SQL, you could write a query that select all the
book ids and category ids where the books ids are in your set and then
order by book id. That would give you a result set like the one I lay
out above and you could convert that, in Python, to a list of primary
keys. There's no way to do that directly with the Django ORM, but it's
not too painful (half a dozen lines or so) to do it with a
cursor.execute() call and a Python loop.

Regards,
Malcolm

Malcolm Tredinnick

unread,

Jul 27, 2008, 1:01:36 PM7/27/08

to django...@googlegroups.com

On Sun, 2008-07-27 at 02:18 -0700, Eric Abrahamsen wrote:
> And a related question...
>
> If this has to be done in Python, it can be done pretty simply like
> this:
>
> for b in books:
> b.cats = b.categories.values_list('id', flat=True)
>
> I'd be happy with this, except that the list of all books gets
> filtered and excluded into several different context variables later
> in the view, and the new cats attribute doesn't 'stick' when that
> happens. I thought that iterating over a queryset evaluated it, and
> that that would mean that the python objects were 'present' and
> modifiable. Clearly I'm wrong about something... What's going on
> there?

Every time you apply a new method call to a queryset -- for example if
you wrote books.filter(...) -- it returns a *new* queryset. This means
that your original queryset will still have the objects in cache, but
the new queryset will talk to the database again to get the results. It
would introduce quite a lot of extra complexity to try and first filter
over the cache and only then talk to the database. It would also
introduce inconsistent results at times. So we intentionally don't even
attempt that.

Regards,
Malcolm

Eric Abrahamsen

unread,

Jul 27, 2008, 9:30:04 PM7/27/08

to django...@googlegroups.com

Thanks Malcolm, this was a really helpful explanation. I've been
saving cursor.execute() for a rainy day, but maybe it's time to start
practicing. In the meantime, I'll probably go with one query for the
book set, then loop for each book creating 'cats', then turn the
queryset into a list and do further filtering using python instead of
the ORM. That ought to keep the number of queries down, and keep the
object instances around, until I get more familiar with sql (it was a
minor triumph just writing the extra() call and figuring out why it
didn't work).

Thanks again,
Eric

Eric Abrahamsen

unread,

Jul 28, 2008, 12:27:10 PM7/28/08

to django...@googlegroups.com

So of course once I had done that much I couldn't go to sleep until
I'd figured out cursor.execute(), and it turned out to be ridiculously
easy. The same view is used by three equivalent models, with the model
name in the variable 'kind':

cursor.execute("SELECT %s_id, datacategory_id from dbase_%s_categories
ORDER BY %s_id" % (kind,kind,kind))
cat_nums = {}
for k, g in itertools.groupby(cursor, lambda x: x[0]):
cat_nums[k] = [x[1] for x in g]
for item in bigList:
item.cats = cat_nums.get(item.pk, [])

That reduced the number of queries from several hundred to four, and
cut the SQL time to about 20% of what it was (previously this was the
ugliest view I'd ever written). Groupby is probably unnecessary,
except that I have a mild fixation on itertools.

</obsession>

On Jul 28, 2008, at 1:01 AM, Malcolm Tredinnick wrote:

Malcolm Tredinnick

unread,

Jul 28, 2008, 1:36:26 PM7/28/08

to django...@googlegroups.com

On Tue, 2008-07-29 at 00:27 +0800, Eric Abrahamsen wrote:
> So of course once I had done that much I couldn't go to sleep until
> I'd figured out cursor.execute(), and it turned out to be ridiculously
> easy. The same view is used by three equivalent models, with the model
> name in the variable 'kind':
>
> cursor.execute("SELECT %s_id, datacategory_id from dbase_%s_categories
> ORDER BY %s_id" % (kind,kind,kind))

Well, the SQL looks good, but there's a bigger problem. :-(

This is exactly how SQL injection attacks happen. Let the database
substitute your parameters, don't do it in python.

cursor.execute("SELECT %s, ....", (kind, kind, kind))

The execute() method takes two arguments: a string containing format
parameters and the parameters as a list or tuple. Have a read of Python
PEP 249 if you want some more information about it. The reason doing it
this way and not your original approach is that if "kind" contains SQL
unsafe characters like quotes or double-quotes or semicolons, the DB
wrapper will escape them properly. In this case you're lucky that you
know what is in "kind" (at least for now) and it's not too complicated.
But it's a bad habit to get into.

Regards,
Malcolm

Eric Abrahamsen

unread,

Jul 28, 2008, 9:55:01 PM7/28/08

to django...@googlegroups.com

I had made a desultory effort to avoid that, but while I was testing
it this line:

cursor.execute("SELECT %s, datacategory_id from %s ORDER BY %s",
('book_id','dbase_book_categories', 'book_id'))

produced a SQL complaint for ''dbase_book_categories' ORDER BY
'book_id''. So I just thought to hell with it.

How far wrong could I be with that line?

Thanks again,
Eric

Eric Abrahamsen

unread,

Jul 29, 2008, 12:18:13 AM7/29/08

to django...@googlegroups.com

Got this to work with mapping key string formatting:

cursor.execute("SELECT %(id)s, datacategory_id from %(table)s ORDER BY
%(id)s" % ({'id':'%s_id'%kind, 'table':'dbase_%s_categories'%kind}))

On Jul 29, 2008, at 1:36 AM, Malcolm Tredinnick wrote:

Reply all

Reply to author

Forward