The long version:
I've got a Book model, linked by M2M field called 'categories' to a
DataCategory model. In the database, this is dbase_book linked to
dbase_datacategory via dbase_book_categories.
One view outputs all the books, and I'd like to use only the pk of
each category associated with each book. Right now I'm doing that by
looping over book.categories.all and outputting the pk, but this seems
wasteful and is also an inconvenient format for how I'm using it in
the template.
So I thought of using an extra() call when the original list of Books
is created. I thought the corresponding QuerySet would be this:
books = Book.objects.extra(select={'cats': "SELECT
dbase_datacategory.id FROM dbase_datacategory JOIN
dbase_book_categories ON (dbase_datacategory.id =
dbase_book_categories.datacategory_id) WHERE
dbase_book_categories.book_id = dbase_book.id"})
This produces an empty set in Django, presumably because of the SQL
error "Subquery returns more than 1 row". I'm sadly ignorant about SQL
– is it possible to have this subquery return some sort of a list? If
so, is it going to be more work than it's worth?
Thanks,
Eric
Well, give yourself a reward of choice for a clear problem description.
It's nice to know in advance that you realise the trade-offs and why
you're wanting to work at the Python level. :-)
At the SQL level, the result of any query is a sequence of rows, each
row containing a collection of output columns (the values you are
selecting). So if you were wanting to collect a bunch of results all
related to a single value in some other column, you would get back a
sequence of rows that looks like this:
value1, related_value_1
value1, related_value_2
value1, related_value_3
There's no way for an SQL result set to somehow merge all of
related_value_1, ..., related_value_3 into a single set of results and
return them in a single row. But that's what you're trying to do, which
is why you are seeing the error.
Because of the complexities of merging a result set like the above into
something that looks like the object for value1 with a list of objects
for the related_values hanging off it, Django doesn't support that style
of query in any form (the pattern crops up in a few places). It's
certainly not something we can support through extra(), in any case.
That would require parsing whatever you pass into extra(), which would
ultimately require a full SQL parser in Django. Not going to happen! :-)
So anything you pass into extra(), for any parameter, is treated as an
opaque string. It's entirely up to you to make sure it fits in nicely
with the query. Really, extra() is just for some edge cases, not a tool
of first choice. I realise you probably realise that, since you seem to
understand the error you're seeing, but I just wanted to make it clear.
This isn't "wait a while and we'll fix it" territory -- it's a case of
"don't use that approach".
>
> The long version:
>
> I've got a Book model, linked by M2M field called 'categories' to a
> DataCategory model. In the database, this is dbase_book linked to
> dbase_datacategory via dbase_book_categories.
>
> One view outputs all the books, and I'd like to use only the pk of
> each category associated with each book. Right now I'm doing that by
> looping over book.categories.all and outputting the pk, but this seems
> wasteful and is also an inconvenient format for how I'm using it in
> the template.
The approach you mention in your second mail in this thread, using a
single values() call looks like the most efficient approach to me -- one
db call per book -- if you only want to use the ORM. If you were
comfortable with raw SQL, you could write a query that select all the
book ids and category ids where the books ids are in your set and then
order by book id. That would give you a result set like the one I lay
out above and you could convert that, in Python, to a list of primary
keys. There's no way to do that directly with the Django ORM, but it's
not too painful (half a dozen lines or so) to do it with a
cursor.execute() call and a Python loop.
Regards,
Malcolm
Every time you apply a new method call to a queryset -- for example if
you wrote books.filter(...) -- it returns a *new* queryset. This means
that your original queryset will still have the objects in cache, but
the new queryset will talk to the database again to get the results. It
would introduce quite a lot of extra complexity to try and first filter
over the cache and only then talk to the database. It would also
introduce inconsistent results at times. So we intentionally don't even
attempt that.
Regards,
Malcolm
Thanks again,
Eric
cursor.execute("SELECT %s_id, datacategory_id from dbase_%s_categories
ORDER BY %s_id" % (kind,kind,kind))
cat_nums = {}
for k, g in itertools.groupby(cursor, lambda x: x[0]):
cat_nums[k] = [x[1] for x in g]
for item in bigList:
item.cats = cat_nums.get(item.pk, [])
That reduced the number of queries from several hundred to four, and
cut the SQL time to about 20% of what it was (previously this was the
ugliest view I'd ever written). Groupby is probably unnecessary,
except that I have a mild fixation on itertools.
</obsession>
On Jul 28, 2008, at 1:01 AM, Malcolm Tredinnick wrote:
Well, the SQL looks good, but there's a bigger problem. :-(
This is exactly how SQL injection attacks happen. Let the database
substitute your parameters, don't do it in python.
cursor.execute("SELECT %s, ....", (kind, kind, kind))
The execute() method takes two arguments: a string containing format
parameters and the parameters as a list or tuple. Have a read of Python
PEP 249 if you want some more information about it. The reason doing it
this way and not your original approach is that if "kind" contains SQL
unsafe characters like quotes or double-quotes or semicolons, the DB
wrapper will escape them properly. In this case you're lucky that you
know what is in "kind" (at least for now) and it's not too complicated.
But it's a bad habit to get into.
Regards,
Malcolm
I had made a desultory effort to avoid that, but while I was testing
it this line:
cursor.execute("SELECT %s, datacategory_id from %s ORDER BY %s",
('book_id','dbase_book_categories', 'book_id'))
produced a SQL complaint for ''dbase_book_categories' ORDER BY
'book_id''. So I just thought to hell with it.
How far wrong could I be with that line?
Thanks again,
Eric
cursor.execute("SELECT %(id)s, datacategory_id from %(table)s ORDER BY
%(id)s" % ({'id':'%s_id'%kind, 'table':'dbase_%s_categories'%kind}))
On Jul 29, 2008, at 1:36 AM, Malcolm Tredinnick wrote: