Make sure QuerySet.get() does not fetch more rows than it absolutely needs

1,423 views
Skip to first unread message

Patryk Zawadzki

unread,
Jun 3, 2013, 8:50:10 AM6/3/13
to django-d...@googlegroups.com
Here's the ticket:

https://code.djangoproject.com/ticket/6785

tl;dr: Calling .get() on a badly filtered queryset can result in lots
of database rows being fetched and wrapped in Python objects for no
gain.

Currently .get() logic is as follows: clone the queryset, prepopulate
its full cache, check if more than one row was returned and if so,
raise MultipleObjectsReturned.

This works fine if the query returns two or five rows. It's not so
fine if the query results in thousands or millions of rows being
returned. While working on a large online store we had one of the
production servers die while trying to fetch twenty something million
rows because of a programming error that resulted in wrong Q object
being used as a filter.

The only reason all rows are fetched is to return an accurate counter
as part of the exception message. This counter is not accessible
programmatically and having the exact number there is of little value.
The ticket proposes to add a safeguard by limiting the database to at
most two results. The logic then is simple: if zero rows are returned,
raise ObjectDoesNotExist, if two, raise MultipleObjectsReturned. The
latter could possibly execute a second COUNT query if DEBUG is enabled
and the counter is deemed important enough to keep.

Of course you can work around the whole problem by not writing buggy
code but we all know how it works. In the mentioned case even unit
tests did not help as there were not enough records in the database to
trigger MultipleObjectsReturned.

Related pull request: https://github.com/django/django/pull/1139

--
Patryk Zawadzki
I solve problems.

Shai Berger

unread,
Jun 3, 2013, 5:22:20 PM6/3/13
to django-d...@googlegroups.com
On Monday 03 June 2013, Patryk Zawadzki wrote:
> Here's the ticket:
>
> https://code.djangoproject.com/ticket/6785
>
> tl;dr: Calling .get() on a badly filtered queryset can result in lots
> of database rows being fetched and wrapped in Python objects for no
> gain.
>

tl;dr: There's a general, valuable optimization to be made here, but it should
be implemented at a lower level.

I have raised a related issue about a year and a half ago[0], thinking
(mistakenly) that it was mostly an Oracle issue. Ian Kelly had pointed me in
the right direction, but then life happened... The problem is that for all
single-row queries except aggregates, Django uses the same strategy as used in
get (and which you kept using in your patch): limit the query (usually)
properly, then fetch until no more records are retrieved (fetching in done in
chunks using fetchmany(), and not using fetchall()).

For single-row queries, this means two fetches; an unnecessary network
roundtrip, for each such query, unless the backend or underlying driver take
care to prevent it. On Oracle, at least, nobody does.

What we should do instead, in my opinion, is stop fetching as soon as a chunk
arrives that isn't full. This alone should reduce the number of network
accesses significantly on most Django projects.

While we're doing that, we could also define a new mode of sql execution (see
SQLCompiler.results_iter(), in django.db.models.sql.compiler): Next to the
current SINGLE (which uses fetchone()) and MULTI (the chunked loop described
above) we could have SINGLE_VERIFY, which always fetches a single chunk of
size 2. Make get() use that, and #6785 is fixed, while improving performance --
rather than hurting it on correct usage, like the proposed patch[1] does.

Assuming nobody raises objections, I intend to propose a patch along these
lines sometime in the coming weeks (it's one of a few things on my to-do
list), but I wouldn't mind at all if someone beat me to it.

Hope this helps,
Shai.

[0] https://groups.google.com/d/msg/django-developers/SQ0Pltt_f0M/3ccQn0aauiEJ
[1] https://github.com/django/django/pull/1139

Anssi Kääriäinen

unread,
Jun 4, 2013, 2:48:25 AM6/4/13
to Django developers
> [0]https://groups.google.com/d/msg/django-developers/SQ0Pltt_f0M/3ccQn0a...
> [1]https://github.com/django/django/pull/1139

Note that how queryset iteration happens has been changed[1]. Except
for .iterator() call the rows in the queryset are converted to objects
immediately. This isn't a big change for other core backends than
Oracle (they always fetched all rows in one go), but for Oracle users
this could mean changes in amount of rows fetched from DB.

So, with this in mind you could change the compiler's iterator to use
fetchall() except when .iterator() is used.

As for .get() - I don't find the number of duplicates in the error
message that useful. If you got more than one object you failed. How
badly you failed isn't interesting. I can't think of any case where
the amount of objects you got actually matters. If somebody feels
strongly about this, then we could limit the amount of fetched objects
to some smallish amount (21 for example). If you actually get 21
objects back then you say "it returned more than 20" in the error
message.

- Anssi

[1]https://code.djangoproject.com/ticket/18702

Jacob Kaplan-Moss

unread,
Jun 4, 2013, 10:02:25 AM6/4/13
to django-developers
On Tue, Jun 4, 2013 at 1:48 AM, Anssi Kääriäinen
<anssi.ka...@thl.fi> wrote:
> As for .get() - I don't find the number of duplicates in the error
> message that useful.

Yeah, I'd agree with that. It's another one of those things that goes
WAY back into the misty reaches of Django's history, but I don't think
there's a particularly good reason it's stuck around. I think a LIMIT
2 and an error message like "… returned more than one" would be fine.
If people are particularly interested, they could re-run the query
with a .count() instead of a .get().

Jacob

Marc Tamlyn

unread,
Jun 4, 2013, 10:21:18 AM6/4/13
to django-d...@googlegroups.com
I'd be inclined to agree with Anssi that we could do something like LIMIT 21 to just remove the issues with large numbers. I have found it quite helpful to have the exact number when it's small - especially when debugging issues with strange joins etc.



--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.



Reply all
Reply to author
Forward
0 new messages