QuerySet: "if obj in queryset" can be very slow

133 views
Skip to first unread message

Thomas Guettler

unread,
Nov 2, 2011, 7:42:00 AM11/2/11
to Django users
Hi,

I just discovered, that "if obj in queryset" can be very slow:

queryset=MyModel.objects.all()
obj=MyModel.objects.get(id=319526)

#if obj in queryset: # This is very slow, if queryset is big: All lines from queryset get created to Python objects
# print 'in'

if queryset.filter(id=obj.id): # Fast: Check is done inside DB.
print 'in'

What is the best way to do "if obj in queryset"?

Thomas

--
Thomas Guettler, http://www.thomas-guettler.de/
E-Mail: guettli (*) thomas-guettler + de

Flavia Missi

unread,
Nov 2, 2011, 7:57:48 AM11/2/11
to django...@googlegroups.com
That's because your queryset is being evaluated when you compare, maybe if you explain your problem, we can give you a better solution than the use of `in`.

Take a look at the docs about when querysets are evaluated:


[]'s


--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.




--
Flávia Missi

Jirka Vejrazka

unread,
Nov 2, 2011, 9:35:21 AM11/2/11
to django...@googlegroups.com
> queryset=MyModel.objects.all()
> obj=MyModel.objects.get(id=319526)

Thomas,

if the second line works for you (i.e. does not raise
MyModel.DoesNotExist exception), then obj is in queryset by
definition.

What are you trying to achieve?

Jirka

Thomas Guettler

unread,
Nov 2, 2011, 11:25:26 AM11/2/11
to django...@googlegroups.com
Here is a better example:

def get_special_objects():
# Some are special. But the result can still be huge!
return MyModel.objects.filter(....)

obj=MyModel.objects.get(....)

# is this object "special?"
if obj in get_special_objects(): # This is very slow if there are many rows in the result.
# yes, it is special

# This is my current solution
if get_special_objects().filter(pk=obj.pk).count():
# yes, it is special


Up to now I sometimes used "if obj in queryset" but never realized, that
this evaluates the whole queryset. Up to now I thought this is lazy.

I have no big problem with this, since I found a solution.

Is there a reason why "if obj in queryset" is executed in python code, and not
in the DB?

Thomas

Tom Evans

unread,
Nov 2, 2011, 11:45:48 AM11/2/11
to django...@googlegroups.com
On Wed, Nov 2, 2011 at 3:25 PM, Thomas Guettler <h...@tbz-pariv.de> wrote:
> Here is a better example:
>
> def get_special_objects():
>    # Some are special. But the result can still be huge!
>    return MyModel.objects.filter(....)
>
> obj=MyModel.objects.get(....)
>
> # is this object "special?"
> if obj in get_special_objects(): # This is very slow if there are many rows in the result.
>    # yes, it is special
>
> # This is my current solution
> if get_special_objects().filter(pk=obj.pk).count():
>    # yes, it is special
>
>
> Up to now I sometimes used "if obj in queryset" but never realized, that
> this evaluates the whole queryset. Up to now I thought this is lazy.
>
> I have no big problem with this, since I found a solution.
>
> Is there a reason why "if obj in queryset" is executed in python code, and not
> in the DB?
>
>  Thomas
>
>

Choice and consistency.

You've picked an edge case where it does make more sense to query the
database, however I could contrive any number of examples where it
would be wrong to filter in the database, for example if I had a long
list of objects I wanted to test are in the queryset AND I wanted the
items in the queryset constructed anyway, I'd use the 'in' operator.
This is the choice.

For consistency, querying the database is performed by methods on a
model's manager class or related queryset, where as python functions
operating on a queryset do not perform database queries. Eg, len(qs)
evaluates the queryset and counts the number of instances in the
queryset using python, where as qs.count() performs a DB query to
obtain the result. Neither is wrong, they are appropriate in different
cases.

Cheers

Tom

Ian Clelland

unread,
Nov 2, 2011, 1:30:53 PM11/2/11
to django...@googlegroups.com
On Wed, Nov 2, 2011 at 8:25 AM, Thomas Guettler <h...@tbz-pariv.de> wrote:
# This is my current solution
if get_special_objects().filter(pk=obj.pk).count():
   # yes, it is special


I can't speak to the "why" of this situation; it seems to me that this could always be converted into a more efficient database query without any unexpected side-effects (and if I really wanted the side effects, I would just write "if obj in list(qs)" instead). In this case, though, I would usually write something like this:

if get_special_objects().filter(pk=obj.pk).exists():
   # yes, it is special

I believe that in some cases, the exists() query can be optimized to return faster than a count() aggregation, and I think that the intent of the code appears more clearly.

Ian

--
Regards,
Ian Clelland
<clel...@gmail.com>

Tom Evans

unread,
Nov 2, 2011, 1:46:12 PM11/2/11
to django...@googlegroups.com

OK, take this example. I have a django model table with 70 million
rows in it. Doing any kind of query on this table is slow, and
typically the query is date restrained - which mysql will use as the
optimum key, meaning any further filtering is a table scan on the
filtered rows.

Pulling a large query (say, all logins in a month, ~1 million rows)
takes only a few seconds longer than counting the number of rows the
query would find - after all, the database still has to do precisely
the same amount of work, it just doesn't have to deliver the data.

Say I have a n entries I want to test are in that resultset, and I
also want to iterate through the list, calculating some data and
printing out the row, I can do the existence tests either in python or
in the database. If I do it in the database, I have n+1 expensive
queries to perform. If I do it in python, I have 1 expensive query to
perform, and (worst case) n+1 full scans of the data retrieved (and I
avoid locking the table for n+1 expensive queries).

Depending on the size of the data set, as the developer I have the
choice of which will be more appropriate for my needs. Sometimes I
need "if qs.filter(pk=obj.pk).exists()", sometimes I need "if obj in
qs".

Cheers

Tom

Furbee

unread,
Nov 2, 2011, 2:06:21 PM11/2/11
to django...@googlegroups.com
Thanks Tom, that's a great explanation!

Furbeenator

Ian Clelland

unread,
Nov 2, 2011, 2:06:38 PM11/2/11
to django...@googlegroups.com
I agree that there are situations where you want, or need, to pull the data in to Python for processing, to avoid a lot of database overhead. That's why we have select_related, as well: sometimes you really do need to just grab as much as possible all at once. 

The trouble is that querysets are *supposed* to be lazy; just evaluating as much as necessary, as late as possible, to do the job. I think that this behaviour violates the principle of least surprise, by instantiating a (potentially very large) queryset as a side-effect of a simple inclusion test.

Any other time that you want a queryset instantiated, the idiomatic way to do it is to construct a Python list based on it:

# Get all objects at once from database
objs = list(qs)
# Now use that list multiple times in a method

or

for obj in list(qs):
  # qs is evaluated once, list members may be manipulated as needed in Python

or, by extension,

if obj in list(qs):
  # stuff

I wouldn't rely on the behaviour of the in operator to evaluate the queryset for me; it doesn't look right to me, it's not obvious to anyone else looking at the code, and I don't think it's documented behaviour.

I would prefer that in did an exists query, but since there are explicit ways to force either behavior, in practise I use one of those explicit ways, rather than leave the code looking ambiguous.

Ian Clelland

unread,
Nov 2, 2011, 2:26:31 PM11/2/11
to django...@googlegroups.com
On Wed, Nov 2, 2011 at 10:46 AM, Tom Evans <teva...@googlemail.com> wrote:
OK, take this example. I have a django model table with 70 million
rows in it. Doing any kind of query on this table is slow, and
typically the query is date restrained - which mysql will use as the
optimum key, meaning any further filtering is a table scan on the
filtered rows.

Pulling a large query (say, all logins in a month, ~1 million rows)
takes only a few seconds longer than counting the number of rows the
query would find - after all, the database still has to do precisely
the same amount of work, it just doesn't have to deliver the data.

Say I have a n entries I want to test are in that resultset, and I
also want to iterate through the list, calculating some data and
printing out the row, I can do the existence tests either in python or
in the database. If I do it in the database, I have n+1 expensive
queries to perform. If I do it in python, I have 1 expensive query to
perform, and (worst case) n+1 full scans of the data retrieved (and I
avoid locking the table for n+1 expensive queries).

Depending on the size of the data set, as the developer I have the
choice of which will be more appropriate for my needs. Sometimes I
need "if qs.filter(pk=obj.pk).exists()", sometimes I need "if obj in
qs".


Just looking at the source to QuerySet (finally), and it looks like the __contains__ method actually does something different than this: It evaluates the whole QuerySet in bulk at the database level, and starts creating model instances based on that, but only until it finds a matching one. So, after running "if obj in qs", you might end up with one object created, or you might end up with 70M objects, or anywhere in between.

Again: odd, undocumented, and potentially surprising behaviour, and I'd recommend explicit over implicit, especially in this case.

Thomas Guettler

unread,
Nov 3, 2011, 2:37:25 AM11/3/11
to django...@googlegroups.com
Hi,

I created a ticket:

https://code.djangoproject.com/ticket/17156

{{{
The documentation should explain what happens in this code:

if obj in queryset:
....

If queryset is huge, the above code can be very slow.

If you want to run the code in the database, you should use this:

if queryset.filter(pk=obj.pk).exists():
....

Related discussion: http://groups.google.com/group/django-users/browse_thread/thread/5a8196ff1086f4ae/530548a127a6e3ca

Can someone please provide a patch for the documentation. I am not a native speaker, and won't find the right words.

Tom Evans

unread,
Nov 3, 2011, 8:31:09 AM11/3/11
to django...@googlegroups.com
On Wed, Nov 2, 2011 at 6:26 PM, Ian Clelland <clel...@gmail.com> wrote:
> Just looking at the source to QuerySet (finally), and it looks like the
> __contains__ method actually does something different than this: It
> evaluates the whole QuerySet in bulk at the database level, and starts
> creating model instances based on that, but only until it finds a matching
> one. So, after running "if obj in qs", you might end up with one object
> created, or you might end up with 70M objects, or anywhere in between.
> Again: odd, undocumented, and potentially surprising behaviour, and I'd
> recommend explicit over implicit, especially in this case.
>

I disagree. All of this behaviour is documented in my mind:

It is documented here when querysets are evaluated:

https://docs.djangoproject.com/en/1.3/ref/models/querysets/#when-querysets-are-evaluated

It's documented here that iterating through a queryset will cache the
objects created:

https://docs.djangoproject.com/en/1.3/ref/models/querysets/#iterator

It's documented here that testing for results is slightly quicker than
evaluating and testing the queryset:

https://docs.djangoproject.com/en/1.3/ref/models/querysets/#exists

It's documented here that caching is optimistic - it caches the
results that have been requested:

https://docs.djangoproject.com/en/1.3/topics/db/queries/#caching-and-querysets

Do we really need to specifically point out that the __contains__
method iterates through the queryset and stops when it finds an
answer?

Cheers

Tom

Reply all
Reply to author
Forward
0 new messages