I just discovered, that "if obj in queryset" can be very slow:
queryset=MyModel.objects.all()
obj=MyModel.objects.get(id=319526)
#if obj in queryset: # This is very slow, if queryset is big: All lines from queryset get created to Python objects
# print 'in'
if queryset.filter(id=obj.id): # Fast: Check is done inside DB.
print 'in'
What is the best way to do "if obj in queryset"?
Thomas
--
Thomas Guettler, http://www.thomas-guettler.de/
E-Mail: guettli (*) thomas-guettler + de
--
You received this message because you are subscribed to the Google Groups "Django users" group.
To post to this group, send email to django...@googlegroups.com.
To unsubscribe from this group, send email to django-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
Thomas,
if the second line works for you (i.e. does not raise
MyModel.DoesNotExist exception), then obj is in queryset by
definition.
What are you trying to achieve?
Jirka
def get_special_objects():
# Some are special. But the result can still be huge!
return MyModel.objects.filter(....)
obj=MyModel.objects.get(....)
# is this object "special?"
if obj in get_special_objects(): # This is very slow if there are many rows in the result.
# yes, it is special
# This is my current solution
if get_special_objects().filter(pk=obj.pk).count():
# yes, it is special
Up to now I sometimes used "if obj in queryset" but never realized, that
this evaluates the whole queryset. Up to now I thought this is lazy.
I have no big problem with this, since I found a solution.
Is there a reason why "if obj in queryset" is executed in python code, and not
in the DB?
Thomas
Choice and consistency.
You've picked an edge case where it does make more sense to query the
database, however I could contrive any number of examples where it
would be wrong to filter in the database, for example if I had a long
list of objects I wanted to test are in the queryset AND I wanted the
items in the queryset constructed anyway, I'd use the 'in' operator.
This is the choice.
For consistency, querying the database is performed by methods on a
model's manager class or related queryset, where as python functions
operating on a queryset do not perform database queries. Eg, len(qs)
evaluates the queryset and counts the number of instances in the
queryset using python, where as qs.count() performs a DB query to
obtain the result. Neither is wrong, they are appropriate in different
cases.
Cheers
Tom
# This is my current solution
if get_special_objects().filter(pk=obj.pk).count():
# yes, it is special
OK, take this example. I have a django model table with 70 million
rows in it. Doing any kind of query on this table is slow, and
typically the query is date restrained - which mysql will use as the
optimum key, meaning any further filtering is a table scan on the
filtered rows.
Pulling a large query (say, all logins in a month, ~1 million rows)
takes only a few seconds longer than counting the number of rows the
query would find - after all, the database still has to do precisely
the same amount of work, it just doesn't have to deliver the data.
Say I have a n entries I want to test are in that resultset, and I
also want to iterate through the list, calculating some data and
printing out the row, I can do the existence tests either in python or
in the database. If I do it in the database, I have n+1 expensive
queries to perform. If I do it in python, I have 1 expensive query to
perform, and (worst case) n+1 full scans of the data retrieved (and I
avoid locking the table for n+1 expensive queries).
Depending on the size of the data set, as the developer I have the
choice of which will be more appropriate for my needs. Sometimes I
need "if qs.filter(pk=obj.pk).exists()", sometimes I need "if obj in
qs".
Cheers
Tom
rows in it. Doing any kind of query on this table is slow, andOK, take this example. I have a django model table with 70 million
typically the query is date restrained - which mysql will use as the
optimum key, meaning any further filtering is a table scan on the
filtered rows.
Pulling a large query (say, all logins in a month, ~1 million rows)
takes only a few seconds longer than counting the number of rows the
query would find - after all, the database still has to do precisely
the same amount of work, it just doesn't have to deliver the data.
Say I have a n entries I want to test are in that resultset, and I
also want to iterate through the list, calculating some data and
printing out the row, I can do the existence tests either in python or
in the database. If I do it in the database, I have n+1 expensive
queries to perform. If I do it in python, I have 1 expensive query to
perform, and (worst case) n+1 full scans of the data retrieved (and I
avoid locking the table for n+1 expensive queries).
Depending on the size of the data set, as the developer I have the
choice of which will be more appropriate for my needs. Sometimes I
need "if qs.filter(pk=obj.pk).exists()", sometimes I need "if obj in
qs".
I created a ticket:
https://code.djangoproject.com/ticket/17156
{{{
The documentation should explain what happens in this code:
if obj in queryset:
....
If queryset is huge, the above code can be very slow.
If you want to run the code in the database, you should use this:
if queryset.filter(pk=obj.pk).exists():
....
Related discussion: http://groups.google.com/group/django-users/browse_thread/thread/5a8196ff1086f4ae/530548a127a6e3ca
Can someone please provide a patch for the documentation. I am not a native speaker, and won't find the right words.
I disagree. All of this behaviour is documented in my mind:
It is documented here when querysets are evaluated:
https://docs.djangoproject.com/en/1.3/ref/models/querysets/#when-querysets-are-evaluated
It's documented here that iterating through a queryset will cache the
objects created:
https://docs.djangoproject.com/en/1.3/ref/models/querysets/#iterator
It's documented here that testing for results is slightly quicker than
evaluating and testing the queryset:
https://docs.djangoproject.com/en/1.3/ref/models/querysets/#exists
It's documented here that caching is optimistic - it caches the
results that have been requested:
https://docs.djangoproject.com/en/1.3/topics/db/queries/#caching-and-querysets
Do we really need to specifically point out that the __contains__
method iterates through the queryset and stops when it finds an
answer?
Cheers
Tom