QuerySet performance on large datasets

53 views
Skip to first unread message

Juergen H

unread,
Jan 12, 2021, 6:42:42 AM1/12/21
to Django users
Dear django users,

when working on very large datasets (millions of items) and performing queries on it, I have a few questions regaring the performance.

  • From my understanding using `bool(qs)` is preferred over using `qs.exists()` when the queryset is used later on, as `bool(qs)` already evaluates and caches the qs, which `qs.exists()` does not. Is that correct?

  • Is the above assumption also true, when the qs that has bee evaluated by using `bool(qs)` afterwards is accessed by e.g. `qs.earliest(some_property)`,?
    Or does `qs.earliest(some_property)` generate a new qs and hence there is not really an advantage of having the original qs already evaluated by `bool(qs)`?

    E.g. using `.filter()` will surely create and return a new qs, as stated in the docu and in this case I am pretty sure, the advantage of evaluating and caching the qs by using `bool(qs)` is lost here.
    But for `earliest()` or `latest()` the docu only states, that an object is returned. It does not say anything about a new queryset being created.
    Hence my questions..

Thanks a lot and happy coding
Juergen

Peter of the Norse

unread,
Feb 12, 2021, 7:51:23 PM2/12/21
to django...@googlegroups.com
bool(qs) is recommended when the queryset has already been run.  If you do something like 

    val = qs.earliest(some_property) if bool(qs) else None

then there isn’t a performance gain because it still has to run qs.exists() to even see if there is a value.  Later things like that will be cached though.

Although I might be wrong about this.  I have not dug enough to be certain.

--
You received this message because you are subscribed to the Google Groups "Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/571d5cb6-bd53-4db4-aad1-2a820d71782en%40googlegroups.com.

Peter of the Norse

unread,
Apr 11, 2021, 11:59:29 AM4/11/21
to django...@googlegroups.com
Follow up to my own message.  I found a way of testing this that I think everyone should look at:

$ ./manage.py shell
Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import logging
>>> l = logging.getLogger('django.db.backends')
>>> l.setLevel(logging.DEBUG)
>>> l.addHandler(logging.StreamHandler())

This will allow you to see what SQL is actually being run.  That way you can do things like comparing which statement is faster when you do things side by side.

>>> from django.contrib.auth.models import User
>>> qs = User.objects.all()
>>> val = qs.earliest('date_joined') if qs.exists() else None
(0.000) SELECT (1) AS "a" FROM "auth_user" LIMIT 1; args=()
(0.026) SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" ORDER BY "auth_user"."date_joined" ASC LIMIT 1; args=()
>>> qs = User.objects.all()  # Don’t use the cache
>>> val = qs.earliest('date_joined') if bool(qs) else None
(0.024) SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user"; args=()
(0.000) SELECT "auth_user"."id", "auth_user"."password", "auth_user"."last_login", "auth_user"."is_superuser", "auth_user"."username", "auth_user"."first_name", "auth_user"."last_name", "auth_user"."email", "auth_user"."is_staff", "auth_user"."is_active", "auth_user"."date_joined" FROM "auth_user" ORDER BY "auth_user"."date_joined" ASC LIMIT 1; args=()

This is a new install with only one user.  Your numbers are likely to be much higher and have significant differences.
Reply all
Reply to author
Forward
0 new messages