Potential performance related bug in django`s Queryset.get().

101 views
Skip to first unread message

Anudeep Samaiya

unread,
Jan 2, 2020, 8:43:47 AM1/2/20
to Django developers (Contributions to Django itself)
Hi everyone,

Happy New Year!!

Ok so I found that Querset.get() is very slow for large datasets when multiple objects exists in very big numbers. I did following changes in my local copy of django code and it improved the performance for very large data sets significantly (like in a blink of second). Didn't had any obvious effects for a table with like 10K records or so. I don't have proper stats to prove the performance.

So what was the issue?
Queryset.get() raises two exceptions
1. DoesNotExist
2. MultipleObjectsFound

In case Multiple objects are found, Querset.get() raises an error with how many objects are found. To do this it was evaluating query to find length by iterating over the queryset which was creating a bottle-neck. For small datasets this was not abovious but for large datasets with more than 1 million recors this was slow.

So Instead I tried changing the method of counting using the Queryset.count(). If count == 1,  only then evaluated the query by calling Querset._fetch_all(). The results were much than before.

So do you think this is right way? Should I raise a pr for the patch?


diff --git a/django/db/models/query.py b/django/db/models/query.py
index
38c1358..e442384 100644
--- a/django/db/models/query.py
+++ b/django/db/models/query.py
@@ -420,8 +420,9 @@ class QuerySet:
         
if not clone.query.select_for_update or connections[clone.db].features.supports_select_for_update_with_limit:
             limit
= MAX_GET_RESULTS
             clone
.query.set_limits(high=limit)
-        num = len(clone)
+        num = clone.count()
         
if num == 1:
+            clone._fetch_all()
             
return clone._result_cache[0]
         
if not num:
             
raise self.model.DoesNotExist(


Thanks

Anudeep Samaiya

Adam Johnson

unread,
Jan 2, 2020, 11:21:00 AM1/2/20
to django-d...@googlegroups.com
Hi Anudeep

Your change makes get() perform an extra query for count() before it . This would be a regression for most uses of get().

get() is not intended for use "when multiple objects exists in very big numbers". You may want to perform a single query, something like Model.objects.filter(id__in=[1, 2, 3, 4, ...]) , and sort out your duplicates in a loop in Python.

Hope that helps,

Adam

charettes

unread,
Jan 2, 2020, 11:21:45 AM1/2/20
to Django developers (Contributions to Django itself)
Hey there!

The current code assumes that .get() will likely match one result which should be the case
most of the time and limit the number of possible results to prevent catastrophic matches

Your patch has the side effect of performing an additional COUNT query for every single
QuerySet.get() call. It would also likely only perform better under certain circumstances
(e.g. large columns retrieved, possibility of using index only scan on COUNT). For these
reasons I don't think this is a good idea.

In summary QuerySet.get is currently optimized for a correct usage of its API while limiting
nefarious side effects of misuses and this approach would optimize for the uncommon case.

Cheers,
Simon

Anudeep Samaiya

unread,
Jan 2, 2020, 11:46:48 AM1/2/20
to django-d...@googlegroups.com
Thanks for the revert, I completely agree with both of your explanations.

Thanks
Anudeep Samaiya

-- 
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/9b9e1842-b888-4ff9-a1b7-1d2d0ebc5b8b%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages