I'm using Django Rest Framework and have the following situation:
I have two models, `Article` and `Comment`:
{{{
class Article(models.Model):
# ...
class Comment(models.Model):
article = models.ForeignKey(Article, related_name='comments')
# ...
}}}
And an articles viewset:
{{{
class Articles(viewsets.ModelViewSet):
model = Article
serializer_class = ArticleSerializer
paginate_by = 50
paginate_by_param = 'limit'
def get_queryset(self):
return Article.objects.annotate(comments_count=Count('comments'))
}}}
When doing a simple listing, the database will be queried two times: one
to count the result set, the other to fetch the results:
{{{
SELECT
COUNT(*)
FROM (
SELECT
`article`.`id` AS `id`,
COUNT(`article_comment`.`id`) AS `comments_count`
FROM `article`
LEFT OUTER JOIN `article_comment` ON (`article`.`id` =
`article_comment`.`article_id`)
GROUP BY `article`.`id`
ORDER BY NULL
)
subquery;
}}}
And
{{{
SELECT
`article`.`id` AS `id`,
COUNT(`article_comment`.`id`) AS `comments_count`
FROM `article`
LEFT OUTER JOIN `article_comment` ON (`article`.`id` =
`article_comment`.`article_id`)
GROUP BY `article`.`id`
ORDER BY `article`.`id` DESC
LIMIT 50;
}}}
respectively.
If you look at (and `EXPLAIN`) the first query, the one that does the
counting, you'll notice it loads up *all* records and then starts counting
them, a process which may take up all resources and eventually crash the
machine. It did so in my case, with only 50000 records.
I figured this has something to do with Django, not DRF.
--
Ticket URL: <https://code.djangoproject.com/ticket/25375>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* needs_better_patch: => 0
* needs_tests: => 0
* needs_docs: => 0
Comment:
Could you please add a test case that demonstrates the problem without
DRF?
--
Ticket URL: <https://code.djangoproject.com/ticket/25375#comment:1>
Comment (by charettes):
Wonder if this can be related to the changes that caused #25230?
@alexei, are you using MySQL?
--
Ticket URL: <https://code.djangoproject.com/ticket/25375#comment:2>
* status: new => closed
* resolution: => needsinfo
--
Ticket URL: <https://code.djangoproject.com/ticket/25375#comment:3>
Comment (by adamchainz):
I think DRF is probably mutating your QuerySet to use a subquery to get
the count when it's not necessary. If you are using MySQL, these tend to
be slow - newer versions of MySQL and MariaDB have improvements that make
them faster.
--
Ticket URL: <https://code.djangoproject.com/ticket/25375#comment:4>
* status: closed => new
* cc: alexei (added)
* resolution: needsinfo =>
Comment:
Hello,
Sorry for the delay. I didn't get notified when you guys replied, and I
completely forgot about it.
Yes, in my previous example I was using MySQL.
Hoewever, I created a test app to see how it goes using the default
settings (including SQLite db). The result is as follows:
{{{
$ python manage.py shell
>>> from django.core.paginator import Paginator
>>> from django.db import connections
>>> from django.db.models import Count
>>> from bug.models import Article
>>> a = Article.objects.annotate(comments_count=Count('comments'))
>>> p = Paginator(a, 25)
>>> print p.object_list
[]
>>> print connections['default'].queries
[{u'time': u'0.001', u'sql': u'QUERY = u\'SELECT "bug_article"."id",
"bug_article"."title", "bug_article"."body", COUNT("bug_comment"."id") AS
"comments_count" FROM "bug_article" LEFT OUTER JOIN "bug_comment" ON (
"bug_article"."id" = "bug_comment"."article_id" ) GROUP BY
"bug_article"."id", "bug_article"."title", "bug_article"."body" LIMIT 21\'
- PARAMS = ()'}]
>>> print p.count
0
>>> print connections['default'].queries
[{u'time': u'0.001', u'sql': u'QUERY = u\'SELECT "bug_article"."id",
"bug_article"."title", "bug_article"."body", COUNT("bug_comment"."id") AS
"comments_count" FROM "bug_article" LEFT OUTER JOIN "bug_comment" ON (
"bug_article"."id" = "bug_comment"."article_id" ) GROUP BY
"bug_article"."id", "bug_article"."title", "bug_article"."body" LIMIT 21\'
- PARAMS = ()'}, {u'time': u'0.000', u'sql': u'QUERY = u\'SELECT COUNT(*)
FROM (SELECT "bug_article"."id" AS "id", "bug_article"."title" AS "title",
"bug_article"."body" AS "body", COUNT("bug_comment"."id") AS
"comments_count" FROM "bug_article" LEFT OUTER JOIN "bug_comment" ON (
"bug_article"."id" = "bug_comment"."article_id" ) GROUP BY
"bug_article"."id", "bug_article"."title", "bug_article"."body")
subquery\' - PARAMS = ()'}]
>>>
}}}
This time I'm again on Django 1.7.1.
For a second look, I updated Django to 1.8.4. The result is the same.
As you can see, it's Paginator's `count` that wraps the query resulting
from the QuerySet in a subquery.
--
Ticket URL: <https://code.djangoproject.com/ticket/25375#comment:5>
Comment (by timgraham):
The queries in your latest comment are difficult to read given they are
not formatted as nicely as the ticket description. Perhaps you could edit
the description and move your latest comment there.
After that, could you explain what the expected query is? Are you able to
propose a fix? Not knowing these particular internal details of Django
offhand, it's a bit difficult for me to triage the ticket otherwise.
Thanks!
--
Ticket URL: <https://code.djangoproject.com/ticket/25375#comment:6>
* status: new => closed
* resolution: => needsinfo
--
Ticket URL: <https://code.djangoproject.com/ticket/25375#comment:7>
Comment (by Stefan Wehrmeyer):
The Paginator needs to make two queries: the count query and the query for
the page objects.
If the queryset you use in your Paginator satisfies
[https://github.com/django/django/blob/0af14b2eaa0bf0821a8aacc8486489a1eb348397/django/db/models/sql/query.py#L412-L413
one of these conditions] the count query will run as a subquery, so this
query:
{{{
SELECT COUNT(*) FROM (SELECT ...);
}}}
instead of
{{{
SELECT COUNT(*) AS "__count" FROM ...;
}}}
In my case the first query takes double the amount of time of the second
query.
The row count can be independent of annotations. So if you make the
Paginator do the count query on a queryset without annotations (resulting
in a more performant count query) and then doing the actual pagination
with the annotated queryset may result in a small performance win.
Annotations can influence row count (e.g. when filtering by annotation)
and you still may have `groupby()` or `distinct()` on your queryset that
influences counts and forces a count with a subquery. So I'm not sure if
there's a general way for the Paginator (or any `.count()` call) to run
the more performant query.
--
Ticket URL: <https://code.djangoproject.com/ticket/25375#comment:8>
Comment (by Ramiro Morales):
This is similar to #23771
--
Ticket URL: <https://code.djangoproject.com/ticket/25375#comment:9>