Optimizing out unused annotations from count queries

2,147 views
Skip to first unread message

Tom Forbes

unread,
Aug 19, 2017, 1:11:13 PM8/19/17
to django-d...@googlegroups.com
Hello,
I think there is potential room for improvement with how Django handles count() queries, specifically relating to annotations. Currently any annotations that are added to a queryset are included in the SQL statement that is generated in the count() query, including all joins and SQL function calls - I've personally been bitten by this with DRF and a query with a particularly complex set of annotations where the count query took longer than fetching the data itself, but was semantically equivalent to a much faster, plain Model.objects.count() call.

I think in most cases annotations can be stripped, for example in the two queries below the annotation is not filtered on so it can be safely removed from the query without affecting the result:

one = Book.objects.annotate(Count('chapters')).count()
two = Book.objects.count()

I wanted to gather some feedback from the group as to whether this always holds true: if an annotation is not filtered (or ordered on) it can always be safely removed from a count query? I wouldn't be surprised if there is a case where this isn't safe, but I cannot think of one.

I've made an initial merge request that removes annotations from querysets that have no filter or ordering that depends on annotations here: https://github.com/django/django/pull/8928. It seems like Django also has all the information to optimize out annotations that are not directly or indirectly used by filters, but I wanted to gather some feedback before I attempted this.

Tom

Josh Smeaton

unread,
Aug 19, 2017, 7:10:50 PM8/19/17
to Django developers (Contributions to Django itself)
I'd like to see this provided all bases were covered. I'll just list below the cases where I think it wouldn't be safe.

- Filtered annotations
- Annotations that join to anything other than a non-null foreign key: .annotate(local=F('related__field'))
- Annotations that have a GROUP BY on fields that are not just the PK or entire field set (with .values())

The group by one I'm unsure of, because I forget if a count is called with an aggregate query as a subquery or if it's an error.

I think an optimisation is available if:

1. No filters ref an annotation
2. values() has not been called
3. There are no joins at all

Further optimisation scenarios may be available with stricter rules (such as join types), but we could look at doing that separately if it was difficult.

Tom Forbes

unread,
Aug 19, 2017, 7:41:43 PM8/19/17
to django-d...@googlegroups.com
Thanks for your reply Josh. Can you elaborate on why optimizing out '.annotate(local=F('related__field'))' would not be safe?

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscribe@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/506d0f2c-51e8-400f-a4fb-66c2b8597aeb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Josh Smeaton

unread,
Aug 19, 2017, 7:48:23 PM8/19/17
to Django developers (Contributions to Django itself)
Thanks for making me elaborate, because I'm probably wrong. I'll step through what my thinking was though.

'.annotate(local=F('related__field'))' will join to the related table. If `related` is a nullable foreign key, then a join would do an implicit filter, removing the row from the result set if it had no relation, and reducing the count. 

But Django models a nullable foreign key join as a LEFT JOIN which wouldn't filter the row out and wouldn't reduce the count.

Multivalue relationships (reverse foreign key and m2m joins) would need some care though I think. I'm not sure if the same annotation above works on m2m joins or not.

On Sunday, 20 August 2017 09:41:43 UTC+10, Tom Forbes wrote:
Thanks for your reply Josh. Can you elaborate on why optimizing out '.annotate(local=F('related__field'))' would not be safe?
On 20 Aug 2017 00:10, "Josh Smeaton" <josh.s...@gmail.com> wrote:
I'd like to see this provided all bases were covered. I'll just list below the cases where I think it wouldn't be safe.

- Filtered annotations
- Annotations that join to anything other than a non-null foreign key: .annotate(local=F('related__field'))
- Annotations that have a GROUP BY on fields that are not just the PK or entire field set (with .values())

The group by one I'm unsure of, because I forget if a count is called with an aggregate query as a subquery or if it's an error.

I think an optimisation is available if:

1. No filters ref an annotation
2. values() has not been called
3. There are no joins at all

Further optimisation scenarios may be available with stricter rules (such as join types), but we could look at doing that separately if it was difficult.

On Sunday, 20 August 2017 03:11:13 UTC+10, Tom Forbes wrote:
Hello,
I think there is potential room for improvement with how Django handles count() queries, specifically relating to annotations. Currently any annotations that are added to a queryset are included in the SQL statement that is generated in the count() query, including all joins and SQL function calls - I've personally been bitten by this with DRF and a query with a particularly complex set of annotations where the count query took longer than fetching the data itself, but was semantically equivalent to a much faster, plain Model.objects.count() call.

I think in most cases annotations can be stripped, for example in the two queries below the annotation is not filtered on so it can be safely removed from the query without affecting the result:

one = Book.objects.annotate(Count('chapters')).count()
two = Book.objects.count()

I wanted to gather some feedback from the group as to whether this always holds true: if an annotation is not filtered (or ordered on) it can always be safely removed from a count query? I wouldn't be surprised if there is a case where this isn't safe, but I cannot think of one.

I've made an initial merge request that removes annotations from querysets that have no filter or ordering that depends on annotations here: https://github.com/django/django/pull/8928. It seems like Django also has all the information to optimize out annotations that are not directly or indirectly used by filters, but I wanted to gather some feedback before I attempted this.

Tom

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.

Anssi Kääriäinen

unread,
Aug 21, 2017, 1:58:16 AM8/21/17
to Django developers (Contributions to Django itself)
On Sunday, August 20, 2017 at 2:48:23 AM UTC+3, Josh Smeaton wrote:
Thanks for making me elaborate, because I'm probably wrong. I'll step through what my thinking was though.

'.annotate(local=F('related__field'))' will join to the related table. If `related` is a nullable foreign key, then a join would do an implicit filter, removing the row from the result set if it had no relation, and reducing the count. 

But Django models a nullable foreign key join as a LEFT JOIN which wouldn't filter the row out and wouldn't reduce the count.

Multivalue relationships (reverse foreign key and m2m joins) would need some care though I think. I'm not sure if the same annotation above works on m2m joins or not.

Interestingly enough, just doing a .filter(m2m_relation__foo='bar') might change the results. For m2m relations, any non-aggregate annotation is not safe to remove.

I believe our test cases are not fully covering here, so removing the annotations will be a bit risky. If there are cases where removal is known to be safe, it's of course a good idea to remove anything extra from the query.

 - Anssi 

Tom Forbes

unread,
Aug 21, 2017, 4:28:35 AM8/21/17
to django-d...@googlegroups.com
Interestingly enough, just doing a .filter(m2m_relation__foo='bar') might change the results

Is that not because it's a filter? Would ". annotate(x=F('m2m_relation__foo'))" change the results in any way? Sorry if I'm not following.



--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscribe@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.

Anssi Kääriäinen

unread,
Aug 21, 2017, 7:14:27 AM8/21/17
to Django developers (Contributions to Django itself)
On Monday, August 21, 2017 at 11:28:35 AM UTC+3, Tom Forbes wrote:
Interestingly enough, just doing a .filter(m2m_relation__foo='bar') might change the results

Is that not because it's a filter? Would ". annotate(x=F('m2m_relation__foo'))" change the results in any way? Sorry if I'm not following.

What's happening here is that Django generates an unconstrained join to the m2m_relation. Unsurprisingly that generates duplicate rows to the queryset. For example:

class Publisher:
    name = CharField()

class Book:
     publisher = ForeignKey(Publisher, related_name='books')

print(Publisher.objects.annotate(book_id=F('books__id')).query)
OUT: SELECT publisher.id, book.id as book_id FROM publisher JOIN book ON book.publisher_id = publisher.id

Now, if there are two books fo any publisher, that publisher will be outputted twice in the resulting queryset.

Normally you wouldn't want to do this, but even if you do this by accident, count() and len(qs) should return the same number. And, for .values() querysets this can be actually useful in some cases.

Note that it's possible to remove the select entry for the annotation, even if the join can't be removed in this case.

 - Anssi

Adam Johnson

unread,
Aug 21, 2017, 7:34:57 AM8/21/17
to django-d...@googlegroups.com
I'm not sure how much QuerySet optimization django should really be doing. Databases' optimizers are really good and as we've seen there are a huge number of edge cases in trying to do the work in Djangoland, as QuerySets don't really correspond to the underlying data model.

Here's an example in MySQL 5.7 where a subselect with an expression that looks like a django annotation is being optimized away:

adam@localhost [7]> show create table t;
+-------+--------------------------------------------------------------------------------+
| Table | Create Table                                                                   |
+-------+--------------------------------------------------------------------------------+
| t     | CREATE TABLE `t` (
  `a` text,
  `b` text
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
+-------+--------------------------------------------------------------------------------+
1 row in set (0.00 sec)

adam@localhost [8]> explain select count(*) from (select char_length(a) from t) tt;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
|  1 | SIMPLE      | t     | NULL       | ALL  | NULL          | NULL | NULL    | NULL |    2 |   100.00 | NULL  |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
1 row in set, 1 warning (0.00 sec)

adam@localhost [9]> show warnings;
+-------+------+--------------------------------------------------------------+
| Level | Code | Message                                                      |
+-------+------+--------------------------------------------------------------+
| Note  | 1003 | /* select#1 */ select count(0) AS `count(*)` from `test`.`t` |
+-------+------+--------------------------------------------------------------+
1 row in set (0.00 sec)

The last note is the rewritten SQL - as you can see it figured the char_length(a) and subselect were unnecessary and removed them. Most databases will have these kind of optimizations in place.

Tom, have you got the SQL of problematic counts you've experienced in the past? It might be easier to fix this at a different level, e.g. the SQL that Django generates, that try adding an optimization "layer".

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscribe@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.

For more options, visit https://groups.google.com/d/optout.



--
Adam
Reply all
Reply to author
Forward
0 new messages