Strange query when using annotate and count

320 views
Skip to first unread message

Cristiano Coelho

unread,
Nov 14, 2017, 8:54:23 PM11/14/17
to Django users
I'm getting some very odd query when combining annotate with count. See the following:

>>> q = Vulnerability.objects.annotate(score=WordTrigramCustomSimilarity('test','summary'))
>>> q.count()
3094
>>> print connection.queries[-1]
'SELECT COUNT(*) 
FROM (
    SELECT "vulnerabilities_vulnerability"."id" AS Col1, custom_word_similarity(\'test\', "vulnerabilities_vulnerability"."summary") AS "score" 
    FROM "vulnerabilities_vulnerability" 
    GROUP BY "vulnerabilities_vulnerability"."id", custom_word_similarity(\'test\', "vulnerabilities_vulnerability"."summary")
) subquery
>>> q2 = Vulnerability.objects.filter(summary__icontains='test')
>>> q2.count()
33
>>> print connection.queries[-1]
'SELECT COUNT(*) AS "__count" 
FROM "vulnerabilities_vulnerability" 
WHERE UPPER("vulnerabilities_vulnerability"."summary"::text) LIKE UPPER(\'%test%\')


Custom function code, is this what's causing the odd count behavior? Did I miss anything?

class WordTrigramCustomSimilarity(Func):
    function = 'custom_word_similarity' 
    def __init__(self, string, expression, **extra):
        if not hasattr(string, 'resolve_expression'):
            string = Value(string)
        super(WordTrigramCustomSimilarity, self).__init__(string, expression, output_field=FloatField(), **extra)

I would expect for the query to be a simple count, rather than a nested query with a useless group by (correct me if I'm wrong).
The issue gets even worse if the function is expensive, since it gets called when it's not needed at all, more than once.
Also the issue behaves pretty much the same if the queryset includes filtering and ordering but I didn't include it here for simplicity.

Using Django 1.11.7 + postgres (psycopg) backend.

Cristiano Coelho

unread,
Nov 16, 2017, 8:39:54 AM11/16/17
to Django users
So far the only work around I could find is to either use extra and select instead of annotate, which leads to the right count query but makes the queryset really unreliable due to the hardcoded column names that might break if django decides to alias the tables.

Is there any flag that needs to be set on a Func subclass so it doesn't add that useless group by clause when counting?

Cristiano Coelho

unread,
Nov 21, 2017, 8:46:21 AM11/21/17
to Django users
Hmm, should I try with the dev mailing list? Guess it's something no one faced before?


El martes, 14 de noviembre de 2017, 22:54:23 (UTC-3), Cristiano Coelho escribió:

Simon Charette

unread,
Nov 23, 2017, 10:05:29 PM11/23/17
to Django users
Hello Cristiano,

I understand your frustration but please avoid using the developer mailing list
as a second tier support channel. I suggest you try the IRC #django channel if
you need to want to get faster support.

What's happening here is that annotate() really means "select this field" while
in your other case you use a lookup (summary__icontains) which are only going to
be added to the WHERE clause of your query.

I'm not sure why you are annotating your queryset without referring to it in
a filter clause later on but the ORM cannot simply ignore it when you are
performing your `count()` because some annotations could interfere with grouping
somehow.

There is an open ticket[0] to add support for an `alias()` method that would
allow the ORM to clear/ignore the specified expressions if it's not referenced
in the query.

In the mean time I think the best approach would be to avoid annotating the
queryset if your don't need to reference the score.

Cheers,
Simon

[0] https://code.djangoproject.com/ticket/27719

Cristiano Coelho

unread,
Nov 23, 2017, 10:41:41 PM11/23/17
to Django users
Hello Simon, thanks for the response.

The above code is just an example, the reason behind the annotate because there's some complicated code that builds a queryset and annotates it so it can easily be reused. It works fine 99% of the time except when there's a count involved and it ends up being redundant. The solution would be to not annotate anything and replace the code in multiple places to add the annotate call (or similar using a custom queryset or manager I guess), but that's quite painful and will end up with a lot of duplicated code since there's also an order_by that follows the annotate that needs to be moved over as well

Isn't there a way (even if it's hackish) to simply clear the annotations (and order by)? I know querysets are smart enough to not include the order by clause if there's a count.

You could also suggest using two separate calls or a flag to pass down to the internal code so it doesn't include the additional stuff, but that wouldn't work since paginators accept only one query set for example and internall uses it for both count and results.

Simon Charette

unread,
Nov 23, 2017, 11:12:07 PM11/23/17
to Django users
Hello Cristiano,


> Isn't there a way (even if it's hackish) to simply clear the annotations (and order by)? I know querysets are smart enough to not include the order by clause if there's a count.

The ordering is cleared on `count()` because it shouldn't interfere with the result in any way. You can clear annotations using `queryset.query.annotations.clear()` but be aware that it is a private API that could change under your feet. Make sure to make a copy of the queryset (e.g. copy = queryset.all()) before performing this change as it will alter the queryset in place.

Best,
Simon

Cristiano Coelho

unread,
Nov 24, 2017, 8:14:10 AM11/24/17
to Django users
Hello Simon,

That private API is good to know, but now that I think of it would still not work for me, since my queryset is passed to a paginator and that's the class that does both the count and actual queryset execution, so need a queryset that can have both the annotation but also clears it if count is called so it doesn't create the redundant sub queries.

I'm wondering what's better, should I try to resolve this at the manager level overriding count? I feel like this might cause issues if the annotation is actually legit (ie with an aggregate) and it needs the subquery after all.
The other option is to subclass the paginator class with a special one that does this annotation clearing before running count.

Even with these cases, if the annotated value is used later with a filter query I can't really simply removed, but the sub queries and extra function calls really doesn't make sense to be there when doing a count, so it seems that all the options are quite bad and hackish.

Any other options?

Simon Charette

unread,
Nov 24, 2017, 12:39:27 PM11/24/17
to Django users
Hello Cristiano,

If you are solely using this annotation for querying purpose, that means you are
not expecting a `score` attribute to be attached to your `Vulnerability` instances,
you could try defining your function as a lookup on CharField/TextField instead.

You could then use it only when required by passing a tuple of values to it.

e.g.

Vulnerability.objects.filter(summary__wordsimilarity_gt=('test', threshold))

If you need the value annotated I'm afraid the best solution would be to use
a custom paginator class. You could give a try at detecting whether or not
an annotation is referenced before removing it from query.annotations and
report your finding on the aforementioned ticket but I expect this to be
relatively hard to do correctly. Still it would make implementing `alias()`
way easier and provide help for users in the situation as you.

Best,
Simon

Cristiano Coelho

unread,
Nov 24, 2017, 4:04:21 PM11/24/17
to Django users
Hello Simon,

You are right, the score is really not meant to be attached so a custom lookup might work, if it wasn't for the issue that an order_by clause would fail without the annotation.

So it seems it's either update and duplicate a lot of code or find a very ugly work around, I was hoping I missed some flag or implementation detail on the custom function to tell the query builder to avoid all the grouping and subquery stuff, will research more...

Cristiano Coelho

unread,
Dec 6, 2017, 8:28:22 AM12/6/17
to Django users
After testing for a while, there really doesn't seem to be a good way to do this. .

Annotation is required since the query is filtered based on the annotated value, so any attempt to clear annotations would fail. Although a lookup could be used for the filtering.
But a custom lookup is not possible since the query also sorts by the annotated value, so an annotation is required.

This means that I would need to use a custom lookup for count queries, and annotation for result queries, which would mean a lot of duplicated code and also that the Django Paginator class can't be used since it expects only one queryset.


Are there really no options to make a count with a custom function to skip the group by clause (and so the inner query)? I mean there's no group by at all with the original query, it's just when count is used.

To summarize, the original query is a bit more complex than the one described on top, reason annotate is needed, and it looks more like:

q = Vulnerability.objects.annotate(score=WordTrigramCustomSimilarity('test','summary')).filter(score__gt=0.6).order_by('-score')

Which is then sent to a paginator, that ends up almost adding 3 times more database query time do to the horrid count query generated.

Cristiano Coelho

unread,
Dec 6, 2017, 2:20:15 PM12/6/17
to Django users
Update. Found a work around that gets rid of the unnecessary group by (and hence speeds up the query, from ~200ms to ~100ms in my use case). Simply adding a .values('pk').
Bad thing, every model needs a custom manager and it will still use an inner query, still trying to figure out the side effects of it, nothing so far...

class FasterCountQuerySet(QuerySet):
    def count(self):
        return super(FasterCountQuerySet, self.values('pk')).count()


Makes me wonder if the original count code can be made "smarter" in order to prevent the unnecessary group by, or if this simply a bug
Reply all
Reply to author
Forward
0 new messages