A simple `.count()` query results in a fast SQL (execution time in seconds
on the left):
{{{(0.015) SELECT COUNT(*) AS "__count" FROM "table";}}}
When we add `.distinct()`, a subquery is created with all columns
SELECTed:
{{{(0.178) SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1, ...
(15 columns) FROM "table") subquery;}}}
When instead of `.distinct()` we write `.distinct('id')` and add
`.order_by('id', 'col1', 'col2')`, the subquery is additionally ORDERed:
{{{(0.151) SELECT COUNT(*) FROM (SELECT DISTINCT ON ("table"."id")
"table"."id" AS Col1, ... (15 columns) FROM "table" ORDER BY "table"."id"
ASC, "table"."col1" ASC, "table"."col2" ASC) subquery;}}}
Funny thing is that without `.distinct('id')` we can write
`.order_by('non_existing_column')` and it works without any exception.
After adding `.values('id')` and an empty `.order_by()`, the query is as
fast as it can be with DISTINCT:
{{{(0.053) SELECT COUNT(*) FROM (SELECT DISTINCT ON ("table"."id")
"table"."id" AS Col1 FROM "table") subquery;}}}
I think that the subquery for `count()` should never contain additional
columns nor be ordered (and the same probably goes for other
aggregations).
--
Ticket URL: <https://code.djangoproject.com/ticket/30685>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* status: new => closed
* resolution: => invalid
Comment:
Hi Adam,
From the description here it just looks like the expected behaviour, as
per the
[https://docs.djangoproject.com/en/2.2/ref/models/querysets/#django.db.models.query.QuerySet.distinct
various `Note` callouts in the `distinct()` documentation].
(If you really think there's a bug here, can I ask you to narrow it down
to something more specific? What are the exact models and ORM calls in
play. What is the exact SQL generated? Vs What do you expect and why?
Thanks.)
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:1>
Comment (by adamsol):
I don't think it's a bug, but I think there is a place for optimization. I
can't see a reason why all columns are SELECTed in a subquery just to
perform a simple count. I've attached the SQLs generated and what I would
expect.
So, for any model, a query like this:
{{{Model.objects.distinct().count()}}}
should produce an SQL like this:
{{{SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1 FROM
"table") subquery;}}}
instead of:
{{{SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1, ... (all
other columns) FROM "table") subquery;}}}
which, depending on the number of columns and size of the table, can be
several times slower (at least on PostgreSQL).
If you think this is not worth the effort, then OK, I won't insist.
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:2>
Comment (by Carlton Gibson):
I think the issue is that there's no way for the `distinct()` call to know
that you're going to follow up with a `count()` — and that code to special
case for that isn't going to worth the effort.
If you know you only want to select a single field tell the ORM via
[https://docs.djangoproject.com/en/2.2/ref/models/querysets/#django.db.models.query.QuerySet.only
`only()`]. This'll give you want you need, I think:
`Model.objects.only('id').distinct().count()`.
(Maybe the ORM experts can tell you more, but I suspect that be more or
less the end of the story...)
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:3>
Comment (by Simon Charette):
It should feasible to adapt `sql.Query.get_count()` to perform these
optimizations by setting a `.values` mask when no columns are specified.
You might have to tweak the suquery pushdown logic a bit as that's what
kicks in when `distinct` is used but it should be feasible.
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:4>
* status: closed => new
* resolution: invalid =>
* stage: Unreviewed => Accepted
Comment:
OK, fine, thank you for the input Simon. Let’s accept on that basis. 👍
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:5>
Comment (by Adam Sołtysik):
Thank you for accepting. I have added `.values('pk').order_by()` before my
`.count()` queries, but I'll be glad to be able to get rid of it.
Just to add a bit more information. The "problem" is not only with
`.distinct()`, also any `.annotate()` makes `count()` produce a subquery
with the redundant annotations inside. For example, a query like this
results in SQL that is 4 times slower for my data than the bare count:
{{{Model.objects.annotate(id2=F('id')).count()}}}
{{{(0.057) SELECT COUNT(*) FROM (SELECT "table"."id" AS Col1, "table"."id"
AS "id2" FROM "table" GROUP BY "table"."id") subquery;}}}
Adding `.values('pk')` of course solves this as well.
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:6>
Comment (by Carlton Gibson):
> Adding .values('pk') of course solves this as well.
Yes… and for the same reason, right? 🙂
Fancy taking on a PR as per Simon’s suggestion? (We’re happy to advise…)
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:7>
Comment (by Adam Sołtysik):
Actually, wouldn't it be as simple as adding these two lines after `obj =
self.clone()` here:
https://github.com/django/django/blob/05964b2198e53a8d66e34d83d9123e3051720b28/django/db/models/sql/query.py#L505-L514
{{{
obj.set_values(['pk'])
obj.clear_ordering(force_empty=True)
}}}
I tried it and it seems to be working (didn't run tests, though). But I
think I don't understand the second part of Simon's advice.
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:8>
Comment (by Simon Charette):
The `annotation` part is tracked in #28477 so I'll rename this ticket to
only mention `distinct`.
I'm afraid we can't simply clear the values and ordering, else quite bit
of code in `get_aggregate`
[https://github.com/django/django/blob/65e86948b80262574058a94ccaae3a9b59c3faea/django/db/models/sql/query.py#L440-L444
would be rendundant].
One case that these changes don't consider is the fact that
`.values('foo').distinct()` could influence the `.count()`
e.g.
{{{#!sql
SELECT COUNT(*) FROM (SELECT DISTINCT author_id FROM book)
}}}
Is not equivalent to
{{{#!sql
SELECT COUNT(*) FROM (SELECT DISTINCT book_id FROM book)
}}}
I suggest you open a PR with your changes so CI can report what exactly
that breaks.
I'm getting three failures locally on SQLite; one related to the
aforementioned `.values().distinct()` issue, one related to
`.datetimes().distinct()` and one related to `order_by` spawning a
multiple valued relationship.
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:9>
Comment (by Ivaylo Donchev):
Hi :)
Is this ticket free now ? Do you mind me start playing with an
implemention of it ^
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:10>
Comment (by Adam Sołtysik):
I probably won't find time for a PR in the nearest time, so as far as I'm
concerned, feel free to take it :)
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:11>
Comment (by Ivaylo Donchev):
Okay, thanks :) . Will play with it these days.
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:12>
* status: new => assigned
* owner: nobody => Ivaylo Donchev
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:13>
* owner: Ivaylo Donchev => (none)
* status: assigned => new
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:14>
Comment (by Simon Charette):
Now that #28477 has been fixed the location where this should be handled
should be made clearer as the elidable annotation part of the problem has
been solved.
This issue has a lot of overlap with #25230 so I wonder if both should be
merged (possibly have #25230 merged into this one?)
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:15>
* owner: (none) => Simon Charette
* needs_better_patch: 0 => 1
* has_patch: 0 => 1
* status: new => assigned
--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:16>