[Django] #30685: Suboptimal QuerySet.distinct().count()

16 views
Skip to first unread message

Django

unread,
Aug 6, 2019, 5:23:20 AM8/6/19
to django-...@googlegroups.com
#30685: Suboptimal QuerySet.distinct().count()
-------------------------------------+-------------------------------------
Reporter: adamsol | Owner: nobody
Type: | Status: new
Cleanup/optimization |
Component: Database | Version: 2.2
layer (models, ORM) |
Severity: Normal | Keywords:
Triage Stage: | Has patch: 0
Unreviewed |
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
-------------------------------------+-------------------------------------
I have a PostgreSQL table with 100 000 records and 15 columns.

A simple `.count()` query results in a fast SQL (execution time in seconds
on the left):

{{{(0.015) SELECT COUNT(*) AS "__count" FROM "table";}}}

When we add `.distinct()`, a subquery is created with all columns
SELECTed:

{{{(0.178) SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1, ...
(15 columns) FROM "table") subquery;}}}

When instead of `.distinct()` we write `.distinct('id')` and add
`.order_by('id', 'col1', 'col2')`, the subquery is additionally ORDERed:

{{{(0.151) SELECT COUNT(*) FROM (SELECT DISTINCT ON ("table"."id")
"table"."id" AS Col1, ... (15 columns) FROM "table" ORDER BY "table"."id"
ASC, "table"."col1" ASC, "table"."col2" ASC) subquery;}}}

Funny thing is that without `.distinct('id')` we can write
`.order_by('non_existing_column')` and it works without any exception.

After adding `.values('id')` and an empty `.order_by()`, the query is as
fast as it can be with DISTINCT:

{{{(0.053) SELECT COUNT(*) FROM (SELECT DISTINCT ON ("table"."id")
"table"."id" AS Col1 FROM "table") subquery;}}}

I think that the subquery for `count()` should never contain additional
columns nor be ordered (and the same probably goes for other
aggregations).

--
Ticket URL: <https://code.djangoproject.com/ticket/30685>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
Aug 6, 2019, 5:54:05 AM8/6/19
to django-...@googlegroups.com
#30685: Suboptimal QuerySet.distinct().count()
-------------------------------------+-------------------------------------
Reporter: adamsol | Owner: nobody
Type: | Status: closed
Cleanup/optimization |
Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution: invalid
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Carlton Gibson):

* status: new => closed
* resolution: => invalid


Comment:

Hi Adam,

From the description here it just looks like the expected behaviour, as
per the
[https://docs.djangoproject.com/en/2.2/ref/models/querysets/#django.db.models.query.QuerySet.distinct
various `Note` callouts in the `distinct()` documentation].

(If you really think there's a bug here, can I ask you to narrow it down
to something more specific? What are the exact models and ORM calls in
play. What is the exact SQL generated? Vs What do you expect and why?
Thanks.)

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:1>

Django

unread,
Aug 6, 2019, 6:19:17 AM8/6/19
to django-...@googlegroups.com
#30685: Suboptimal QuerySet.distinct().count()
-------------------------------------+-------------------------------------
Reporter: adamsol | Owner: nobody
Type: | Status: closed
Cleanup/optimization |
Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution: invalid
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by adamsol):

I don't think it's a bug, but I think there is a place for optimization. I
can't see a reason why all columns are SELECTed in a subquery just to
perform a simple count. I've attached the SQLs generated and what I would
expect.

So, for any model, a query like this:

{{{Model.objects.distinct().count()}}}

should produce an SQL like this:

{{{SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1 FROM
"table") subquery;}}}

instead of:

{{{SELECT COUNT(*) FROM (SELECT DISTINCT "table"."id" AS Col1, ... (all
other columns) FROM "table") subquery;}}}

which, depending on the number of columns and size of the table, can be
several times slower (at least on PostgreSQL).

If you think this is not worth the effort, then OK, I won't insist.

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:2>

Django

unread,
Aug 6, 2019, 6:54:49 AM8/6/19
to django-...@googlegroups.com
#30685: Suboptimal QuerySet.distinct().count()
-------------------------------------+-------------------------------------
Reporter: adamsol | Owner: nobody
Type: | Status: closed
Cleanup/optimization |
Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution: invalid
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Carlton Gibson):

I think the issue is that there's no way for the `distinct()` call to know
that you're going to follow up with a `count()` — and that code to special
case for that isn't going to worth the effort.

If you know you only want to select a single field tell the ORM via
[https://docs.djangoproject.com/en/2.2/ref/models/querysets/#django.db.models.query.QuerySet.only
`only()`]. This'll give you want you need, I think:
`Model.objects.only('id').distinct().count()`.

(Maybe the ORM experts can tell you more, but I suspect that be more or
less the end of the story...)

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:3>

Django

unread,
Aug 6, 2019, 9:52:35 AM8/6/19
to django-...@googlegroups.com
#30685: Suboptimal QuerySet.distinct().count()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: nobody

Type: | Status: closed
Cleanup/optimization |
Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution: invalid
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Simon Charette):

It should feasible to adapt `sql.Query.get_count()` to perform these
optimizations by setting a `.values` mask when no columns are specified.

https://github.com/django/django/blob/05964b2198e53a8d66e34d83d9123e3051720b28/django/db/models/sql/query.py#L505-L514

You might have to tweak the suquery pushdown logic a bit as that's what
kicks in when `distinct` is used but it should be feasible.

https://github.com/django/django/blob/05964b2198e53a8d66e34d83d9123e3051720b28/django/db/models/sql/query.py#L421-L457

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:4>

Django

unread,
Aug 6, 2019, 12:11:13 PM8/6/19
to django-...@googlegroups.com
#30685: Suboptimal QuerySet.distinct().count()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: nobody
Type: | Status: new
Cleanup/optimization |

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Carlton Gibson):

* status: closed => new
* resolution: invalid =>
* stage: Unreviewed => Accepted


Comment:

OK, fine, thank you for the input Simon. Let’s accept on that basis. 👍

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:5>

Django

unread,
Aug 7, 2019, 4:24:46 AM8/7/19
to django-...@googlegroups.com
#30685: Suboptimal QuerySet.distinct().count()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: nobody
Type: | Status: new
Cleanup/optimization |

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Adam Sołtysik):

Thank you for accepting. I have added `.values('pk').order_by()` before my
`.count()` queries, but I'll be glad to be able to get rid of it.

Just to add a bit more information. The "problem" is not only with
`.distinct()`, also any `.annotate()` makes `count()` produce a subquery
with the redundant annotations inside. For example, a query like this
results in SQL that is 4 times slower for my data than the bare count:

{{{Model.objects.annotate(id2=F('id')).count()}}}
{{{(0.057) SELECT COUNT(*) FROM (SELECT "table"."id" AS Col1, "table"."id"
AS "id2" FROM "table" GROUP BY "table"."id") subquery;}}}

Adding `.values('pk')` of course solves this as well.

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:6>

Django

unread,
Aug 7, 2019, 4:36:19 AM8/7/19
to django-...@googlegroups.com
#30685: Suboptimal QuerySet.distinct().count()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: nobody
Type: | Status: new
Cleanup/optimization |

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Carlton Gibson):

> Adding .values('pk') of course solves this as well.

Yes… and for the same reason, right? 🙂

Fancy taking on a PR as per Simon’s suggestion? (We’re happy to advise…)

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:7>

Django

unread,
Aug 7, 2019, 6:10:36 AM8/7/19
to django-...@googlegroups.com
#30685: Optimize QuerySet.count() with distinct() or annotate().
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: nobody
Type: | Status: new
Cleanup/optimization |

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Adam Sołtysik):

Actually, wouldn't it be as simple as adding these two lines after `obj =
self.clone()` here:
https://github.com/django/django/blob/05964b2198e53a8d66e34d83d9123e3051720b28/django/db/models/sql/query.py#L505-L514

{{{
obj.set_values(['pk'])
obj.clear_ordering(force_empty=True)
}}}

I tried it and it seems to be working (didn't run tests, though). But I
think I don't understand the second part of Simon's advice.

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:8>

Django

unread,
Aug 7, 2019, 11:02:15 AM8/7/19
to django-...@googlegroups.com
#30685: Optimize QuerySet.count() with distinct()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: nobody
Type: | Status: new
Cleanup/optimization |

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Simon Charette):

The `annotation` part is tracked in #28477 so I'll rename this ticket to
only mention `distinct`.

I'm afraid we can't simply clear the values and ordering, else quite bit
of code in `get_aggregate`
[https://github.com/django/django/blob/65e86948b80262574058a94ccaae3a9b59c3faea/django/db/models/sql/query.py#L440-L444
would be rendundant].

One case that these changes don't consider is the fact that
`.values('foo').distinct()` could influence the `.count()`

e.g.

{{{#!sql
SELECT COUNT(*) FROM (SELECT DISTINCT author_id FROM book)
}}}

Is not equivalent to

{{{#!sql
SELECT COUNT(*) FROM (SELECT DISTINCT book_id FROM book)
}}}

I suggest you open a PR with your changes so CI can report what exactly
that breaks.

I'm getting three failures locally on SQLite; one related to the
aforementioned `.values().distinct()` issue, one related to
`.datetimes().distinct()` and one related to `order_by` spawning a
multiple valued relationship.

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:9>

Django

unread,
Nov 4, 2019, 3:21:57 PM11/4/19
to django-...@googlegroups.com
#30685: Optimize QuerySet.count() with distinct()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: nobody
Type: | Status: new
Cleanup/optimization |

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Ivaylo Donchev):

Hi :)
Is this ticket free now ? Do you mind me start playing with an
implemention of it ^

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:10>

Django

unread,
Nov 4, 2019, 4:00:37 PM11/4/19
to django-...@googlegroups.com
#30685: Optimize QuerySet.count() with distinct()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: nobody
Type: | Status: new
Cleanup/optimization |

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Adam Sołtysik):

I probably won't find time for a PR in the nearest time, so as far as I'm
concerned, feel free to take it :)

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:11>

Django

unread,
Nov 5, 2019, 6:44:28 AM11/5/19
to django-...@googlegroups.com
#30685: Optimize QuerySet.count() with distinct()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: nobody
Type: | Status: new
Cleanup/optimization |

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Ivaylo Donchev):

Okay, thanks :) . Will play with it these days.

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:12>

Django

unread,
Nov 5, 2019, 6:44:42 AM11/5/19
to django-...@googlegroups.com
#30685: Optimize QuerySet.count() with distinct()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: Ivaylo
Type: | Donchev
Cleanup/optimization | Status: assigned

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Ivaylo Donchev):

* status: new => assigned
* owner: nobody => Ivaylo Donchev


--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:13>

Django

unread,
May 31, 2021, 3:49:56 PM5/31/21
to django-...@googlegroups.com
#30685: Optimize QuerySet.count() with distinct()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: (none)
Type: | Status: new
Cleanup/optimization |

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Mariusz Felisiak):

* owner: Ivaylo Donchev => (none)
* status: assigned => new


--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:14>

Django

unread,
Nov 9, 2022, 1:02:41 PM11/9/22
to django-...@googlegroups.com
#30685: Optimize QuerySet.count() with distinct()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: (none)
Type: | Status: new
Cleanup/optimization |

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Simon Charette):

Now that #28477 has been fixed the location where this should be handled
should be made clearer as the elidable annotation part of the problem has
been solved.

This issue has a lot of overlap with #25230 so I wonder if both should be
merged (possibly have #25230 merged into this one?)

--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:15>

Django

unread,
Mar 5, 2023, 2:57:32 PM3/5/23
to django-...@googlegroups.com
#30685: Optimize QuerySet.count() with distinct()
-------------------------------------+-------------------------------------
Reporter: Adam Sołtysik | Owner: Simon
Type: | Charette
Cleanup/optimization | Status: assigned

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 1

Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Simon Charette):

* owner: (none) => Simon Charette
* needs_better_patch: 0 => 1
* has_patch: 0 => 1


* status: new => assigned


--
Ticket URL: <https://code.djangoproject.com/ticket/30685#comment:16>

Reply all
Reply to author
Forward
0 new messages