[Django] #25230: Change to Query.get_count() causes big performance hit

Django

unread,

Aug 5, 2015, 7:54:15 PM8/5/15

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit
----------------------------------------------+--------------------
Reporter: dexity | Owner: nobody
Type: Cleanup/optimization | Status: new
Component: Database layer (models, ORM) | Version: 1.8
Severity: Normal | Keywords:
Triage Stage: Unreviewed | Has patch: 0
Easy pickings: 0 | UI/UX: 0
----------------------------------------------+--------------------
https://code.djangoproject.com/ticket/23875
Recent pull merge
https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233?diff=unified
generates SQL which causes severe performance hit when using distinct
parameter.
Instead of getting something like:
{{{SELECT DISTINCT COUNT('*') FROM some_model;}}}

it generates SQL query which looks:
{{{SELECT COUNT('*') FROM (SELECT DISTINCT * FROM some_model) subquery;}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/25230>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

Aug 5, 2015, 7:55:19 PM8/5/15

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by dexity):

* needs_better_patch: => 0
* needs_tests: => 0
* needs_docs: => 0

Old description:

> https://code.djangoproject.com/ticket/23875
> Recent pull merge
> https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233?diff=unified
> generates SQL which causes severe performance hit when using distinct
> parameter.
> Instead of getting something like:
> {{{SELECT DISTINCT COUNT('*') FROM some_model;}}}
>
> it generates SQL query which looks:
> {{{SELECT COUNT('*') FROM (SELECT DISTINCT * FROM some_model)
> subquery;}}}

New description:

Recent pull merge
https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233?diff=unified
generates SQL which causes severe performance hit when using distinct
parameter.
Instead of getting something like:
{{{SELECT DISTINCT COUNT('*') FROM some_model;}}}

it generates SQL query which looks:
{{{SELECT COUNT('*') FROM (SELECT DISTINCT * FROM some_model) subquery;}}}

--

--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:1>

Django

unread,

Aug 5, 2015, 10:25:29 PM8/5/15

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

-------------------------------------+-------------------------------------
Reporter: dexity | Owner: nobody

Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by jarshwah):

* stage: Unreviewed => Accepted

Comment:

The exact line causing the performance regression is:
https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233
#diff-0edd853580d56db07e4020728d59e193R339

The `or self.distinct` component now forces a subquery. Might need Anssi
to confirm if that part is needed or not, and whether we can be smarter in
a restricted set of use cases.

--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:2>

Django

unread,

Aug 6, 2015, 1:24:06 AM8/6/15

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

-------------------------------------+-------------------------------------
Reporter: dexity | Owner: nobody

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by akaariai):

The above queries are identical because the distinct isn't actually doing
anything. It won't be doing anything if we can guarantee there is some
unique column in the query. For example, when there are no joins and the
primary key of the main table is included in the query, then the distinct
is redundant. We could be even smarter and detect when there are no
multivalued joins in the query (and the pk is present).

Not sure how much trouble we should push to fixing this - the original
distinct in the query is redundant, so it could be removed from the query
in the reporter's application, at least for the count query.

--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:3>

Django

unread,

Aug 6, 2015, 1:31:25 PM8/6/15

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

-------------------------------------+-------------------------------------
Reporter: dexity | Owner: nobody

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by dexity):

For our database the newly generated query (with subquery) takes 30-40 sec
to perform, the original query took 0.1 sec. Huge difference!

--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:4>

Django

unread,

Aug 6, 2015, 2:48:54 PM8/6/15

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

-------------------------------------+-------------------------------------
Reporter: dexity | Owner: nobody

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by aaugustin):

MySQL?

--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:5>

Django

unread,

Aug 6, 2015, 5:56:01 PM8/6/15

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

-------------------------------------+-------------------------------------
Reporter: dexity | Owner: nobody

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by charettes):

I do get a non negligible slowdown on PostgreSQL with a fully analyzed
table containing ~30M rows.

{{{
database=# EXPLAIN ANALYZE SELECT COUNT('*') FROM (SELECT DISTINCT * FROM
schema.table) subquery;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=6213779.77..6213779.78 rows=1 width=0) (actual
time=88610.804..88610.804 rows=1 loops=1)
-> Unique (cost=5290067.41..5858505.79 rows=28421919 width=117)
(actual time=53265.590..86513.151 rows=28176350 loops=1)
-> Sort (cost=5290067.41..5361122.20 rows=28421919 width=117)
(actual time=53265.588..76718.379 rows=28176350 loops=1)
Sort Key: table.col0, table.col1, table.col2, table.col2,
table.col3, table.col4, table.col5
Sort Method: external merge Disk: 1296744kB
-> Seq Scan on table (cost=0.00..522350.19 rows=28421919
width=117) (actual time=0.840..15830.723 rows=28176350 loops=1)
Total runtime: 88695.934 ms
(7 rows)

database=# EXPLAIN ANALYZE SELECT COUNT(*) FROM schema.table;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=593404.99..593405.00 rows=1 width=0) (actual
time=19139.072..19139.073 rows=1 loops=1)
-> Seq Scan on table (cost=0.00..522350.19 rows=28421919 width=0)
(actual time=5.683..17387.798 rows=28176350 loops=1)
Total runtime: 19139.108 ms
(3 rows)
}}}

Should we escalate this to release blocker?

--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:6>

Django

unread,

Aug 6, 2015, 10:13:53 PM8/6/15

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

-------------------------------------+-------------------------------------
Reporter: dexity | Owner: nobody

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by jarshwah):

akaariai's argument against fixing this makes sense to me.

The distinct in `Model.objects.distinct().count()` is redundant, because
the `pk` enforces that each row will be distinct anyway. For the cases
where `distinct` will actually change the result, then a subquery is
required.

So even if we decide to fix this, I don't think it warrants release
blocker status. Fix the queryset by removing the redundant `distinct`. If
we can demonstrate that changing the queryset to be correct is non-
obvious, then I think that might push the case for implementing a fix.

Really though, distinct is very rarely useful, and is indicative of a
poorly constructed query (in sql or in django). Obviously there are useful
distinct queries, but they are the exception in my experience.

--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:7>

Django

unread,

Aug 7, 2015, 1:15:12 AM8/7/15

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

-------------------------------------+-------------------------------------
Reporter: dexity | Owner: nobody

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by dexity):

We use MySQL. {{{Model.objects.distinct().count()}}} is a simple query to
make the problem obvious. There might be intermediate filters which
apparently make {{{distinct()}}} more relevant.

--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:8>

Django

unread,

Aug 7, 2015, 3:49:27 AM8/7/15

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

-------------------------------------+-------------------------------------
Reporter: dexity | Owner: nobody

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by akaariai):

I'm not against fixing this. We could in general remove distinct from
queries when we can prove it is redundant.

But, I don't think this is that high priority issue. I probably will get
to fixing this at some point, and I'm definitely willing to review
patches.

--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:9>

Django

unread,

Aug 24, 2015, 9:10:17 AM8/24/15

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

-------------------------------------+-------------------------------------
Reporter: dexity | Owner: nobody

Needs tests: 0 | Patch needs improvement: 0

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by shaib):

Replying to [comment:7 jarshwah]:

> Really though, distinct is very rarely useful, and is indicative of a
poorly constructed query (in sql or in django). Obviously there are useful
distinct queries, but they are the exception in my experience.

It becomes necessary if you use backreferences for FKs in your filter; SQL
is stupid and multiplies results in that case.

--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:10>

Django

unread,

Mar 5, 2023, 2:57:38 PM3/5/23

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

-------------------------------------+-------------------------------------
Reporter: dexity | Owner: Simon
Type: | Charette
Cleanup/optimization | Status: assigned

Component: Database layer | Version: 1.8
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 1

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by Simon Charette):

* owner: nobody => Simon Charette
* needs_better_patch: 0 => 1
* has_patch: 0 => 1
* status: new => assigned

--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:11>

Django

unread,

Jan 17, 2025, 7:01:29 PMJan 17

to django-...@googlegroups.com

#25230: Change to Query.get_count() causes big performance hit

-------------------------------------+-------------------------------------
Reporter: dexity | Owner: (none)
Type: | Status: new
Cleanup/optimization |

Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by Simon Charette):

* owner: Simon Charette => (none)
* status: assigned => new

Comment:

Not actively working on this.
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:12>

Reply all

Reply to author

Forward