it generates SQL query which looks:
{{{SELECT COUNT('*') FROM (SELECT DISTINCT * FROM some_model) subquery;}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/25230>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* needs_better_patch: => 0
* needs_tests: => 0
* needs_docs: => 0
Old description:
> https://code.djangoproject.com/ticket/23875
> Recent pull merge
> https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233?diff=unified
> generates SQL which causes severe performance hit when using distinct
> parameter.
> Instead of getting something like:
> {{{SELECT DISTINCT COUNT('*') FROM some_model;}}}
>
> it generates SQL query which looks:
> {{{SELECT COUNT('*') FROM (SELECT DISTINCT * FROM some_model)
> subquery;}}}
New description:
Recent pull merge
https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233?diff=unified
generates SQL which causes severe performance hit when using distinct
parameter.
Instead of getting something like:
{{{SELECT DISTINCT COUNT('*') FROM some_model;}}}
it generates SQL query which looks:
{{{SELECT COUNT('*') FROM (SELECT DISTINCT * FROM some_model) subquery;}}}
Related ticket: https://code.djangoproject.com/ticket/23875
--
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:1>
* stage: Unreviewed => Accepted
Comment:
The exact line causing the performance regression is:
https://github.com/django/django/commit/c7fd9b242d2d63406f1de6cc3204e35aaa025233
#diff-0edd853580d56db07e4020728d59e193R339
The `or self.distinct` component now forces a subquery. Might need Anssi
to confirm if that part is needed or not, and whether we can be smarter in
a restricted set of use cases.
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:2>
Comment (by akaariai):
The above queries are identical because the distinct isn't actually doing
anything. It won't be doing anything if we can guarantee there is some
unique column in the query. For example, when there are no joins and the
primary key of the main table is included in the query, then the distinct
is redundant. We could be even smarter and detect when there are no
multivalued joins in the query (and the pk is present).
Not sure how much trouble we should push to fixing this - the original
distinct in the query is redundant, so it could be removed from the query
in the reporter's application, at least for the count query.
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:3>
Comment (by dexity):
For our database the newly generated query (with subquery) takes 30-40 sec
to perform, the original query took 0.1 sec. Huge difference!
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:4>
Comment (by aaugustin):
MySQL?
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:5>
Comment (by charettes):
I do get a non negligible slowdown on PostgreSQL with a fully analyzed
table containing ~30M rows.
{{{
database=# EXPLAIN ANALYZE SELECT COUNT('*') FROM (SELECT DISTINCT * FROM
schema.table) subquery;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=6213779.77..6213779.78 rows=1 width=0) (actual
time=88610.804..88610.804 rows=1 loops=1)
-> Unique (cost=5290067.41..5858505.79 rows=28421919 width=117)
(actual time=53265.590..86513.151 rows=28176350 loops=1)
-> Sort (cost=5290067.41..5361122.20 rows=28421919 width=117)
(actual time=53265.588..76718.379 rows=28176350 loops=1)
Sort Key: table.col0, table.col1, table.col2, table.col2,
table.col3, table.col4, table.col5
Sort Method: external merge Disk: 1296744kB
-> Seq Scan on table (cost=0.00..522350.19 rows=28421919
width=117) (actual time=0.840..15830.723 rows=28176350 loops=1)
Total runtime: 88695.934 ms
(7 rows)
database=# EXPLAIN ANALYZE SELECT COUNT(*) FROM schema.table;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=593404.99..593405.00 rows=1 width=0) (actual
time=19139.072..19139.073 rows=1 loops=1)
-> Seq Scan on table (cost=0.00..522350.19 rows=28421919 width=0)
(actual time=5.683..17387.798 rows=28176350 loops=1)
Total runtime: 19139.108 ms
(3 rows)
}}}
Should we escalate this to release blocker?
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:6>
Comment (by jarshwah):
akaariai's argument against fixing this makes sense to me.
The distinct in `Model.objects.distinct().count()` is redundant, because
the `pk` enforces that each row will be distinct anyway. For the cases
where `distinct` will actually change the result, then a subquery is
required.
So even if we decide to fix this, I don't think it warrants release
blocker status. Fix the queryset by removing the redundant `distinct`. If
we can demonstrate that changing the queryset to be correct is non-
obvious, then I think that might push the case for implementing a fix.
Really though, distinct is very rarely useful, and is indicative of a
poorly constructed query (in sql or in django). Obviously there are useful
distinct queries, but they are the exception in my experience.
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:7>
Comment (by dexity):
We use MySQL. {{{Model.objects.distinct().count()}}} is a simple query to
make the problem obvious. There might be intermediate filters which
apparently make {{{distinct()}}} more relevant.
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:8>
Comment (by akaariai):
I'm not against fixing this. We could in general remove distinct from
queries when we can prove it is redundant.
But, I don't think this is that high priority issue. I probably will get
to fixing this at some point, and I'm definitely willing to review
patches.
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:9>
Comment (by shaib):
Replying to [comment:7 jarshwah]:
> Really though, distinct is very rarely useful, and is indicative of a
poorly constructed query (in sql or in django). Obviously there are useful
distinct queries, but they are the exception in my experience.
It becomes necessary if you use backreferences for FKs in your filter; SQL
is stupid and multiplies results in that case.
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:10>
* owner: nobody => Simon Charette
* needs_better_patch: 0 => 1
* has_patch: 0 => 1
* status: new => assigned
--
Ticket URL: <https://code.djangoproject.com/ticket/25230#comment:11>