{{{#!python
Model.objects.all().distinct('foo').order_by('bar')
}}}
The above generates the following SQL:
{{{#!sql
SELECT DISTINCT ON ("app_model"."foo") <lots of columns here>
FROM "app_model"
ORDER BY "app_model"."bar" ASC;
}}}
I am not sure about all backends however the above syntax is not allowed
in PostgreSQL which produces the following error:
{{{#!sql
ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY
expressions
}}}
Here are [http://www.postgresql.org/docs/9.0/static/sql-select.html#SQL-
DISTINCT PostgreSQL docs] explaining why that is not allowed:
DISTINCT ON ( expression [, ...] ) keeps only the first row of each set
of rows where the given expressions evaluate to equal. [...] Note that the
"first row" of each set is unpredictable unless ORDER BY is used to ensure
that the desired row appears first. [...] The DISTINCT ON expression(s)
must match the leftmost ORDER BY expression(s).
This ticket proposes to use subqueries in such situations which would use
SQL:
{{{#!sql
SELECT *
FROM (
SELECT DISTINCT ON ("app_model"."foo") <lots of columns here>
FROM "app_model"
) result
ORDER BY "app_model"."bar" ASC;
}}}
The above is perfectly valid SQL and produces expected results (please
note that {{{ORDER_BY}}} is in the outer query to guarantee that distinct
results are correctly sorted).
I created a simple [https://gist.github.com/miki725/ce26d8b6ee2f3075884a
patch] by overwriting few things in {{{SQLCompiler.as_sql()}}} which seems
to work pretty well. The patch only creates subquery when the above dilema
is encountered which should not have any negative side-effects on existing
queries (since such queries were not allowed by SQL). The patch also works
on the {{{.count()}}} queries since Django then strips any ordering hence
the subquery is never created.
--
Ticket URL: <https://code.djangoproject.com/ticket/24218>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* needs_better_patch: => 0
* needs_tests: => 0
* needs_docs: => 0
Old description:
New description:
I created a simple proof-of-concept
[https://gist.github.com/miki725/ce26d8b6ee2f3075884a patch] by
overwriting few things in {{{SQLCompiler.as_sql()}}} which seems to work
pretty well. The patch only creates subquery when the above dilema is
encountered which should not have any negative side-effects on existing
queries (since such queries were not allowed by SQL). The patch also works
on the {{{.count()}}} queries since Django then strips any ordering hence
the subquery is never created.
--
--
Ticket URL: <https://code.djangoproject.com/ticket/24218#comment:1>
* needs_docs: 0 => 1
* needs_tests: 0 => 1
* stage: Unreviewed => Accepted
Comment:
Thanks for the patch! The feature idea generally makes sense to me -
taking something that currently results in an error and making it instead
return sane results. Tentatively accepting based on that. Some thoughts:
1. We need to check what the other core backends currently do with this
query. If there are backends in which that query is currently an error,
but this patch would change their results, that might be a problem.
2. If I understand the Postgres docs correctly, your proposed subquery SQL
results in an undefined/unpredictable selection of which actual row is
chosen of each group where the `DISTINCT ON` query returns the same value.
I guess if this is a problem, the answer is "make your `ORDER BY` match
the `DISTINCT ON`", but it seems like a non-obvious subtlety that might
bite people.
If nobody else sees any blockers that I'm missing, the things that would
still be needed here are:
a) Add a test demonstrating the new behavior.
b) Add/update the documentation as needed (I think this is a behavior that
ought to be mentioned in the docs somewhere).
c) Turn it into a pull request so the CI runs on it and we can see if it
breaks any existing tests.
Once those are done there may be some comments on code style or
implementation choices. The POC code seems clear enough, but I'd defer to
others who know the ORM better on whether it's actually implemented at the
right level / in the right place.
--
Ticket URL: <https://code.djangoproject.com/ticket/24218#comment:2>
Comment (by ramiro):
Per the Postgres documentation snipped, shouldn't the `ORDER BY` actually
be applied to the sub-query to ensure a predictable result?
Also, make sure to test behavior with both bare `distinct()` calls and
calls where joins are involved (e.g.
`Model.dictinct('fkfield__related_model_field')`)
--
Ticket URL: <https://code.djangoproject.com/ticket/24218#comment:3>
Comment (by carljm):
Replying to [comment:3 ramiro]:
> Per the Postgres documentation snipped, shouldn't the `ORDER BY`
actually be applied to the sub-query to ensure a predictable result?
I think the idea is that in this case the user has supplied an `ORDER BY`
which _can't_ be applied to the sub-query (because it doesn't match the
`DISTINCT ON`), and so we accept unpredictable results in order to provide
some results which match what the user requested.
This business of unpredictable results is what makes me most uncomfortable
with the patch.
Would it be better to auto-generate an `ORDER BY` for the subquery
matching the `DISTINCT ON`?
If we did that, the subquery isn't even needed: we could get the same
results by just automatically prepending the `ORDER BY` such that it
matches the `DISTINCT ON`.
Not sure if that's a good idea, but it seems like perhaps at least as good
an idea as the one proposed here.
--
Ticket URL: <https://code.djangoproject.com/ticket/24218#comment:4>
Comment (by carljm):
Actually that wouldn't give the same results, never mind.
--
Ticket URL: <https://code.djangoproject.com/ticket/24218#comment:5>
Comment (by miki725):
Thank you for such quick responses!
This business of unpredictable results is what makes me most
uncomfortable with the patch.
I completely agree with that statement. Let me explain my-use case and
then maybe possible solutions.
{{{#!python
class ModelFoo(models.Model):
bars = models.ManyToManyField(ModelBar, related_name='foos')
class ModelBar(models.Model):
latlon = models.PointField(spatial_index=True)
# query foos where they have related bars within certain distance
# and sort by increasing distance
(ModelFoo.objects
# filter by distance
.filter(bars__latlon__distance_lte=(point, Distance(mi=1)))
# add distance annotation
.distance(point, field_name='bars__latlon')
# sort by distance
.extra(order_by=['distance'])
# distinct foos
.distinct('id'))
}}}
The reason I needed distinct here is because otherwise whenever foo had
multiple bars, I would get duplicate foo models due to the join. Since in
my case I simply used bars for filtering foos, I didnt care about
"unpredictable" order at all since I only cared about foos. As for
sorting, initially I also thought that {{{ORDER BY}}} should be inside the
inner query and {{{DISTINCT}}} on the outer query however for some reason
when I did that, foos were not sorted by distance anymore.
This however is just my use-case and I do see how this can cause issues in
other scenarios since you cannot be sure what comes first in the
{{{DISTINCT}}}.
Here are some possible solutions:
1. It seems that whenever a pattern of many-guys is used to filter single
guys on one/many-to-many and sorting is required, this pattern might be
useful. So maybe this functionality should only be enabled in those cases.
2. What about if this behavior will be configurable? More specifically
what about if the user will be able to control:
* explicitly enable this functionary since otherwise Django cannot
guarantee data integrity for all use-cases. This will force the user to
explicitly acknowledge they want to do this hence eliminating the burden
from Django to always provide data-integrity (and explain all the cases
when it is not provided).
* what goes to inner and outer subquery ({{{ORDER BY}}} or
{{{DISTINCT}}}). Maybe even allow {{{ORDER BY}}} in both queries. So the
user will pick some sort columns for both inner and outer query hence
solving "predictability" of {{{DISTINCT}}} and allowing to sort overall
results.
I do like the idea of making these things customizable however not sure
what would be an API to do those customizations. Is there precedent like
that in Django ORM for query customizations?
3. Perhaps the right solution is try not to solve my particular use-case
but provide a more generic solution to use nested queries. Maybe use API
something like:
{{{#!python
Model.objects.nested(
Model.objects.filter(...).distinct(...).order_by(...) # inner
).order_by(...) # outer
}}}
Let me know your thoughts.
--
Ticket URL: <https://code.djangoproject.com/ticket/24218#comment:6>
Comment (by akaariai):
We need to go wirh option 2, specifically the solution where the inner and
outer query have differen order by clauses. This way you can order the
results the way you want, yet pick the distinct element as you wish.
I'vw thought of a solution something like this
{{{
qs.distict('foo').order_by('foo').barrier().order_by('bar')
}}}
The idea is that the barrier call effectively forces everything left of
the call to go into subquery, then things after the barrier go into outer
query. This construct would be hugely useful in other use cases, too.
--
Ticket URL: <https://code.djangoproject.com/ticket/24218#comment:7>
Comment (by carljm):
@miki725 - In a case like yours, I've generally just done `.distinct()`
instead of `.distinct('id')`. Since you're only selecting columns from
foo, that should give the same results, and it makes Postgres happy with
any `ORDER BY` clause you like, with no need for a subquery. So I'm not
sure that your particular case actually offers a use case for this feature
- in this case our support for `DISTINCT ON` is actually an "attractive
nuisance."
@akaariai - that API doesn't feel quite right to me. `order_by()` in every
other case determines the ordering of the returned results. Just because
in this case we are also applying a SQL `ORDER BY` clause in the subquery
(an implementation detail) doesn't mean that overloading `order_by()` is
the right API for it. For addressing this particular use case, I think a
more intuitive API would be to introduce an ordering argument to
`.distinct()` to allow customization of the ordering used to pick the
distinct row. Of course, that's a less general API - but I think we would
need a more thorough list of the cases where subqueries are needed in
order to see what a generic subqueries API should look like (e.g. are you
intending to limit this API to cases where the outer query is simply a
`SELECT *` from the subquery?). It feels to me that a generic subqueries
API should involve passing one queryset to another, or wrapping one in
another, much like the current case where passing a queryset to an `__in`
filter results in a subquery.
--
Ticket URL: <https://code.djangoproject.com/ticket/24218#comment:8>
Comment (by akaariai):
Yeah, the barrier() API is more a generic tool. No objection to adding a
specific API for distinct.
--
Ticket URL: <https://code.djangoproject.com/ticket/24218#comment:9>
* cc: Harro (added)
Comment:
This is exactly what I need right now.
My proposal would be to add the distinct ordering to the distinct itself
as a kwarg.
{{{#!python
Model.objects.all().distinct('foo', order_by=('-baz',)).order_by('bar')
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/24218#comment:10>
Comment (by Giovanni Totaro - aka Vanni):
**Workaround with
[https://docs.djangoproject.com/en/dev/ref/models/expressions/#subquery-
expressions Subquery() expressions]** (Django >= 1.11):
The same result of the really nice proposal by Carl and Harro:
{{{ Model.objects.all().distinct('foo',
order_by=('-baz',)).order_by('bar') }}}
can be currently obtained with something like this:
{{{
from django.db.models import Subquery
Model.objects.filter(
pk__in=Subquery(
Model.objects.all().distinct('foo').order_by('-baz').values('pk')
)
).order_by('bar')
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/24218#comment:11>