[Django] #33682: SQL generation bug in `.distinct()` when supplied fields go through multiple many-related tables

Django

unread,

May 5, 2022, 2:15:30 PM5/5/22

to django-...@googlegroups.com

#33682: SQL generation bug in `.distinct()` when supplied fields go through
multiple many-related tables
-------------------------------------+-------------------------------------
Reporter: Robert | Owner: nobody
Leach |
Type: Bug | Status: new
Component: Database | Version: 3.2
layer (models, ORM) |
Severity: Normal | Keywords: sql, distinct,
Triage Stage: | Has patch: 0
Unreviewed |
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
-------------------------------------+-------------------------------------
I have a rather complex database and an advanced search interface that
creates complex queries. It’s been working great for over a year now.

I recently added a feature to count distinct related table records in the
joined results. When I added fields to these many-related tables to
`.distinct()` (and to `.order_by()`), I couldn’t get the test to execute
without hitting an `InvalidColumnReference` error. And though I’m
supplying the same expanded fields list to `.order_by()` that I am to
`.distinct()`, the error claims that `SELECT DISTINCT ON expressions must
match initial ORDER BY expressions`…

When I print the SQL, the place where it notes a difference has a weird
`T8` reference where the model name should be in an `order by` clause.
The corresponding `distinct` clause has the full table name instead of the
reference, which is what I suspect is triggering the exception.

I was able to create a set of toy models and a test that minimally
reproduces the exception...

toymodels.py:
{{{
from django.db.models import Model, CharField, AutoField, ForeignKey,
ManyToManyField, CASCADE

class TestPeak(Model):
id = AutoField(primary_key=True)
name = CharField(max_length=10)
compounds = ManyToManyField(
to="TestCompound",
related_name="testpeaks",
)
class Meta:
verbose_name = "testpeak"
verbose_name_plural = "testpeaks"
ordering = ["name"]

class TestCompound(Model):
id = AutoField(primary_key=True)
name = CharField(max_length=10)
class Meta:
verbose_name = "testcompound"
verbose_name_plural = "testcompounds"
ordering = ["name"]

class TestSynonym(Model):
name = CharField(max_length=10, primary_key=True)
compound = ForeignKey(
TestCompound, related_name="testsynonyms", on_delete=CASCADE
)
class Meta:
verbose_name = "testsynonym"
verbose_name_plural = "testsynonyms"
ordering = ["compound", "name"]
}}}

test_bug.py:
{{{
from DataRepo.tests.tracebase_test_case import TracebaseTestCase
from DataRepo.models.toymodels import TestPeak, TestCompound, TestSynonym
from django.db.models import Q

class DjangoSQLBug(TracebaseTestCase):
maxDiff = None

@classmethod
def setUpTestData(cls):
TestCompound.objects.create(name="testcpd")
cpd = TestCompound.objects.get(id__exact=1)
TestSynonym.objects.create(name="testsyn",compound=cpd)
TestPeak.objects.create(name="testpk")
pk = TestPeak.objects.get(id__exact=1)
pk.compounds.add(cpd)

def test_mm_om_query(self):
q_exp = Q(name__iexact="testpk")
distinct_fields = ['name', 'pk',
'compounds__testsynonyms__compound', 'compounds__testsynonyms__name',
'compounds__testsynonyms__pk', 'compounds__name', 'compounds__pk']
qs =
TestPeak.objects.filter(q_exp).order_by(*distinct_fields).distinct(*distinct_fields)
self.assertEqual(qs.count(), 1)
}}}

`python manage.py test` output:
{{{
Creating test database for alias 'default'...
Creating test database for alias 'validation'...
System check identified no issues (0 silenced).
E
======================================================================
ERROR: test_mm_om_query (DataRepo.tests.sqlbugtest.test_bug.DjangoSQLBug)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/rleach/PROJECT-
local/TRACEBASE/tracebase/.venv/lib/python3.9/site-
packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
psycopg2.errors.InvalidColumnReference: SELECT DISTINCT ON expressions
must match initial ORDER BY expressions
LINE 1: ...peak"."id", "DataRepo_testsynonym"."compound_id", "DataRepo_...
^

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/rleach/PROJECT-
local/TRACEBASE/tracebase/DataRepo/tests/sqlbugtest/test_bug.py", line 21,
in test_mm_om_query
self.assertEqual(qs.count(), 1)
File "/Users/rleach/PROJECT-
local/TRACEBASE/tracebase/.venv/lib/python3.9/site-
packages/django/db/models/query.py", line 412, in count
return self.query.get_count(using=self.db)
File "/Users/rleach/PROJECT-
local/TRACEBASE/tracebase/.venv/lib/python3.9/site-
packages/django/db/models/sql/query.py", line 519, in get_count
number = obj.get_aggregation(using, ['__count'])['__count']
File "/Users/rleach/PROJECT-
local/TRACEBASE/tracebase/.venv/lib/python3.9/site-
packages/django/db/models/sql/query.py", line 504, in get_aggregation
result = compiler.execute_sql(SINGLE)
File "/Users/rleach/PROJECT-
local/TRACEBASE/tracebase/.venv/lib/python3.9/site-
packages/django/db/models/sql/compiler.py", line 1175, in execute_sql
cursor.execute(sql, params)
File "/Users/rleach/PROJECT-
local/TRACEBASE/tracebase/.venv/lib/python3.9/site-
packages/django/db/backends/utils.py", line 66, in execute
return self._execute_with_wrappers(sql, params, many=False,
executor=self._execute)
File "/Users/rleach/PROJECT-
local/TRACEBASE/tracebase/.venv/lib/python3.9/site-
packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
return executor(sql, params, many, context)
File "/Users/rleach/PROJECT-
local/TRACEBASE/tracebase/.venv/lib/python3.9/site-
packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
File "/Users/rleach/PROJECT-
local/TRACEBASE/tracebase/.venv/lib/python3.9/site-
packages/django/db/utils.py", line 90, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/Users/rleach/PROJECT-
local/TRACEBASE/tracebase/.venv/lib/python3.9/site-
packages/django/db/backends/utils.py", line 84, in _execute
return self.cursor.execute(sql, params)
django.db.utils.ProgrammingError: SELECT DISTINCT ON expressions must
match initial ORDER BY expressions
LINE 1: ...peak"."id", "DataRepo_testsynonym"."compound_id", "DataRepo_...
^

----------------------------------------------------------------------
Ran 1 test in 0.018s

FAILED (errors=1)
Destroying test database for alias 'default'...
Destroying test database for alias 'validation'...
gen-rl-macbookair[2022-05-05 12:34:44]:~/PROJECT-
local/TRACEBASE/tracebase$
}}}

I posted on Django Users about this, if you would like more information:

https://forum.djangoproject.com/t/is-this-a-bug-in-djangos-sql-creation-
through-multiple-many-to-many-tables/13508/9

We’re on Django 3.2 using Postgres.

--
Ticket URL: <https://code.djangoproject.com/ticket/33682>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

May 5, 2022, 11:20:15 PM5/5/22

to django-...@googlegroups.com

#33682: SQL generation bug in `.distinct()` when supplied fields go through
multiple many-related tables
-------------------------------------+-------------------------------------

Keywords: sql, distinct, | Triage Stage:

| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Simon Charette):

The `QuerySet.distinct` and `order_by` methods
[https://docs.djangoproject.com/en/4.0/ref/models/querysets/#django.db.models.query.QuerySet.order_by
don't resolve references to foreign key the same way] when referenced
models define a `Meta.ordering`.

You can get a similar crash with way less code by doing

{{{#!python
list(TestSynonym.objects.distinct('compound').order_by('compound'))
}}}

As the above actually translates to

{{{#!python
list(TestSynonym.objects.distinct('compound').order_by('compound__name'))
}}}

Due to `TestCompound.Meta.ordering = ('name',)`

------

I would argue that this is ''invalid'' but I'm curious to hear what others
have to say as the resulting error is definitely cryptic there's possibly
ways we could do things better.

Given `DISTINCT ON` usage requires a matching `ORDER BY` I see three
options

1. Do nothing, users should learn the gotchas of `Meta.ordering` before
using it. We've got a precedent against that by making aggregations ignore
`Meta.ordering` in the recent versions.
2. Make `distinct('related_with_ordering')` behave like `order_by` with
regards to ordering expansion to make both APIs coherent given they a
dependent (resulting clause could match). This will require a deprecation
period and spreads the arguable bad related ordering expansion pattern to
another method.
3. Refuse the temptation to guess and make
`distinct('related_with_ordering')` error out loudly so at least users are
not presented this cryptic error. This maintains the arguably confusing
mismatch between both APIs but we stop the spread of `ordering` expansion.

--
Ticket URL: <https://code.djangoproject.com/ticket/33682#comment:1>

Django

unread,

May 6, 2022, 11:20:16 AM5/6/22

to django-...@googlegroups.com

#33682: SQL generation bug in `.distinct()` when supplied fields go through
multiple many-related tables
-------------------------------------+-------------------------------------
Reporter: Robert Leach | Owner: nobody
Type: Bug | Status: new
Component: Database layer | Version: 3.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: sql, distinct, | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Robert Leach):

I figured I might have been missing something. I actually have code to
avoid the gotcha, but apparently, I only did it for the model from which
the query came from (i.e. the "root model"):
{{{
# If there are any split_rows manytomany related tables, we will
need to prepend the ordering (and pk) fields of
# the root model
if len(distinct_fields) > 0:
distinct_fields.insert(0, "pk")
tmp_distincts =
self.getOrderByFields(model_name=self.rootmodel.__name__)
tmp_distincts.reverse()
for fld_nm in tmp_distincts:
distinct_fields.insert(0, fld_nm)

if order_by is not None and order_by not in distinct_fields:
distinct_fields.insert(0, order_by)
}}}

{{{
def getOrderByFields(self, mdl_inst_nm=None, model_name=None):
"""
Retrieves a model's default order by fields, given a model
instance name.
"""
... brevity edit ...

# Get a model object
mdl = apps.get_model("DataRepo", mdl_nm)

if "ordering" in mdl._meta.__dict__:
return mdl._meta.__dict__["ordering"]
return []
}}}

I know this ticket system is not the place for getting support, but if
you'll indulge me... would prepending all the meta ordering fields avoid
the gotcha if I inserted the meta ordering field(s) before any other
fields? (In my use case, the order is unimportant - only the distinct
is). I'd found that it did in the case of the "root model".

--
Ticket URL: <https://code.djangoproject.com/ticket/33682#comment:2>

Django

unread,

May 6, 2022, 12:11:00 PM5/6/22

to django-...@googlegroups.com

#33682: SQL generation bug in `.distinct()` when supplied fields go through
multiple many-related tables
-------------------------------------+-------------------------------------
Reporter: Robert Leach | Owner: nobody
Type: Bug | Status: new
Component: Database layer | Version: 3.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: sql, distinct, | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Robert Leach):

I believe that the note at the bottom of the distinct section of this doc:

[https://docs.djangoproject.com/en/4.0/ref/models/querysets/#distinct]

answers my above question, but to dissect the verbiage it tells you *what*
to do without the *why* fully described. It tells you that adding the
`_id` field would be necessary to **make the expressions match** (which is
"a" reason why to add the `id` field), but it doesn't explicitly explain
why that makes them match, or say whether the `id` field is precisely
required or if any field will do.

If I'd better understood the *why* in that doc, I might have coded the
right solution to the gotcha and not overlooked the other cases.

My updated understanding is that it seems that the reason *a* related
model field is necessary is because the related model "field" in the model
definition that links to the related model isn't a "field". It's a
reference that gets turned into a field that by default uses the
`meta.ordering`. (I didn't even notice that the distinct clause had
`compound_id` and the order by clause had `name` in that position.) So
I'm guessing that *any*(?) related model field in front of a (non-field)
related model reference (whether it's at the beginning of the distinct
list or "just before" the non-field related model reference) would solve
the issue? Or will *any* explicit inclusion of a non-field related model
reference cause the problem? **Or** perhaps even explicit inclusion of
such a (non) field would cause the problem.

I think these are areas in which the doc could be improved just a bit
more. Understanding the /why/ **better**, I think, could be helpful to
avoid these pitfals, and also help to understand an otherwise cryptic
error message.

--
Ticket URL: <https://code.djangoproject.com/ticket/33682#comment:3>

Django

unread,

May 9, 2022, 1:29:47 AM5/9/22

to django-...@googlegroups.com

#33682: SQL generation bug in `.distinct()` when supplied fields go through
multiple many-related tables
-------------------------------------+-------------------------------------
Reporter: Robert Leach | Owner: nobody
Type: Bug | Status: new
Component: Database layer | Version: 3.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: sql, distinct, | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Mariusz Felisiak):

Robert, Can you propose a documentation improvement via GitHub's PR?

--
Ticket URL: <https://code.djangoproject.com/ticket/33682#comment:4>

Django

unread,

May 9, 2022, 1:31:47 PM5/9/22

to django-...@googlegroups.com

#33682: SQL generation bug in `.distinct()` when supplied fields go through
multiple many-related tables
-------------------------------------+-------------------------------------
Reporter: Robert Leach | Owner: nobody
Type: Bug | Status: new
Component: Database layer | Version: 3.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: sql, distinct, | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Robert Leach):

Replying to [comment:4 Mariusz Felisiak]:

> Robert, Can you propose a documentation improvement via GitHub's PR?

I can certainly give it a shot, though I'm not the best writer when it
comes to brevity.

Also, I don't have a deep understanding of the related Django code, so my
understanding could be empirically correct, but technically flawed (like
Bohr's model of the atom). For example, when the same field reference is
supplied to both `.order_by()` and `.distinct()`, such as in Simon's
example:

{{{
TestSynonym.objects.distinct('compound').order_by('compound')
}}}

...why is the inserted field in each case not coordinated? Why does the
conversion from the reference (`compound`) differ? Simon says it resolves
to:

{{{

list(TestSynonym.objects.distinct('compound').order_by('compound__name'))
}}}

but based on my debug output of another test using that above call, that's
imprecise. It shows:

{{{
QUERY: SELECT DISTINCT ON ("DataRepo_testsynonym"."compound_id")
"DataRepo_testsynonym"."name", "DataRepo_testsynonym"."compound_id" FROM
"DataRepo_testsynonym" INNER JOIN "DataRepo_testcompound" ON
("DataRepo_testsynonym"."compound_id" = "DataRepo_testcompound"."id")
ORDER BY "DataRepo_testcompound"."name" ASC
}}}

which means that the distinct field resolution and order by field
resolutions are:

- `distinct`: `compound_id`
- `order_by`: `name`

When those methods are assessed individually, I understand why those
fields are the preferred solution (e.g. the meta ordering may not be
unique), but given that `distinct` requires the same fields be present at
the beginning of the order-by, I don't know what prevents the code to be
written to have those fields be resolved in a way that is copacetic.
Like, why not convert the reference into 2 additional fields that
together, meet both requirements (`name` AND `compound_id`)? Order-by
would be satisfied and distinct would be satisfied. Or... in my case,
`name` is unique, so distinct could resolve to the meta ordering without
issue...

Is there a technical reason the code doesn't already do this?

--
Ticket URL: <https://code.djangoproject.com/ticket/33682#comment:5>

Django

unread,

May 10, 2022, 1:22:01 AM5/10/22

to django-...@googlegroups.com

#33682: SQL generation bug in `.distinct()` when supplied fields go through
multiple many-related tables
-------------------------------------+-------------------------------------
Reporter: Robert Leach | Owner: nobody
Type: Bug | Status: new
Component: Database layer | Version: 3.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: sql, distinct, | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Mariusz Felisiak):

Replying to [comment:5 Robert Leach]:

> When those methods are assessed individually, I understand why those
fields are the preferred solution (e.g. the meta ordering may not be
unique), but given that `distinct` requires the same fields be present at
the beginning of the order-by, I don't know what prevents the code to be
written to have those fields be resolved in a way that is copacetic.
Like, why not convert the reference into 2 additional fields that
together, meet both requirements (`name` AND `compound_id`)? Order-by
would be satisfied and distinct would be satisfied. Or... in my case,
`name` is unique, so distinct could resolve to the meta ordering without
issue...
>
> Is there a technical reason the code doesn't already do this?

This would be another logic that's implicit and probably unexpected by
users (at least in some cases). As far as I'm aware it's preferable to
fail loudly even with a `ProgrammingError`. I'm not sure how to improve
this note, maybe it's enough to add a correct example:

{{{#!diff
diff --git a/docs/ref/models/querysets.txt b/docs/ref/models/querysets.txt
index a9da1dcf7e..891b8255b0 100644
--- a/docs/ref/models/querysets.txt
+++ b/docs/ref/models/querysets.txt
@@ -565,7 +565,9 @@ Examples (those after the first will only work on
PostgreSQL)::
...wouldn't work because the query would be ordered by ``blog__name``
thus
mismatching the ``DISTINCT ON`` expression. You'd have to explicitly
order
by the relation ``_id`` field (``blog_id`` in this case) or the
referenced
- one (``blog__pk``) to make sure both expressions match.
+ one (``blog__pk``) to make sure both expressions match::
+
+ Entry.objects.order_by('blog_id').distinct('blog_id')

``values()``
~~~~~~~~~~~~
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/33682#comment:6>

Django

unread,

May 10, 2022, 12:21:57 PM5/10/22

to django-...@googlegroups.com

#33682: SQL generation bug in `.distinct()` when supplied fields go through
multiple many-related tables
-------------------------------------+-------------------------------------
Reporter: Robert Leach | Owner: nobody
Type: Bug | Status: new
Component: Database layer | Version: 3.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: sql, distinct, | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Robert Leach):

Replying to [comment:6 Mariusz Felisiak]:

> This would be another logic that's implicit and probably unexpected by
users (at least in some cases). As far as I'm aware it's preferable to
fail loudly even with a `ProgrammingError`. I'm not sure how to improve
this note, maybe it's enough to add a correct example:
>
> {{{#!diff
> diff --git a/docs/ref/models/querysets.txt
b/docs/ref/models/querysets.txt
> index a9da1dcf7e..891b8255b0 100644
> --- a/docs/ref/models/querysets.txt
> +++ b/docs/ref/models/querysets.txt
> @@ -565,7 +565,9 @@ Examples (those after the first will only work on
PostgreSQL)::
> ...wouldn't work because the query would be ordered by
``blog__name`` thus
> mismatching the ``DISTINCT ON`` expression. You'd have to
explicitly order
> by the relation ``_id`` field (``blog_id`` in this case) or the
referenced
> - one (``blog__pk``) to make sure both expressions match.
> + one (``blog__pk``) to make sure both expressions match::
> +
> + Entry.objects.order_by('blog_id').distinct('blog_id')
>
> ``values()``
> ~~~~~~~~~~~~
> }}}

I'm not sure that would have been enough for me to have anticipated or
correctly interpreted the error I encountered. I just corrected our code
base and I now feel I have a clearer picture of how I got here... Let me
explain why. I'd thought my expressions did match:

{{{

distinct_fields = ['name', 'pk',
'compounds__testsynonyms__compound', 'compounds__testsynonyms__name',
'compounds__testsynonyms__pk', 'compounds__name', 'compounds__pk']
qs =
TestPeak.objects.filter(q_exp).order_by(*distinct_fields).distinct(*distinct_fields)
}}}

I was like "but I'm supplying the exact same fields for both order_by and
distinct".

What I think I'd missed was that "referenced field" in the phrase
`explicitly order by the relation _id or referenced field` doesn't just
apply to related model fields in the queried("/root") model (e.g.
`Entry`'s fields such as `blog` in
`Entry.objects.order_by('blog').distinct('blog')`, but applies to all
related models through all my joins - **and** (indirectly/potentially)
their `meta.ordering` values... because I wasn't explicitly adding the
`compounds__testsynonyms__compound` "field" above in my codebase where I
ran into this error - I was implicitly adding them -> I had written code
to straightforwardly grab and prepend the meta.ordering fields in order to
satisfy distinct's requirements and still have things ordered nicely - and
that's where I was running into this problem, because it didn't occur to
me that the fields in meta.ordering could be changed differently by
`order_by` and `distinct`. meta.ordering lets you add those "blog" fields
and I wasn't the developer that had added them to our code base, but I
didn't even think to check whether those fields I was adding would be
subject to the distinct/order_by gotcha.

I mean - all the information is there in the doc for me to have avoided
this. You just have to piece together doc info on `meta.ordering`,
`order_by`, and `distinct` and extend the examples to realize it applies
to joined models' fields - not just the root model. It was just a bit too
complex for me at my level of django experience to anticipate correctly
the implications of what I was doing.

**Maybe the best thing to do would be to anticipate the motivations that
lead me down this path...** I wanted to use distinct on a join query,
which meant that I needed to add fields to order_by that I didn't really
care about - I just needed to add them to meet the requirement that the
fields matched. BUT - I didn't want to change the desired ordering - so I
needed to explicitly add the default orderings so that the IDs wouldn't
override them. So maybe the doc should just point out that if you're
adding fields to order_by solely to be able to use distinct, but you don't
want to change the default ordering specified in meta.ordering - you need
to consider the fact that the meta.ordering fields explicitly added can
make you run into the order_by/distinct gotcha.

--
Ticket URL: <https://code.djangoproject.com/ticket/33682#comment:7>

Reply all

Reply to author

Forward