With the following db model:
{{{
@python_2_unicode_compatible
class CPUJob(models.Model):
account = models.ForeignKey(Account, blank=True, null=True)
username = models.CharField(max_length=50, blank=True, null=True)
project = models.ForeignKey(Project, null=True, blank=True)
machine = models.ForeignKey(Machine, blank=True, null=True)
date = models.DateField(db_index=True, blank=True, null=True)
queue = models.ForeignKey(Queue, blank=True, null=True)
cpu_usage = models.BigIntegerField(blank=True, null=True)
mem = models.BigIntegerField(blank=True, null=True)
vmem = models.BigIntegerField(blank=True, null=True)
ctime = models.DateTimeField(blank=True, null=True)
qtime = models.DateTimeField(blank=True, null=True)
etime = models.DateTimeField(blank=True, null=True)
start = models.DateTimeField(blank=True, null=True)
act_wall_time = models.BigIntegerField(blank=True, null=True)
est_wall_time = models.BigIntegerField(blank=True, null=True)
jobid = models.CharField(max_length=50, blank=True, null=True,
unique=True)
cores = models.BigIntegerField(blank=True, null=True)
list_mem = models.BigIntegerField(blank=True, null=True)
list_pmem = models.BigIntegerField(blank=True, null=True)
list_vmem = models.BigIntegerField(blank=True, null=True)
list_pvmem = models.BigIntegerField(blank=True, null=True)
exit_status = models.BigIntegerField(blank=True, null=True)
jobname = models.CharField(max_length=256, blank=True, null=True)
software = models.ManyToManyField(SoftwareVersion, blank=True,
null=True)
class Meta:
ordering = ['-date']
db_table = 'cpu_job'
}}}
The following on mysql produces a good result:
{{{
q = CPUJob.objects.values('project').annotate(usage=Sum('cpu_usage'),
jobs=Count('id'))
print q.query
}}}
of
{{{
SELECT `cpu_job`.`project_id`, SUM(`cpu_job`.`cpu_usage`) AS `usage`,
COUNT(`cpu_job`.`id`) AS `jobs` FROM `cpu_job` GROUP BY
`cpu_job`.`project_id` ORDER BY `cpu_job`.`date` DESC
}}}
However on Postgresql, with the same data, I get the following query:
{{{
SELECT "cpu_job"."project_id", SUM("cpu_job"."cpu_usage") AS "usage",
COUNT("cpu_job"."id") AS "jobs" FROM "cpu_job" GROUP BY
"cpu_job"."project_id", "cpu_job"."date" ORDER BY "cpu_job"."date" DESC
}}}
Note additional term "cpu_job"."date" in the GROUP BY. I did not ask for
it, it got put there.
I suspect the problem is the sort order on the table. The SQL seems to be
fine on mysql (really?????), but appears to be invalid on Postgresql (this
actually makes more sense to me). So Django appears to be silently
adjusting the request to make it valid on Postgresql.
{{{
karaage=> SELECT "cpu_job"."project_id", SUM("cpu_job"."cpu_usage") AS
"usage", COUNT("cpu_job"."id") AS "jobs" FROM "cpu_job" GROUP BY
"cpu_job"."project_id" ORDER BY "cpu_job"."date" DESC
;
ERROR: column "cpu_job.date" must appear in the GROUP BY clause or be
used in an aggregate function
LINE 1: ...cpu_job" GROUP BY "cpu_job"."project_id" ORDER BY "cpu_job"....
}}}
I would much rather Django gives an error under Postgresql (and maybe even
MYSQL too) rather then silently changing the query, and giving different
results to what I expected.
(for this particular query, I don't need sorting, and had not noticed it
would cause problems - specifying order_by() seems to solve this problem).
Thanks
--
Ticket URL: <https://code.djangoproject.com/ticket/23557>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* needs_better_patch: => 0
* needs_tests: => 0
* needs_docs: => 0
Comment:
The behaviour you're seeing with Postgres is correct, and documented:
https://docs.djangoproject.com/en/dev/topics/db/aggregation/#interaction-
with-default-ordering-or-order-by
Columns in the ORDER BY clause must be added to the GROUP BY clause on
Postgres, and any RDBMS that conforms to the spec. MySQL and sqlite allow
columns in the select list that aren't also in the GROUP BY. Django should
be consistent though - it's weird that different results are returned
based on the underlying engine.
I would propose that the MySQL backend should add the order by columns
into the group by list.
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:1>
Comment (by jarshwah):
Additionally, it may be worth talking about removing the non-standard
behaviour from mysql in django by setting the session to
"ONLY_FULL_GROUP_BY": http://dev.mysql.com/doc/refman/5.0/en/sql-
mode.html#sqlmode_only_full_group_by but that may be backwards
incompatible.
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:2>
* stage: Unreviewed => Accepted
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:3>
Comment (by shaib):
It should be noted that "full" group-by causes significant performance
degradations on mysql; this is related to #17144.
In other respects, @jarshwah's comments are mostly correct -- except for
one point, and that is that the SQL spec is retarded, and an ORM ''can''
know when it's safe to group by a subset of the retrieved fields (which
are unique for the group).
I suspect the only way to resolve this without breaking stuff is to close
this ticket as "wontfix".
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:4>
Comment (by brian):
My personal opinion is that Django should never alter silently alter
queries. If there is something wrong with the query, it should generate an
error rather then attempt to fix the problem by altering the query. The
fact the current Postgresql behaviour is documented doesn't make it any
more correct IMHO. The programmer should be forced to explicitly state the
preferred solution rather then Django guess it.
I would suggest that Django has some sort of legacy mode, enabled by
default, for preserving the current behaviour, and a new mode where errors
such as these trigger an error condition. That way any problems can be
identified and fixed, rather then silently await some change (e.g.
different database, or somebody deciding to add an ordering field where
there was none), and nobody noticing that the results generated are now
wrong.
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:5>
Comment (by jarshwah):
I would be in favour of clearing out the ordering when using a .values
call, if an entry in values does not explicitly reference the ordered
column. I think this would neatly solve the issue. Unfortunately, I think
that'd be backwards incompatible. A flag could be used, but I'm not sure
it'd be very nice. It could also be retired after a deprecation period
where the behaviour was swapped.
{{{
Meta:
ordering = ['col']
values_respects_ordering = False # defaults to True
}}}
But then I'm sure someone else will say django is silently altering the
query by not respecting the Meta.ordering once the behaviour is swapped.
I wish Meta.ordering didn't exist.
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:6>
Comment (by akaariai):
I am with brian here - I think Django's way to calculate the GROUP BY is
complex enough without the GROUP BY clause being altered by .order_by()
calls.
In my opinion Django's group by logic should be as follows:
- Group by the primary key of the table in the query by default. The
group by will need to be appended by functionally dependent columns as
many databases need this. This means that we can add any column from the
same table to the group by, or any column from any table pointed by direct
foreign key or reverse one to one field.
- Group by the user's defined .values() call. Possibly allow extending
it by functionally dependent columns. That is, if some table's primary key
is in the values() list, then allow extension from that primary key along
foreign keys and o2ofields.
- If any other column is needed in the group by, then error out. This
means that:
`CPUJob.objects.values('project').annotate(usage=Sum('cpu_usage'),
jobs=Count('id')).order_by('date')` is an error.
The functionally dependent columns appending means that we need to do a
group by author.id, author.name on some databases even though technically
group by author.id gives the same result. Similarly, we need to group by
author.id, author.name, book.id, book.name when
.select_related('favorite_book') is applied, where favorite_book is a
foreign key from author to book.
The main point here is that we really shouldn't alter the results because
.order_by() was added to the query. Just error out in that case. Resist
the temptation to guess, explicit is better than implicit and so on...
I think we can deprecate the current behavior of silently altering the
group by. Checking if a column is functionally dependent on some subset of
columns in the group by isn't exactly easy, but should be doable at least
for primary key case. Multi-column unique indexes are going to be a bit
harder to support...
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:7>
Comment (by Simon Charette):
I'll note that the issue initially reported issue here is not reproducible
anymore since #14357 which deprecated `Meta.ordering` from being
considered when doing annotation. Even if that was not the case the
optimization of selected primary key grouping in #19259 would also have
solve the
I still think that it would be worth to keep this ticket open though to go
through a deprecation period and then raise an error when `values` is used
for grouping and is paired with an explicit `order_by` that would result
in extra `GROUP BY` entries. This can likely be achieved in
[https://github.com/django/django/blob/897f38fabea5e1b196f11250ff6dadfffa489840/django/db/models/sql/compiler.py#L149-L154
SQLCompiler.get_group_by] by branching off `self.query.group_by is not
True` and comparing the results of calling `self.collapse_group_by` with
expressions originating from `order_by` and ones without and if there is
any difference warn/raise.
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:8>
* cc: Ryan Cheley (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:9>
Comment (by Simon Charette):
Some interesting development in SQL:2023 about the possibility to `ORDER
BY` non-selected columns when using aggregation `F868`.
[http://peter.eisentraut.org/blog/2023/04/04/sql-2023-is-finished-here-is-
whats-new#order-by-in-grouped-table-f868 If I understand correctly
Postgres 16+ should implement this feature].
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:10>
* owner: nobody => Ryan Cheley
* status: new => assigned
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:11>
* owner: Ryan Cheley => (none)
* status: assigned => new
--
Ticket URL: <https://code.djangoproject.com/ticket/23557#comment:12>