GROUP BY

263 views
Skip to first unread message

Glenn Maynard

unread,
Jan 26, 2009, 8:55:23 PM1/26/09
to Django users
How are aggregates grouped on anything other than a simple column?

db/sql/query.py set_group_by suggests that this might work, to (for
example) group together dollar amounts:

Table.objects.extra(select={"dollars": "cents/100"}).values
("dollars").aggregate(...)

That doesn't throw an error (it does if the extra() isn't there), but
it doesn't group as expected, either. (For some reason, it adds a
GROUP BY containing every field in Table.)

Russell Keith-Magee

unread,
Jan 26, 2009, 10:20:59 PM1/26/09
to django...@googlegroups.com
On Tue, Jan 27, 2009 at 10:55 AM, Glenn Maynard <glennf...@gmail.com> wrote:
>
> How are aggregates grouped on anything other than a simple column?

I'm a little bit confused as to what you think is happening here -
what do you mean by "other than a simple column"? The default
aggregate grouping is effectively by object instance; the
implementation means that every non-aggregate field on the model that
is returned by the SELECT is included in the GROUP BY.

> db/sql/query.py set_group_by suggests that this might work, to (for
> example) group together dollar amounts:

That block of code is internal API - if you choose to use it, you do
so at your own risk, and backwards compatibility is not guaranteed.
set_group_by() is an internal setup call; I doubt it will do anything
useful in your case.

The public API for aggregates consists of annotate(), aggregate() and
the interaction of those calls with values().

> Table.objects.extra(select={"dollars": "cents/100"}).values
> ("dollars").aggregate(...)
>
> That doesn't throw an error (it does if the extra() isn't there), but
> it doesn't group as expected, either. (For some reason, it adds a
> GROUP BY containing every field in Table.)

The reason every field in the table is included is because that's how
GROUP BY works - any field that is in the query that isn't an
aggregate needs to be included in the group by. When you call
Table.objects...., this implies SELECT Table.id, Table.name, ... at
the database level. Some other fields are included in the GROUP BY to
accomodate the needs of order_by() and select_related().

The only exception to this is MySQL; MySQL allows for grouping by
primary key, which is an optimization for the simple case of grouping
by every non-aggregate field on a model.

I'd like to provide some more concrete advice, but it's not exactly
clear what you're trying to achieve. If you describe the actual
problem you are trying to solve, what you have tried, and the errors
or unexpected results you have seen, we might be able to provide
better advice.

Yours,
Russ Magee %-)

Glenn Maynard

unread,
Jan 26, 2009, 11:20:46 PM1/26/09
to Django users
On Jan 26, 10:20 pm, Russell Keith-Magee <freakboy3...@gmail.com>
wrote:
> I'm a little bit confused as to what you think is happening here -
> what do you mean by "other than a simple column"? The default
> aggregate grouping is effectively by object instance; the
> implementation means that every non-aggregate field on the model that
> is returned by the SELECT is included in the GROUP BY.

If I do Table.objects.values("date").annotate(Sum("dollars")), I get
the expected "SELECT SUM(dollars) FROM table GROUP BY date" (or so);
easy enough.

I want to do "SELECT sum(dollars) FROM table GROUP BY func(value)", or
any other expression for the GROUP BY. I can do this for ORDER BY
with extra() (though I havn't figured out how to know what the aliases
for the tables are going to be, so it's somewhat brittle).

Just experimenting, if I do:
Table.objects.extra(select={"group": "func(value)"}).values
("group").annotate(Sum("dollars"))
and then manually hack set_group_by to add "group" to self.group_by,
it generates the result I'm expecting, eg. [{"dollars__sum": 10,
"group": 1}, ...]. But, I havn't found any path in the real API to
make that happen.

Malcolm Tredinnick

unread,
Jan 26, 2009, 11:42:32 PM1/26/09
to django...@googlegroups.com
On Mon, 2009-01-26 at 20:20 -0800, Glenn Maynard wrote:
> On Jan 26, 10:20 pm, Russell Keith-Magee <freakboy3...@gmail.com>
> wrote:
> > I'm a little bit confused as to what you think is happening here -
> > what do you mean by "other than a simple column"? The default
> > aggregate grouping is effectively by object instance; the
> > implementation means that every non-aggregate field on the model that
> > is returned by the SELECT is included in the GROUP BY.
>
> If I do Table.objects.values("date").annotate(Sum("dollars")), I get
> the expected "SELECT SUM(dollars) FROM table GROUP BY date" (or so);
> easy enough.
>
> I want to do "SELECT sum(dollars) FROM table GROUP BY func(value)", or
> any other expression for the GROUP BY. I can do this for ORDER BY
> with extra() (though I havn't figured out how to know what the aliases
> for the tables are going to be, so it's somewhat brittle).

So, this is kind of an API extension request. Realise that Django
intentionally does not expose "GROUP BY" or anything like that, in the
ORM. Although the fact that we're over a relational storage backend
sometimes leaks through, the ORM API is fairly SQL-agnostic. Instead, we
expose particular pieces of functionality that happen to be implemented
using "GROUP BY" when it's turned into SQL (and could well be
implemented by some entirely different set of magical fairies with a
different storage system backend).

The aggregate support in Django provided an ORM API for aggregate
functionality. Not an exposure of grouping.

>
> Just experimenting, if I do:
> Table.objects.extra(select={"group": "func(value)"}).values
> ("group").annotate(Sum("dollars"))
> and then manually hack set_group_by to add "group" to self.group_by,
> it generates the result I'm expecting, eg. [{"dollars__sum": 10,
> "group": 1}, ...]. But, I havn't found any path in the real API to
> make that happen.

That's because it doesn't exist. Django isn't meant to completely
replace the need for SQL, for a start.

Also, given the relative infrequency with which this sort of thing crops
up across the entire set of all possible database interactions, it's not
unreasonable that it might, in fact, take more than one line of Python
to do it. You could construct the queryset, then look up the appropriate
alias in the "query" attribute (possibly adding the table, if
necessary), then set up the extra() bit. It would take three or four
line, which could be wrapped in a function if you needed to do it
regularly. Uses internal API all the way through, but simple enough to
write tests for so that you could tell if it broke due to internal
changes in the future.

Finally, if there's some use-case that pops up at the Python level that
might justify an API and if there's a sensible API, then it's something
to consider (not necessarily add, but consider) as an addition to Django
in the future (maybe in the Django 1.2 or 1.3 time-frame). You'll need
to express the problem in terms of something that is a reasonable
situation for a Python ORM, not as "I can do this in SQL, and so it must
be possible in Django as a single command", since that is in conflict
with the stated goals. There's always a slight problem with attempting
to create an API to pass through essentially arbitrary SQL (e.g. the
"func()" part of your raw SQL) via the ORM, since it means nothing at
the Python level, so requires exposing raw SQL as part of the API.
Extra() is already a fairly ugly hack in many respects -- it's
incredibly fragile, in the sense that once you add it to a queryset,
subsquent manipulations of that queryset can fail in ways we cannot hope
to detect or recover from.

I hope that puts the item you're asking about in some perspective. I've
tried to give an idea of why this is a complex area. What you're wanting
to do is probably already possible via a few lines of code that poke at
the internals and possibly won't get any more direct than that in the
near future.

Regards,
Malcolm


Russell Keith-Magee

unread,
Jan 26, 2009, 11:49:15 PM1/26/09
to django...@googlegroups.com
On Tue, Jan 27, 2009 at 1:20 PM, Glenn Maynard <glennf...@gmail.com> wrote:
>
> On Jan 26, 10:20 pm, Russell Keith-Magee <freakboy3...@gmail.com>
> wrote:
>> I'm a little bit confused as to what you think is happening here -
>> what do you mean by "other than a simple column"? The default
>> aggregate grouping is effectively by object instance; the
>> implementation means that every non-aggregate field on the model that
>> is returned by the SELECT is included in the GROUP BY.
>
> If I do Table.objects.values("date").annotate(Sum("dollars")), I get
> the expected "SELECT SUM(dollars) FROM table GROUP BY date" (or so);
> easy enough.

Yes - using values() before an annotate() call specifies that a subset
of the fields is required, which controls the grouping that is output.

> Just experimenting, if I do:
> Table.objects.extra(select={"group": "func(value)"}).values
> ("group").annotate(Sum("dollars"))
> and then manually hack set_group_by to add "group" to self.group_by,
> it generates the result I'm expecting, eg. [{"dollars__sum": 10,
> "group": 1}, ...]. But, I havn't found any path in the real API to
> make that happen.

There have been a couple of reports recently (#10113 and #10127 are
two examples that jump to mind) that describe errors in the exact
composition of the GROUP BY statement constructed by Django. I suspect
this may be one of this family of bugs. In your case, the culprit is a
select added in an extra(). I'll need to spend some time to check the
exact details and implications, but my initial reaction is that your
query:

Table.objects.extra(select={"group":
"func(value)"}).values("group").annotate(Sum("dollars"))

should work as you describe without requiring any hackery. If you
could open a ticket describing this problem (the query and what you
expect as SQL) it will ensure we don't forget about this before we
release v1.1.

Also, keep in mind that Django's SQL layer is an ORM - an _Object_
Relational Mapper. The API is designed to look like you're dealing
with Objects. The fact that it is a relational store underneath is
secondary. This means that there is a non-trival collection of easy
SQL queries that will be difficult (or impossible) to represent in the
ORM. This may not apply in this case specifically, but it is worth
remembering when you start looking for ways to represent esoteric SQL
as ORM statements. Sometimes the right solution is to just use SQL.

Yours,
Russ Magee %-)

Glenn Maynard

unread,
Jan 27, 2009, 2:46:25 AM1/27/09
to Django users
On Jan 26, 11:42 pm, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:
> That's because it doesn't exist. Django isn't meant to completely
> replace the need for SQL, for a start.

I was assuming it was possible because the extra() API already seems
to try to avoid forcing people to drop out of the ORM entirely for
cases like this (which I like, the current brittleness of extra()
aside).

> Also, given the relative infrequency with which this sort of thing crops
> up across the entire set of all possible database interactions, it's not
> unreasonable that it might, in fact, take more than one line of Python
> to do it.

I don't care if it takes more than one line, though there is, in
fact :), a big difference between taking more than one line and having
to bypass the public API.

> Extra() is already a fairly ugly hack in many respects -- it's
> incredibly fragile, in the sense that once you add it to a queryset,
> subsquent manipulations of that queryset can fail in ways we cannot hope
> to detect or recover from.

I've seen some of the issues with extra(). But, I think that ability
(not necessarily that particular method) is critical, to be able to
only partially drop out of the ORM; to do things in SQL's vocabulary
that are out of the ORM's scope, while remaining inside the ORM enough
to still operate on existing QuerySets that you've been handed. It
seems critical to balancing reasonable ORM design (not trying to
reproduce SQL's whole broad vocabulary) with the real-world tasks
requiring that vocabulary.

On Jan 26, 11:49 pm, Russell Keith-Magee <freakboy3...@gmail.com>
wrote:
> Sometimes the right solution is to just use SQL.

I don't mind dropping to raw SQL when needed for complex cases. But,
the ORM encourages a design that sometimes makes this hard--passing
around QuerySets, eg.

>>> good_blue_books = Books.objects.filter(color="blue", rating__gte=8)
>>> get_books_sold_per_day(good_blue_books)
[{"day": 0, "sales": 150}, {"day": 1, "sales": 100}, ...]

That's extremely useful, but falling back on raw SQL when
get_books_sold_per_day needs to do something complex becomes more
difficult, since it needs to operate on the fairly arbitrary QuerySet
it's been given--and the more heavily invested code is in QuerySet,
the bigger the problem will become.

If there are any examples of constructing this sort of thing,
integrating a QuerySet inside custom SQL--rather than extra(), which
is the reverse--I'd like to see it. I'd be more comfortable with that
code investment knowing some way to do that. (I guess I might as well
resign myself to familiarizing myself with the internals and just
figure it out myself...)

> If you could open a ticket describing this problem (the query and what you expect as SQL) it will ensure we don't forget about this before we release v1.1.

http://code.djangoproject.com/ticket/10132

Malcolm Tredinnick

unread,
Jan 27, 2009, 2:50:24 AM1/27/09
to django...@googlegroups.com
On Mon, 2009-01-26 at 23:46 -0800, Glenn Maynard wrote:
[...]

> I don't care if it takes more than one line, though there is, in
> fact :), a big difference between taking more than one line and having
> to bypass the public API.

Only semantically, when you're trying to access something like "group
by", since, pretty much by definition, you're doing stuff that the
public ORM isn't intending to provide.

Regards,
Malcolm

Glenn Maynard

unread,
Jan 27, 2009, 3:13:01 AM1/27/09
to Django users
On Jan 27, 2:50 am, Malcolm Tredinnick <malc...@pointy-stick.com>
wrote:
> > I don't care if it takes more than one line, though there is, in
> > fact :), a big difference between taking more than one line and having
> > to bypass the public API.
>
> Only semantically, when you're trying to access something like "group
> by", since, pretty much by definition, you're doing stuff that the
> public ORM isn't intending to provide.

If Russell's guess is correct, then there's a way to do this within
the public API--despite it beyond beyond what the *abstract* part of
the API is trying to provide. If he's wrong, then there's no way to
do it within the public API. That's not a semantic difference.
Reply all
Reply to author
Forward
0 new messages