How do I manually set the "GROUP BY" for a django queryset?

4,434 views
Skip to first unread message

Greg

unread,
Feb 14, 2010, 5:30:27 PM2/14/10
to Django users
I have a large-ish query which is taking ~10 seconds to run, but I can
get it down to less than a second by changing the group by part of the
query from

GROUP BY `refunds_userprofile`.`ird_number` ,
`auth_user`.`first_name` , `auth_user`.`last_name` ,
`refunds_userprofile`.`user_id` , `auth_user`.`id` ,
`auth_user`.`username` , `auth_user`.`first_name` ,
`auth_user`.`last_name` , `auth_user`.`email` ,
`auth_user`.`password` , `auth_user`.`is_staff` ,
`auth_user`.`is_active` , `auth_user`.`is_superuser` ,
`auth_user`.`last_login` , `auth_user`.`date_joined`

to

GROUP BY `refunds_userprofile`.`ird_number`

(note this doesn't change the query results)

I figured I might have to use some undocumented methods to do this,
but I couldn't figure it out - I've tried explicitly setting the
query.group_by attribute with no success. Does anyone know how I could
do this?

Daniel Roseman

unread,
Feb 14, 2010, 6:15:21 PM2/14/10
to Django users

What ORM query are you running to get the above generated SQL? Looks
like it could be a bug.
--
DR.

Russell Keith-Magee

unread,
Feb 14, 2010, 7:07:16 PM2/14/10
to django...@googlegroups.com

The short answer is "you don't". Django doesn't expose any way to
manually control GROUP BY because that is a relational construct, and
Django is attempting to provide an Object-like wrapper.

As for why dropping GROUP BY clauses is faster - remember, t's always
possible to get the wrong answer in O(1) time :-) Daniel has asked the
more important question - what are you trying to do? Django is
generally pretty careful about adding GROUP BY clauses; arbitrarily
dropping them make me a little nervous. It's possible you're found a
bug, but it's impossible to confirm that without more details.

Yours
Russ Magee %-)

Greg Brown

unread,
Feb 15, 2010, 3:43:46 PM2/15/10
to django...@googlegroups.com
Ok, here goes:

First, some background - UserProfile is the
settings.AUTH_PROFILE_MODULE and the Application model has a
ForeignKey to User. There's ~90,000 users/userprofiles, ~110,000
applications (most have only one application, some have 2). The
database is mysql. If that's not enough info, let me know and I'll see
if I can put together a simple reproduce-able test case.

The queryset giving the problem is

UserProfile.objects.values(
'ird_number', 'user__first_name', 'user__last_name', 'user__pk',
).annotate(
Max('user__application__creation_date'),
).order_by('-user__application__creation_date__max')

I'm beginning to think it might be a bug - because UserProfile and
User are joined by a one-to-one relationship, it's sufficient to group
by one of their primary keys, and I think the ORM should know this.

Should I open a ticket for this?

Greg

> --
> You received this message because you are subscribed to the Google Groups "Django users" group.
> To post to this group, send email to django...@googlegroups.com.
> To unsubscribe from this group, send email to django-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
>
>

--
http://gregbrown.co.nz/

Dj Gilcrease

unread,
Feb 15, 2010, 4:13:27 PM2/15/10
to django...@googlegroups.com
On Mon, Feb 15, 2010 at 1:43 PM, Greg Brown <gregpla...@gmail.com> wrote:
> The queryset giving the problem is
>
> UserProfile.objects.values(
>        'ird_number', 'user__first_name', 'user__last_name', 'user__pk',
>    ).annotate(
>        Max('user__application__creation_date'),
>    ).order_by('-user__application__creation_date__max')


It looks to me like you are trying to get more info from the User
model then then UserProfile model, so maybe change your query to be
based on the User not the UserProfile


User.objects.values('first_name', 'last_name', 'pk').annotate(
Max('application__creation_date'),
).order_by('-application__creation_date__max').select_related()

Greg Brown

unread,
Feb 15, 2010, 4:51:27 PM2/15/10
to django...@googlegroups.com
That's a good point, but then the ird_number field (on UserProfile) is
inaccessible. I'd need to do something like

User.objects.values('first_name', 'last_name', 'pk',
'userprofile__ird_number').annotate(


Max('application__creation_date'),
).order_by('-application__creation_date__max').select_related()

but that raises

FieldError: Invalid field name: 'userprofile__ird_number'

Greg

Russell Keith-Magee

unread,
Feb 16, 2010, 7:42:48 AM2/16/10
to django...@googlegroups.com
On Tue, Feb 16, 2010 at 4:43 AM, Greg Brown <gregpla...@gmail.com> wrote:
> Ok, here goes:
>
> First, some background - UserProfile is the
> settings.AUTH_PROFILE_MODULE and the Application model has a
> ForeignKey to User. There's ~90,000 users/userprofiles, ~110,000
> applications (most have only one application, some have 2). The
> database is mysql. If that's not enough info, let me know and I'll see
> if I can put together a simple reproduce-able test case.
>
> The queryset giving the problem is
>
> UserProfile.objects.values(
>        'ird_number', 'user__first_name', 'user__last_name', 'user__pk',
>    ).annotate(
>        Max('user__application__creation_date'),
>    ).order_by('-user__application__creation_date__max')
>
> I'm beginning to think it might be a bug - because UserProfile and
> User are joined by a one-to-one relationship, it's sufficient to group
> by one of their primary keys, and I think the ORM should know this.

This is only true for MySQL. Every other database that Django supports
requires a complete list of non-aggregate fields in the GROUP BY
clause - which is the behaviour you are seeing.

> Should I open a ticket for this?

It's probably worth opening a ticket, but not because it's a bug. Like
I said, the 'group by pk' think is a MySQL eccentricity; the SQL you
describe is completely normal for other databases (mod the fact that
your values() clause has a lot less fields than the SQL sample you
gave, but I take it that this is was just truncated for clarity). The
query is that is being issued is *correct*, it's just not *optimal*.

However, there might be an improved optimization to be had. My first
impression is that we might be able to allow the 'group by pk'
optimization as long as the pk is included in the list of fields in a
values query. The current code only allows the optimization in the
case that the list of selected columns is the same as the list of
fields on the requesting model (i.e., the non-values() case). Beefing
up the optimization like this will require a bit of thought to make
sure it isn't catching cases it shouldn't (i.e., is there a query that
includes the PK for which the 'group by pk' optimization isn't valid),
but providing we aren't going to break anything,

Yours,
Russ Magee %-)

Reply all
Reply to author
Forward
0 new messages