Re: Suggestion: Aggregate/Grouping/Calculated methods in Django ORM

769 views
Skip to first unread message

Jacob Kaplan-Moss

unread,
Dec 4, 2006, 10:45:01 AM12/4/06
to django...@googlegroups.com, django-d...@googlegroups.com
On 12/4/06 5:57 AM, John Lenton wrote:
> The "max", "min" and other such functions might be a little more
> problematic, unless groupby returned, rather than a generic iterator,
> a special "queryset group" and give _it_ the max/min/etc methods. This
> way it would be clear that max() returns a tuple (value, queryset) (to
> me, at least...). Also, ...groupby('foo').max() would return the same
> result as max(...groupby('foo')), but less efficiently.
>
> Talking through my hat?

No, I think not -- I think that syntax (``queryset.groupby(field).max()``)
actually looks like the best proposal for aggregates I've seen thus far...

I'm taking this to django-dev for more discussion; it'll get seen by more the
right people there.

Thoughts, anyone?

Jacob

DavidA

unread,
Dec 4, 2006, 4:30:50 PM12/4/06
to Django developers

Jacob Kaplan-Moss wrote:
> No, I think not -- I think that syntax (``queryset.groupby(field).max()``)
> actually looks like the best proposal for aggregates I've seen thus far...
>
> Thoughts, anyone?
>
> Jacob

I think it quickly gets more complicated than that syntax would
support. For example, how would you ask for more than one aggregate
value in that syntax? My common use case is grouping a bunch of
financial positions, where the SQL would look something like:

select account, count(*), sum(quantity), sum(total_pnl) from
position
group by account

Would I have to call queryset.groupby(account) three times: once for
count(), once for sum(quantity) and once for sum(total_pnl)?

And what exactly does queryset.groupby() return? In my case, if account
is a ForeignKey from a Position model to an Account model, can I
dereference fields from the result?

account_summary =
Position.objects.filter(date=today).groupby(account)
for summary on account_summary:
print summary.name ### would this work? Is this the name
property of an Account?

And how would I dereference the aggregate fields in the groupby
results? By index? (is account_summary[0][2] the quantity sum of the
first account summary row?)

I've run into all of these issues (multiple aggregate columns,
dereferencing model relations, aggregate alias names) in playing around
with this and I think they are all problems you run into quickly that
make a solution rather complicated.

My idea was a queryset.groupby() could return some sort of dynamic
Django model class where the attributes where the aggregated fields
plus the fields you were grouping by and if you were grouping by a
relation field, it would magically work like any other model relation.

But I don't know how complicated that would be and I haven't thought of
a syntax that works nicely for the more complex cases.

-Dave

Jacob Kaplan-Moss

unread,
Dec 4, 2006, 5:09:41 PM12/4/06
to django-d...@googlegroups.com
On 12/4/06 3:30 PM, DavidA wrote:
> I think it quickly gets more complicated than that syntax would
> support.

Oh, of *course* it does -- that's why I want to discuss it more!

I think, though, that we should be able to find a 80/20 point for aggregates
and support most of the common use cases. That's pretty consistent with the
rest of Django's ORM: we don't try to model *everything* you can do with SQL,
just most of the common stuff.

I'm totally OK with making complex queries be written in SQL; it actually
turns out that SQL's a really good language for handing lots of data :)

As long as the syntax is nice, it seems that any aggregate support would be
better than none.

> For example, how would you ask for more than one aggregate
> value in that syntax? My common use case is grouping a bunch of
> financial positions, where the SQL would look something like:
>
> select account, count(*), sum(quantity), sum(total_pnl) from
> position
> group by account
>
> Would I have to call queryset.groupby(account) three times: once for
> count(), once for sum(quantity) and once for sum(total_pnl)?

My gut would be that that particular case wouldn't be supported, but::

quantity, total_pnl = Position.objects.groupby("account").sum("quantity",
"total_pnl")

would work. So maybe you can only do one aggregate operation at once? That
seems to mesh OK with the majority of aggregate usage *I* do...

> And what exactly does queryset.groupby() return?

I was thinking it would return some proxy object that supports .sum(), .avg(),
etc. methods. By itself I don't think it would be very useful.

> My idea was a queryset.groupby() could return some sort of dynamic
> Django model class where the attributes where the aggregated fields
> plus the fields you were grouping by and if you were grouping by a
> relation field, it would magically work like any other model relation.

I don't think I follow here - can you give me an idea of what you're talking
about?

> But I don't know how complicated that would be and I haven't thought of
> a syntax that works nicely for the more complex cases.

Again, I'm OK with just supporting the simple-ish cases if that's all we can
work out.

Jacob

Russell Keith-Magee

unread,
Dec 5, 2006, 6:56:42 AM12/5/06
to django-d...@googlegroups.com
On 12/4/06, Jacob Kaplan-Moss <ja...@jacobian.org> wrote:
>
> I'm taking this to django-dev for more discussion; it'll get seen by more the
> right people there.
>
> Thoughts, anyone?

Ok; here's my thoughts.

1. Introduction
~~~~~~~~~~~~~~~
Consider the following pseudo-model:

class Book(Model):
name = CharField()
price = FloatField()
pub_date = DateField()

class Order(Model):
description = CharField()
time = DateTimeField()
books = ManyToManyField(Book)

Simple aggregation use cases would be questions like:
a) What is the cheapest book?
b) What is the average book price?
c) Find me all the books with a price greater than the average.
d) What is the total cost of order 1234?

These questions require that you query the database to obtain a single
aggregated value, or a object matching that aggregated value.

However, the more general class of problem is to generate a summary
statistic for each object in a query, with the summary stats coming
from the related objects. For example:

e) What is the total cost of _each_ order?

IMHO, what we need is the ability to annotate summary statistics onto
the results returned by a query set.

2. Proposal
~~~~~~~~~~~
Add an 'annotate()' modifier to QuerySets. The arguments of annotate()
describe the annotations that you want to be applied to the objects
that are returned by the Query Set.

annotate() returns a query set, so it can be used multiple times, be
combined with filters, etc. The argument handling strategy employed in
filter() is reused here; kwargs to annotate() can be decomposed on a
__ boundary to describe table joins, with the last part describing the
aggregate operator to be used. The syntax follows something like:

Model.objects.annotate(field1__field2__aggregate='annotation')

field1__field2 describes the path to get to the field that will be
ultimately aggregated; aggregate is the aggregation function (max,min
etc); the value of the argument is a string that is the name of the
annotate argument.

The objects that are returned when the queryset is executed will be
normal objects, but with additional attributes corresponding to the
annotations.

e.g.,
# Get order 1234, and annotate it a few different ways
>>> order = Order.objects.get(id=1234).annotate(
books__price__sum='total_cost',
books__count='item_count')
# Inspect the order
>>> order.description
"Dad's birthday"
# Annotated orders have a 'total_cost' attribute...
>>> order.total_cost
47.2
# ... and an 'item_count' attribute
>>> order.item_count
3

3. Just the facts, Ma'am
~~~~~~~~~~~~~~~~~~~~~~~~
Ok; so what if you just want _the minimum_, or _the average_? For
this, I propose an aggregates() queryset modifier.

>>> Book.objects.aggregates(price__min='min_price', pub_date__max='last_update')
{'min_price':0.5, 'last_update': 2006-11-22}

aggregates() would expand queries in the same manner as annotate(),
but would be a terminal clause, just like the values() clause.

This is a more verbose notation than the simple 'max()/min()' . I have
discussed my problems with these operators previously; however, if
there is sufficient demand, I don't see any reason that min('field')
couldn't be included in the API as a shorthand for
Model.objects.aggregates(field__min='min')['min'].

4. Comparisons
~~~~~~~~~~~~~~
There is one additional requirement I can see; to perform queries like
(c), you need to be able to compare annotation attributes to object
attributes.

# Annotate a query set with the average price of books
>>> qs = Book.objects.annotate(price__average='avg_price').
# Filter all books with obj.avg_price < obj.price
>>> expensive_books = qs.filter(avg_price__lt=F('price'))

The F() object is a placeholder to let the query language know that
'price' isn't just a string, it's the name of a field. This follows
the example of Q() objects providing query wrappers.

This capability would also be beneficial to the query language in
general; at present, there is no easy way to pull out all objects
where field1 = field2.

5. Implementation
~~~~~~~~~~~~~~~~~
Now the mechanics: What does annotate() do to the SQL?

If there is an annotate clause in a query set:
- All of the base model attributes (the attributes that would normally
be returned by the queryset) are placed in a GROUP BY clause
- Any query on a base model attribute is placed in a HAVING clause
- Any query on the annotation field is placed in a WHERE clause
- If the annotation clause traverses joins, those tables are joined in
the same manner as they would be for filter().

6. Limitations/Problems
~~~~~~~~~~~~~~~~~~~~~~~
- This approach doesn't handle any particularly creative usage of the
GROUP BY clause. I'm open to suggestions, but I'm also happy to put
this problem subset into the 20% "do it in raw SQL" category.

- I'm not overly enamoured with the name aggregates() - it isn't a
particularly intuitive name for the functionality provided; summary()
is the only other option I could think of.

- I haven't tried to implement this, so there are probably some sharp
edges in the SQL generation process.

~~~~~~~~~~~~~~~~~~~~~~~

So - that's my two bits. Comments?

Yours,
Russ Magee %-)

John Lenton

unread,
Dec 5, 2006, 7:01:31 AM12/5/06
to django-d...@googlegroups.com
On 12/4/06, DavidA <david.av...@gmail.com> wrote:
>
> Would I have to call queryset.groupby(account) three times: once for
> count(), once for sum(quantity) and once for sum(total_pnl)?

I hadn't even considered having a multi-parameter tuple-returning
"sum"; I was ok with either calling groupby thrice, or saving the
groupby and calling the different ops in sequence. In either case, a
database roundtrip per call.

> And what exactly does queryset.groupby() return?

quoting from itertools.groupby: «an iterator that returns (key,
sub-iterator) grouped by each value of key(value)»

I had thought that queryset.groupby should behave in the same way
itertools.groupby would behave, i.e. that there would only be
implementation (and performance) differences between

queryset = Position.objects.filter(date=today)
account_summary = itertools.groupby(queryset, operator.attrgetter('account'))

and

account_summary = Position.objects.filter(date=today).groupby('account')

this makes it very easy to explain, and to understand what you'll be
getting from groupby()

> In my case, if account
> is a ForeignKey from a Position model to an Account model, can I
> dereference fields from the result?
>
> account_summary =
> Position.objects.filter(date=today).groupby(account)
> for summary on account_summary:
> print summary.name ### would this work? Is this the name
> property of an Account?

in view of the above: no. To do the above, you'd do this instead:

for account, positions in account_summary:
print account.name

> And how would I dereference the aggregate fields in the groupby
> results? By index? (is account_summary[0][2] the quantity sum of the
> first account summary row?)

positions would be the (lazy) iterator for that purpose, already set
up for you (by this I mean, I don't expect it to be a performance
gain, just a convenience).

> My idea was a queryset.groupby() could return some sort of dynamic
> Django model class where the attributes where the aggregated fields
> plus the fields you were grouping by and if you were grouping by a
> relation field, it would magically work like any other model relation.

that sounds way too magic for my taste :)

--
John Lenton (jle...@gmail.com) -- Random fortune:
The trouble with a lot of self-made men is that they worship their creator.

DavidA

unread,
Dec 5, 2006, 4:24:30 PM12/5/06
to Django developers

Russell Keith-Magee wrote:
> annotate() returns a query set, so it can be used multiple times, be
> combined with filters, etc. The argument handling strategy employed in
> filter() is reused here; kwargs to annotate() can be decomposed on a
> __ boundary to describe table joins, with the last part describing the
> aggregate operator to be used. The syntax follows something like:
>
> Model.objects.annotate(field1__field2__aggregate='annotation')

[snip]

> e.g.,
> # Get order 1234, and annotate it a few different ways
> >>> order = Order.objects.get(id=1234).annotate(
> books__price__sum='total_cost',
> books__count='item_count')
> # Inspect the order
> >>> order.description
> "Dad's birthday"
> # Annotated orders have a 'total_cost' attribute...
> >>> order.total_cost
> 47.2
> # ... and an 'item_count' attribute
> >>> order.item_count
> 3

I like making the aggregate function a part of a keyword argument. It
seems consistent with the Django DB API and offers a lot of
flexibility. Better, in my opinion than individual functions for each
aggregator. The 'annotate' name is a little indirect. Maybe something
like 'calc_fields'?

> 3. Just the facts, Ma'am
> ~~~~~~~~~~~~~~~~~~~~~~~~
> Ok; so what if you just want _the minimum_, or _the average_? For
> this, I propose an aggregates() queryset modifier.
>
> >>> Book.objects.aggregates(price__min='min_price', pub_date__max='last_update')
> {'min_price':0.5, 'last_update': 2006-11-22}
>
> aggregates() would expand queries in the same manner as annotate(),
> but would be a terminal clause, just like the values() clause.
>
> This is a more verbose notation than the simple 'max()/min()' . I have
> discussed my problems with these operators previously; however, if
> there is sufficient demand, I don't see any reason that min('field')
> couldn't be included in the API as a shorthand for
> Model.objects.aggregates(field__min='min')['min'].

This seems good, too, but maybe call it 'calc_values' or something with
the 'values' name in it to be consistent with the existing values()
method. The shortcut is nice but I could live without it.

> 4. Comparisons
> ~~~~~~~~~~~~~~
> There is one additional requirement I can see; to perform queries like
> (c), you need to be able to compare annotation attributes to object
> attributes.
>
> # Annotate a query set with the average price of books
> >>> qs = Book.objects.annotate(price__average='avg_price').
> # Filter all books with obj.avg_price < obj.price
> >>> expensive_books = qs.filter(avg_price__lt=F('price'))
>
> The F() object is a placeholder to let the query language know that
> 'price' isn't just a string, it's the name of a field. This follows
> the example of Q() objects providing query wrappers.

To make it more like Q(), would it be better to do
F(avg_price__lt='price') so you could combine them with | and &?

DavidA

unread,
Dec 5, 2006, 4:33:56 PM12/5/06
to Django developers

John Lenton wrote:
>
> I hadn't even considered having a multi-parameter tuple-returning
> "sum"; I was ok with either calling groupby thrice, or saving the
> groupby and calling the different ops in sequence. In either case, a
> database roundtrip per call.

I'm often grouping thousands of rows for my cases so doing multiple
round trips per field would be painful.

> I had thought that queryset.groupby should behave in the same way
> itertools.groupby would behave, i.e. that there would only be
> implementation (and performance) differences between
>
> queryset = Position.objects.filter(date=today)
> account_summary = itertools.groupby(queryset, operator.attrgetter('account'))
>
> and
>
> account_summary = Position.objects.filter(date=today).groupby('account')

I'm confused. Where do you specify the aggregate functions for
aggregating the specific columns (sum, avg, etc.)? Or am I
misunderstanding what this does?

> in view of the above: no. To do the above, you'd do this instead:
>
> for account, positions in account_summary:
> print account.name

I see. Works for me.

>
> > And how would I dereference the aggregate fields in the groupby
> > results? By index? (is account_summary[0][2] the quantity sum of the
> > first account summary row?)
>
> positions would be the (lazy) iterator for that purpose, already set
> up for you (by this I mean, I don't expect it to be a performance
> gain, just a convenience).

But would there be aliases for these aggregate expessions? That is, the
equivalent of

select sum(quantity) as tot_quantity ...


> > My idea was a queryset.groupby() could return some sort of dynamic
> > Django model class where the attributes where the aggregated fields
> > plus the fields you were grouping by and if you were grouping by a
> > relation field, it would magically work like any other model relation.
>
> that sounds way too magic for my taste :)

True.

Reply all
Reply to author
Forward
0 new messages