Aggregates

23 views
Skip to first unread message

Nicolas E. Lara G.

unread,
Mar 15, 2008, 1:51:28 PM3/15/08
to Django developers
With the QuerySet refactoring being almost ready now I have been
thinking of implementing aggregates for django models.

Aggregates for django models have been discussed a lot in the dev
list, so I'm taking most of the ideas from what Honza Král, Russell
Keith-Magee -among others- wrote. I will just try to summarize what I
think are the best approaches and add some examples of how it should
work.


The actual proposal
===================

Models
------

This are the (pseudo)models we'll use as example::

class Purchase(models.Model):
identifier = CharField()
payment = FloatField()
quantity = IntegerField()

class Buyer(models.Model):
name = CharField()
age = IntegerField()
purchases = ManyToMany(Purchase)

General Syntax
--------------

The general syntax I propose is combine the filter-like idea [1] with
the function=tuple idea [2] as proposed once in a previous discussion
[3]. In
that discussion it was never very explicit how to combine them, so the
idea is as follows:

There should be an aggregate() QuerySet modifier, that _always_
returns a QuerySet (probably an AggregateQuerySet subclass of
QuerySet).


suppose you have::
>>> queryset = Buyer.objects.all()

This modifier receives arguments in two ways:

As function=tuple for the different aggregate functions::
>>> queryset.aggregate( sum=( 'payment',), avg=('quantity',
'payment', 'age') )

which results in a queryset who's values() would be::
[
{
'name' : 'Peter Smith',
... other attributes ...
'payment' : {'sum': 20000, 'avg' : 200},
'quantity' : {'avg' : 1500}
}
{
'name' : 'Jane Doe',
... other attributes ...
'payment' : {'sum': 62000, 'avg' : 321},
'quantity' : {'avg' : 12604}
}
{
'age': {'avg' : 35 }
}
]


and in filter-like syntax::
>>> queryset.aggregate( payment__sum='total_payment',
quantity__avg='normal_request', payment__avg='avg_pay',
age__avg='mean_age')

which results in a queryset who's values() would be::
[
{
'name' : 'Peter Smith',
... other attributes ...
'total_payment' : 20000,
'avg_pay' : 200,
'normal_request' 1500
}
{
'name' : 'Jane Doe',
... other attributes ...
'total_payment' : 62000,
'avg_pay' : 321,
'normal_request' 12604
}
{
'mean_age': 35
}
]


In any of this resulting QuerySets you could do filtering as either
qs.filter(payment__sum=x) or qs.filter(total_pay=x) depending on the
case.

Note that this result set includes aggregates that refer to the
actual table you are quering at the same level as the objects and the
aggregates that refer to the joined table inside the actual
''object''.

When a field name exists in the base model that has the same name of
the field on the related model we want to aggregate we could resolve
the ambiguety by refering to therelated field like::

>>> queryset.aggregate( purchase__payment__sum='total_payment')

or::

>>> queryset.aggregate( sum=( 'purchase__payment',) )

otherwise the field on the base model will take priority and become a
model-level aggregate.

There is no way to specify that only the aggregated fields should
appear in the results because this could be done with the modifiers
proposed in [#5420].

In the case where group_by is not specified it is always grouped by
the ForeignKey (or ManyToMany). If the group_by is specified (this is
always specified in the function=tuple format) the object is grouped
by those fields.

The count function should, IMHO, be handled in the filter-like syntax
so you can specify on which field to count allowing you to count
excluding null values for foreign tables.

This way fou could::

>>> queryset.aggregate( payment__count='already_payed')

Which would result in only the fields with payment different than null
to be counted. Those would be added as a atribute for each grouped
object according to the rule specified before.

To specify ordering you would use the function order_by refering to
the alias or the field__function. this is::

qs.order_by('-payment__sum')

(for the function=tuple syntax)

or::

qs.order_by('-total_payment')

(for the filter-like syntax)

[1] http://groups.google.com/group/django-developers/msg/2f91fa217bc465bb
[2] http://code.djangoproject.com/ticket/3566
[3] http://groups.google.com/group/django-developers/browse_thread/thread/f433edf7b0ebdbcb/9758e11744381e67
[#5420] http://code.djangoproject.com/ticket/5420

SQL
---

As said before, the GROUP BY clause, when not specified, would be
generated by using the related key.

Other queries that refer to the 'grouped by' fields are placed in a
HAVING clause.

Other queries are placed in a WHERE clause.

Downsides
---------

The only downside I see with having the two syntax possibilities is
inconsistency. The person doing, for example, the templates would need
to know how the query was executed in order to decide weather use::

{{ object.total_payment }}

or::

{{ object.payment.sum }}

Of course, this could also be seen as a ''aliasing feature'', but I'm
not thrilled.

Another problem I see is that group_by is too SQLish. You gotta think
in tables/sql instead of objects. I still don't see a better way of
allowing the user to specify by which fields to group. Maybe a simple
name change would make it easier.

Extras
------

1- An idea I had, but I am not sure yet how it could be well
implemented
is to allow users to define their own aggregate function and provide
it to extend the aggregate api.

For example, suppose I want to aggregate on the price of every sold
item (in the models above). I could do something like::

>>> def get_price(payment, quantity):
... return payment / quantity
...

>>> aggregate_user_defined(get_price, 'price')

>>> queryset.aggregate(purchases__price__avg='average_price')

This would result in retrieving avg(payment/quantity).

I am still not sure in how to extend this idea to allow the user to
retrieve something like: ( avg(payment) * quantity ). I don't even
know if it should be allowed. Probably it should just be done in
python after getting the results. But, as I said, this is just an idea
that could be explored.


2- Another extra thing that could be explored is Russell's idea of
allowing field lookups to compare two fields:
Quote:

There is one additional requirement I can see; to perform
queries like "Find me all the books with a price greater than
the average", you need to be able to compare annotation
attributes to object attributes.

# Annotate a query set with the average price of books
>>> qs = Book.objects.annotate(price__average='avg_price').

# Filter all books with obj.avg_price < obj.price

>>> expensive_books = qs.filter(avg_price__lt=F('price'))

The F() object is a placeholder to let the query language know
that
'price' isn't just a string, it's the name of a field. This
follows
the example of Q() objects providing query wrappers.

This capability would also be beneficial to the query language
in
general; at present, there is no easy way to pull out all
objects
where field1 = field2.

I put this as an extra since it does not involves the whole ORM and
not just the aggregates though I think would be a very nice to have
feature.

3- Another thing I've been looking into is to allow, for performance
reasons, the aggregates to be kept in a table as is done in
django-simpleaggregation [1]. This could be done if a variable is
specified in settings. This would be very good to performance in
systems that rely a lot on aggregation but still it, to me, seems very
"not the django way". (To the author of simpleaggregation: don't get
me wrong, it seems very "the django way" as a contrib app but not for
the actual ORM api)

[1] http://code.google.com/p/django-simpleaggregation/

Justin Fagnani

unread,
Mar 15, 2008, 5:41:29 PM3/15/08
to django-d...@googlegroups.com
Hi Nicolas,

I'm extremely excited about the possibility of aggregation support coming up, since I have a lot more SQL in my projects than I expected when I started using Django.

So I've been lurking and reading the aggregation proposals and some of the syntax proposals so far have seemed a little bit odd to me. They assume a small, fixed set of aggregation functions, without easily allowing user functions. Also, function=tuple seems a bit backwards. I'd like to declare my column names then specify what they are, or just throw functions into a list.

Before using Django I used Java and Hibernate, and while in typical Java fashion, Hibernate can be way too verbose, I do like the way it handled projections which included grouping, aggregation and aliasing. My thinking is definitely influenced by Hibernate, and if you strip off the excessive Java syntax you get some nice things.

First, I wonder if we could have aggregation functions as objects, much like Q objects. Then the syntax could be like: 

>>> queryset.aggregate(Sum('purchases__payment'), Average('age'))

or
>>> queryset.aggregate(Sum('purchases__payment'), Average('purchases_quantity'), Average('age'))
or
>>> queryset.aggregate(Sum('purchases__payment'), Average('purchases__quantity', 'purchases__payment', 'age'))

If you want to provide aliases you could use kwargs:

>>> queryset.aggregate(total_payment=Sum('purchases__payment'), avg_age=Average('age'))

I think that reads much clearer, to me at least. It also allows for the addition of aggregation functions that may not be supported by all databases, or user defined functions.

The difference between positional and keyword arguments could be handled by a default naming convention. Sum('field') would be 'sum_field' or 'field_sum' or something in the values() dictionary. If there's a conflict raise an exception or add a suffix.

 >>> queryset.aggregate(Sum('purchases__payment'), Average('purchases__quantity', 'purchases__payment', 'age')).values()

would return something like:

 [ {

      'name' : 'Peter Smith',
      ... other attributes ...
      'purchases__payment_sum' : 20000,
      'purchases__payment_avg' : 200,
      'purchases__quantity_avg' : 1500 },
    { 'age_avg': 35 } ]

Or better, purchases could be a dictionary:

 [ {

      'name' : 'Peter Smith',
      ... other attributes ...
      'purchases' : {
            'payment_sum' : 20000,
            'payment_avg' : 200,
            'quantity_avg' : 1500 },
    { 'age_avg': 35 } ]



If aggregate objects behave similarly to Q objects, then you could combine them, like:

>>> queryset.aggregate(avg_price=Sum('purchases__payment') / Sum('purchases__quantity'))


It may also be possible to use the aggregation functions directly in filters:

>>> cheap_books = queryset.filter(price__lt=Avg('price'))

I think that's an pretty clean way to specify that query. It should be roughly equivalent to the following (using the F(), which I think is an absolutely great idea, and annotate() from Russell):

>>> cheap_books = queryset.annotate(price_avg=Avg('price')).filter(price__lte=F('price_avg'))



Just some thoughts...

-Justin

Petite Abeille

unread,
Mar 15, 2008, 5:51:22 PM3/15/08
to django-d...@googlegroups.com

On Mar 15, 2008, at 10:41 PM, Justin Fagnani wrote:

> since I have a lot more SQL in my projects than I expected when I
> started using Django.

Would it be possible to abstract away that hand crafted SQL from
Django by the simple expedient of hiding it behind a database artifact
such as a view, stored procedure, pipelined function or such?

That way Django doesn't need to know about your SQL and your SQL
doesn't need to be embedded in Django.

The best of both world really :)

--
PA.
http://alt.textdrive.com/nanoki/

Justin Fagnani

unread,
Mar 15, 2008, 6:05:29 PM3/15/08
to django-d...@googlegroups.com
On Sat, Mar 15, 2008 at 2:51 PM, Petite Abeille <petite....@gmail.com> wrote:
Would it be possible to abstract away that hand crafted SQL from
Django by the simple expedient of hiding it behind a database artifact
such as a view, stored procedure, pipelined function or such?

I think that would be difficult, since I regularly switch databases. I really like using sqlite for development, for instance, and Postgres when I can, but I'm forced to use MySQL on a shared host. So far, either with extras or just using a cursor, my SQL has been completely portable. It's all aggregation though and would completely go away with aggregation support in the DB API.

-Justin

Petite Abeille

unread,
Mar 15, 2008, 7:25:14 PM3/15/08
to django-d...@googlegroups.com

On Mar 15, 2008, at 11:05 PM, Justin Fagnani wrote:

> I think that would be difficult, since I regularly switch databases. I
> really like using sqlite for development, for instance, and Postgres
> when I
> can, but I'm forced to use MySQL on a shared host. So far, either with
> extras or just using a cursor, my SQL has been completely portable.

Out of curiosity, does your project contains any database specific DDL
or DML? Perhaps some create table, index, partition, constrain,
sequence, analytic, statistic, or such?

Or is it fully driven end-to-end by Django's ORM? E.g. all the
database artifacts, as well as the database itself is generated,
managed by Django, from cradle-to-grave?

If your project already have database specific artifacts, it is but a
small step to re-factor some of the more advance DML away from Django
and expose them back to Django in an ORM friendly way such as a view
or cursor or whatnot.

After all, what's the point of an ORM but to shield you away from
common SQL trivialities?

On the other hand, if an ORM starts mimicking most of SQL
peculiarities, perhaps something has gone wrong along the way :)

--
PA.
http://alt.textdrive.com/nanoki/

Malcolm Tredinnick

unread,
Mar 15, 2008, 7:55:12 PM3/15/08
to django-d...@googlegroups.com

On Sun, 2008-03-16 at 00:25 +0100, Petite Abeille wrote:
>
> On Mar 15, 2008, at 11:05 PM, Justin Fagnani wrote:
>
> > I think that would be difficult, since I regularly switch databases. I
> > really like using sqlite for development, for instance, and Postgres
> > when I
> > can, but I'm forced to use MySQL on a shared host. So far, either with
> > extras or just using a cursor, my SQL has been completely portable.
>
> Out of curiosity, does your project contains any database specific DDL
> or DML? Perhaps some create table, index, partition, constrain,
> sequence, analytic, statistic, or such?

Could you not hijack the thread, please. Nicola started this to talk
about aggregates, not how one particular project uses custom SQL.

Thank you,
Malcolm

--
No one is listening until you make a mistake.
http://www.pointy-stick.com/blog/

Petite Abeille

unread,
Mar 15, 2008, 8:14:23 PM3/15/08
to django-d...@googlegroups.com

On Mar 16, 2008, at 12:55 AM, Malcolm Tredinnick wrote:

> Could you not hijack the thread, please. Nicola started this to talk
> about aggregates, not how one particular project uses custom SQL.

Fair enough.

On the other hand, one could argue that the subtext of those questions
is something along the lines of:

"Don't bother re-implementing SQL in your favorite language"

Which is pretty much what emerges from your recent "Queryset
Implementation" post:

http://www.pointy-stick.com/blog/2008/03/11/queryset-implementation/

In any case, apologies for disturbing the groupthink :)

--
PA.
http://alt.textdrive.com/nanoki/

Nicolas E. Lara G.

unread,
Mar 16, 2008, 8:25:55 AM3/16/08
to Django developers

Hi Justin,
Your ideas seems very interesting. Some thoughts about them:

> So I've been lurking and reading the aggregation proposals and some of the
> syntax proposals so far have seemed a little bit odd to me. They assume a
> small, fixed set of aggregation functions, without easily allowing user
> functions. Also, function=tuple seems a bit backwards. I'd like to declare
> my column names then specify what they are, or just throw functions into a
> list.

For me the only real reasons to keep function=tuple syntax are
specifying group by (which I believe would be done with something
similar in your syntax, but lets not speculate) and to not tie the
user to select aliases when they are not needed. Aliasing is a great
functionality but if my aliases are always the same name as the
function I am calling I want them to be generated for me.

I believe your syntax does bit the one I'm proposing regarding the
power of user defined functions, but as for general syntax it still
seems a bit unconsistent with existing api syntax. Also it would make
the user defined functions to be directly specified in the queryset
and this is bad for re-use.

> First, I wonder if we could have aggregation functions as objects, much like
> Q objects. Then the syntax could be like:

If you would like to mimic Q objects for aggregates why not specify
some aggregate definition function like:

>>> my_own_aggregate = Aggregate( Sum('field_1') * 'field2', 'default_name')
>>> queryset.aggregate(purchases__price__custom=my_own_aggregate('alias'))

I believe this would allow the same 'expresiveness' power, keep the
syntax clear for querysets and allow re-use.

>
> [ {
> 'name' : 'Peter Smith',
> ... other attributes ...
> 'purchases__payment_sum' : 20000,
> 'purchases__payment_avg' : 200,
> 'purchases__quantity_avg' : 1500 },
> { 'age_avg': 35 } ]
>
> Or better, purchases could be a dictionary:
>
> [ {
> 'name' : 'Peter Smith',
> ... other attributes ...
> 'purchases' : {
> 'payment_sum' : 20000,
> 'payment_avg' : 200,
> 'quantity_avg' : 1500 },
> { 'age_avg': 35 } ]
>

As for the result format I believe there are many possiblities and
they will depend on the final choice of syntax. For me the flatter the
results are the better.

> It may also be possible to use the aggregation functions directly in
> filters:
>
> >>> cheap_books = queryset.filter(price__lt=Avg('price'))
>
> I think that's an pretty clean way to specify that query. It should be
> roughly equivalent to the following (using the F(), which I think is an
> absolutely great idea, and annotate() from Russell):
>
> >>> cheap_books = queryset.annotate
>
> (price_avg=Avg('price')).filter(price__lte=F('price_avg'))
>

This I don't like that much. Though it would definitelly make the
calls simpler the use of aggregate functions everywhere might come
back to bite you. Also, users will probably not realize that it
represents a more expensive operation that a simple "price__lt=3".

I believe that the F syntax is very useful since it allows not only
comparison with aggregates but with any other field.

> Just some thoughts...

Thanks a lot for your very good Ideas =)

I believe that your syntax would be a great replacement for the
tuple=function syntax (and I believe that just lowercase-ing the
function names would make it look less odd to me) but I would still
keep the filter like syntax around. Any thoughts on the "aggregate
definition function" idea?

Regards,

Nicolas

Russell Keith-Magee

unread,
Mar 16, 2008, 9:11:44 AM3/16/08
to django-d...@googlegroups.com
On Sun, Mar 16, 2008 at 2:51 AM, Nicolas E. Lara G.
<nicol...@gmail.com> wrote:
>
> With the QuerySet refactoring being almost ready now I have been
> thinking of implementing aggregates for django models.

Hi Nicholas,

This is an ambitious project, so good luck! Long time readers will
know that I have an interest in aggregates, so i'll be here to help
out however I can.

Here is some feedback on your specific proposals:

> There should be an aggregate() QuerySet modifier, that _always_
> returns a QuerySet (probably an AggregateQuerySet subclass of
> QuerySet).
>
> suppose you have::
> >>> queryset = Buyer.objects.all()
>
> This modifier receives arguments in two ways:
>
> As function=tuple for the different aggregate functions::
> >>> queryset.aggregate( sum=( 'payment',), avg=('quantity',
> 'payment', 'age') )
>
> which results in a queryset who's values() would be::
> [
> {
> 'name' : 'Peter Smith',
> ... other attributes ...
> 'payment' : {'sum': 20000, 'avg' : 200},
> 'quantity' : {'avg' : 1500}
> }
> {
> 'name' : 'Jane Doe',
> ... other attributes ...
> 'payment' : {'sum': 62000, 'avg' : 321},
> 'quantity' : {'avg' : 12604}
> }
> {
> 'age': {'avg' : 35 }
> }
> ]

I'm not sure I'm entirely comfortable with a few aspects of this.

1) The output format: Using the example data structure you provided,
how do you extract the average age? The best I can see is
"values[2]['age']['avg']" - values[2] is not a very helpful access
mechanism.

2) The query mechanism confuses the idea of an aggregate over the
query set itself (e.g., avg age of all buyers) and an aggregate of a
member in the group (e.g., the average quantity purchased by each
individual buyer).

My original proposal had 2 mechanisms: aggregate() to get aggregates
over the entire QS, which returned a dictionary of the resulting
aggregates, and annotate() to augment object instance with aggregate
attributes.

The particular goal of annotate() was to enable you to get a list
actual objects which have extra attributes, so you could reference
"somebuyer.average_payment".

In your proposal, I suppose annotate() would return the first two
elements of your result set, and aggregate() would return the last
element.

3) How does this syntax extend into multiple-table aggregates (i.e,
when you are querying over Buyer and need to get an aggregate over
something related to the Purchase table) - in particular the output
format?

> and in filter-like syntax::
> >>> queryset.aggregate( payment__sum='total_payment',
> quantity__avg='normal_request', payment__avg='avg_pay',
> age__avg='mean_age')

Obviously, I like the aliasing format, but the same comments about
output format and inconsistency between aggregates over instances and
over groups.

> When a field name exists in the base model that has the same name of
> the field on the related model we want to aggregate we could resolve
> the ambiguety by refering to therelated field like:

Add this to the list of problems that are avoided if you separate
aggregates on the model from aggregates on the instance. :-)

> There is no way to specify that only the aggregated fields should
> appear in the results because this could be done with the modifiers
> proposed in [#5420].

Agreed.

I would add to this - if you take the 'annotated object' approach to
the output format, you could use 'values()' to pull out the subset of
values that you want for each object.

> In the case where group_by is not specified it is always grouped by
> the ForeignKey (or ManyToMany). If the group_by is specified (this is
> always specified in the function=tuple format) the object is grouped
> by those fields.

You've lost me here. Where does 'group_by' come from?

> The count function should, IMHO, be handled in the filter-like syntax
> so you can specify on which field to count allowing you to count
> excluding null values for foreign tables.

Agreed.

> To specify ordering you would use the function order_by refering to
> the alias or the field__function. this is::
>
> qs.order_by('-payment__sum')

> qs.order_by('-total_payment')

Agreed. My only comment here is that it would be nice if the
payment__sum format had some analog in the original query syntax. This
is my biggest problem with the unaliased aggregate notation that you
are proposing - it doesn't really share an analog with anything else
in Django, whereas the 'payment__sum' approach and the aliased format
both have immediate and relatively obvious analogs with the syntax
used by filter().

> SQL

I think you're on the right track here, except for:

> As said before, the GROUP BY clause, when not specified, would be
> generated by using the related key.

Again - you seem to have skipped this bit of your proposal.

> Downsides
> ---------
>
> The only downside I see with having the two syntax possibilities is
> inconsistency. The person doing, for example, the templates would need
> to know how the query was executed in order to decide weather use::
>
> {{ object.total_payment }}
>
> or::
>
> {{ object.payment.sum }}
>
> Of course, this could also be seen as a ''aliasing feature'', but I'm
> not thrilled.

I don't see this as a huge problem. Personally, I could live without
the non-aliased format, but it doesn't bother me enough to want to
reject it, and the syntax makes sense in context.

> Another problem I see is that group_by is too SQLish. You gotta think
> in tables/sql instead of objects. I still don't see a better way of
> allowing the user to specify by which fields to group. Maybe a simple
> name change would make it easier.

Again - where does group_by come from?

> Extras
> ------
>
> 1- An idea I had, but I am not sure yet how it could be well
> implemented
> is to allow users to define their own aggregate function and provide
> it to extend the aggregate api.

Interesting idea, but one that I suspect could be added after the
fact, possibly as part of a larger customization project. For example,
GeoDjango could probably make use of some customized filter tags -
whatever approach is used for aggregates should probably be mirrored
in filters.

> 2- Another extra thing that could be explored is Russell's idea of
> allowing field lookups to compare two fields:

Again, this is something that is useful outside aggregates itself, and
could be implemented independent of any aggregate implementation.

> I put this as an extra since it does not involves the whole ORM and
> not just the aggregates though I think would be a very nice to have
> feature.
>
> 3- Another thing I've been looking into is to allow, for performance
> reasons, the aggregates to be kept in a table as is done in
> django-simpleaggregation [1].

I'll need to look into this before I can make a full comment, but my
immediate reaction is that we should concentrate on simple SQL query
construction, rather than optimization strategies.

Yours,
Russ Magee %-)

Nicolas E. Lara G.

unread,
Mar 16, 2008, 4:05:08 PM3/16/08
to Django developers
Hi Russell,

>
> This is an ambitious project, so good luck! Long time readers will
> know that I have an interest in aggregates, so i'll be here to help
> out however I can.
>

Thanks

>
> 1) The output format: Using the example data structure you provided,
> how do you extract the average age? The best I can see is
> "values[2]['age']['avg']" - values[2] is not a very helpful access
> mechanism.
>

This is entirely true. I guess I didn't thought that through. Still
the ressult format will probably change according to the final syntax
if this is decided to be implemented.

> 2) The query mechanism confuses the idea of an aggregate over the
> query set itself (e.g., avg age of all buyers) and an aggregate of a
> member in the group (e.g., the average quantity purchased by each
> individual buyer).
>
> My original proposal had 2 mechanisms: aggregate() to get aggregates
> over the entire QS, which returned a dictionary of the resulting
> aggregates, and annotate() to augment object instance with aggregate
> attributes.
>

i know this but I was trying to avoid using two different mechanisms.
(more below)

> The particular goal of annotate() was to enable you to get a list
> actual objects which have extra attributes, so you could reference
> "somebuyer.average_payment".
>
> In your proposal, I suppose annotate() would return the first two
> elements of your result set, and aggregate() would return the last
> element.
>

That exactely would be the equivalence of annotate and aggregate.

My idea is to be able to do both things with only one queryset
modifier. I believe that both things are conceptually equal -they are
aggregates- and should be treated as such. Although your separation
makes it easier to implement and get the results it would be great to
allow the user to, in only one use of the aggregate modifier, get
aggregates over the entire queryset and the instances.
As for now the only solution I see is to return a tuple, but this is
not a very elegant solution (i would prefer the separation on teo
modifiers). I'll give it more thought and re-post soon.

> 3) How does this syntax extend into multiple-table aggregates (i.e,
> when you are querying over Buyer and need to get an aggregate over
> something related to the Purchase table) - in particular the output
> format?

It does.. by the rule you quoted below. If the name specified is part
of the model we are querying on (i.e. Buyer) it is an aggregate over
the query itself. If it doesn't it aggregates over a member of the
group (the related model that contains a field with that name). This
is not hard to implement and I believe it comes very intuitively to
the user.

> > and in filter-like syntax::
> > >>> queryset.aggregate( payment__sum='total_payment',
> > quantity__avg='normal_request', payment__avg='avg_pay',
> > age__avg='mean_age')
>
> Obviously, I like the aliasing format, but the same comments about
> output format and inconsistency between aggregates over instances and
> over groups.
>
> > When a field name exists in the base model that has the same name of
> > the field on the related model we want to aggregate we could resolve
> > the ambiguety by refering to therelated field like:
>
> Add this to the list of problems that are avoided if you separate
> aggregates on the model from aggregates on the instance. :-)
>

This I dont see as a problem but as a feature. It could be done in a
way that you always need to refer to the related field when calling an
aggregate on an instance but this is unnecesary if there are no name
clashes. It is equally as easy to check the fields on the related
table to find out which is to be used and this is a very non-expesive
operation. Still I wouldn't opose to always specifying the related
field name in aggregates over groups.

> > There is no way to specify that only the aggregated fields should
> > appear in the results because this could be done with the modifiers
> > proposed in [#5420].
>
> Agreed.
>
> I would add to this - if you take the 'annotated object' approach to
> the output format, you could use 'values()' to pull out the subset of
> values that you want for each object.

Good. So there are going to be multiple ways of restricting the
results. No need to implement this a third time.

> > In the case where group_by is not specified it is always grouped by
> > the ForeignKey (or ManyToMany). If the group_by is specified (this is
> > always specified in the function=tuple format) the object is grouped
> > by those fields.
>
> You've lost me here. Where does 'group_by' come from?

It is true that I didn't say much about the group by feature, so here
it goes:
Though group by seems very SQL-ish to me, I understand is a feature
that many people would like to have since it allows more flexibility
(and nobody likes to have to go and do it in raw sql). The group_by
attribute could be added (in the function=tuple way) to specify the
fields to go in the GROUP BY clause. In the case that this attribute
is present the group by clause would be populated with those fields
instead of the foreign key/id.

>
> > The count function should, IMHO, be handled in the filter-like syntax
> > so you can specify on which field to count allowing you to count
> > excluding null values for foreign tables.
>
> Agreed.
>
> > To specify ordering you would use the function order_by refering to
> > the alias or the field__function. this is::
>
> > qs.order_by('-payment__sum')
> > qs.order_by('-total_payment')
>
> Agreed. My only comment here is that it would be nice if the
> payment__sum format had some analog in the original query syntax. This
> is my biggest problem with the unaliased aggregate notation that you
> are proposing - it doesn't really share an analog with anything else
> in Django, whereas the 'payment__sum' approach and the aliased format
> both have immediate and relatively obvious analogs with the syntax
> used by filter().
>

This is true. And I also mention it in another post. But one thing is
being able to have aliases and another is for aliases to be
compulsory. This is why I believe the function=tuple could be around
to help in the simple cases but having the filter-like syntax as a
prefered syntax.


>
> > Downsides
> > ---------
>
> > The only downside I see with having the two syntax possibilities is
> > inconsistency. The person doing, for example, the templates would need
> > to know how the query was executed in order to decide weather use::
>
> > {{ object.total_payment }}
>
> > or::
>
> > {{ object.payment.sum }}
>
> > Of course, this could also be seen as a ''aliasing feature'', but I'm
> > not thrilled.
>
> I don't see this as a huge problem.

Niether do I. But i felt like hidding something if I didn't poit it
out.

>
> > Extras
> > ------
>
> > 1- An idea I had, but I am not sure yet how it could be well
> > implemented
> > is to allow users to define their own aggregate function and provide
> > it to extend the aggregate api.
>
> Interesting idea, but one that I suspect could be added after the
> fact, possibly as part of a larger customization project. For example,
> GeoDjango could probably make use of some customized filter tags -
> whatever approach is used for aggregates should probably be mirrored
> in filters.
>

I wasn't very sure of how this could be done, but a previous post by
Justin pointed some interesting ideas. You can read my new idea for
this in my response to his post. But I aggree, this is a further step
that comes after the basic of aggregates is done.

> > 2- Another extra thing that could be explored is Russell's idea of
> > allowing field lookups to compare two fields:
>
> Again, this is something that is useful outside aggregates itself, and
> could be implemented independent of any aggregate implementation.
>
> > I put this as an extra since it does not involves the whole ORM and
> > not just the aggregates though I think would be a very nice to have
> > feature.
>
> > 3- Another thing I've been looking into is to allow, for performance
> > reasons, the aggregates to be kept in a table as is done in
> > django-simpleaggregation [1].
>
> I'll need to look into this before I can make a full comment, but my
> immediate reaction is that we should concentrate on simple SQL query
> construction, rather than optimization strategies.
>

I will. All of this 'section' are extra things that could be added
after tha basics are done.

> Yours,
> Russ Magee %-)

Thanks for your comments =)

As a conclusion, I will try to come up with a better result syntax or,
if I can't, yield to the separating-into-two-modifiers option. I will
post on this tomorrow probably. I will also see into providing a
better way to specify which fields to group that is less SQLish and
helps the user think in models/objects and not in tables/joins.

Russell Keith-Magee

unread,
Mar 16, 2008, 7:13:33 PM3/16/08
to django-d...@googlegroups.com
On Mon, Mar 17, 2008 at 5:05 AM, Nicolas E. Lara G.
<nicol...@gmail.com> wrote:
>
> > In your proposal, I suppose annotate() would return the first two
> > elements of your result set, and aggregate() would return the last
> > element.
> >
>
> That exactely would be the equivalence of annotate and aggregate.
>
> My idea is to be able to do both things with only one queryset
> modifier. I believe that both things are conceptually equal -they are
> aggregates- and should be treated as such.

But they're not equivalent. One mechanism returns a dictionary of
aggregates. The other returns a list of model instances, each of which
is annotated with aggregates. The only similarity is that 'they are
both about aggregates' - beyond that superficial similarity, they are
quite different. Given that there are two quite different output
modes, I don't think having two modifiers is at all unusual.

Potential name clashes, and needing to introduce a syntax that allows
for multiple ways to reference the same aggregate - this is a feature?

> My point is that It could be done in a


> way that you always need to refer to the related field when calling an
> aggregate on an instance but this is unnecesary if there are no name
> clashes. It is equally as easy to check the fields on the related
> table to find out which is to be used and this is a very non-expesive
> operation. Still I wouldn't opose to always specifying the related
> field name in aggregates over groups.

The throwaway line to counter this is "Explicit is better than implicit".

On a practial note, I haven't sat down and worked on an
implementation, but I suspect that you will find that the logic
required to resolve these shortcut aggregates will be more headache
than it is worth.

> > > In the case where group_by is not specified it is always grouped by
> > > the ForeignKey (or ManyToMany). If the group_by is specified (this is
> > > always specified in the function=tuple format) the object is grouped
> > > by those fields.
> >
> > You've lost me here. Where does 'group_by' come from?
>
> It is true that I didn't say much about the group by feature, so here
> it goes:
> Though group by seems very SQL-ish to me, I understand is a feature
> that many people would like to have since it allows more flexibility
> (and nobody likes to have to go and do it in raw sql). The group_by
> attribute could be added (in the function=tuple way) to specify the
> fields to go in the GROUP BY clause. In the case that this attribute
> is present the group by clause would be populated with those fields
> instead of the foreign key/id.

Remember, as with all Django ORM work, we're only trying to get to
80%; complex problems solutions will always require complex solutions,
for which the right answer is to roll your own SQL.

With that in mind: I'm not convinced of the need for group_by(). Can
you provide a use case where it would actually be useful? It seems to
me that both of the use cases driven by the output formats (aggregate
of entire QS, and annotated aggregates of QS members) pretty much
imply the group_by that is required. If you start playing with
group_by, you're going to get QS output which isn't a clean or easily
interpreted object list.

Yours,
Russ Magee %-)

Nicolas E. Lara G.

unread,
Mar 16, 2008, 8:21:17 PM3/16/08
to Django developers


On Mar 16, 7:13 pm, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:
> On Mon, Mar 17, 2008 at 5:05 AM, Nicolas E. Lara G.
>
> <nicolasl...@gmail.com> wrote:
>
> > > In your proposal, I suppose annotate() would return the first two
> > > elements of your result set, and aggregate() would return the last
> > > element.
>
> > That exactely would be the equivalence of annotate and aggregate.
>
> > My idea is to be able to do both things with only one queryset
> > modifier. I believe that both things are conceptually equal -they are
> > aggregates- and should be treated as such.
>
> But they're not equivalent. One mechanism returns a dictionary of
> aggregates. The other returns a list of model instances, each of which
> is annotated with aggregates. The only similarity is that 'they are
> both about aggregates' - beyond that superficial similarity, they are
> quite different. Given that there are two quite different output
> modes, I don't think having two modifiers is at all unusual.
>

we are not understanding each other here. I meant that yes, annotate
is equivalent to the first two elements and aggregate is queivalent to
the last element.
Annotate and aggregate are not equivalent at all. I just would like to
obtain both functionaliies with one modifier. Basically I would like
to be able to do only one query to obtain both model wise aggregates
and related field aggregates. Having two different querysets for each
would result on having to do two queries. But, as you say, if it is
*that* important to have only *one* db acces go to raw sql.
You convinced me with that.
It is true that group by might complicate both the implementation and
the use. I just thought it to be a 'wanted' feature. I believe that
grouping is a very delicate aspect and would make the user abandon the
abstraction and start thinking SQL. This is why I've been trying to
think of a way to provide the functionality without implicitly setting
group_by. But I wouldn't mind, at all, leaving this as a raw sql case
since most of the use cases I could think of could be solved by
aggregating on the related model after filtering.

I still owe you the proposed result format for having only one
modifier. I'll reply on that tomorrow since my time-zone requires me
to go to sleep and all I could think of so far is returning a tuple
(which makes no sense)

Best Regards,

Nicolas

Justin Fagnani

unread,
Mar 16, 2008, 8:23:31 PM3/16/08
to django-d...@googlegroups.com
Hey Nicolas,

For me the only real reasons to keep function=tuple syntax are
specifying group by (which I believe would be done with something
similar in your syntax, but lets not speculate)

I like Russell's idea of using values to set GROUP BY. In my examples, since there was no values() call, GROUP BY is assumed to be on all the fields (or on id). An example grouping would be:

>>> queryset.values('age').aggregate(Sum('purchases__payment'))

Something that comes to mind just from looking at that is that the result will have 'age' and 'purchases__payment_sum', even though values just has 'age'. Maybe values could take Aggregate objects directly so that we could write:

>>> queryset.values('age', Sum('purchases__payment'))

That's definitely clean.

and to not tie the
user to select aliases when they are not needed. Aliasing is a great
functionality but if my aliases are always the same name as the
function I am calling I want them to be generated for me.

This is the point of allows positional args and keyword args. With positional args you don't specify alias, and with kwargs you do, but it happens in the same syntax. The example above does not specify aliases.

One difference in my ideas is that instead of the result having a dictionary for each aggregate function, that a default name be provided in the absence of an alias. This gets rid of any difference in the structure of the results for aliases or no aliases.

I believe your syntax does bit the one I'm proposing regarding the
power of user defined functions, but as for general syntax it still
seems a bit unconsistent with existing api syntax.

I think it's a bit more consistent when you consider the Q syntax, which it mirrors pretty well. The problem I have with the filter syntax is that the aggregate() function itself is responsible for determining what function you want based on parsing the kwargs, which makes adding new functions and supporting user functions difficult. filter() is different because databases and users don't add typically operators.


Also it would make
the user defined functions to be directly specified in the queryset
and this is bad for re-use.

Since Aggregate objects override arithmetic operators, users could certainly, and very easily, write a custom function. You touch on this later, but I just want to point out that Aggregates as objects allows a great deal of flexibility in how they're used. Look back at my example with a customer function in aggregate():

>>> queryset.aggregate(avg_price=Sum('purchases__payment') / Sum('purchases__quantity'))

Because Aggregate function objects can be combined with operators, it becomes trivial to write and use user function:

>>> def my_func(value_field, quantity_field):
>>>     return Sum(value_field) / Sum(quantity_field)

>>> queryset.aggregate(avg_price=my_func('purchases__payment', 'purchases__quantity'))

> It may also be possible to use the aggregation functions directly in
> filters:
>
> >>> cheap_books = queryset.filter(price__lt=Avg('price'))
>
> I think that's an pretty clean way to specify that query. It should be
> roughly equivalent to the following (using the F(), which I think is an
> absolutely great idea, and annotate() from Russell):
>
> >>> cheap_books = queryset.annotate
>
> (price_avg=Avg('price')).filter(price__lte=F('price_avg'))
>

This I don't like that much. Though it would definitelly make the
calls simpler the use of aggregate functions everywhere might come
back to bite you. Also, users will probably not realize that it
represents a more expensive operation that a simple "price__lt=3".

How is it more expensive? You still have to query the DB to get that value of 3, right? 
 
I believe that your syntax would be a great replacement for the
tuple=function syntax (and I believe that just lowercase-ing the
function names would make it look less odd to me)

The reason why they're uppercase is that they're objects, exactly like Q objects. This is why they can be combined with regular python operators. Maybe that part wasn't clear in my email, but these aggregate objects would be responsible for emitting the correct SQL or relational algebra objects (whatever Malcom's work requires) for SELECT, GROUP BY, and FROM clauses. The objects would form an expression tree when combined with operators. Here's some very basic example classes:

class Expression(object):
  def __add__(self, rhs):
    return Add(self, rhs)
  
class Add(Expression):
  def __init__(self, lhs, rhs):
    self.lhs = lhs
    self.rhs = rhs
  def select(self):
    return self.lhs.expr() + " + " + self.rhs.expr()

class Sum(Expression):
  def __init__(self, col_name):
      self.col_name = col_name
  def select(self):
    return "SUM(%s)" % self.col_name

I hope that makes some things more clear. I think your reservations are mostly already taken care of. However, I have no idea how hard it would be to implement. I'd love to hear Malcom's thoughts on this, and I guess if I'm really interested in this it's time to start digging into the qs-rf code and see how it works.

Cheers,
  Justin


Justin Fagnani

unread,
Mar 17, 2008, 1:22:14 AM3/17/08
to django-d...@googlegroups.com
Well I started looking at the qs-rf code and it seems like Malcolm already has quite a start on aggregation support. I actually got something sorta working in the last hour, though it's probably a nasty hack.

If you're not familiar with the qs-rf code, this might not make a lot of sense: Basically in a lot of places sql.Query.select and other lists aren't assumed to contain lists of strings. If an item isn't a string then .as_sql() is called. So I wrote a test Sum aggregation object with .as_sql() and I made two small changes: one to sql.Query.add_fields() and one ValuesQuerySet.iterator. Now you can pass the Sum object to values() as a positional arg and it'll work.

So currently, the following works:

# Models:
class Book(models.Model):
  author = models.CharField(max_length=128) 
  name = models.CharField(max_length=128)
  price = models.FloatField()

class Customer(models.Model):
  name = models.CharField(max_length=128)
  age = models.IntegerField()

class Purchase(models.Model):
  customer = models.ForeignKey(Customer)
  book = models.ForeignKey(Book)
  quantity = models.IntegerField()
  payment = models.FloatField()

#Aggregate objects:
class Sum(object):
  def __init__(self, col=None, alias=None):
      self.col = col
      self._alias = alias
  def as_sql(self, quote_func=None):
    return "SUM(%s)" % self.col
  def alias(self):
    return self._alias or (self.col + "_sum")

class Avg(object):
  def __init__(self, col=None, alias=None):
      self.col = col
      self._alias = alias
  def as_sql(self, quote_func=None):
    return "AVG(%s)" % self.col
  def alias(self):
    return self._alias or (self.col + "_avg")


#Input some data...

>>> c = Customer.objects.get(id=1)
>>> c.purchases.values()
[ {'payment': 1.0, 'customer_id': 1, 'id': 1, 'book_id': 1, 'quantity': 1},
  {'payment': 2.0, 'customer_id': 1, 'id': 2, 'book_id': 2, 'quantity': 1},
  {'payment': 20.0, 'customer_id': 1, 'id': 3, 'book_id': 3, 'quantity': 10} ]

>>> c.purchases.values(Sum('payment'), Avg('payment'))
[{'payment_sum': 23.0, 'payment_avg': 7.666666666666667}]

Auto grouping by non-aggregate fields in values() also works:

>>> Book.objects.values()
[ {'price': 1.0, 'id': 1, 'name': u'book 1', 'author': u'Twain'},
  {'price': 2.0, 'id': 2, 'name': u'book 2', 'author': u'Twain'},
  {'price': 2.0, 'id': 3, 'name': u'book 3', 'author': u'Wilde'} ]

>>> Book.objects.values('author',Avg('price'))
[ {'price_avg': 1.5, 'author': u'Twain'},
  {'price_avg': 2.0, 'author': u'Wilde'} ]


Aggregation/grouping across relationships doesn't work. I tried to figure that out, but totally hit the wall in the qs code. Also, using kwargs to specify the alias doesn't work, but the aggregate classes can have an alias set, so this works:

>>> Book.objects.values(Avg('price', alias='average_price'))
[{'average_price': 1.6666666666666667}]


Malcolm also already has aggregation related classes in sql.datastructures. (My hack won't work with his classes because it want's an .alias() method.) Without his input, I'd guess he has a pretty good idea where he's going with this. I only added 10 lines to sql.Query and 1 to ValuesQuerySet, so it appears extremely close.

I attached my patch if anyone want's to check it out. Supporting arithmetic/logical operations should be pretty straight forward, and I might play with that next.

Malcolm, it'd be great to hear your input since it seems like you have most of the infrastructure in place to support this type of aggregation. The qs-rf code looks great, btw. It was pretty easy to understand (except for maybe sql.Query.setup_joins which I'm trying to wrap my head around :)

Cheers, 
  Justin

django-aggregation.diff

Malcolm Tredinnick

unread,
Mar 17, 2008, 5:40:43 AM3/17/08
to django-d...@googlegroups.com
I don't have time or spare brain cycles to wade through this thread and
give it the attention it deserves at the moment. Russell should have
things well in hand. He's picked up pretty much the same things I
noticed that were problematic in Nicola's original proposal and then
some more stuff.

I'll just make a few quick comments on your proto-implementation.
You've pretty much found a logical way to fit this in, but you need to
be more imaginative in your test cases to drive your understanding a bit
further.

On Sun, 2008-03-16 at 22:22 -0700, Justin Fagnani wrote:
[...]

> #Aggregate objects:
> class Sum(object):
> def __init__(self, col=None, alias=None):
> self.col = col
> self._alias = alias
> def as_sql(self, quote_func=None):
> return "SUM(%s)" % self.col
> def alias(self):
> return self._alias or (self.col + "_sum")

I'll just consider this one class. The comments apply to both.

You can't just deal with "col" as a string here, since there will also
be a table alias attached to make sure the column is selected from the
correct table. Just using "col" could very easily be ambiguous. For
example, what if I was summing the "id" values? There are many tables
with an "id" column that could be involved in the query.

As a general observation, there are multiple uses of the word alias
floating around here: table aliases and the alias used for the output
column. You might want to be very explicit and call the output alias
something clear like output_alias, since "alias" is used throughout the
Query class to refer to table aliases.

Have a look at how add_filter() works out both the alias and the column
name to use when it's constructing the where-clause entry (look at the
"entry" tuple in add_filter(), for example). You'll need to replace the
name that is passed in with a reference to the (right alias,
db_column_name) pair. Looking at the existing Count() class might give
you some starting points -- particularly how it deals with the two part
of it's "col" attribute, although that is not intended to be a fully
fledged count implementation, since it only counts the entire result
set. However, it shouldn't be too far away from what is needed and the
API should be mostly what is needed.

Any class like this has to implement relabel_aliases(). There are a
number of cases where the aliases might need to change after the
structure has been created, so you need that method. It's passed a
dictionary that maps the old alias name to the new name for any aliases
that are being changed. Again, the existing Count class will give you an
example of what happens there.

[...]


>
> Aggregation/grouping across relationships doesn't work. I tried to
> figure that out, but totally hit the wall in the qs code.

This should Just Work(tm) once you start handling the mapping from input
attribute names to (alias, column) pairs correctly. It's not just five
minutes work to understand, but that's because it's pretty much the meat
of the whole query construction process.

Basically, the string you pass in the "name" parameter to setup_joins()
is something that looks like foo__bar__baz (in your current example, it
would be something like "price"). It's the normal Django notation for
referring to a field, either on the current model or on a related model
via the double-underscore connections. The setup_joins() method works
out which tables need to be joined so that you can query the right
columns to get to your information. Remember that the database table and
column names are not the same as the field names you pass in. So you
need to look at the "field" object you get back from setup_joins() to
work out the real column name.

The best way to get a feeling for what's going on there is to print out
what you get back from setup_joins() and poke around at those
structures. It's somewhat in flux, since I'm doing some heavy
optimisation work on that code at the moment, but what you learn now
will still be relevant in any changed version in a day or two. The most
likely change there is that the nested list of joins that is returned
from setup_joins() will be a flat list and there'll be a parameter that
says where the final table join started (each successive model that is
added to the filter could add one or two tables to the join list and we
need to know where the last model started sometimes). Anyway, I'm still
working on that, since the code now behaves correctly, so I can now
modify it to make a few things go faster.

Your change to the iterator() method in ValuesQuerySet looks ugly. The
attribute is called field_names, so it should contain the field names,
not either the field names or some structure that has to be poked at to
get the field names. My point is that modifications to handle these
non-string inputs should go in _setup_query(), rather than iterator().

Note there's a bug in _setup_query() and iterator() that I'll eventually
fix. I only mention it here in case you're wondering how it works at the
moment. If you pass in multi-model values() fields and also have
extra(select=...), the wheels fall off. Also, if you only select some of
the extra(select=...) fields in values(), the result is wrong. That's
easy enough to fix, though. Just a matter of whacking on _setup_query()
some more.

Also, ValuesQuerySets are going to grow a bit more error checking. Right
now,

Entry.objects.filter(....).values(...) |
Entry.objects.filter(...)

doesn't raise an error. It should, since the two halves are different
sorts of objects. It should also raise an error if you combine two
values() queries with different fields in them, most likely (or perhaps
merge the fields). That's still on the TODO list, so don't do that yet,
but eventually you'll need to test what happens when you try to combine
two querysets via "&" or "|". But I haven't implemented specific
versions of __and__ and __or__ on ValuesQuerySets yet.

Okay.. that's a lot to think about. If you really want to understand the
alias stuff and why it's relevant (or get a feeling for what the result
should look like), print out some of these queries
(my_queyrset.query.as_sql()) to see what is being produced. If you have
a naked column -- that is, one without a table alias of some kind
attached -- it will lead to bugs down the track because it will be
possible to introduce that columnn name in more than one table in the
query and that isn't valid SQL.

Regards,
Malcolm

--
Why be difficult when, with a little bit of effort, you could be
impossible.
http://www.pointy-stick.com/blog/

Justin Fagnani

unread,
Mar 17, 2008, 3:03:48 PM3/17/08
to django-d...@googlegroups.com
Thanks for the reply Malcolm,

No worries on limited cycles. I'll have to go through your reply again later today when I get a chance. My hack was definitely ugly, and the aggregate classes... minimal, since I wasn't expecting to write any code, but then saw that it looked like you almost had support for non-string fields.

So I'll push forward on this as much as I can and see where I get. There's a lot to digest to fully understand how everything works. I'll write back with questions as soon as I can.

Thanks,
  Justin

Justin Fagnani

unread,
Mar 17, 2008, 9:11:39 PM3/17/08
to django-d...@googlegroups.com
Ok, I have a chance to work on this more now. I made a new ticket for it at #6811 since it's so different from the other proposed syntax.

-Justin

David Cramer

unread,
Mar 18, 2008, 12:29:20 AM3/18/08
to Django developers
I still prefer our "OMG ITS SQL" patch: http://code.djangoproject.com/ticket/4997

I want to manually specify group by, i dont want to count, or sum, or
average, i want to group by.

On Mar 17, 6:11 pm, "Justin Fagnani" <justin.fagn...@gmail.com> wrote:
> Ok, I have a chance to work on this more now. I made a new ticket for it at
> #6811 since it's so different from the other proposed syntax.http://code.djangoproject.com/ticket/6811#
>
> -Justin
>
> On Mon, Mar 17, 2008 at 12:03 PM, Justin Fagnani <justin.fagn...@gmail.com>

Nicolas E. Lara G.

unread,
Mar 18, 2008, 4:31:46 PM3/18/08
to Django developers
Hello again,

I've been digging the queryset-refactor code and decided to implement
a proof-of-concept of the aggregate method.
This is by no mean a final implementation. At the moment there is no
support for following related fields (joins) and table aliases are not
managed (though its not needed since there are no joins).

As you can see I started implementing the aggregate modifier according
to Russell's suggestion (separate aggregate and annotate). I've been
thinking about joining both modifiers in one, but got to the
conclusion that it is better to do the separated implementations and,
in case that it is necesary to have a modiifer that do both (which I
think is the case), add some syntactic sugar (shortcut) to use both
methods in one modifier call.

Also the "GROUP BY" functionality is implemented using values().

This implementation works like this:

== Models ==

class Purchase(models.Model):
identifier = models.CharField(max_length=100)
payment = models.FloatField()
quantity = models.IntegerField()


class Buyer(models.Model):
name = models.CharField(max_length=100)
age = models.IntegerField()
purchases = models.ManyToManyField(Purchase)

== The rest ==

>>Buyer.objects.all().values()
[{'age': 21, 'id': 1, 'name': u'Peter'},
{'age': 35, 'id': 2, 'name': u'John'}]

>>> Buyer.objects.all().aggregate(age__avg='mean_age')
{'mean_age': Decimal("28.0000000000000000")}

>>> Purchase.objects.all().aggregate(quantity__sum='total_adquisitions', payment__avg='mean_pay')
{'total_adquisitions': 579L, 'mean_pay': 1446397.2}

>>> Purchase.objects.all().values()
[{'identifier': u'r123', 'id': 1, 'quantity': 25, 'payment': 25000.0},
{'identifier': u'r1254', 'id': 2, 'quantity': 56, 'payment':
356841.0},
{'identifier': u'dger', 'id': 3, 'quantity': 85, 'payment':
6185463.0},
{'identifier': u'asgvrs', 'id': 4, 'quantity': 68, 'payment':
651161.0},
{'identifier': u'r123', 'id': 5, 'quantity': 345, 'payment':
13521.0}]

>>> Purchase.objects.all().values('identifier').aggregate(quantity__sum='num_adquisiitions', payment__sum='total_pay')
[{'num_adquisiitions': 370L, 'identifier': u'r123', 'total_pay':
38521.0},
{'num_adquisiitions': 56L, 'identifier': u'r1254', 'total_pay':
356841.0},
{'num_adquisiitions': 68L, 'identifier': u'asgvrs', 'total_pay':
651161.0},
{'num_adquisiitions': 85L, 'identifier': u'dger', 'total_pay':
6185463.0}]

>>> Purchase.objects.all().values('identifier').aggregate(quantity__sum='num_adquisiitions', payment__count='num_payments', payment__avg='average_earnings')
[{'average_earnings': 19260.5,
'identifier': u'r123',
'num_adquisiitions': 370L,
'num_payments': 2L},
{'average_earnings': 356841.0,
'identifier': u'r1254',
'num_adquisiitions': 56L,
'num_payments': 1L},
{'average_earnings': 651161.0,
'identifier': u'asgvrs',
'num_adquisiitions': 68L,
'num_payments': 1L},
{'average_earnings': 6185463.0,
'identifier': u'dger',
'num_adquisiitions': 85L,
'num_payments': 1L}]


(sorry for the data dump...)

I will open a ticket with the proposed syntax and the respective
patches.

Any thoughts?

Regards,

Petite Abeille

unread,
Mar 18, 2008, 5:05:07 PM3/18/08
to django-d...@googlegroups.com

On Mar 18, 2008, at 9:31 PM, Nicolas E. Lara G. wrote:
>
> Any thoughts?

So...

Django:
Buyer.objects.all().aggregate(age__avg='mean_age')

SQL:
select avg( age ) as mean_age
from buyer

Django:
Purchase
.objects
.all


().values('identifier').aggregate(quantity__sum='num_adquisiitions',
payment__count='num_payments', payment__avg='average_earnings')

SQL:
select identifier,
sum( quantity ) as num_adquisiitions,
count( payment ) as num_payments
from purchase
group by identifier

Nicolas E. Lara G.

unread,
Mar 18, 2008, 5:19:55 PM3/18/08
to Django developers
the ticket can be found in http://code.djangoproject.com/ticket/6821

On Mar 18, 4:31 pm, "Nicolas E. Lara G." <nicolasl...@gmail.com>
wrote:

Justin Fagnani

unread,
Mar 18, 2008, 8:26:33 PM3/18/08
to django-d...@googlegroups.com
Hey Nicolas,

It seems to be a bad week to have this discussion with all the main devs busy with the PyCon sprint and aggregation not being a top priority, but I want to see if I can clarify my objection to the field__function=alias syntax so that hopefully we can be on the same page and not have duplicated proposals and work. Hopefully the core devs can check this out after this week.

So here goes...

1) My first, and main, objection to the syntax is that it's simply unnecessary

Most developers are used to variable=function(argument) syntax for variable assignment and function application, so it seems like a good idea to stick with that.

The only reason to have argument__function=variable is to stay consistent with QuerySet.filter(). However I'd argue that they are very different cases because there is no way to represent filter operators in the standard (lhs op rhs) syntax. The best you could do is something like filter(price=lt(5)), which isn't any better. And at least with lhs__op=rhs the elements are in the same order, so it almost looks right.

The filter() syntax was borne out of necessity, but there's no parallel need here because we can just write aggregate(alias=function(column))

2) It's magic. Again, even though the filter syntax is a little magical, but it was necessary.

3) It overloads __ for yet another purpose

__ is already used to traverse relations and to specify query operators, and now it'll be used for functions too. What happens when these uses collide?

For instance, this syntax would mean that you can't have related_names that are the same as function names. While it would be a bad idea to have a relation named 'sum' (just like it's a bad idea to have related_name='lte') I don't see a reason to add more restrictions and gotchas.

There may be other conflicts too, I'm not sure, but the point is that we can sidestep all potential issues by not overloading __ too much.

4) It adds more logic to QuerySet/Query and may be hard to extend.

With aggregate functions as objects, most of the work to implement them is done outside of the core db wrapper. This hopefully reduces chances for bugs and makes the ORM easier to maintain. It definitely means that it's easier to add new functions, both for core and for user functions. If someone want's to implement support for something like MySQL's GROUP_CONCAT() or PostgreSQL's EVERY(), they can do it simply by defining a class.


Now I fully understand that combining values() and aggregate() and doing automatic grouping might be deemed confusing, since as Russell said, "explicit is better than implicit", but I at least think that a standard function application syntax should be considered.


Cheers,
  Justin

Malcolm Tredinnick

unread,
Mar 18, 2008, 8:47:17 PM3/18/08
to django-d...@googlegroups.com

On Tue, 2008-03-18 at 17:26 -0700, Justin Fagnani wrote:
[...]
>
> 2) It's magic. Again, even though the filter syntax is a little
> magical, but it was necessary.

"Magic" is an emotive and terribly ill-defined term. It has no place in
a technical discussion.

> 3) It overloads __ for yet another purpose
>
>
> __ is already used to traverse relations and to specify query
> operators, and now it'll be used for functions too. What happens when
> these uses collide?

Your argument goes right off the rails here. We already use double
underscores to separate fields and functions -- the lookup types. The
name clash potential is actually a non-issue. The last item in a
sequence is the lookup type (and aggregates just become another sort of
lookup type). If the last item isn't a valid lookup type, we append
__exact to it for filtering purposes. So if you have a field called
"gte" in your model, you can still filter on "foo__gte__exact=6", etc.
The same disambiguation would apply in this case.


Malcolm

--
Plan to be spontaneous - tomorrow.
http://www.pointy-stick.com/blog/

Russell Keith-Magee

unread,
Mar 18, 2008, 9:27:34 PM3/18/08
to django-d...@googlegroups.com
On Tue, Mar 18, 2008 at 1:29 PM, David Cramer <dcr...@gmail.com> wrote:
>
> I still prefer our "OMG ITS SQL" patch: http://code.djangoproject.com/ticket/4997
>
> I want to manually specify group by, i dont want to count, or sum, or
> average, i want to group by.

If you go back to the original discussions (almost 2 and a half years
ago), you'll see why this was rejected. In short, the argument goes
something like this:

No - you don't want to do group_by. You want to achieve some logical
goal, which is achieved in SQL implements using GROUP BY syntax.

Django isn't in the business of redefining SQL syntax - if we tried to
do this, we would fail. However, Django _is_ in the business of
identifying common object-based use-cases, and building an elegant
syntax for representing those use-cases.

One common logical goal is to evaluate aggregates - Calculating
averages, sums, etc of collections of objects. You obviously have
another use case. If you can quantify that use case in a general way,
and express it as an operation on an object (or group of objects), you
may have the basis for a proposal for an extension to the Django ORM.

Of course, there will always be a subset of valid and useful SQL
queries that can't be distilled into a broader use-case; in those
cases, there will always be a need to drop to raw SQL. I believe that
the QS-refactor will make writing raw SQL easier, but that isn't the
same as 'just exposing group_by()'.

Yours,
Russ Magee %-)

Russell Keith-Magee

unread,
Mar 18, 2008, 9:36:08 PM3/18/08
to django-d...@googlegroups.com
On Wed, Mar 19, 2008 at 9:26 AM, Justin Fagnani
<justin....@gmail.com> wrote:
> Hey Nicolas,
>
> It seems to be a bad week to have this discussion with all the main devs
> busy with the PyCon sprint and aggregation not being a top priority, but I
> want to see if I can clarify my objection to the field__function=alias
> syntax so that hopefully we can be on the same page and not have duplicated
> proposals and work. Hopefully the core devs can check this out after this
> week.

You may not have _all_ the core developers attention, but _some_ of
them are listening - I'm definitely reading everything, and responding
when I get a chance; Malcolm is evidently reading too.

This doesn't mean that aggregates are a low priority. They are
certainly a high priority for me. I'm sure Adrian and Jacob have
noticed the discussion, and they will have also noticed that Malcolm
and I are involved. They probably won't say anything until Malcolm or
I ask them for an official judgement, or until they see something that
Malcolm or I are supporting that offends their sensibilities.

I haven't been as responsive as I was hoping to be on this thread -
I'm in the process of doing a project handover at work, and the
weekend was hectic for me personally; with luck, I should be able to
give you some more in-depth feedback on your proposal in the next day
or so.

> 1) My first, and main, objection to the syntax is that it's simply
> unnecessary
>

> 2) It's magic. Again, even though the filter syntax is a little magical,
> but it was necessary.
>
> 3) It overloads __ for yet another purpose

I disagree. The usage of the aggregate syntax is no different to the
filter syntax in rendition or interpretation, except that the
'functions' are aggregates rather than comparators. '__' is used for
join specification, or function specification in the terminal case.

The potential for name space clashes is almost nil, since the last
clause in the query term must be the aggregate function. The sum over
a field called sum is sum__sum; the first sum is the field name, the
second is the sum function specifier.

> 4) It adds more logic to QuerySet/Query and may be hard to extend.

This is actually one of the aspects of your proposal that I
particularly like. In particular, it provides an entry point for
complex aggregate cases. The example that springs to mind is
aggregates with complex group_by clauses that work over multiple table
joins.
We may not want to provide out-of-the-box support for these complex
cases in the form of a native operator or syntax, but it would be
helpful to be able to provide a clean mechanism for end users to
define 'custom aggregates' in a similar fashion to custom Q objects.

It also affords all sorts of aggregate math opportunities, as you have
previously indicated.

Yours,
Russ Magee %-)

Justin Fagnani

unread,
Mar 18, 2008, 11:48:08 PM3/18/08
to django-d...@googlegroups.com
On Tue, Mar 18, 2008 at 6:36 PM, Russell Keith-Magee <freakb...@gmail.com> wrote:
This doesn't mean that aggregates are a low priority.

Ah... I phrased that badly. I meant that it wasn't a high priority for this week, during the sprint. No biggie.

> 1) My first, and main, objection to the syntax is that it's simply
> unnecessary
>
>  2) It's magic. Again, even though the filter syntax is a little magical,
> but it was necessary.
>
> 3) It overloads __ for yet another purpose

I disagree. The usage of the aggregate syntax is no different to the
filter syntax in rendition or interpretation, except that the
'functions' are aggregates rather than comparators. '__' is used for
join specification, or function specification in the terminal case.

I hear you. I guess I see a difference between operators/comparators and functions because Python uses infix notation. The __ syntax for lookup types at least is somewhat similar to Python syntax in that it's roughly infix. The proposed __ syntax for function application is far different than Python. Sure there's really no difference between operators and functions, so I know it's not that big of a deal, but a completely standard Python-looking syntax is available, so I don't see why we shouldn't use it. I also think that the function object syntax is just much cleaner.

The potential for name space clashes is almost nil

Yeah, I was wrong about this, as Malcolm pointed out.
 
>  4) It adds more logic to QuerySet/Query and may be hard to extend.

This is actually one of the aspects of your proposal that I
particularly like. In particular, it provides an entry point for
complex aggregate cases. The example that springs to mind is
aggregates with complex group_by clauses that work over multiple table
joins.

Could you provide some examples? Once I got alias remapping working a huge amount of things started Just Working(tm), like Malcolm said would happen. Grouping over multiple tables is one of those. I don't think that's unique to my proposal though, sql.Query pretty much just does it.
 
For instance:
>>> Purchase.objects.values('customer', 'book__author', Sum('payment'))

results in this SQL:

SELECT "bookstore_customer"."id", "bookstore_book"."author", SUM("bookstore_purchase"."payment") FROM "bookstore_purchase" INNER JOIN "bookstore_customer" ON ("bookstore_purchase"."customer_id" = "bookstore_customer"."id") INNER JOIN "bookstore_book" ON ("bookstore_purchase"."book_id" = "bookstore_book"."id") GROUP BY "bookstore_customer"."id", "bookstore_book"."author"

to give how much each customer spent on each author's books, which is the kind of query I write by hand frequently. I'm not sure if that's the type of query you're talking about.

I'm not sure if that's the type of query you're talking about. I think the function=tuple syntax will do the same thing with:

>>> Purchase.objects.values('customer', 'book__author').aggregate(payment__sum='payment_sum')

(maybe that's all TMI for now, if so sorry)

We may not want to provide out-of-the-box support for these complex
cases in the form of a native operator or syntax, but it would be
helpful to be able to provide a clean mechanism for end users to
define 'custom aggregates' in a similar fashion to custom Q objects.

It also affords all sorts of aggregate math opportunities, as you have
previously indicated.


One thing to think about is that expressions, being more open-ended, may be easier to abuse. They'll certainly require very good error checking, otherwise there will be times that the queryset is constructed without complaint, but the query fails when executed. This might not be any different from the situation with Q objects though.


Cheers,
  Justin


Russell Keith-Magee

unread,
Mar 19, 2008, 8:13:48 AM3/19/08
to django-d...@googlegroups.com
On Wed, Mar 19, 2008 at 12:48 PM, Justin Fagnani
<justin....@gmail.com> wrote:
>
> I hear you. I guess I see a difference between operators/comparators and
> functions because Python uses infix notation. The __ syntax for lookup types
> at least is somewhat similar to Python syntax in that it's roughly infix.
> The proposed __ syntax for function application is far different than
> Python. Sure there's really no difference between operators and functions,
> so I know it's not that big of a deal, but a completely standard
> Python-looking syntax is available, so I don't see why we shouldn't use it.
> I also think that the function object syntax is just much cleaner.

Keep in mind that an important part of API usability is consistency
with existing APIs. Django queries have a very distinct flavour.
Wherever possible, extensions and improvements to Django queries
should reflect that flavour. You obviously are not as enamoured with
that flavour as the core devs (which includes myself), but we're not
going to abandon it or contradict it without a very good reason.

> Could you provide some examples? Once I got alias remapping working a huge
> amount of things started Just Working(tm), like Malcolm said would happen.
> Grouping over multiple tables is one of those. I don't think that's unique
> to my proposal though, sql.Query pretty much just does it.

Ok; here's the problem space that got me interested in Django
aggregates in the first place. It's complex enough to require table
joins and some potentially interesting groupings.

A Project consists of Milestones.

A Milestone consists of a list of tasks.

A Task is assigned to a User, and contains an initial estimate (in
hours) of the effort required, and a boolean flag that indicates when
the task is complete.

Effort (measured in hours) can be logged against a given task by any given User.

Now, given that multi-table setup, here are a series of questions to
answer, roughly increasing in complexity:

- Tell me how much effort has been expended by User U?
- Give me a list of tasks, with the total effort that has been spent
on each task
- Give me list of tasks that have exceeded their initial effort estimate
- Give me a list of milestones with the amount of effort outstanding
on each milestone
- Give me the total effort that has been expended on project P
- How much effort is outstanding on project P
- Calculate the average discrepancy (in absolute and in relative
terms) between task estimate and expended effort for completed tasks.
- Calculate the average discrepancy between task estimate and expended
effort on a per-milestone basis

This isn't an exhaustive list, but it should give you an idea of the
sort of things that might be required; you may need to group by
Milestone, Task, or User, in conjunction with queries that span the
same space.

It may not be possible to express some of these problems with a simple
syntax. While it would be nice to have a syntax that could hit all
these queries, I don't necessarily consider them to be strict
acceptance criteria. However, a syntax that allows for easy user
extension would allow me to write a helper aggregate class that
_could_ hit the extreme cases.

> One thing to think about is that expressions, being more open-ended, may be
> easier to abuse. They'll certainly require very good error checking,
> otherwise there will be times that the queryset is constructed without
> complaint, but the query fails when executed. This might not be any
> different from the situation with Q objects though.

We can't prevent dedicated people from making mistakes. However, we
can make sure that the APIs we provide have a single obvious
interpretation. Malcolm wrote a good blog post recently referencing
Rusty Russell's guidelines for API design - it's well worth a read.

Yours,
Russ Magee %-)

Jacob Kaplan-Moss

unread,
Mar 19, 2008, 8:36:58 AM3/19/08
to django-d...@googlegroups.com
On Tue, Mar 18, 2008 at 8:36 PM, Russell Keith-Magee
<freakb...@gmail.com> wrote:
> I'm sure Adrian and Jacob have noticed the discussion [...]

This is me noticing :)

As a point of order, assuming that no responses == no attention is a
bit of a stretch. I read every single word on django-developers, but
only reply where I'm going to help the discussion. In this case Russ
has thought about this issue far more deeply than me; the chance that
I'll disagree with his judgement is nearly nil.

Jacob

[For the record, of course, I agree with everything Russ has said so far :)]

Justin Fagnani

unread,
Mar 19, 2008, 4:16:13 PM3/19/08
to django-d...@googlegroups.com
On Wed, Mar 19, 2008 at 5:13 AM, Russell Keith-Magee <freakb...@gmail.com> wrote:
Keep in mind that an important part of API usability is consistency
with existing APIs. Django queries have a very distinct flavour.
Wherever possible, extensions and improvements to Django queries
should reflect that flavour. You obviously are not as enamoured with
that flavour as the core devs (which includes myself), but we're not
going to abandon it or contradict it without a very good reason.

It's not that I don't like the query syntax, I do. It's a very good solution to the problem given what's possible, but it's only necessary because you can't write a query like filter(relation.field<value) in Python. So (relation.field<value) is transformed into (relation__field__lt=value) and that's quite elegant, and especially nice for the case of (field=value). With functions though we can write(name=Function(argument)) so it's not necessary to write (argument__function='name').

And as an extension to values() it makes sense to me, because the proposed syntax basically comes down to two things: 1) allow values() to take an expression objects as arguments, and 2) use kwargs to specify output aliases. (well three things: automatic grouping by fields)



Thanks a lot for the example queries. I added these to a test project and got them all working, but I need to figure out how to correctly write and submit tests for patches. I think the syntax is nice enough. It's at least simple for simple cases.
 
- Tell me how much effort has been expended by User U? 
>>> u.work_log.values(Sum('hours'))
 
- Give me a list of tasks, with the total effort that has been spent
on each task
>>>  Task.objects.values('id', Sum('work_log__hours'))
 
- Give me list of tasks that have exceeded their initial effort estimate
>>>  Task.objects.values('id', 'estimate', total_hours=Sum('work_log__hours')).filter(total_hours__gt=F('estimate'))

(this actually works right now, but only because of a nasty hack to Query to lookup fields in a column_alias map and output a having clause. I'm pretty sure it'll die on more complex queries, but it's just a proof-of-concept)

- Give me a list of milestones with the amount of effort outstanding
on each milestone
>>>  Milestone.objects.values('id', hours_left=Sum('task__estimate) - Sum('task__work_log__hours'))

- Give me the total effort that has been expended on project P
>>>  p.milestones.values(Sum('tasks__work_log__hours'))
 
- How much effort is outstanding on project P
>>>  p.milestones.values(Sum('tasks__estimate') - Sum('tasks__work_log__hours')) 

- Calculate the average discrepancy (in absolute and in relative
terms) between task estimate and expended effort for completed tasks.
>>>  Task.objects.values(Sum(Abs(F('estimate') - F('work_log__hours'))) / Count()).filter(complete=True)
>>>  Task.objects.values(Sum(F('estimate') - F('work_log__hours')) / Count()).filter(complete=True)
 
- Calculate the average discrepancy between task estimate and expended
effort on a per-milestone basis
>>>  Milestone.objects.values(Sum(Abs(F('tasks__estimate') - F('tasks__effort_log__hours'))) / Count())

I hope I interpreted those right.



We can't prevent dedicated people from making mistakes. However, we
can make sure that the APIs we provide have a single obvious
interpretation. Malcolm wrote a good blog post recently referencing
Rusty Russell's guidelines for API design - it's well worth a read.

Those articles are good. Hopefully this will fall somewhere around 3.5 on the API design level, as the simplest use should be correct, and if you mess up you'll at least get an error. Right now it's like a 6 because you'll get an sql error when you execute the query, not an error whenyou construct it. Luckily, and I've only tested with sqlite so far, the close correspondence between the expression syntax and SQL output means that sqlite's errors make a lot of sense. Some simple checks, like not allowing nesting of aggregate functions will help a lot.

 
-Justin

Justin Fagnani

unread,
Mar 19, 2008, 4:20:12 PM3/19/08
to django-d...@googlegroups.com
On Wed, Mar 19, 2008 at 5:36 AM, Jacob Kaplan-Moss <jacob.ka...@gmail.com> wrote:
As a point of order, assuming that no responses == no attention is a
bit of a stretch.

Hey Jacob, I'm sorry. Malcom had told me it was a bad/busy week for him and I assumed that was true for a lot of the devs becuase of the sprint. I wasn't assuming that no responses == no attention.

Cheers,
  Justin

Russell Keith-Magee

unread,
Mar 19, 2008, 10:46:33 PM3/19/08
to django-d...@googlegroups.com
On Wed, Mar 19, 2008 at 5:31 AM, Nicolas E. Lara G.
<nicol...@gmail.com> wrote:
>
> Hello again,

>
> As you can see I started implementing the aggregate modifier according
> to Russell's suggestion (separate aggregate and annotate). I've been
> thinking about joining both modifiers in one, but got to the
> conclusion that it is better to do the separated implementations and,
> in case that it is necesary to have a modiifer that do both (which I
> think is the case), add some syntactic sugar (shortcut) to use both
> methods in one modifier call.
>
> Also the "GROUP BY" functionality is implemented using values().

Could you please clarify what you mean by this - I can see this as
being either very good, or awful.

The implication of your examples is that annotate/aggregate won't work
unless you provide a values() object. If this is what you are
intending, I see this as a significant weakness. I can't see any
reason that values() should be required.

My concept for aggregates has always been that the GROUP BY clause
would include all non-aggregates fields in the SELECT. This means that
in the default case:

>>> Model.objects.annotate(field1__sum='total')

What you get back is a list of model instances that have an extra
annotated attribute (in this case, total). You have the flexibility to
use the objects as you would any other model instance, except that you
have the summary data coming along for the ride. This sample query
would return _all_ objects, annotated, since it isn't otherwise
filtered or restricted, and the GROUP BY clause would (by necessity)
include all the fields on the model.

This does open an interesting intersection with the way values()
operates. When you use a values() clause, you are restricting a query
to a subset of fields:

>>> Model.objects.values('field2','field3').annotate(field1__sum='total')

This restricts the fields returned by the SELECT, and it would make
sense to reflect this in the GROUP BY clause. The returned values in
this case aren't annotated model objects, since there won't
(necessarily) be 1-1 correspondence between the elements in the
returned data and the underlying model data, so the dictionary output
is appropriate. This strikes me as an elegant coverage of the 80% case
for wanting an explicit GROUP BY clause.

> I will open a ticket with the proposed syntax and the respective
> patches.

On this note - ticket #3566 has been used for a long time to track the
development of aggregates. There is no need to open a new ticket -
many interested eyballs are focussed on the original ticket, so that
is the best place to keep discussion.

Yours,
Russ Magee %-)

Russell Keith-Magee

unread,
Mar 20, 2008, 2:24:43 AM3/20/08
to django-d...@googlegroups.com
On Thu, Mar 20, 2008 at 5:16 AM, Justin Fagnani
<justin....@gmail.com> wrote:
>
> And as an extension to values() it makes sense to me, because the proposed
> syntax basically comes down to two things: 1) allow values() to take an
> expression objects as arguments, and 2) use kwargs to specify output
> aliases. (well three things: automatic grouping by fields)

I don't completely agree that values() is the right place. Firstly, it
ignores the difference between annotate() and aggregate() that I
mentioned earlier. Secondly, it removes the ability to calculate
aggregates, but not include them in the return set. As it currently
stands, values() can be used to reduce the return set to those columns
in which you have an interest; if you overload values() to include
aggregates as well, you won't be able to exclude aggregates from the
result columns.

> > - Tell me how much effort has been expended by User U?
> >>> u.work_log.values(Sum('hours'))

This highlights my earlier point: values() returns a ValuesQuerySet -
a list of dictionaries. What gets returned in this example?

> > - Give me a list of milestones with the amount of effort outstanding
> > on each milestone
> >>> Milestone.objects.values('id', hours_left=Sum('task__estimate) -
> Sum('task__work_log__hours'))

This actually starts to point out a flaw in your proposed syntax - or
at least, an API weakness that we won't be able to avoid. kwargs need
to come after args when you call a Python function, so any aliased
aggregates will need to come later in the argument list. While this
won't break anything, my Magic 8-ball predicts many mailing list
discussions wondering why they are getting argument ordering
exceptions getting thrown from what looks like a perfectly legal
aggregate query.

Using the underscore approach, _all_ aggregate specifiers are are
kwargs, so you can define them in whatever order makes sense.

> I hope I interpreted those right.

To a cursory examination, yes.

Some overall comments:

1) For all spruiking of a "cleaner, more Pythonic format", it looks
very brackety and messy to me. To me, filter clauses read much easier.
However, I acknowledge that this is somewhat a matter of personal
taste (although the taste of the core devs will, in general, carry a
little more weight).

2) The ability to do math on aggregate clauses is quite cool. However,
the types of math that are likely to be required will be fairly
consistent on a per-model basis. I'm wondering if the right mechansim
by which to acheive this is to provide a way to define new filter and
aggregate clauses on the QuerySet.

For example, you can define a custom manager that defines a foobar()
method to provide Model.objects.foobar() as an API entry point.
Managers provide a way to override the default QuerySet; if we extend
the filter and aggregate syntax so that '__foobar' in a filter looks
for filter_foobar() as a method on the queryset, or aggregate_foobar()
if used in an aggregate; these then expand out into whatever SQL might
be required.

3) We still haven't hit a use case where the aggregate has unusual
grouping requirements. My original thought is that end users could be
writing custom Aggregate clauses that ultimately turn into:

MyAggregate('fieldname', group_by='field2', other='some argument')

However, given that I can't come up with a use case, this is starting
to get a big YAGNI to me.

Yours,
Russ %-)

Justin Fagnani

unread,
Mar 20, 2008, 4:42:20 AM3/20/08
to django-d...@googlegroups.com
On Wed, Mar 19, 2008 at 11:24 PM, Russell Keith-Magee <freakb...@gmail.com> wrote:
I don't completely agree that values() is the right place.

I'm not 100% convinced values() is the right place either, but on the other hand these are still values that are being retuned, just calculated values. For some simple cases, such as selecting a single aggregate value with no joins, I think the usage is pretty obvious-looking. 

I've seen people propose something like User.objects.average('age') before, which is simple but flawed. IMO, User.objects.values(Avg('age')) is almost as simple and intuitive without the problems of the .average(), etc.

Of course, this maybe be because I know SQL. I fully expect that values('age') would return many items and values(Avg('age')) returns one item. Would some one who's never used SQL be confused? Do many Django users really have no SQL experience at all, and if so, would this be hard to explain? I don't know.

Firstly, it ignores the difference between annotate() and aggregate() that I mentioned earlier.

I'm not completely ignoring aggregate/annotate, I am wondering if aggregate() could just be folded into values() so we'd only have values/annotate. I have some other thoughts about annotate() that I'll save for later (I have to get some sleep).
 
Secondly, it removes the ability to calculate
aggregates, but not include them in the return set. As it currently
stands, values() can be used to reduce the return set to those columns
in which you have an interest; if you overload values() to include
aggregates as well, you won't be able to exclude aggregates from the
result columns.

I hadn't really considered this, but when would you want to calculate a value and not include it? If it's for query purposes I'd think that a good way to handle it would be to add expression support directly to filter() so that you could say queryset.filter(field1__lte=Avg('field1')) and not have the aggregate in the results. Or do something similar with a Q-like object.
 


> > - Tell me how much effort has been expended by User U?
> >>> u.work_log.values(Sum('hours'))

This highlights my earlier point: values() returns a ValuesQuerySet -
a list of dictionaries. What gets returned in this example?

It's still a ValuesQuerySet, but because there's no group by clause it's only one item:
[{'hours_sum': 7.5}]

It's possible to do set result_type=SINGLE where there's an aggregate but no group by, but I think it's be better for values() to always return the same type.



> > - Give me a list of milestones with the amount of effort outstanding
> > on each milestone
> >>>  Milestone.objects.values('id', hours_left=Sum('task__estimate) -
> Sum('task__work_log__hours'))

This actually starts to point out a flaw in your proposed syntax - or
at least, an API weakness that we won't be able to avoid... 
...my Magic 8-ball predicts many mailing list

discussions wondering why they are getting argument ordering
exceptions getting thrown from what looks like a perfectly legal
aggregate query.

I don't know if I agree with that concern so much, but your 8-ball is surely better tuned than mine for this type of thing. The SyntaxError in that case is pretty clear, and users can be kindly reminded that they are still writing Python.



 Some overall comments:

1) For all spruiking

Whoa, I learned a new word today :)
 
of a "cleaner, more Pythonic format", it looks
very brackety and messy to me. To me, filter clauses read much easier.
However, I acknowledge that this is somewhat a matter of personal
taste (although the taste of the core devs will, in general, carry a
little more weight).

Some of the use cases you presented are much more complex than filter clauses could ever handle, (and including such complex calculations directly in values() probably isn't preferable, though I really like the option). For the simple cases expression object are at least comparable in 'cleanliness', but in my opinion, easier to understand at a glance.

>>> u.work_log.values(Sum('hours'))
vs
>>> u.work_log.values(hours__sum='hours_sum')

For complex cases, I think this:
>>> Milestone.objects.values(Sum(Abs(F('tasks__estimate') - F('tasks__effort_log__hours'))) / Count())

would look a lot nicer as :
>>> def avg_difference(f1, f2):
>>>    return Sum(Abs(F(f1) - F(f2))) / Count()
>>> Milestone.objects.values(avg_difference('tasks__estimate', 'tasks__effort_log__hours'))

The function definition is messy, but it is a mathematical equation after all and can be off in some other place.
 


2) The ability to do math on aggregate clauses is quite cool. However,
the types of math that are likely to be required will be fairly
consistent on a per-model basis. I'm wondering if the right mechansim
by which to acheive this is to provide a way to define new filter and
aggregate clauses on the QuerySet.

But you can also define these functions without having to add a hook to look them up from kwargs. If you want to put the in a custom manager, or the model itself, or just in a module somewhere, it doesn't matter.

 
For example, you can define a custom manager that defines a foobar()
method to provide  Model.objects.foobar() as an API entry point.
Managers provide a way to override the default QuerySet; if we extend
the filter and aggregate syntax so that '__foobar' in a filter looks
for filter_foobar() as a method on the queryset, or aggregate_foobar()
if used in an aggregate; these then expand out into whatever SQL might
be required.

Interesting. My first reaction is that custom Q objects already exist for this, and that it's more complex than just defining an aggregate function anywhere you want. Extendable filter clauses could be useful in general, but maybe it should happen on a wider scope. QuerySets or Managers could have a function to load a filter library, similar to {% load %} for template tags.

 

3) We still haven't hit a use case where the aggregate has unusual
grouping requirements. My original thought is that end users could be
writing custom Aggregate clauses that ultimately turn into:

MyAggregate('fieldname', group_by='field2', other='some argument')

However, given that I can't come up with a use case, this is starting
to get a big YAGNI to me.

I agree, but I also think this is already accommodated by objects with an add_to_query() method. Those objects can do pretty much whatever they want.


I'll write about annotate() later. I have some questions about how certain things would work, but I want to go back and read the discussions and your proposal again.

Cheers,
  Justin

Nicolas E. Lara G.

unread,
Mar 20, 2008, 12:55:17 PM3/20/08
to Django developers


On Mar 19, 10:46 pm, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:
> On Wed, Mar 19, 2008 at 5:31 AM, Nicolas E. Lara G.
>
> <nicolasl...@gmail.com> wrote:
>
> > Hello again,
>
> > As you can see I started implementing the aggregate modifier according
> > to Russell's suggestion (separate aggregate and annotate). I've been
> > thinking about joining both modifiers in one, but got to the
> > conclusion that it is better to do the separated implementations and,
> > in case that it is necesary to have a modiifer that do both (which I
> > think is the case), add some syntactic sugar (shortcut) to use both
> > methods in one modifier call.
>
> > Also the "GROUP BY" functionality is implemented using values().
>
> Could you please clarify what you mean by this - I can see this as
> being either very good, or awful.

I mean that the group by is done automatically but if values is used
the fields specified in values are added to the group by clause
instead of all the fields in the model (or the id or related key
depending on the case)

> The implication of your examples is that annotate/aggregate won't work
> unless you provide a values() object. If this is what you are
> intending, I see this as a significant weakness. I can't see any
> reason that values() should be required.

Not at all!, I use the aggregate with and without values.

>
> My concept for aggregates has always been that the GROUP BY clause
> would include all non-aggregates fields in the SELECT. This means that
> in the default case:
>
> >>> Model.objects.annotate(field1__sum='total')
>
> What you get back is a list of model instances that have an extra
> annotated attribute (in this case, total). You have the flexibility to
> use the objects as you would any other model instance, except that you
> have the summary data coming along for the ride. This sample query
> would return _all_ objects, annotated, since it isn't otherwise
> filtered or restricted, and the GROUP BY clause would (by necessity)
> include all the fields on the model.
>
> This does open an interesting intersection with the way values()
> operates. When you use a values() clause, you are restricting a query
> to a subset of fields:
>
> >>> Model.objects.values('field2','field3').annotate(field1__sum='total')
>
> This restricts the fields returned by the SELECT, and it would make
> sense to reflect this in the GROUP BY clause. The returned values in
> this case aren't annotated model objects, since there won't
> (necessarily) be 1-1 correspondence between the elements in the
> returned data and the underlying model data, so the dictionary output
> is appropriate. This strikes me as an elegant coverage of the 80% case
> for wanting an explicit GROUP BY clause.
>

This is exactely how my code/patch/hack opperates.

Nicolas E. Lara G.

unread,
Mar 20, 2008, 1:28:47 PM3/20/08
to Django developers


On Mar 20, 2:24 am, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:
> On Thu, Mar 20, 2008 at 5:16 AM, Justin Fagnani
>
> <justin.fagn...@gmail.com> wrote:
>
> > And as an extension to values() it makes sense to me, because the proposed
> > syntax basically comes down to two things: 1) allow values() to take an
> > expression objects as arguments, and 2) use kwargs to specify output
> > aliases. (well three things: automatic grouping by fields)
>
> I don't completely agree that values() is the right place. Firstly, it
> ignores the difference between annotate() and aggregate() that I
> mentioned earlier. Secondly, it removes the ability to calculate
> aggregates, but not include them in the return set. As it currently
> stands, values() can be used to reduce the return set to those columns
> in which you have an interest; if you overload values() to include
> aggregates as well, you won't be able to exclude aggregates from the
> result columns.

I mentioned it before, but I believe that reducing the result set
should be managed by simething like #3566 (http://
code.djangoproject.com/ticket/3566 ) and aggregates should not worry
about it since is a separate problem.
Still I believe that putting aggregates in values makes syntax
confusing. Also, values is suposed to return ""a QuerySet that
evaluates to a list of dictionaries instead of model-instance
objects"", and that doesn't say anything about modifying the query in
any way. I think that having something that "evaluates to a list of
diccionaries" and might or might not modify the results of the
queryset is bad for comprehention and api stability.
Have you considered the idea of doing it like this one?
>>> my_own_aggregate = Aggregate( sum('field_1') * F('field2') + 5, 'default_name')
>>> queryset.aggregate(purchases__price__custom=my_own_aggregate('alias'))

or (better yet for syntax reuse):

>>> my_own_aggregate = Aggregate( F('field_1__sum') * F('field2') + 5, 'default_name')
>>> queryset.aggregate(purchases__price__custom=my_own_aggregate('alias'))

> For example, you can define a custom manager that defines a foobar()
> method to provide Model.objects.foobar() as an API entry point.
> Managers provide a way to override the default QuerySet; if we extend
> the filter and aggregate syntax so that '__foobar' in a filter looks
> for filter_foobar() as a method on the queryset, or aggregate_foobar()
> if used in an aggregate; these then expand out into whatever SQL might
> be required.
>

This is a good idea, but I think that it might be problematic because,
if I understand correctely, the would have to be a different method in
the manager for each 'math operators aggregate' that you need to use.
This is good if there is a aggregate that you are going to reuse
constantely, but if you just want to do diferent math operations
depending on the program flow you'll end up with a lot of methods in
the manager that are only used once or twice in the code.
Maybe this could be a very good option to manage esoteric group by
uses (preatty much roll your own aggregate function).

Nicolas E. Lara G.

unread,
Mar 20, 2008, 1:48:30 PM3/20/08
to Django developers


On Mar 20, 4:42 am, "Justin Fagnani" <justin.fagn...@gmail.com> wrote:
> On Wed, Mar 19, 2008 at 11:24 PM, Russell Keith-Magee <
>
> freakboy3...@gmail.com> wrote:
> > I don't completely agree that values() is the right place.
>
> I'm not 100% convinced values() is the right place either, but on the other
> hand these are still values that are being retuned, just calculated values.
> For some simple cases, such as selecting a single aggregate value with no
> joins, I think the usage is pretty obvious-looking.
>
> I've seen people propose something like User.objects.average('age') before,
> which is simple but flawed. IMO, User.objects.values(Avg('age')) is almost
> as simple and intuitive without the problems of the .average(), etc.
>
> Of course, this maybe be because I know SQL. I fully expect that
> values('age') would return many items and values(Avg('age')) returns one
> item. Would some one who's never used SQL be confused? Do many Django users
> really have no SQL experience at all, and if so, would this be hard to
> explain? I don't know.

Probablly most django users do have SQL experience but the idea is to
abstract from it. If you need to think in SQL while working on objects
then the abstraction is not really helping.
I agree that the return value should always be the same but then, how
to access it?
hours = u.work_log.values(Sum('hours'))
hours[0]['hours_sum'] ??
or
for x in hours: x['hours_sum']

it seems odd.
This is where I believe your proposal is very strong. From the
begining I have been trying to integrate the diferent proposals
somehow. As I said before I believe that the object syntax is a very
good replacement for the function=tuple syntax and I see no reason why
not keeping both. filter-like syntax as a standard and object syntax
for un-aliased aggregation and arithmetic.

Winsley von Spee

unread,
Mar 23, 2008, 3:23:12 PM3/23/08
to django-d...@googlegroups.com
Hi,

perhaps I've just read over it, but how to you plan to handle not
numeric types like datetime? In mysql you have to use something like
"FROM_UNIXTIME(AVG(UNIX_TIMESTAMP(today))", but how should that be
handled in the orm especially as its backend dependant.

Regards

David Cramer

unread,
Mar 23, 2008, 6:19:33 PM3/23/08
to Django developers
So we're not going to support group by at all, even though is
extremely simple? I don't understand why everyone says mine (and
Curse's and many other peoples) use of SQL is so uncommon. I'd
guarantee almost every developer and I know who writes any kind of
scalable code (and I've seen what they've written) uses GROUP BY in
SQL much more than just for SUM, or COUNT, or any of those operations.

There really is no logical reason to not allow logic in Django. Even
if it's in the extra() clause it's better than not having it. I
*refuse* to write raw SQL for something that took 30 minutes to patch
in, and I will continue to fork Django until the developers see fit to
include logical changes. While many of these are going to be present
in qsrf, and qsrf will make everything easier, it doesn't add much of
the *needed* functionality for complex projects (again, not everyone
is building simplistic blog software with Django).

SELECT my_table.* FROM my_table JOIN my_other_table WHERE my_other
table.my_table_id = my_table.id GROUP BY my_table.id HAVING COUNT(1) >
1;

How do you ever hope to achieve something like this in the ORM without
patches like mine? This is a very useful query, and we used it heavily
for tagging applications, which seem to be fairly common nowadays.
Realistically, how do you efficiently want to query for objects that
have 2 tags without joining on the table N(tag) times? It seems like a
common enough use-case to add in this functionality. You have to
understand that DISTINCT doesn't solve every grouping situation that
is needed.

My main argument is there's no reason to *not* include the
functionality except that a few select people say "we dont need it".
It's not causing performance overhead. It's not a lot of work. There
are patches already made (and if they're not on trac I have them
locally). I really don't want to have to switch to Storm, or Alchemy,
because honestly, I love Django's approach. The more I'm building
complex projects though, the more headaches I'm having, and this is
mainly due to the ORM being so damned limited, and databases being so
complex.

On Mar 18, 6:36 pm, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:
> On Wed, Mar 19, 2008 at 9:26 AM, Justin Fagnani
>

David Cramer

unread,
Mar 23, 2008, 6:32:29 PM3/23/08
to Django developers
Sorry -- I missed page 2. So GROUP BY and similar things will be
supported through an Aggregates base class?

On Mar 18, 6:36 pm, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:
> On Wed, Mar 19, 2008 at 9:26 AM, Justin Fagnani
>

Justin Fagnani

unread,
Mar 23, 2008, 8:02:53 PM3/23/08
to django-d...@googlegroups.com
There's quite a few things that are backend dependent. String concatenation and dates are the big ones. Functions need a way to to get the proper SQL from the backend, or delegate their whole behavior to a backend specified class.

With backend-defined functions, you example would look something like Date(Avg(Timestamp('date-col'))). (or possibly hide the explicit type conversions to allow Avg('date_col'))

String concatenation is more difficult because to support using + requires a type inference system so that 'a' + 'b' emits CONCAT('a','b') on mysql.

This is where open-ended expressions could get very complicated. Type inference isn't so bad at first look, but I'm not completely familiar with each databases coercion rules. It's possible that type inference has backend dependancies too, and we would want to hide that from the programmer.

In a perfect world a valid expression should always work independent of the backend (so the coercion should be conservative and support the least-common denominator behavior), but in order for that to be true, then certain expressions which would work with one backend but not another should fail even when running on the backend that supports them.

This all could become very tricky very quickly, and it might be better to pick a very limited subset of  functionality to support at first, like say expressions only work with numbers, or numbers and dates. Something so that the first version is simple and just works, but that doesn't get in the way of supporting everything else in the future. For the time-being with grouping and extra, the date averaging could be done with SQL.

-Justin

Justin Fagnani

unread,
Mar 23, 2008, 11:37:07 PM3/23/08
to django-d...@googlegroups.com
Hey David,

Right now sql.Query doesn't output the HAVING clause, but if it did I think you could do this with a custom filter with an add_to_query() method that calls setup_joins and appends to query.having.

Also, with annotate() but without group_by(), as proposed, your query look like:
>>> MyModel.objects.annotate(count=Count('my_other_model')).filter(count__gt=1)

I'm not really sure if it's bad that the SQL in this case contains COUNT(my_other_model.id) rather than COUNT(1).

Aggregation support in filter() might help make the intent more clear. With the function=tuple syntax I think your query would look something like:

>>> MyModel.objects.filter(my_other_model__count__gt=1)

With expressions, I can think of two options:
>>> MyModel.objects.filter(Count('my_other_model') > 1)
which would work by overloading __gt__ or
>>> MyModel.objects.filter(GT(Count('my_other_model'), 1))

but I'm not sure I like either of those.

I think the case against a group_by() method is that if you look what you're really trying to do here, you don't need an explicit group_by since you can tell what the grouping should be from the filter. Maybe try some more examples to see if group_by() is actually necessary. It could be that for all or most of your cases, implicit grouping works fine.

Sorry -- I missed page 2. So GROUP BY and similar things will be
supported through an Aggregates base class?

I'm not sure which Aggregates class you're referring to, the base class for aggregate expressions, or the AggregateQuerySet?

With the current version of expressions, each expression has an aggregate() method which traverses the tree and returns True if any children return True. If so it triggers auto-grouping based on all non-aggregating fields and expressions. The Aggregate base class returns True by default, but the logic is not based on isinstance(). This leaves the door open for stuff like an N-ary Min function that's only an aggregate if it has only one argument and that arg is a field name.

Cheers,
  Justin

flo...@gmail.com

unread,
Mar 24, 2008, 8:22:37 PM3/24/08
to Django developers
A quick ping: I've written a small patch against qs-rf which might
facilitate the development of aggregates, and solve David Cramer's
problem of having to fork Django.

The idea is to allow for a custom QuerySet subclass to be used by
default everywhere, so if you subclass
django.db.models.query.BaseQuerySet and provide group_by
functionality, then you can point Django at your new subclass and
Django will use it everywhere.

Patch is here: http://code.djangoproject.com/ticket/6875

-Eric Florenzano

Russell Keith-Magee

unread,
Mar 24, 2008, 8:27:22 PM3/24/08
to django-d...@googlegroups.com

I haven't given it any specific thought, but it stands to reason that
the eccentricities of any given backend should be abstracted away.
Part of this design process is to identify the crusty edge cases that
we need to handle. Thanks for the pointer.

Yours
Russ Magee %-)

Russell Keith-Magee

unread,
Mar 24, 2008, 9:12:46 PM3/24/08
to django-d...@googlegroups.com
On Mon, Mar 24, 2008 at 7:19 AM, David Cramer <dcr...@gmail.com> wrote:
>
> So we're not going to support group by at all, even though is
> extremely simple? I don't understand why everyone says mine (and
> Curse's and many other peoples) use of SQL is so uncommon. I'd
> guarantee almost every developer and I know who writes any kind of
> scalable code (and I've seen what they've written) uses GROUP BY in
> SQL much more than just for SUM, or COUNT, or any of those operations.

And yet, the example you give is a count on related objects.

> SELECT my_table.* FROM my_table JOIN my_other_table WHERE my_other
> table.my_table_id = my_table.id GROUP BY my_table.id HAVING COUNT(1) >
> 1;
>
> How do you ever hope to achieve something like this in the ORM without
> patches like mine?

Lets see if I can make myself clearer by using your example.

You have provided some sample SQL. However, that wasn't the start
point. Once upon a time, you had a problem you needed to solve: "I
need to find all the MyTable objects that have more than one related
OtherTable objects". Notice that this question doesn't use the phrase
"GROUP BY".

Django's ORM didn't support aggregates, so you hand crafted some SQL,
and since aggregates were involved, you needed to use a GROUP BY
clause. This is because GROUP BY is the way SQL handles aggregation.

Now, we're talking about adding aggregates to Django. So, we need to
find a way to express your original question as Django syntax. The
solution to this problem isn't "add GROUP BY to Django Syntax" - it's
to find an elegant way to represent the underlying problem as an
object-based syntax. Based on the discussions so far, something like
this would seem appropriate:

MyTable.objects.aggregate(othertable__count='related_count').filter(related_count__gt=1)

Notice that this query doesn't require the use of the phrase GROUP BY
in the ORM syntax. When this hits the database, GROUP BY will
certainly be required. However, the terms for the GROUP BY can be
entirely implied from the rest of the query. There is no need to
expose GROUP BY into the ORM syntax.

Your original question wasn't "How do I perform a GROUP BY over a
MyTable" - it was a real-world question that needed an object-based
answer. The goal of the Django ORM is to solve your _actual_ problem,
not your SQL problem.

> My main argument is there's no reason to *not* include the
> functionality except that a few select people say "we dont need it".

There is a very good reason, which I have told you several times. The
Django ORM does not provide an alternate syntax for SQL. It provides a
way of interrogating an object model that happens to be backed by a
database. As a result, the external APIs are not SQL specific - they
encompass broader object-based use-cases. "Just adding GROUP BY"
doesn't achieve any of the Django design goals.

Now, if you can present a use case that isn't satisfied by the
discussions we are currently having - that is, a usage of GROUP BY
that steps outside the bounds of what could be implied from other
parts of an ORM query - let us know. We want to be able to cover as
many potential queries as possible. However, even then, I doubt that
the solution will be "add GROUP BY" - there will be another class of
object-based queries that we haven't considered, for which we will
need a new _object based_ syntax.

Yours,
Russ Magee %-)

David Cramer

unread,
Mar 25, 2008, 1:48:48 AM3/25/08
to Django developers
Distinct can cover most use cases for this, but in general:
MyModel.objects.filter(some_kind_of_manytomany__blah__blah__blah=X).select_related(some_kind_of_manytomany__blah).group_by(some_kind_of_manytomany__blah)

If you tag on anything for an additional select that distinct can't
handle, you need group by.

Btw, your example doesn't work, having happens after the query
finishes executing, not within the where clause. That specific use
case *has* to execute exaclty as shown.
On Mar 24, 6:12 pm, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:

David Cramer

unread,
Mar 25, 2008, 1:59:57 AM3/25/08
to Django developers
Another thing, which was solved in my group_by patch, is count()'ing
those rows. I haven't looked at any of the aggregate code but did you
solve that?

Here's the query in Curse's ORM:

MyModel.objects.filter(Q(tags__tag='test') |
Q(tags__tag='test2')).extra(having=['count(1) =
2']).group_by('id').count()

On Mar 24, 6:12 pm, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:

Russell Keith-Magee

unread,
Mar 26, 2008, 9:51:28 PM3/26/08
to django-d...@googlegroups.com
On Tue, Mar 25, 2008 at 2:48 PM, David Cramer <dcr...@gmail.com> wrote:
>
> Distinct can cover most use cases for this, but in general:
> MyModel.objects.filter(some_kind_of_manytomany__blah__blah__blah=X).select_related(some_kind_of_manytomany__blah).group_by(some_kind_of_manytomany__blah)

And now we finally reach your actual use case - the fact that you need
to issue queries that produce non-distinct results for each object
instance in the result set. Any given instance of MyModel in your
database will return N instances in the result set, where N is the
number of m2m related objects associated with the instance. To me,
this seems like a variant on the annotation use case. Might I suggest
the following:

MyModel.objects.annotate(some_kind_of_manytomany__blah__extra='extra_data')

I don't think __extra is the best keyword to use here, but for the
moment it will do as a placeholder. The idea is to define an annotated
output set - but rather than annotating with a computed aggregate,
you're annotating with data from related models. Since the extra_data
annotations aren't (necessarily) unique for an instance, there will be
N instances of each MyModel instance in the result set, with each
instance in the result set annotated with a unique joined value from
the m2m relation. The GROUP BY can be implied as the list of fields in
MyModel, plus any __extra annotations.

The naming of __extra notwithstanding, how does this sound?

> If you tag on anything for an additional select that distinct can't
> handle, you need group by.

Like I keep saying - no, you don't need GROUP BY. GROUP BY is a means
to an end, not the only way of expressing the concept. What is needed
is a mechanism in the ORM that can represent the underlying use case
that necessitates the use of GROUP BY in the SQL query that is
generated.

> Btw, your example doesn't work, having happens after the query
> finishes executing, not within the where clause. That specific use
> case *has* to execute exaclty as shown.

Which bit doesn't work? As far as I can make out, there isn't anything
ambiguous about the query I gave you - the ORM is given a bunch of
filter statements, and can make a decision about whether they are
WHERE or HAVING statements based on the GROUP BY that is implied by
the original object query.

Yours,
Russ Magee %-)

Russell Keith-Magee

unread,
Mar 26, 2008, 10:08:18 PM3/26/08
to django-d...@googlegroups.com
On Thu, Mar 20, 2008 at 5:42 PM, Justin Fagnani
<justin....@gmail.com> wrote:
>
> On Wed, Mar 19, 2008 at 11:24 PM, Russell Keith-Magee
> <freakb...@gmail.com> wrote:
> >
> > I don't completely agree that values() is the right place.
>
> I'm not 100% convinced values() is the right place either, but on the other
> hand these are still values that are being retuned, just calculated values.
> For some simple cases, such as selecting a single aggregate value with no
> joins, I think the usage is pretty obvious-looking.

Except that values() doesn't currently return 'just values' - it
returns a list of values, notionally tied to object instances.
Top-level aggregates won't ever be a list, and instance-based
(annotated) aggregates will always be associated with an instance.

> I hadn't really considered this, but when would you want to calculate a
> value and not include it? If it's for query purposes I'd think that a good
> way to handle it would be to add expression support directly to filter() so
> that you could say queryset.filter(field1__lte=Avg('field1')) and not have
> the aggregate in the results. Or do something similar with a Q-like object.

The simplest case I can think of is that you may want to filter on an
aggregate, but not include it as part of the result set because you
don't want to dirty the attribute list of the instances being
returned.

> > > > - Tell me how much effort has been expended by User U?
> > > >>> u.work_log.values(Sum('hours'))
> >
> > This highlights my earlier point: values() returns a ValuesQuerySet -
> > a list of dictionaries. What gets returned in this example?
> >
> It's still a ValuesQuerySet, but because there's no group by clause it's
> only one item:
> [{'hours_sum': 7.5}]

Seriously - this sounds like a good idea to you? A list that will only
ever contain a single dictionary? I think I can safely put on my core
developer hat and say "This idea is a non-starter".

> It's possible to do set result_type=SINGLE where there's an aggregate but no
> group by, but I think it's be better for values() to always return the same
> type.

Well, sure. Then, we could put a result_type=SINGLE into filter(), and
we could get rid of get().

Or, we could use a different clause, and not have to worry about
remembering the result_type argument, importing the SINGLE constant,
or debugging the different types of output that could be returned. :-)

> > > > - Give me a list of milestones with the amount of effort outstanding
> > > > on each milestone
> > > >>> Milestone.objects.values('id', hours_left=Sum('task__estimate) -
> > > Sum('task__work_log__hours'))
> >
> > This actually starts to point out a flaw in your proposed syntax - or
> > at least, an API weakness that we won't be able to avoid...
> > ...my Magic 8-ball predicts many mailing list
> > discussions wondering why they are getting argument ordering
> > exceptions getting thrown from what looks like a perfectly legal
> > aggregate query.
>
> I don't know if I agree with that concern so much, but your 8-ball is surely
> better tuned than mine for this type of thing. The SyntaxError in that case
> is pretty clear, and users can be kindly reminded that they are still
> writing Python.

The problem is that a non-trivial number of Django newcomers don't
think of Django as a Python library - they think of it as a language
unto itself. Case in point - the number of questions asked on
Django-users that can be answered by pointing to the Python docs.
Setting up APIs that can raise errors based upon non-adherence to
Python doctrine is the sort of thing that sets of alarm bells for me.

Yours,
Russ Magee %-)

Reply all
Reply to author
Forward
0 new messages