Aggregates: anything planned for 1.0?

9 views
Skip to first unread message

Brian Beck

unread,
Apr 12, 2007, 3:27:52 PM4/12/07
to Django developers
I didn't see anything about aggregate support in
VersionOneFeatures[1], are people still hung up about the syntax, or
is anyone hacking on this? It's been mentioned a couple times since
PyCon, but what's the status?

[1] http://code.djangoproject.com/wiki/VersionOneFeatures

Honza Král

unread,
Apr 12, 2007, 3:59:33 PM4/12/07
to django-d...@googlegroups.com
the queryset refactoring must (well, should) happen before aggregation
support. After the refactoring, it should be relatively easy to put in
the aggregations.

see:
http://code.djangoproject.com/ticket/3566
http://groups.google.com/group/django-developers/browse_frm/thread/691da1d6e3b0661d/3e7aaadef9006866


I am currently finishing one project, it should be over in a couple of
days. If I don't get a new one I may finally have some time to do some
work on it. But I cannot promise anything, so if you have the time, go
for it.. ;)

Honza


--
Honza Král
E-Mail: Honza...@gmail.com
ICQ#: 107471613
Phone: +420 606 678585

Russell Keith-Magee

unread,
Apr 12, 2007, 8:24:04 PM4/12/07
to django-d...@googlegroups.com
On 4/13/07, Brian Beck <exo...@gmail.com> wrote:
>
> I didn't see anything about aggregate support in
> VersionOneFeatures[1], are people still hung up about the syntax, or
> is anyone hacking on this? It's been mentioned a couple times since
> PyCon, but what's the status?

Many proposals have been made on this feature, but nothing has been
decided upon. For the record, here's my last attempt at a proposal:

http://groups.google.com/group/django-developers/msg/2f91fa217bc465bb

I'm _very_ keen to get aggregates into Django - it was one of the
original reasons I started contributing to the project, but
magic-removal, the test frameworks, fixtures, and the 0.96 push took
priority.

Implementing this feature is also waiting on a QuerySet rewrite. I
believe Malcolm said he was working on this, but I don't know how far
he has got.

Yours,
Russ Magee %-)

Russell Keith-Magee

unread,
Apr 12, 2007, 8:25:01 PM4/12/07
to django-d...@googlegroups.com
On 4/13/07, Honza Král <honza...@gmail.com> wrote:
> the queryset refactoring must (well, should) happen before aggregation
> support. After the refactoring, it should be relatively easy to put in
> the aggregations.
>
> see:
> http://code.djangoproject.com/ticket/3566

I didn't notice this ticket go past (we _really_ need to fix the RSS
feed for new tickets...). I'll take a look at it and work up some
comments.

Yours,
Russ Magee %-)

Russell Keith-Magee

unread,
Apr 12, 2007, 11:11:00 PM4/12/07
to django-d...@googlegroups.com
On 4/13/07, Honza Král <honza...@gmail.com> wrote:
> the queryset refactoring must (well, should) happen before aggregation
> support. After the refactoring, it should be relatively easy to put in
> the aggregations.
>
> see:
> http://code.djangoproject.com/ticket/3566

Ok; I've had a chance to look at this now. Although there are some
syntax differences, I think your idea and mine [1] are acutally pretty
close. A bit of both, and we might just have something.

Comments:

GROUP BY = values()
~~~~~~~~~~~~~~~~~~

Isn't the group_by clause duplicating what is already provided by the
values() filter? i.e., saying 'return me a list of dictionaries of
object data that isn't a full object'? When the aggregates clause is
in use, a GROUP BY over the selected fields is implied. If you don't
provide a values clause, then you want to GROUP BY all the fields in
the model

So, your example would become:

>>> queryset = Model.objects.all()
>>> queryset.values('name', 'city').aggregates(sum=('pay',
'some_other_field'), avg=('pay', 'age'), count=True)

Output format
~~~~~~~~~~~~~
As my proposal established, I'm a fan of the 'augmenting the returned
data' approach:

[
{ 'name' : 'John Doe',
'city' : 'Prague',
'sum' :' {
'pay' : 50000,
'some_other_field' : 10000
}
'average' : {
'pay' : 10000,
'age' : 30
},
'count' : 5,
},
...
]

To me, the clause is returning a list of data groups (i.e., the
results of the values() clause); putting the grouping data in a
group_by dictionary seems overkill. This approach does introduce some
new magic-words that are not allowed as field names (or are at least
dangerous if you do). Personally, I don't see that as a major problem.

Weaknesses in the function=tuple argument approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

While the function=tuple of field names approach does work, I feel it
is lacking:
- You can't use aliases: SUM(field) as "total_pay"
- Filtering on aggregated values is difficult (impossible?)
- It doesn't handle joins in the same way as filter()

The approach I proposed was to mirror the filter syntax:
pay__sum='total_pay'
other_field__average='some_alias'

This approach mirrors the filter mechanism for attribute naming and
joins, allows for aliases, which provides a mechanism to filter on
aliases.

The only downside I can see is that the simple case requires the user
to provide an alias - i.e., pay__sum='pay_sum'. I'll admit that this
isn't particularly elegant.

However, it should be possible to combine both approaches:

>>> Model.objects.aggregate(sum=('pay'), other_field__average='mean_value')
[
{ 'name' : 'John Doe',
'city' : 'Prague',
'sum' :' {
'pay' : 50000,
}
'mean_value' :30,
},
...
]

This means the sum of pay isn't aliased, so it can't be filtered, but
you can filter on 'mean_value'. Argument clashes aren't a problem
either - 'sum' by itself is unambiguous as an argument; any
filter-like argument will need to have at least a '__' in it.

Ambiguities in COUNT
~~~~~~~~~~~~~~~~~~~~

aggregate(count=True) is already covered with count(); I'm not sure
I'm wild about the duplication. There is also the problem that it
precludes counting other things: e.g., counting related objects -
return the number of books that each author has written.

Author.objects.aggregate(books_count='number_of_books_written')
or
Author.objects.aggregate(count=('books'))

On top of this, you can get the gross count from:

len(Author.objects.aggregate(...))

Aggregate of results vs aggregates of related objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you run an aggregate over a field from the table you are querying,
the aggregate is a table level property, not a row level property.
Compare the expected results from:

- Get me the date of the earliest and latest published article: Return
a single value for the min, and a single value for the max.
- Annotate each author with the date of their earliest and latest
published article: Return a min and max value for each author.

The second case is relatively easy:

>>> Author.objects.aggregate(article__pub_date__min='earliest',
article__pub_date__max='latest'))
[{
'name':'John Doe',
'earliest': 2007-01-04
'latest': 2007-02-14
}
...
]

However, the first case:

>>> Article.objects.aggregate(pub_date__min='earliest', pub_date__max='latest'))
{[
'title': 'first article'
'earliest': 2007-01-04
'latest': 2007-02-14
],
[
'title': 'second article'
'earliest': 2007-01-04
'latest': 2007-02-14
],
}

while legal, isn't the most elegant way at getting at a unique min/max
for an entire table. This was the reason for my suggestion for two
functions: aggregate and annotate. Aggregate returns a dictionary of
just aggregated values:

>>> Article.objects.aggregate(min=('pub_date'), max=('pub_date'))
{
'min' : '2007-01-04',
'max' : '2007-02-14'
}

Whereas annotate() returns a list of annotated dictionary data.

================================

I'd better stop now - this is getting longer than I originally
anticipated, and if it gets much longer nobody will read it :-) As
always, feedback welcome.

Yours,
Russ Magee %-)

[1] http://groups.google.com/group/django-developers/msg/2f91fa217bc465bb

Honza Král

unread,
Apr 13, 2007, 6:16:02 PM4/13/07
to django-d...@googlegroups.com
On 4/13/07, Russell Keith-Magee <freakb...@gmail.com> wrote:
>
> On 4/13/07, Honza Král <honza...@gmail.com> wrote:
> > the queryset refactoring must (well, should) happen before aggregation
> > support. After the refactoring, it should be relatively easy to put in
> > the aggregations.
> >
> > see:
> > http://code.djangoproject.com/ticket/3566
>
> Ok; I've had a chance to look at this now. Although there are some
> syntax differences, I think your idea and mine [1] are acutally pretty
> close. A bit of both, and we might just have something.

great, thanks for your comments. I attached my answers in the text

>
> Comments:
>
> GROUP BY = values()
> ~~~~~~~~~~~~~~~~~~
>
> Isn't the group_by clause duplicating what is already provided by the
> values() filter? i.e., saying 'return me a list of dictionaries of
> object data that isn't a full object'?

from a technical view - YES, from user's view: NO
that should be two different things, only the same format of returned
data doesn't seem to be enough to merge these two functions

When the aggregates clause is
> in use, a GROUP BY over the selected fields is implied. If you don't
> provide a values clause, then you want to GROUP BY all the fields in
> the model
>
> So, your example would become:
>
> >>> queryset = Model.objects.all()
> >>> queryset.values('name', 'city').aggregates(sum=('pay',
> 'some_other_field'), avg=('pay', 'age'), count=True)

if you want to go with values() (which I am personally against), why not do:

qset.values('field', 'field2', 'field3__sum', 'f4__count', ...) < = >
SELECT SUM(field3), count(f4) GROUP BY field, field2

I still don't like it, but seems better, and would solve issues with
trying to refer to the aggregated value --
order_by('-f4__count','field'), see below

>
> Output format
> ~~~~~~~~~~~~~
> As my proposal established, I'm a fan of the 'augmenting the returned
> data' approach:
>
> [
> { 'name' : 'John Doe',
> 'city' : 'Prague',
> 'sum' :' {
> 'pay' : 50000,
> 'some_other_field' : 10000
> }
> 'average' : {
> 'pay' : 10000,
> 'age' : 30
> },
> 'count' : 5,
> },
> ...
> ]

what if I have a field called 'average' ?

>
> To me, the clause is returning a list of data groups (i.e., the
> results of the values() clause); putting the grouping data in a
> group_by dictionary seems overkill. This approach does introduce some
> new magic-words that are not allowed as field names (or are at least
> dangerous if you do). Personally, I don't see that as a major problem.

I do, I just don't like constraints on naming, it just feels wrong,
when there is a simple way around this.
it would also allow for easier generic processing of the results (for
example when building a general statistics app, we will know without
inspecting the model, which fields were in the GROUP BY etc. --
explicit is better than implicit ;)

>
> Weaknesses in the function=tuple argument approach
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> While the function=tuple of field names approach does work, I feel it
> is lacking:
> - You can't use aliases: SUM(field) as "total_pay"
> - Filtering on aggregated values is difficult (impossible?)

true, this is a problem for anything I am presenting

you could use something like order_by('sum(field3)'), but that seems weird...
I outlined one solution in the values() part, but I am not happy with
that either...

> - It doesn't handle joins in the same way as filter()

true, but it should, that's why I want to keep the arguments as simple
string containing only the field name (including optional
foreign__or__other__key__lookups), so same logic could be used for
filters and these

>
> The approach I proposed was to mirror the filter syntax:
> pay__sum='total_pay'
> other_field__average='some_alias'

if you go with __sum etc., you don't need aliases, you can use this
name (field__sum) as a reference to the field:

qset = qset.values('field', 'field2', 'field3__sum', 'f4__count', ...)

qset.filter( field__contains='AA' ) # results in a WHERE clause
qset.filter( field3__sum__lte=23 ) # results in a HAVING clause
qset.order_by( '-field3__sum', 'field' )


>
> This approach mirrors the filter mechanism for attribute naming and
> joins, allows for aliases, which provides a mechanism to filter on
> aliases.
>
> The only downside I can see is that the simple case requires the user
> to provide an alias - i.e., pay__sum='pay_sum'. I'll admit that this
> isn't particularly elegant.
>
> However, it should be possible to combine both approaches:
>
> >>> Model.objects.aggregate(sum=('pay'), other_field__average='mean_value')
> [
> { 'name' : 'John Doe',
> 'city' : 'Prague',
> 'sum' :' {
> 'pay' : 50000,
> }
> 'mean_value' :30,
> },
> ...
> ]

this is inconsistent, there should imho be just one way for this...

>
> This means the sum of pay isn't aliased, so it can't be filtered, but
> you can filter on 'mean_value'. Argument clashes aren't a problem
> either - 'sum' by itself is unambiguous as an argument; any
> filter-like argument will need to have at least a '__' in it.
>
> Ambiguities in COUNT
> ~~~~~~~~~~~~~~~~~~~~
>
> aggregate(count=True) is already covered with count(); I'm not sure
> I'm wild about the duplication.

well, but what if you need number of articles per author? existing
.count() wouldn't cut it (in one select) and once you are doing
aggregation anyway, its simple to throw in COUNT(*)...
so if you
a) change .count() to behave like .aggregate(count=True), it would break code
b) don't allow count in aggregation, you will have more queries for no
good reason.
I see .count() as a very useful shortcut for
.aggregate(count=True)['count']

There is also the problem that it
> precludes counting other things: e.g., counting related objects -
> return the number of books that each author has written.
>
> Author.objects.aggregate(books_count='number_of_books_written')
> or
> Author.objects.aggregate(count=('books'))
>
> On top of this, you can get the gross count from:
>
> len(Author.objects.aggregate(...))

I see, we obviously don't understand each other:
I don't want count=True to add count of groups produced by group by, I
want it to add information on the group's size:
SELECT autor_id, COUNT(*), AVG(rating) FROM articles GROUP BY article_id;
which would return me list of authors with number of published
articles and their average rating

>
> Aggregate of results vs aggregates of related objects
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> If you run an aggregate over a field from the table you are querying,
> the aggregate is a table level property, not a row level property.
> Compare the expected results from:
>
> - Get me the date of the earliest and latest published article: Return
> a single value for the min, and a single value for the max.
> - Annotate each author with the date of their earliest and latest
> published article: Return a min and max value for each author.
>
> The second case is relatively easy:
>
> >>> Author.objects.aggregate(article__pub_date__min='earliest',
> article__pub_date__max='latest'))
> [{
> 'name':'John Doe',
> 'earliest': 2007-01-04
> 'latest': 2007-02-14
> }
> ...
> ]

why do you assume you are grouping by name, where is it specified? the
code of the queryset doesn't say anything about a name. I am confused
by this.

>
> However, the first case:
>
> >>> Article.objects.aggregate(pub_date__min='earliest', pub_date__max='latest'))
> {[
> 'title': 'first article'
> 'earliest': 2007-01-04
> 'latest': 2007-02-14
> ],
> [
> 'title': 'second article'
> 'earliest': 2007-01-04
> 'latest': 2007-02-14
> ],
> }

I don't understand this at all: you only asked for aggregates, you
didn't specify group_by so the SELECT should look something like:

SELECT MIN(pub_date) as earliest, MAX(pub_date) as latest FROM articles;
there is no way this would return more that one row

and why would it magically include 'title' in the result ? I am confused again.

>
> while legal, isn't the most elegant way at getting at a unique min/max
> for an entire table. This was the reason for my suggestion for two
> functions: aggregate and annotate. Aggregate returns a dictionary of
> just aggregated values:
>
> >>> Article.objects.aggregate(min=('pub_date'), max=('pub_date'))
> {
> 'min' : '2007-01-04',
> 'max' : '2007-02-14'
> }
>
> Whereas annotate() returns a list of annotated dictionary data.

but those two are the same thing, the aggregate just has an empty
group by clause, so it groups the entire table and is missing the
'group_by' field in the result (or any other fields except the
aggregates you asked for)

>
> ================================
>
> I'd better stop now - this is getting longer than I originally
> anticipated, and if it gets much longer nobody will read it :-) As
> always, feedback welcome.
>
> Yours,
> Russ Magee %-)
>
> [1] http://groups.google.com/group/django-developers/msg/2f91fa217bc465bb
>
> >
>

--

Honza Král

unread,
Apr 13, 2007, 6:20:06 PM4/13/07
to django-d...@googlegroups.com
sorry, forgot to include the group_by columns in the select clause:

> qset.values('field', 'field2', 'field3__sum', 'f4__count', ...) < = >

> SELECT field, field2, SUM(field3), count(f4) FROM some_table GROUP BY field, field2;

Russell Keith-Magee

unread,
Apr 16, 2007, 1:03:30 AM4/16/07
to django-d...@googlegroups.com
Hi Honza,

Over the weekend, I tried responding to our last message inline, but
my response kept getting long and confused. So, I've summarized my
thoughts. I have three problems with your proposal:

- Your approach is very group/SQL-centric, rather than being
object/goal-centric. I would expect that the two most common groups
are 'all columns in the table' and 'no columns in the table'. Using
your syntax, the first case is very cumbersome (you have to manually
name every column you want to see in the output), and the second case
yields verbose output (you have to go 2 levels into a dictionary to
get a simple column average).

- I accept that avoiding name clashes is a desirable (and arguably
essential) goal. However, I feel that your output syntax is longwinded
for the simple case (summary of a single table, no groups), and not
particularly expressive for the complex case (summary of joined
tables). I don't buy the 'generic summary statistic app' argument;
mostly because I don't see the use case for such an application, but
also because summary statistics for joined tables will require the use
of __ syntax in the result dictionaries, which isn't as immediately
expressive as aliased names.

- While your syntax is very good for generating groups and (aggregates
of those groups) on a single table, I don't see it extending well into
joins - especially filtering on joins. Using __sum__lte as you suggest
is a problem for filters, because there is the potential for a name
clash between a model field called sum and the desire to use a sum
aggregate.

However, an example is worth a thousand words. Here are some of what I
would consider common use cases that I don't see how your syntax
addresses. Keep in mind that in each case when I ask for an author,
the group is the full list of attributes; I would like to get the full
set of object attributes list of objects along with the aggregates for
each object in a single query rather than requiring one query to get
the objects and one to get the aggregates.

R>>> is the syntax for queries using my proposed syntax. H>>> is your
syntax, as I understand it. In most cases, I can't see how your syntax
reaches the use case. Please correct me if I have misunderstood, or if
there is a better way to use your syntax.

1) Get a list of authors, with the number of articles written by each author:
H>>> Author.objects.aggregate(group_by=('first_name','last_name',...
'phone_number','address','email'), count=True)

R>>> Author.objects.annotate(article__count='number_of_articles')

2) Get a list of authors that have written at least 3 articles:
H>>> ???

R>>> Author.objects.annotate(
article__count='number_of_articles').filter(
number_of_articles__gte=3)

3) Get a list of authors with a first name of 'Fred' that have written
at least 3 articles:
H>>> ???

R>>> Author.objects.filter(name='Fred').annotate(
article__count='number_of_articles').filter(
number_of_articles__gte=3)

4) Get a list of authors that have written at least three articles
with a rating of 4 or higher:
H>>> ???

R>>> Author.objects.annotate(
article__count='number_of_good_articles',
article__rating__average='average_rating').filter(
number_of_good_articles__gte=3,
average_rating__gte=4)

5) Get a list of authors, with the number of >4 rated articles they
have written:
H>>> ???

R>>> Author.objects.annotate(
article__count='number_of_good_articles').filter(
article__rating__gte=4)

6) Get a list of authors with a first name of 'Fred' that have written
at least 3 articles:
H>>> ???

R>>> Author.objects.filter(name='Fred').annotate(
article__count='number_of_articles').filter(
number_of_articles__gte=3)

7) Get the average article rating:
H>>> Article.objects.aggregate(average=('rating'))

R>>> Article.objects.aggregate(rating__average='avg_rating')

8) What is the average rating for authors called 'George'? What about 'Fred'?
H>>> Author.objects.aggregate(group_by=('first_name'),
average=('article__rating'))

R>>> Author.objects.values('first_name').annotate(article__rating__average='avg_rating')

Out of these, I feel your syntax probably has the advantage in case
(7), but getting the output from your results requires two levels of
dictionary lookup, whereas only 1 is required in my case.

In addition to seeing you fill in these use cases with your syntax, I
would be interested in any use cases where you feel that your syntax
has the advantage. My use cases are driven by the way in which I
currently use (or would like to use) aggregates; If your use cases are
significantly different, I'd be interested in hearing them.

In particular, one of your objections to GROUP BY==values()
equivalence is that while they are technically equivalent, they are
not conceptually equivalent to the end user. Can you provide a use
case where this is true (example 8 is my counterexample)?

Yours,
Russ Magee %-)

Honza Král

unread,
Apr 16, 2007, 7:12:08 AM4/16/07
to django-d...@googlegroups.com
Hi Russ,
great comments, thanks.

On 4/16/07, Russell Keith-Magee <freakb...@gmail.com> wrote:
>
> Hi Honza,
>
> Over the weekend, I tried responding to our last message inline, but
> my response kept getting long and confused. So, I've summarized my
> thoughts. I have three problems with your proposal:
>
> - Your approach is very group/SQL-centric, rather than being
> object/goal-centric.

yes, I confess - I come from the DB world and SQL is my language of
choice ;) I love the language and the way you can express your
thoughts in it.

besides, Django's queryset is very SQL-centric so far

I would expect that the two most common groups
> are 'all columns in the table' and 'no columns in the table'.

'all columns in the table' ?? you mean like GROUP BY first_name,
last_name, address, email, ...
even if you would rewrite it as GROUP BY id and join the result with
the original table, it would be unnecessary overhead

you would either have to have smart optimizations that would recognize
this case and group by the foreign key on the data table instead, or
you would end up with a query that would most likely kill your DB.

Since Django's query building is very simple (the only thing it does
magically for you is joining the tables, escaping and returning
objects instead of fields), it would require a massive change to
introduce this sort of optimizations and the overhead without it is
simply too big to ignore for clear syntax.

Using
> your syntax, the first case is very cumbersome (you have to manually
> name every column you want to see in the output),

yes, but I cannot think of a use case where that would be the problem
when grouping or asking for overall statistics. Note please that if
you ask for a foreign key field, you will get the entire object.

I also cannot come up with a way that would allow you not to specify
every field you want on output and would still perform well

and the second case
> yields verbose output (you have to go 2 levels into a dictionary to
> get a simple column average).

yes, I know and am not happy with that either. But if not two levels, than how?

> - I accept that avoiding name clashes is a desirable (and arguably
> essential) goal. However, I feel that your output syntax is longwinded
> for the simple case (summary of a single table, no groups), and not
> particularly expressive for the complex case (summary of joined
> tables). I don't buy the 'generic summary statistic app' argument;
> mostly because I don't see the use case for such an application,

I am building one right now:
I have a system that keeps track of students in a school as they come
and go. The customer now asked for a module that would enable them to
compute statistics at the end of the year:
how many students came from this background, how many teacher wrote
the most reports on the students, what was the average height etc.

With this in Django, I could just build a helper view, that would
construct the queryset and a generic template that could display the
results and let them build the queries mostly on their own (including
some checks, of course).

I think that most bigger applications will need something like this
sooner or later. Reporting and (even this primitive) BI is very
lucrative business and customers want it.

but
> also because summary statistics for joined tables will require the use
> of __ syntax in the result dictionaries, which isn't as immediately
> expressive as aliased names.

true

> - While your syntax is very good for generating groups and (aggregates
> of those groups) on a single table, I don't see it extending well into
> joins - especially filtering on joins.

if you use the field 'names' as you would in a resulting template (as
I used in the examples here), you could do that without any name
clashes. but again, the syntax will become more verbose.

Using __sum__lte as you suggest
> is a problem for filters, because there is the potential for a name
> clash between a model field called sum and the desire to use a sum
> aggregate.
>
> However, an example is worth a thousand words. Here are some of what I
> would consider common use cases that I don't see how your syntax
> addresses. Keep in mind that in each case when I ask for an author,
> the group is the full list of attributes;

see the beginning of my response for why this is bad.

I would like to get the full
> set of object attributes list of objects along with the aggregates for
> each object in a single query rather than requiring one query to get
> the objects and one to get the aggregates.
>
> R>>> is the syntax for queries using my proposed syntax. H>>> is your
> syntax, as I understand it. In most cases, I can't see how your syntax
> reaches the use case. Please correct me if I have misunderstood, or if
> there is a better way to use your syntax.
>
> 1) Get a list of authors, with the number of articles written by each author:
> H>>> Author.objects.aggregate(group_by=('first_name','last_name',...
> 'phone_number','address','email'), count=True)

well, this would be

H>>> Article.objects.aggregate( group_by=('author',), count=True )
provided that when grouping on foreign key, the query will return the
entire object, which makes sense

>
> R>>> Author.objects.annotate(article__count='number_of_articles')
>
> 2) Get a list of authors that have written at least 3 articles:

you got me here - as I said earlier I have not yet come up with a way
to refer to the aggregated fields. I will post my examples here using
the same syntax you would use to retrieve the data in template (using
__ instead of . ):

H>>> Article.objects.aggregate( group_by=('author',), count=True
).filter( count__gte=3 )

>
> R>>> Author.objects.annotate(
> article__count='number_of_articles').filter(
> number_of_articles__gte=3)
>
> 3) Get a list of authors with a first name of 'Fred' that have written
> at least 3 articles:

H>>> Article.objects.filter( author__name='Fred'
).aggregate( group_by=('author',), count=True
).filter( count__gte=3 )

>
> R>>> Author.objects.filter(name='Fred').annotate(
> article__count='number_of_articles').filter(
> number_of_articles__gte=3)
>
> 4) Get a list of authors that have written at least three articles
> with a rating of 4 or higher:

H>>> Article.objects.aggregate(
group_by=('author',), count=True, avg=('rating',)
).filter(
group_by__author__name='Fred',
count__gte=3,
avg__rating__gte=4
)

>
> R>>> Author.objects.annotate(
> article__count='number_of_good_articles',
> article__rating__average='average_rating').filter(
> number_of_good_articles__gte=3,
> average_rating__gte=4)
>
> 5) Get a list of authors, with the number of >4 rated articles they
> have written:

H>>> Article.objects.filter( rating__gte=4 ).aggregate(
group_by=('author',), count=True )

>
> R>>> Author.objects.annotate(
> article__count='number_of_good_articles').filter(
> article__rating__gte=4)
>
> 6) Get a list of authors with a first name of 'Fred' that have written
> at least 3 articles:

isn't this same as 3) or am I missing something?

H>>> Article.objects.filter( author__name='Fred'
).aggregate( group_by=('author',), count=True
).filter( count__gte=3 )

>
> R>>> Author.objects.filter(name='Fred').annotate(
> article__count='number_of_articles').filter(
> number_of_articles__gte=3)
>
> 7) Get the average article rating:
> H>>> Article.objects.aggregate(average=('rating'))
>
> R>>> Article.objects.aggregate(rating__average='avg_rating')
>
> 8) What is the average rating for authors called 'George'? What about 'Fred'?
> H>>> Author.objects.aggregate(group_by=('first_name'),
> average=('article__rating'))
>
> R>>> Author.objects.values('first_name').annotate(article__rating__average='avg_rating')
>
> Out of these, I feel your syntax probably has the advantage in case
> (7), but getting the output from your results requires two levels of
> dictionary lookup, whereas only 1 is required in my case.

true, I would like to access the data more easily, but I am not
willing to compromise the naming or force users to manually alias
their lookups for this.

> In addition to seeing you fill in these use cases with your syntax, I
> would be interested in any use cases where you feel that your syntax
> has the advantage. My use cases are driven by the way in which I
> currently use (or would like to use) aggregates; If your use cases are
> significantly different, I'd be interested in hearing them.
>
> In particular, one of your objections to GROUP BY==values()
> equivalence is that while they are technically equivalent, they are
> not conceptually equivalent to the end user. Can you provide a use
> case where this is true (example 8 is my counterexample)?

in my opinion, if I would specify values(), I haven't said anything
about aggregating the result. If the ORM does that behind my back, I
don't like it. I just asked for individual values, I didn't say
anything about aggregation.

it just doesn't seem nice, especially if values() was along for some
time now and people got used to how it works


btw. refactoring the queryset code is necessary for this to work, so
perhaps we should focus our efforts there first. Maybe something will
emerge from that that would help us decide.

>
> Yours,
> Russ Magee %-)

Russell Keith-Magee

unread,
Apr 16, 2007, 9:55:48 AM4/16/07
to django-d...@googlegroups.com
On 4/16/07, Honza Král <honza...@gmail.com> wrote:

> besides, Django's queryset is very SQL-centric so far

To a large extent, yes, but no in some very important ways.
Article.objects.all() isn't the least bit SQL, although the mapping to
SQL is obvious. Preserving the object flavor of the ORM is an
important design consideration for the ORM. More than one proposed
extension to the ORM system has been knocked back for being little
more than a light wrapper around SQL.

> you would either have to have smart optimizations that would recognize
> this case and group by the foreign key on the data table instead, or
> you would end up with a query that would most likely kill your DB.

One way around this is to use MAX on all the required non-pk fields
from the object list being retrieved:
SELECT object.id, MAX(object.field1), SUM(related.value) FROM object
INNER JOIN related ON object.related = related.id GROUP BY object.id

Since each group for object.id has only 1 member, the MAX returns the
matching field value, without needing to be explicitly included in the
group.

> > tables). I don't buy the 'generic summary statistic app' argument;
> > mostly because I don't see the use case for such an application,
>
> I am building one right now:

...


> I think that most bigger applications will need something like this
> sooner or later. Reporting and (even this primitive) BI is very
> lucrative business and customers want it.

Ok; I misunderstood your intent. My mind was off building some sort of
'summary of summary statistics'. A 'generic view for summary stats' or
some such beast makes sense.

> > Out of these, I feel your syntax probably has the advantage in case
> > (7), but getting the output from your results requires two levels of
> > dictionary lookup, whereas only 1 is required in my case.
>
> true, I would like to access the data more easily, but I am not
> willing to compromise the naming or force users to manually alias
> their lookups for this.

This is why I thought a merging the two syntaxes might have its place.
I can see significant value in allowing aliases - the ability to
filter comes almost by accident, and it allows the developer to use
meaningful names, rather than ('average_age', rather than
'article__author__age__average'). However, the simple case ("just give
me the average, and put it somewhere simple') is extremely useful, and
is cumbersome if you must specify an alias.

> btw. refactoring the queryset code is necessary for this to work, so
> perhaps we should focus our efforts there first. Maybe something will
> emerge from that that would help us decide.

Granted. At one point, Malcolm was pushing this refactor, but I don't
know if he actually started work on it. Malcolm - you out there?

Yours,
Russ Magee %-)

Malcolm Tredinnick

unread,
Apr 16, 2007, 8:54:31 PM4/16/07
to django-d...@googlegroups.com
On Mon, 2007-04-16 at 21:55 +0800, Russell Keith-Magee wrote:
[...]

> > btw. refactoring the queryset code is necessary for this to work, so
> > perhaps we should focus our efforts there first. Maybe something will
> > emerge from that that would help us decide.
>
> Granted. At one point, Malcolm was pushing this refactor, but I don't
> know if he actually started work on it. Malcolm - you out there?

I'm working on it actively. It's taken a back seat until I've finished
the unicode work, though -- since that is necessary to fix some real
bugs we have right now -- which should be end of this coming weekend, I
would hope.

Regards,
Malcolm

Honza Král

unread,
Apr 16, 2007, 9:03:02 PM4/16/07
to django-d...@googlegroups.com

great to hear that.. if you need any help, I have some free time now
and am willing to help. I gave the refactoring and aggregations a lot
of thought and I know my way around SQL.

> Regards,
> Malcolm

Malcolm Tredinnick

unread,
Apr 16, 2007, 9:06:08 PM4/16/07
to django-d...@googlegroups.com
Hey Honza,

On Tue, 2007-04-17 at 03:03 +0200, Honza Král wrote:
> On 4/17/07, Malcolm Tredinnick <mal...@pointy-stick.com> wrote:
> >
> > On Mon, 2007-04-16 at 21:55 +0800, Russell Keith-Magee wrote:
> > [...]
> > > > btw. refactoring the queryset code is necessary for this to work, so
> > > > perhaps we should focus our efforts there first. Maybe something will
> > > > emerge from that that would help us decide.
> > >
> > > Granted. At one point, Malcolm was pushing this refactor, but I don't
> > > know if he actually started work on it. Malcolm - you out there?
> >
> > I'm working on it actively. It's taken a back seat until I've finished
> > the unicode work, though -- since that is necessary to fix some real
> > bugs we have right now -- which should be end of this coming weekend, I
> > would hope.
>
> great to hear that.. if you need any help, I have some free time now
> and am willing to help. I gave the refactoring and aggregations a lot
> of thought and I know my way around SQL.

Good to hear. :-)

I'll get the basic stuff out there as soon as it holds a little water.
It's very similar to the prototype version you wrote earlier (I'd
started a little before you, but it was surprising how similar they
were), just with a few less classes. So you'll find the code very
similar.

I'm not going to touch the aggregation stuff, since there are enough
people thinking about that already.

Regards,
Malcolm

Reply all
Reply to author
Forward
0 new messages