Google Groups Home
Help | Sign in
Message from discussion Aggregates: anything planned for 1.0?
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Russell Keith-Magee  
View profile
 More options Apr 12 2007, 11:11 pm
From: "Russell Keith-Magee" <freakboy3...@gmail.com>
Date: Fri, 13 Apr 2007 11:11:00 +0800
Subject: Re: Aggregates: anything planned for 1.0?
On 4/13/07, Honza Král <honza.k...@gmail.com> wrote:

> the queryset refactoring must (well, should) happen before aggregation
> support. After the refactoring, it should be relatively easy to put in
> the aggregations.

> see:
> http://code.djangoproject.com/ticket/3566

Ok; I've had a chance to look at this now. Although there are some
syntax differences, I think your idea and mine [1] are acutally pretty
close. A bit of both, and we might just have something.

Comments:

GROUP BY = values()
~~~~~~~~~~~~~~~~~~

Isn't the group_by clause duplicating what is already provided by the
values() filter? i.e., saying 'return me a list of dictionaries of
object data that isn't a full object'? When the aggregates clause is
in use, a GROUP BY over the selected fields is implied. If you don't
provide a values clause, then you want to GROUP BY all the fields in
the model

So, your example would become:

>>> queryset = Model.objects.all()
>>> queryset.values('name', 'city').aggregates(sum=('pay',

'some_other_field'), avg=('pay', 'age'), count=True)

Output format
~~~~~~~~~~~~~
As my proposal established, I'm a fan of the 'augmenting the returned
data' approach:

[
 { 'name' : 'John Doe',
   'city' : 'Prague',
   'sum' :' {
      'pay' : 50000,
      'some_other_field' : 10000
   }
   'average' : {
      'pay' : 10000,
      'age' : 30
   },
   'count' : 5,
 },
 ...
]

To me, the clause is returning a list of data groups (i.e., the
results of the values() clause); putting the grouping data in a
group_by dictionary seems overkill. This approach does introduce some
new magic-words that are not allowed as field names (or are at least
dangerous if you do). Personally, I don't see that as a major problem.

Weaknesses in the function=tuple argument approach
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

While the function=tuple of field names approach does work, I feel it
is lacking:
- You can't use aliases: SUM(field) as "total_pay"
- Filtering on aggregated values is difficult (impossible?)
- It doesn't handle joins in the same way as filter()

The approach I proposed was to mirror the filter syntax:
pay__sum='total_pay'
other_field__average='some_alias'

This approach mirrors the filter mechanism for attribute naming and
joins, allows for aliases, which provides a mechanism to filter on
aliases.

The only downside I can see is that the simple case requires the user
to provide an alias - i.e., pay__sum='pay_sum'. I'll admit that this
isn't particularly elegant.

However, it should be possible to combine both approaches:

>>> Model.objects.aggregate(sum=('pay'), other_field__average='mean_value')

[
 { 'name' : 'John Doe',
   'city' : 'Prague',
   'sum' :' {
      'pay' : 50000,
   }
   'mean_value' :30,
 },
 ...
]

This means the sum of pay isn't aliased, so it can't be filtered, but
you can filter on 'mean_value'. Argument clashes aren't a problem
either - 'sum' by itself is unambiguous as an argument; any
filter-like argument will need to have at least a '__' in it.

Ambiguities in COUNT
~~~~~~~~~~~~~~~~~~~~

aggregate(count=True) is already covered with count(); I'm not sure
I'm wild about the duplication. There is also the problem that it
precludes counting other things: e.g., counting related objects -
return the number of books that each author has written.

Author.objects.aggregate(books_count='number_of_books_written')
or
Author.objects.aggregate(count=('books'))

On top of this, you can get the gross count from:

len(Author.objects.aggregate(...))

Aggregate of results vs aggregates of related objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you run an aggregate over a field from the table you are querying,
the aggregate is a table level property, not a row level property.
Compare the expected results from:

- Get me the date of the earliest and latest published article: Return
a single value for the min, and a single value for the max.
- Annotate each author with the date of their earliest and latest
published article: Return a min and max value for each author.

The second case is relatively easy:

>>> Author.objects.aggregate(article__pub_date__min='earliest',

article__pub_date__max='latest'))
[{
    'name':'John Doe',
    'earliest': 2007-01-04
    'latest': 2007-02-14
}

...
]

However, the first case:

>>> Article.objects.aggregate(pub_date__min='earliest', pub_date__max='latest'))

{[
 'title': 'first article'
 'earliest': 2007-01-04
 'latest': 2007-02-14
],
[
 'title': 'second article'
 'earliest': 2007-01-04
 'latest': 2007-02-14
],

}

while legal, isn't the most elegant way at getting at a unique min/max
for an entire table. This was the reason for my suggestion for two
functions: aggregate and annotate. Aggregate returns a dictionary of
just aggregated values:

>>> Article.objects.aggregate(min=('pub_date'), max=('pub_date'))

{
   'min' : '2007-01-04',
   'max' : '2007-02-14'

}

Whereas annotate() returns a list of annotated dictionary data.

================================

I'd better stop now - this is getting longer than I originally
anticipated, and if it gets much longer nobody will read it :-) As
always, feedback welcome.

Yours,
Russ Magee %-)

[1] http://groups.google.com/group/django-developers/msg/2f91fa217bc465bb


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2008 Google