Google Groups Home
Help | Sign in
Message from discussion Aggregates: anything planned for 1.0?
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Honza Král  
View profile
 More options Apr 13 2007, 6:16 pm
From: "Honza Král" <honza.k...@gmail.com>
Date: Sat, 14 Apr 2007 00:16:02 +0200
Local: Fri, Apr 13 2007 6:16 pm
Subject: Re: Aggregates: anything planned for 1.0?
On 4/13/07, Russell Keith-Magee <freakboy3...@gmail.com> wrote:

> On 4/13/07, Honza Král <honza.k...@gmail.com> wrote:
> > the queryset refactoring must (well, should) happen before aggregation
> > support. After the refactoring, it should be relatively easy to put in
> > the aggregations.

> > see:
> > http://code.djangoproject.com/ticket/3566

> Ok; I've had a chance to look at this now. Although there are some
> syntax differences, I think your idea and mine [1] are acutally pretty
> close. A bit of both, and we might just have something.

great, thanks for your comments. I attached my answers in the text

> Comments:

> GROUP BY = values()
> ~~~~~~~~~~~~~~~~~~

> Isn't the group_by clause duplicating what is already provided by the
> values() filter? i.e., saying 'return me a list of dictionaries of
> object data that isn't a full object'?

from a technical view - YES, from user's view: NO
that should be two different things, only the same format of returned
data doesn't seem to be enough to merge these two functions

When the aggregates clause is

> in use, a GROUP BY over the selected fields is implied. If you don't
> provide a values clause, then you want to GROUP BY all the fields in
> the model

> So, your example would become:

> >>> queryset = Model.objects.all()
> >>> queryset.values('name', 'city').aggregates(sum=('pay',
> 'some_other_field'), avg=('pay', 'age'), count=True)

if you want to go with values() (which I am personally against), why not do:

qset.values('field', 'field2', 'field3__sum', 'f4__count', ...) < = >
SELECT SUM(field3), count(f4) GROUP BY field, field2

I still don't like it, but seems better, and would solve issues with
trying to refer to the aggregated value --
order_by('-f4__count','field'), see below

> Output format
> ~~~~~~~~~~~~~
> As my proposal established, I'm a fan of the 'augmenting the returned
> data' approach:

> [
>  { 'name' : 'John Doe',
>    'city' : 'Prague',
>    'sum' :' {
>       'pay' : 50000,
>       'some_other_field' : 10000
>    }
>    'average' : {
>       'pay' : 10000,
>       'age' : 30
>    },
>    'count' : 5,
>  },
>  ...
> ]

what if I have a field called 'average' ?

> To me, the clause is returning a list of data groups (i.e., the
> results of the values() clause); putting the grouping data in a
> group_by dictionary seems overkill. This approach does introduce some
> new magic-words that are not allowed as field names (or are at least
> dangerous if you do). Personally, I don't see that as a major problem.

I do, I just don't like constraints on naming, it just feels wrong,
when there is a simple way around this.
it would also allow for easier generic processing of the results (for
example when building a general statistics app, we will know without
inspecting the model, which fields were in the GROUP BY etc. --
explicit is better than implicit ;)

> Weaknesses in the function=tuple argument approach
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> While the function=tuple of field names approach does work, I feel it
> is lacking:
> - You can't use aliases: SUM(field) as "total_pay"
> - Filtering on aggregated values is difficult (impossible?)

true, this is a problem for anything I am presenting

you could use something like order_by('sum(field3)'), but that seems weird...
I outlined one solution in the values() part, but I am not happy with
that either...

> - It doesn't handle joins in the same way as filter()

true, but it should, that's why I want to keep the arguments as simple
string containing only the field name (including optional
foreign__or__other__key__lookups), so same logic could be used for
filters and these

> The approach I proposed was to mirror the filter syntax:
> pay__sum='total_pay'
> other_field__average='some_alias'

if you go with __sum etc., you don't need aliases, you can use this
name (field__sum) as a reference to the field:

qset = qset.values('field', 'field2', 'field3__sum', 'f4__count', ...)

qset.filter( field__contains='AA' ) # results in a WHERE clause
qset.filter( field3__sum__lte=23 ) # results in a HAVING clause
qset.order_by( '-field3__sum', 'field' )

> This approach mirrors the filter mechanism for attribute naming and
> joins, allows for aliases, which provides a mechanism to filter on
> aliases.

> The only downside I can see is that the simple case requires the user
> to provide an alias - i.e., pay__sum='pay_sum'. I'll admit that this
> isn't particularly elegant.

> However, it should be possible to combine both approaches:

> >>> Model.objects.aggregate(sum=('pay'), other_field__average='mean_value')
> [
>  { 'name' : 'John Doe',
>    'city' : 'Prague',
>    'sum' :' {
>       'pay' : 50000,
>    }
>    'mean_value' :30,
>  },
>  ...
> ]

this is inconsistent, there should imho be just one way for this...

> This means the sum of pay isn't aliased, so it can't be filtered, but
> you can filter on 'mean_value'. Argument clashes aren't a problem
> either - 'sum' by itself is unambiguous as an argument; any
> filter-like argument will need to have at least a '__' in it.

> Ambiguities in COUNT
> ~~~~~~~~~~~~~~~~~~~~

> aggregate(count=True) is already covered with count(); I'm not sure
> I'm wild about the duplication.

well, but what if you need number of articles per author? existing
.count() wouldn't cut it (in one select) and once you are doing
aggregation anyway, its simple to throw in COUNT(*)...
so if you
a) change .count() to behave like .aggregate(count=True), it would break code
b) don't allow count in aggregation, you will have more queries for no
good reason.
I see .count() as a very useful shortcut for
.aggregate(count=True)['count']

There is also the problem that it

> precludes counting other things: e.g., counting related objects -
> return the number of books that each author has written.

> Author.objects.aggregate(books_count='number_of_books_written')
> or
> Author.objects.aggregate(count=('books'))

> On top of this, you can get the gross count from:

> len(Author.objects.aggregate(...))

I see, we obviously don't understand each other:
I don't want count=True to add count of groups produced by group by, I
want it to add information on the group's size:
SELECT autor_id, COUNT(*), AVG(rating) FROM articles GROUP BY article_id;
which would return me list of authors with number of published
articles and their average rating

> Aggregate of results vs aggregates of related objects
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> If you run an aggregate over a field from the table you are querying,
> the aggregate is a table level property, not a row level property.
> Compare the expected results from:

> - Get me the date of the earliest and latest published article: Return
> a single value for the min, and a single value for the max.
> - Annotate each author with the date of their earliest and latest
> published article: Return a min and max value for each author.

> The second case is relatively easy:

> >>> Author.objects.aggregate(article__pub_date__min='earliest',
> article__pub_date__max='latest'))
> [{
>     'name':'John Doe',
>     'earliest': 2007-01-04
>     'latest': 2007-02-14
> }
> ...
> ]

why do you assume you are grouping by name, where is it specified? the
code of the queryset doesn't say anything about a name. I am confused
by this.

> However, the first case:

> >>> Article.objects.aggregate(pub_date__min='earliest', pub_date__max='latest'))
> {[
>  'title': 'first article'
>  'earliest': 2007-01-04
>  'latest': 2007-02-14
> ],
> [
>  'title': 'second article'
>  'earliest': 2007-01-04
>  'latest': 2007-02-14
> ],
> }

I don't understand this at all: you only asked for aggregates, you
didn't specify group_by so the SELECT should look something like:

SELECT MIN(pub_date) as earliest, MAX(pub_date) as latest FROM articles;
there is no way this would return more that one row

and why would it magically include 'title' in the result ? I am confused again.

> while legal, isn't the most elegant way at getting at a unique min/max
> for an entire table. This was the reason for my suggestion for two
> functions: aggregate and annotate. Aggregate returns a dictionary of
> just aggregated values:

> >>> Article.objects.aggregate(min=('pub_date'), max=('pub_date'))
> {
>    'min' : '2007-01-04',
>    'max' : '2007-02-14'
> }

> Whereas annotate() returns a list of annotated dictionary data.

but those two are the same thing, the aggregate just has an empty
group by clause, so it groups the entire table and is missing the
'group_by' field in the result (or any other fields except the
aggregates you asked for)

> ================================

> I'd better stop now - this is getting longer than I originally
> anticipated, and if it gets much longer nobody will read it :-) As
> always, feedback welcome.

> Yours,
> Russ Magee %-)

> [1] http://groups.google.com/group/django-developers/msg/2f91fa217bc465bb

--
Honza Král
E-Mail: Honza.K...@gmail.com
ICQ#:   107471613
Phone:  +420 606 678585

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2008 Google