this is very exciting! I've one suggestion/question though.
On Tue, 2008-04-22 at 13:24 -0700, Nicolas E. Lara G. wrote:
> with the possibility of using non-keyword arguments for un-aliased
> aggregation, in which case the 'alias' would be automatically created
> from the query made:
>  >>> Students.objects.all().aggregate('height__avg')
> {'height_avg' : 1.43}
> 
Was the name-munging here done intentionally? I think, that it's a bad
choice to leave one of the underlines out. I think that it's too easy to
shoot oneself in the foot with such magic (Think denormalized table
schemas).
Apart from that: Great stuff, can't wait for it :-)
	Matthias
First, congratulations on your GSOC selection :)
I would to know if you will be building some way to use PG custom
aggregates. I have so far one custom aggregate and I wonder how it
could work with this scheme.
Specifically, will there be some way to create new aggregate
predicates just as the builtin sum, avg, etc ?
Regards
Rajeev J Sebastian
You may wish to go back and read the earlier thread(s) about this on
this group where this has been discussed before.
One goal in the design here has been to avoid leaking lots of SQL into
the API: "having" and "group by" crop up as a natural function in the
use of aggregates, but they aren't necessarily needed in isolation. The
idea is not to rewrite SQL in Python. Rather, we want to provide certain
pieces of functionality -- in this case, aggregations support -- that is
mapped to SQL. So the SQL will use "having" and "group by" but they
don't to be exposed at the Python level.
Regards,
Malcolm
-- 
Save the whales. Collect the whole set. 
http://www.pointy-stick.com/blog/
In short: I belıeve Malcom's reply says ıt all except for the fact
that the 'havıng' clause would be expressed wıth a fılter modıfıer.
The ORM would take care of puttıng the requested fılter ın a 'havıng'
clause or ın a 'where' clause...
On the ıssue of returnıng model objects ınstead of values
(dıctıonarıes): I belıeve the problem wıth that ıs ınconsıstency,
sınce when usıng values to restrıct the elements that are grouped
together you cannot retrıeve an object. I am -0 on retrıevıng the
objects because ıt becomes complıated for the users, but wouldn't
opose much because values ıs specıfıed whenever a valuesQuerySet would
be returned.
-- 
Nicolas Lara
Linux user #380134
Public key id: 0x152e7713 at http://keyserver.noreply.org/
# Anti-Spam e-mail addresses:
python -c "print '@'.join(['nicolaslara', '.'.join([x for x in
reversed( 'com|gmail'.split('|') )])])"
python -c "print '@'.join(['nicolas', '.'.join([x for x in reversed(
've|usb|labf|ac'.split('|') )])])"
>  On the ıssue of returnıng model objects ınstead of values
>  (dıctıonarıes): I belıeve the problem wıth that ıs ınconsıstency,
>  sınce when usıng values to restrıct the elements that are grouped
>  together you cannot retrıeve an object. I am -0 on retrıevıng the
>  objects because ıt becomes complıated for the users, but wouldn't
>  opose much because values ıs specıfıed whenever a valuesQuerySet would
>  be returned.
I'm not sure why you see this as being complicated for users:
>>> authors = Author.objects.annotate(book__sum='books_published')
>>> for a in authors:
...    print a.name,'has published',a.books_published,'books.'
There is no problem making each a in authors an actual Author object;
the only difference is that each author is annotated. Sure, this means
that the author intances have attributes that aren't part of the
original definition of Author, but this is perfectly consistent with
Python - you can associate arbitrary attributes with any object
instances.
If a values() clause is involved in a query, then it makes sense that
what gets returned should be a value list rather than an object - but
that's perfectly consistent with the behaviour of values() currently.
However, in the simple case -  where you are grouping by full object
instances - I don't see why we shouldn't return object instances.
Yours,
Russ Magee %-)
it would work like:
for m in Model.objects.values( 'name', 'place' ).annotate(
event__count='number_or_events' )
  # the basic data would be available right away
  print m['number_or_events'], m['place'], m['name']
  # or some other type of access, it doesn't have to look like a dictionary
  # haven't really thought about this
  for o m.objects: # this would pull all models that have the
appropriate name and place
    print o.name, o.place, o.type
does this make sense?
the target is to be able to pull the aggregates and provide easy
access to all the objects in the group, by wrapping this into a class
of some sort, we would be able to work with the results more
effectively than when using dictionaries.
It could also maybe mimic the behavior of the model (with rest of the
fields set to None ??) itself (just playing with ideas, please
somebody tell me if this makes any sense to anybody).
The motivation behind all this is that it seems a waste to just return
the data and drop all the metadata in the process.
On Fri, Apr 25, 2008 at 7:05 AM, Russell Keith-Magee
<freakb...@gmail.com> wrote:
> On Thu, Apr 24, 2008 at 10:14 PM, Nicolas Lara <nicol...@gmail.com> wrote:
>
>  >  On the ıssue of returnıng model objects ınstead of values
>  >  (dıctıonarıes): I belıeve the problem wıth that ıs ınconsıstency,
>  >  sınce when usıng values to restrıct the elements that are grouped
>  >  together you cannot retrıeve an object. I am -0 on retrıevıng the
>  >  objects because ıt becomes complıated for the users, but wouldn't
>  >  opose much because values ıs specıfıed whenever a valuesQuerySet would
>  >  be returned.
>
>  I'm not sure why you see this as being complicated for users:
>
>  >>> authors = Author.objects.annotate(book__sum='books_published')
>  >>> for a in authors:
>  ...    print a.name,'has published',a.books_published,'books.'
>
>  There is no problem making each a in authors an actual Author object;
>  the only difference is that each author is annotated. Sure, this means
>  that the author intances have attributes that aren't part of the
>  original definition of Author, but this is perfectly consistent with
>  Python - you can associate arbitrary attributes with any object
>  instances.
plus we currently do this in .extra(select={ ... } )
I am +1 on returning actual objects - the use case would be that I
want to print the table of users with number of comments they made and
i really want to be able to access their profile, their absolute url
and any other attribute or method of the model involved
>
>  If a values() clause is involved in a query, then it makes sense that
>  what gets returned should be a value list rather than an object - but
>  that's perfectly consistent with the behaviour of values() currently.
>  However, in the simple case -  where you are grouping by full object
>  instances - I don't see why we shouldn't return object instances.
>
>  Yours,
>  Russ Magee %-)
>
>
>
>  >
>
-- 
Honza Král
E-Mail: Honza...@gmail.com
ICQ#: 107471613
Phone: +420 606 678585
Hi Simon
I can see what you are saying here, but I'm not sure I agree. The
underlying idea for this syntax is to follow the example of the
'double underscore then operator' syntax used by filters. Yes, the
'assignment' in an aggregate is in the opposite order to what it would
be in normal code - but then, filter(height__lt=3) doesn't really
match perfectly with normal code either.
If you're looking for a way to reconcile the syntax with expectation -
one suggestion is that the syntax aggregate(height__avg='foobar') can
be read as "the average height has the alias foobar". Not ideal, I
know, but it is a start.
>  Have you considered syntax that looks like this instead? :
>  >>> Students.objects.all().aggregate({'average_height': 'height__avg})
>  > {'average_height' : 1.43}
My issue with this syntax is that every aggregate requires 4 extra
punctuation characters (two parentheses, two quotes), and all you
really get in return is to reverse the order so that you get
result=operation type syntax... except that you can't use the =
operator. It also doesn't have any particular analogs with existing
ORM syntax.
Yours,
Russ Magee %-)
I would say it has a pretty good analog:
>>> Student.objects.all().extra(select={'average_height': 'avg(height)'})
Of course that query won't work, since you can't just toss an
aggregate function into a select list and expect to get anything but
an error message.  But compare it to this one, which does work:
>>> Student.objects.all().extra(select={'floor_height': 'floor(height)'})
I can see that there might be a bit of cognitive dissonance; however,
it doesn't bug me personally as much as the syntax your are proposing.
Excessive parentheses and quotes like that increase the possibilities
for mistyping mistakes, and don't really serve to improve clarity in
the underlying expression.
That said, the cognitive dissonance is a reasonable enough objection
to warrant some more exploration of the syntax options.
So - here's a slightly modified proposal. Last time this topic came up
on django-dev, Justin Fagnani made an interesting suggestion which I
think will answer your objections, and leaves open some interesting
possibilities for userspace extension. Rather than the
column__aggregate=alias syntax:
>>> Student.objects.all().annotate(Avg('height'),
tallest=Max('height'), avg_friend_age=Avg('friend__age')
The raw idea - the aggregates themselves are tied up into functions
(Avg, Max, etc) that are constructed as an object. The argument to the
aggregate is the query column, using the standard double-underscore
notation if necessary to cross tables. In QS-RF, this syntax is being
used for cross-table order_by() clauses, so there is an analog there.
If the aggregate function is provided as an anonymous argument (e.g.,
Avg('height')), then the aggregate is asked to provide an appropriate
alias - for example, Avg() might use 'avg__%s' % field_name. If the
aggregate function is provided as a kwarg, the kwarg name is used as
the alias.
Userspace extension becomes a lot easier - If a particular database
has an interesting aggregate that isn't supported as part of the core
set, you can write your own aggregate and use it as you see fit.
It also leaves open the door to do nifty things like
annotate(foo=Avg('a') * Sum('b')) - i.e., alegebraic expressions
between aggregates. Obviously, this sort of API will require a whole
lot of work to get right, but the option is there.
Does this proposal sound any better to you?
Yours,
Russ Magee %-)
Its mostly there as a convenience for the really simple cases. There
some duplication in typing annotate(avg_height=Avg('height'), so the
shortcut is a convenience. IMHO, this is the sort of thing could be
easily explained in the documentation for each aggregate function.
However, it's hardly a mandatory feature - if public opinion is
against this idea, I won't lose any sleep about it.
Yours,
Russ Magee %-)
Regarding this, I believe an option would be to use the reverse syntax:
aggregate(alias='field__function')
because it mimics what we normally do when assigning in code.
or simply  aggregate('field__function') and have the standard alias be
the same as the requested aggregate (it would be 'field__function')
I still prefer this approach over the lookup objects. But this is
probably because i've been thinking of aggregates with the filter-like
syntax for a while and I became fond of it. Anyway I do think the
lookup object syntax is a very good approach, so here are some
considerations:
Extending seems easy, but is it? (in practice). The way I see to
extend this is to create custom classes (like Max, Avg, etc). this
classes would have a __init__ function that receives 2 parameters:
self, of course, and  the field name. The newly defined class should,
then, call something like get_column_name('field') to be able to
translate itself into SQL. Pleas don't furiously reply there might be
other ways to do it, I know there are, this is just one of the ways I
could come up from the top of my head to ilustrate what I mean... A
proposition on this follows...
Though the objective syntax mimics Q objects, when using Q objects you
have one class, Q, that is used for all the queries. Having multiple
classes seems confusing. I would propose to have a single class (A?)
to do the queries. So you could do something like:
aggregate(height=A('max', 'height'), av_friend_age=A('avg', 'friend__age'))
This way A could receive the name of the function (or the actuall
function object) and the field to which it will be applied. A good
thing that could come from this is that extending (adding a new
function, or set of functions) would be really easy because, since the
heavy lifting would be done by A which would already be part of the
framework, the new function only needs to define how it transkates to
SQL given the column to which would be applied. Having the first
parameter be class/classname instead of function/function name is also
an option. This would allow us to add members to the class so that A
can act differentely. An example of this would be using a class that
has the 'alias_prefix' attribute to define what comes before the field
name in the default alias (A alias function could also be available to
receive the fieldname and return the alias).
class Custom:
         alias_prefix = 'my_aggregate'
         def as_sql(self, column):
                  return 'CUSTOM(%s)' % column
class WierdMax(Max):
         def as_sql(self, column):
                   return super(Max, self).as_sql(column) - 10
so
aggregate(A('custom', 'height'), wierd=A('wierdmax', 'friend__age'))
would result in something like:
{'my_aggregate__hieght': 324, 'wierdmax__friend_age': 17}
I don't like the automatic lowercasing... Its very too much magic, but
I found the camel case quiet ugly...
Any thoughs?
Having multiple classes seems confusing.
I would propose to have a single class (A?) to do the queries. So you could do something like:
aggregate(height=A('max', 'height'), av_friend_age=A('avg', 'friend__age'))
I didn't understand your point here. When is the data incomplete?
I dont see the problem in averaging an integerField (probably in
averaging a CharField, but IMHO this kind of invalid values should
make the whole query fail).
The problem with averaging an IntegerField is that the result is unlikely to be an integer, yet parts of the model might assume that it is. At the very least AggregateModels in this situation are unable to be validated, saved or deleted, and it's possible that other methods don't work, so either they shouldn't be models at all, or we need to disable lots of things and issue big, big warnings.
I aggre. An advantage (which I think might also be a double edgesword) of the string function names is the possibility to fallback to
a standard convertion to SQL when no function is defined.
With this we
wouldn't need to wrap every aggregate method in a python class that
translates roughly the same way and we could take advantage of the
direct usage of the DB defined aggregate functions.
Also with the A
syntax you would do only one import instead of one for each function.
> In addition, some cases have special requirements and dealing with them in aI can't think of a requirement that can't be dealt with in a method
> class is easy.
>
also... got an example?
In this case: should we actually return the actual objects? If a
values modifier is involved the expected behavior is to receive a list
of dicts. IMO, we should not allow aggregates to override this
behavior. Also, having some value for the fields wouldn't make sense
since we cannot ensure there is a unique value for a given field for
that group (unless it is specified in values, in which case it would
automatically appear).
I like the idea of having a QuerySet for retrieving the grouped
objects. This can be done simply by having something like: {'objects':
<QuerySet object>}
>
> > I dont see the problem in averaging an integerField (probably in
> > averaging a CharField, but IMHO this kind of invalid values should
> > make the whole query fail).
> >
> >
>
>
>
> The problem with averaging an IntegerField is that the result is unlikely to
> be an integer, yet parts of the model might assume that it is. At the very
> least AggregateModels in this situation are unable to be validated, saved or
> deleted, and it's possible that other methods don't work, so either they
> shouldn't be models at all, or we need to disable lots of things and issue
> big, big warnings.
>
or they can be a list of Models only when no 'values' is specified.
> >
> > I aggre. An advantage (which I think might also be a double edge
> > sword) of the string function names is the possibility to fallback to
> > a standard convertion to SQL when no function is defined.
>
> I'm not sure how wise this is. Considering the variation of available
> functions, and their different behaviors,   just passing the function name
> straight into SQL could easily encourage non-portable queries. I know some
> users will want DB specific functions, but like extras, it should probably
> be clearly advanced functionality.
>
I am not sure either. I see the problems it might bring.
> Ideally expressions.py should be import * friendly
>
> >
> > > In addition, some cases have special requirements and dealing with them
> in a
> > > class is easy.
> > >
> >
> > I can't think of a requirement that can't be dealt with in a method
> > also... got an example?
> >
>
>
> Count() is an easy example. First, it takes an extra kwarg, 'distinct', and
> changes the output of as_sql() if set. Then it also has to do a special case
> in relabel_aliases() and whatever method passes column names to
> query.setup_joins() for COUNT(*)
>
This could all be handled in a count method. Still I see that using
classes poses and advantage.
>
> Cheers,
>  Justin
>
>
>
>  >
>
Regards,
-- 
Nicolas Lara
The idea is to implement exactely that logic into filter.
>  In fact, I'm still not sure (without explicit SQL methodology support,
>  which I'm still confused as to why no one wants to implement) you will
>  quite achieve this. I'm also not sure how count fits into field names.
>  Does any database layer actually differ on count(field_name) vs
>  count(1) vs count(*) when you are doing aggregation?
They do. count(*) and count(1) will count every element,
count(colum_name) will count those that are not null for the given
column name.
-- 
btw, that means you showed only 2 of 3 possible sqls... am i right?
-- 
Best regards, Yuri V. Baburov, ICQ# 99934676, Skype: yuri.baburov,
MSN: bu...@live.com
I agree
> I wonder if aggregate functions on the quesyset's model should even be
> allowed in annotate(). I can't think of a case that makes sense. That
> restriction would certainly get rid of some ambiguity.
This is a solution. but to some extent hurts those users that
"actually wanted to do that query".
The cases where that query would be useful are very few I believe. And
they can always be expressed some other way. An example of such a case
would be:
Again, we have clients and purchases, let's say that every client can
get certain number of (costumer satisfaction) points for their
purchases. And the amount of points-per-purchase can vary depending on
the client (women get more points than men, children more than women,
long time custumers get more and so on). You might want to do
something like:
Buyer.objects.all().annotate( points=Sum('points_per_purchase'),
amount=Sum('purchases__quantity')).filter(points__lt=F('amount'))
This is just an example of such a case. I am not saying in anyway that
this is how it should be done (there are many ways to overcome this)
> In the same vane, maybe implicitly grouping by all fields when no values()
> arguments are present is a bad idea too.
What would you expect then to happen when no grouping is specified?
-- 
I think the 2 options now to tackle this problem are:
   (1) annotate(A1, A2) would behave as SQL (when A1 is on the local
model and A2 is on a join)
         and annotate(A1).annotate(A2) would result in doing the 2
queries and returning something as the second example of my previous
post
   (2) Simply restrict this case. If somebody needs such a case they
can just do 2 querysets or fall back to sql.
I am more inclined towards (2) since (1) seems hard to grasp for the
users, error prone and would add innecesary complexity to the orm's
code that would need to be maintained in the future. Also the use case
is unusual enough so it can fall in the 20% roll your own sql side.
-- 
[The entire original message is not included]
I disagree that (1) is unexpected - I see it as entirely consistent,
just not meaningful. My reading of the annotate() clause is that you
are returning object instances, annotated with aggregate values
derived from the objects related to the instances. In this case, you
are asking for the average over a group that by definition only
contains 1 item. The sum case is a little unusual, but I don't see it
as irreconcilable with the core design.
Although avg, sum etc, are entirely legal in an annotate clause, it
doesn't make much sense to use them in the way you have here (i.e.,
it's legal, but doesn't really represent anything meaningful). They
_do_ make sense if you are using values() to produce some internal
grouping in the returned object set, or if you are doing averages over
a related table.
I see (2) as what .aggregate() should be returnning - aggregate values
over the entire result set.
>  With (2) we would have to make 2 queries while the user expects only
>  one two happen. Also for users that are used to sql, for very wicked
>  reasons, some user might be interested in executing a query that
>  actually sums for every relation.
Yes - but I think it _should_ be two queries at the user level. One
query is to get a list of objects. The second is to find the maximum
value over all those objects. This is two separate questions,
operating in two seperate domains. I see no reason that they shouldn't
be two separate queries.
Yours
Russ Magee %-)