Hello,
I have been selected to implement Aggregate support in the Django ORM
as a project for Google Summer of Code 2008 =) My mentor for this
project is Russell Keith-Magee.
I am very happy with both being selected (of course :p) and, in
general, with all the projects that were selected for Django on this
summer of code. I believe they are all very good additions to the
framework.
Though it has been discussed in this mailing list before I will post a
more concrete description of my project:
General Considerations
----------------------------------
When doing aggregation using the ORM we should continue to be
abstracted from SQL and think about easily expressing the solutions to
the problems we need to solve. The idea is not to provide a wrapper
around any SQL functionalities but to solve data aggregation problems
in a higher abstraction level (with the ORM usign the underlaying SQL
functionalities).
The proposed syntax includes two QuerySet modifiers. The first one
(annotate) would be used to extend the QuerySet with aggregated
information, this is, adding each object a piece of aggregated
information. The second one (aggregate) would be used to reduce the
aggregation over a QuerySet to a set of values for the whole model.
An example of a problem solved with aggregate would be:
"Find out the average height of the kids in a school"
An example of a problem solved with annotate would be:
"Find the average hight of the kids in every classroom"
(this are just examples, more complex things follow)
Proposed Syntax
-----------------------
The syntax for specifying the aggregation will be filter-like syntax
with the possibility of using non-keyword arguments for un-aliased
aggregation, in which case the 'alias' would be automatically created
from the query made:
>>> Students.objects.all().aggregate('height__avg')
{'height_avg' : 1.43}
For specifying the aggregates over a related model the syntax will be
the same as in filter:
Grouping
--------------
Grouping should be done internally. There should be no syntax for
directely asking the database to use GROUP BY. How to group should be
derived from the object representation of what needs to be retrieved.
When necessary to specify the fields to be included in the agregation
(i. e. grouped) it could be done with values().
If we want to specify:
"The minimum number of items sold for the same bulk price"
we could do:
Filtering
------------
When querying for aggregates wwe might want to filter the results
according to information retrieved by the aggregation itself. We
should be able to simply use the filter modifier refering to the alias
of the aggregated field:
Extra
-------
That defines the basic aggregation functionality. There are certain
extra things that can be added but, as I see it, shouldn't be
addressed until the basic functionality is up and running.
One of those things is custom aggregate functions. A first idea to
approach them would be something like this:
>>> my_own_aggregate = Aggregate( sum('field_1') * 'field2',
'default_name')
>>>
queryset.aggregate(purchases__price__custom=my_own_aggregate('alias'))
Still, as I said, there is much to be done before addressing the
custom aggergate problem.
Another improvement that could be added after the basic functionality
is done is the, previously discussed, F syntax for refering to other
fields in the model. A basic idea of this would be something like:
Again, this is to be addressed after the aggregate's basic
functionality is working.
Building Use Cases
----------------------------
As for now I believe this covers most of the use cases of database
aggregation. If anybody can think of a use case which is not covered
by this syntax please post it (preferably with a plain-english
description of what the expected behaviour is). Any suggestion is more
than welcome!
I will keep you all posted in any change or advance on this project.
this is very exciting! I've one suggestion/question though.
On Tue, 2008-04-22 at 13:24 -0700, Nicolas E. Lara G. wrote:
> with the possibility of using non-keyword arguments for un-aliased > aggregation, in which case the 'alias' would be automatically created > from the query made: > >>> Students.objects.all().aggregate('height__avg') > {'height_avg' : 1.43}
Was the name-munging here done intentionally? I think, that it's a bad choice to leave one of the underlines out. I think that it's too easy to shoot oneself in the foot with such magic (Think denormalized table schemas).
Apart from that: Great stuff, can't wait for it :-)
I would to know if you will be building some way to use PG custom aggregates. I have so far one custom aggregate and I wonder how it could work with this scheme.
Specifically, will there be some way to create new aggregate predicates just as the builtin sum, avg, etc ?
Regards Rajeev J Sebastian
On Wed, Apr 23, 2008 at 1:54 AM, Nicolas E. Lara G.
> Hello, > I have been selected to implement Aggregate support in the Django ORM > as a project for Google Summer of Code 2008 =) My mentor for this > project is Russell Keith-Magee. > I am very happy with both being selected (of course :p) and, in > general, with all the projects that were selected for Django on this > summer of code. I believe they are all very good additions to the > framework.
> Though it has been discussed in this mailing list before I will post a > more concrete description of my project:
> General Considerations > ---------------------------------- > When doing aggregation using the ORM we should continue to be > abstracted from SQL and think about easily expressing the solutions to > the problems we need to solve. The idea is not to provide a wrapper > around any SQL functionalities but to solve data aggregation problems > in a higher abstraction level (with the ORM usign the underlaying SQL > functionalities).
> The proposed syntax includes two QuerySet modifiers. The first one > (annotate) would be used to extend the QuerySet with aggregated > information, this is, adding each object a piece of aggregated > information. The second one (aggregate) would be used to reduce the > aggregation over a QuerySet to a set of values for the whole model.
> An example of a problem solved with aggregate would be: > "Find out the average height of the kids in a school" > An example of a problem solved with annotate would be: > "Find the average hight of the kids in every classroom"
> (this are just examples, more complex things follow)
> Proposed Syntax > ----------------------- > The syntax for specifying the aggregation will be filter-like syntax
> So a QuerySet might be: > >>> Students.objects.all().aggregate(height__avg='agerage_hight') > {'average_height' : 1.43}
> with the possibility of using non-keyword arguments for un-aliased > aggregation, in which case the 'alias' would be automatically created > from the query made: > >>> Students.objects.all().aggregate('height__avg') > {'height_avg' : 1.43}
> For specifying the aggregates over a related model the syntax will be > the same as in filter: > >>> Buyer.objects.all().aggregate(purchases__quantity__sum='total_sales') > {'total_sales': 259L}
> The same would stand for annotate... but it would yield a different > result: > >>>Buyer.objects.all().annotate(purchases__quantity__sum='total_sales') > [{'age': 35, 'total_sales': 93L, 'id': 2, 'name': u'John'}, {'age': > 21, 'total_sales': 166L, 'id': 1, 'name': u'Peter'}]
> Grouping > -------------- > Grouping should be done internally. There should be no syntax for > directely asking the database to use GROUP BY. How to group should be > derived from the object representation of what needs to be retrieved.
> When necessary to specify the fields to be included in the agregation > (i. e. grouped) it could be done with values(). > If we want to specify: > "The minimum number of items sold for the same bulk price" > we could do: > >>> Purchase.objects.all().values('payment').annotate(quantity__min='num_adquis itions') > [{'num_adquisitions': 56, 'payment': 356841.0}, > {'num_adquisitions': 3, 'payment': 13521.0}, > {'num_adquisitions': 85, 'payment': 6185463.0}]
> which results in grouping by price.
> Filtering > ------------ > When querying for aggregates wwe might want to filter the results > according to information retrieved by the aggregation itself. We > should be able to simply use the filter modifier refering to the alias > of the aggregated field: > >>> Purchase.objects.all().values('payment').annotate(quantity__min='num_adquis itions').filter(num_adquisitions__lt=5) > [ {'num_adquisitions': 3, 'payment': 13521.0}]
> This would result in a 'having' sql clause.
> Extra > ------- > That defines the basic aggregation functionality. There are certain > extra things that can be added but, as I see it, shouldn't be > addressed until the basic functionality is up and running. > One of those things is custom aggregate functions. A first idea to > approach them would be something like this: > >>> my_own_aggregate = Aggregate( sum('field_1') * 'field2', > 'default_name')
> queryset.aggregate(purchases__price__custom=my_own_aggregate('alias')) > Still, as I said, there is much to be done before addressing the > custom aggergate problem.
> Another improvement that could be added after the basic functionality > is done is the, previously discussed, F syntax for refering to other > fields in the model. A basic idea of this would be something like: > >>> Purchase.objects.all().values('payment').annotate(quantity__min='num_adquis itions', quantity__avg='average_items').filter(num_adquisitions__lt=F('average_items '))
> Again, this is to be addressed after the aggregate's basic > functionality is working.
> Building Use Cases > ---------------------------- > As for now I believe this covers most of the use cases of database > aggregation. If anybody can think of a use case which is not covered > by this syntax please post it (preferably with a plain-english > description of what the expected behaviour is). Any suggestion is more > than welcome!
> I will keep you all posted in any change or advance on this project.
> First, congratulations on your GSOC selection :)
> I would to know if you will be building some way to use PG custom
> aggregates. I have so far one custom aggregate and I wonder how it
> could work with this scheme.
> Specifically, will there be some way to create new aggregate
> predicates just as the builtin sum, avg, etc ?
> Regards
> Rajeev J Sebastian
> On Wed, Apr 23, 2008 at 1:54 AM, Nicolas E. Lara G.
> > Hello,
> > I have been selected to implement Aggregate support in the Django ORM
> > as a project for Google Summer of Code 2008 =) My mentor for this
> > project is Russell Keith-Magee.
> > I am very happy with both being selected (of course :p) and, in
> > general, with all the projects that were selected for Django on this
> > summer of code. I believe they are all very good additions to the
> > framework.
> > Though it has been discussed in this mailing list before I will post a
> > more concrete description of my project:
> > General Considerations
> > ----------------------------------
> > When doing aggregation using the ORM we should continue to be
> > abstracted from SQL and think about easily expressing the solutions to
> > the problems we need to solve. The idea is not to provide a wrapper
> > around any SQL functionalities but to solve data aggregation problems
> > in a higher abstraction level (with the ORM usign the underlaying SQL
> > functionalities).
> > The proposed syntax includes two QuerySet modifiers. The first one
> > (annotate) would be used to extend the QuerySet with aggregated
> > information, this is, adding each object a piece of aggregated
> > information. The second one (aggregate) would be used to reduce the
> > aggregation over a QuerySet to a set of values for the whole model.
> > An example of a problem solved with aggregate would be:
> > "Find out the average height of the kids in a school"
> > An example of a problem solved with annotate would be:
> > "Find the average hight of the kids in every classroom"
> > (this are just examples, more complex things follow)
> > Proposed Syntax
> > -----------------------
> > The syntax for specifying the aggregation will be filter-like syntax
> > So a QuerySet might be:
> > >>> Students.objects.all().aggregate(height__avg='agerage_hight')
> > {'average_height' : 1.43}
> > with the possibility of using non-keyword arguments for un-aliased
> > aggregation, in which case the 'alias' would be automatically created
> > from the query made:
> > >>> Students.objects.all().aggregate('height__avg')
> > {'height_avg' : 1.43}
> > For specifying the aggregates over a related model the syntax will be
> > the same as in filter:
> > >>> Buyer.objects.all().aggregate(purchases__quantity__sum='total_sales')
> > {'total_sales': 259L}
> > The same would stand for annotate... but it would yield a different
> > result:
> > >>>Buyer.objects.all().annotate(purchases__quantity__sum='total_sales')
> > [{'age': 35, 'total_sales': 93L, 'id': 2, 'name': u'John'}, {'age':
> > 21, 'total_sales': 166L, 'id': 1, 'name': u'Peter'}]
> > Grouping
> > --------------
> > Grouping should be done internally. There should be no syntax for
> > directely asking the database to use GROUP BY. How to group should be
> > derived from the object representation of what needs to be retrieved.
> > When necessary to specify the fields to be included in the agregation
> > (i. e. grouped) it could be done with values().
> > If we want to specify:
> > "The minimum number of items sold for the same bulk price"
> > we could do:
> > >>> Purchase.objects.all().values('payment').annotate(quantity__min='num_adquis itions')
> > [{'num_adquisitions': 56, 'payment': 356841.0},
> > {'num_adquisitions': 3, 'payment': 13521.0},
> > {'num_adquisitions': 85, 'payment': 6185463.0}]
> > which results in grouping by price.
> > Filtering
> > ------------
> > When querying for aggregates wwe might want to filter the results
> > according to information retrieved by the aggregation itself. We
> > should be able to simply use the filter modifier refering to the alias
> > of the aggregated field:
> > >>> Purchase.objects.all().values('payment').annotate(quantity__min='num_adquis itions').filter(num_adquisitions__lt=5)
> > [ {'num_adquisitions': 3, 'payment': 13521.0}]
> > This would result in a 'having' sql clause.
> > Extra
> > -------
> > That defines the basic aggregation functionality. There are certain
> > extra things that can be added but, as I see it, shouldn't be
> > addressed until the basic functionality is up and running.
> > One of those things is custom aggregate functions. A first idea to
> > approach them would be something like this:
> > >>> my_own_aggregate = Aggregate( sum('field_1') * 'field2',
> > 'default_name')
> > queryset.aggregate(purchases__price__custom=my_own_aggregate('alias'))
> > Still, as I said, there is much to be done before addressing the
> > custom aggergate problem.
> > Another improvement that could be added after the basic functionality
> > is done is the, previously discussed, F syntax for refering to other
> > fields in the model. A basic idea of this would be something like:
> > >>> Purchase.objects.all().values('payment').annotate(quantity__min='num_adquis itions', quantity__avg='average_items').filter(num_adquisitions__lt=F('average_items '))
> > Again, this is to be addressed after the aggregate's basic
> > functionality is working.
> > Building Use Cases
> > ----------------------------
> > As for now I believe this covers most of the use cases of database
> > aggregation. If anybody can think of a use case which is not covered
> > by this syntax please post it (preferably with a plain-english
> > description of what the expected behaviour is). Any suggestion is more
> > than welcome!
> > I will keep you all posted in any change or advance on this project.
> this is very exciting! I've one suggestion/question though.
> On Tue, 2008-04-22 at 13:24 -0700, Nicolas E. Lara G. wrote:
> > with the possibility of using non-keyword arguments for un-aliased
> > aggregation, in which case the 'alias' would be automatically created
> > from the query made:
> > >>> Students.objects.all().aggregate('height__avg')
> > {'height_avg' : 1.43}
> Was the name-munging here done intentionally? I think, that it's a bad
> choice to leave one of the underlines out. I think that it's too easy to
> shoot oneself in the foot with such magic (Think denormalized table
> schemas).
non intentional at all.. justa typo... sorry for that..
> Apart from that: Great stuff, can't wait for it :-)
On Apr 22, 4:37 pm, "Rajeev J Sebastian" <rajeev.sebast...@gmail.com>
wrote:
> Hi Nicolas,
> First, congratulations on your GSOC selection :)
Thanks =)
> I would to know if you will be building some way to use PG custom
> aggregates. I have so far one custom aggregate and I wonder how it
> could work with this scheme.
To be honest I haven't thought about it. As I said, I'd like to get
the basic aggregation (semi)stable before digging into other stuff
simply because if the basic api changes then everything built on top
of it has to change also.
I do think it is a good idea as long as we don't have to provide PG-
only functionality. With what i've done so far everything is done with
the names provided by the DBMS so if you define a custom aggregate in
postgres, lets say, AGG; you should be able to use it as
some_field__agg='alias'...
> Specifically, will there be some way to create new aggregate
> predicates just as the builtin sum, avg, etc ?
When the I get the basic functionality on wheels I'll check of a way
to do this in a cross-database manner.
> > Hello,
> > I have been selected to implement Aggregate support in the Django ORM
> > as a project for Google Summer of Code 2008 =) My mentor for this
> > project is Russell Keith-Magee.
> > I am very happy with both being selected (of course :p) and, in
> > general, with all the projects that were selected for Django on this
> > summer of code. I believe they are all very good additions to the
> > framework.
> > Though it has been discussed in this mailing list before I will post a
> > more concrete description of my project:
> > General Considerations
> > ----------------------------------
> > When doing aggregation using the ORM we should continue to be
> > abstracted from SQL and think about easily expressing the solutions to
> > the problems we need to solve. The idea is not to provide a wrapper
> > around any SQL functionalities but to solve data aggregation problems
> > in a higher abstraction level (with the ORM usign the underlaying SQL
> > functionalities).
> > The proposed syntax includes two QuerySet modifiers. The first one
> > (annotate) would be used to extend the QuerySet with aggregated
> > information, this is, adding each object a piece of aggregated
> > information. The second one (aggregate) would be used to reduce the
> > aggregation over a QuerySet to a set of values for the whole model.
> > An example of a problem solved with aggregate would be:
> > "Find out the average height of the kids in a school"
> > An example of a problem solved with annotate would be:
> > "Find the average hight of the kids in every classroom"
> > (this are just examples, more complex things follow)
> > Proposed Syntax
> > -----------------------
> > The syntax for specifying the aggregation will be filter-like syntax
> > So a QuerySet might be:
> > >>> Students.objects.all().aggregate(height__avg='agerage_hight')
> > {'average_height' : 1.43}
> > with the possibility of using non-keyword arguments for un-aliased
> > aggregation, in which case the 'alias' would be automatically created
> > from the query made:
> > >>> Students.objects.all().aggregate('height__avg')
> > {'height_avg' : 1.43}
> > For specifying the aggregates over a related model the syntax will be
> > the same as in filter:
> > >>> Buyer.objects.all().aggregate(purchases__quantity__sum='total_sales')
> > {'total_sales': 259L}
> > The same would stand for annotate... but it would yield a different
> > result:
> > >>>Buyer.objects.all().annotate(purchases__quantity__sum='total_sales')
> > [{'age': 35, 'total_sales': 93L, 'id': 2, 'name': u'John'}, {'age':
> > 21, 'total_sales': 166L, 'id': 1, 'name': u'Peter'}]
> > Grouping
> > --------------
> > Grouping should be done internally. There should be no syntax for
> > directely asking the database to use GROUP BY. How to group should be
> > derived from the object representation of what needs to be retrieved.
> > When necessary to specify the fields to be included in the agregation
> > (i. e. grouped) it could be done with values().
> > If we want to specify:
> > "The minimum number of items sold for the same bulk price"
> > we could do:
> > >>> Purchase.objects.all().values('payment').annotate(quantity__min='num_adquis itions')
> > [{'num_adquisitions': 56, 'payment': 356841.0},
> > {'num_adquisitions': 3, 'payment': 13521.0},
> > {'num_adquisitions': 85, 'payment': 6185463.0}]
> > which results in grouping by price.
> > Filtering
> > ------------
> > When querying for aggregates wwe might want to filter the results
> > according to information retrieved by the aggregation itself. We
> > should be able to simply use the filter modifier refering to the alias
> > of the aggregated field:
> > >>> Purchase.objects.all().values('payment').annotate(quantity__min='num_adquis itions').filter(num_adquisitions__lt=5)
> > [ {'num_adquisitions': 3, 'payment': 13521.0}]
> > This would result in a 'having' sql clause.
> > Extra
> > -------
> > That defines the basic aggregation functionality. There are certain
> > extra things that can be added but, as I see it, shouldn't be
> > addressed until the basic functionality is up and running.
> > One of those things is custom aggregate functions. A first idea to
> > approach them would be something like this:
> > >>> my_own_aggregate = Aggregate( sum('field_1') * 'field2',
> > 'default_name')
> > queryset.aggregate(purchases__price__custom=my_own_aggregate('alias'))
> > Still, as I said, there is much to be done before addressing the
> > custom aggergate problem.
> > Another improvement that could be added after the basic functionality
> > is done is the, previously discussed, F syntax for refering to other
> > fields in the model. A basic idea of this would be something like:
> > >>> Purchase.objects.all().values('payment').annotate(quantity__min='num_adquis itions', quantity__avg='average_items').filter(num_adquisitions__lt=F('average_items '))
> > Again, this is to be addressed after the aggregate's basic
> > functionality is working.
> > Building Use Cases
> > ----------------------------
> > As for now I believe this covers most of the use cases of database
> > aggregation. If anybody can think of a use case which is not covered
> > by this syntax please post it (preferably with a plain-english
> > description of what the expected behaviour is). Any suggestion is more
> > than welcome!
> > I will keep you all posted in any change or advance on this project.
On Wed, 2008-04-23 at 07:50 -0700, Eratothene wrote: > It looks that API lacks support for SQL HAVING clause, this one should > be distinct from filter, GROUP BY without HAVING is often useless.
> Consider adding it to api (something like > Model.objects.aggregate(height__sum).having(sex='M'))
> Also do not forget to add HAVING clause to .extra() method.
You may wish to go back and read the earlier thread(s) about this on this group where this has been discussed before.
One goal in the design here has been to avoid leaking lots of SQL into the API: "having" and "group by" crop up as a natural function in the use of aggregates, but they aren't necessarily needed in isolation. The idea is not to rewrite SQL in Python. Rather, we want to provide certain pieces of functionality -- in this case, aggregations support -- that is mapped to SQL. So the SQL will use "having" and "group by" but they don't to be exposed at the Python level.
Sorry ıf thıs goes on the wrong thread. I am wrıttıng at the moment from turkey, where google groups ıs blocked.
In short: I belıeve Malcom's reply says ıt all except for the fact that the 'havıng' clause would be expressed wıth a fılter modıfıer. The ORM would take care of puttıng the requested fılter ın a 'havıng' clause or ın a 'where' clause...
On the ıssue of returnıng model objects ınstead of values (dıctıonarıes): I belıeve the problem wıth that ıs ınconsıstency, sınce when usıng values to restrıct the elements that are grouped together you cannot retrıeve an object. I am -0 on retrıevıng the objects because ıt becomes complıated for the users, but wouldn't opose much because values ıs specıfıed whenever a valuesQuerySet would be returned.
On Wed, Apr 23, 2008 at 5:54 PM, Malcolm Tredinnick
> On Wed, 2008-04-23 at 07:50 -0700, Eratothene wrote: > > It looks that API lacks support for SQL HAVING clause, this one should > > be distinct from filter, GROUP BY without HAVING is often useless.
> > Consider adding it to api (something like > > Model.objects.aggregate(height__sum).having(sex='M'))
> > Also do not forget to add HAVING clause to .extra() method.
> You may wish to go back and read the earlier thread(s) about this on > this group where this has been discussed before.
> One goal in the design here has been to avoid leaking lots of SQL into > the API: "having" and "group by" crop up as a natural function in the > use of aggregates, but they aren't necessarily needed in isolation. The > idea is not to rewrite SQL in Python. Rather, we want to provide certain > pieces of functionality -- in this case, aggregations support -- that is > mapped to SQL. So the SQL will use "having" and "group by" but they > don't to be exposed at the Python level.
-- Nicolas Lara Linux user #380134 Public key id: 0x152e7713 at http://keyserver.noreply.org/ # Anti-Spam e-mail addresses: python -c "print '@'.join(['nicolaslara', '.'.join([x for x in reversed( 'com|gmail'.split('|') )])])" python -c "print '@'.join(['nicolas', '.'.join([x for x in reversed( 've|usb|labf|ac'.split('|') )])])"
On Thu, Apr 24, 2008 at 10:14 PM, Nicolas Lara <nicolasl...@gmail.com> wrote: > On the ıssue of returnıng model objects ınstead of values > (dıctıonarıes): I belıeve the problem wıth that ıs ınconsıstency, > sınce when usıng values to restrıct the elements that are grouped > together you cannot retrıeve an object. I am -0 on retrıevıng the > objects because ıt becomes complıated for the users, but wouldn't > opose much because values ıs specıfıed whenever a valuesQuerySet would > be returned.
I'm not sure why you see this as being complicated for users:
>>> authors = Author.objects.annotate(book__sum='books_published') >>> for a in authors:
There is no problem making each a in authors an actual Author object; the only difference is that each author is annotated. Sure, this means that the author intances have attributes that aren't part of the original definition of Author, but this is perfectly consistent with Python - you can associate arbitrary attributes with any object instances.
If a values() clause is involved in a query, then it makes sense that what gets returned should be a value list rather than an object - but that's perfectly consistent with the behaviour of values() currently. However, in the simple case - where you are grouping by full object instances - I don't see why we shouldn't return object instances.
I would love to see something like AggregateModel returned from .values().annotate()
it would work like:
for m in Model.objects.values( 'name', 'place' ).annotate( event__count='number_or_events' )
# the basic data would be available right away print m['number_or_events'], m['place'], m['name'] # or some other type of access, it doesn't have to look like a dictionary # haven't really thought about this
for o m.objects: # this would pull all models that have the appropriate name and place print o.name, o.place, o.type
does this make sense?
the target is to be able to pull the aggregates and provide easy access to all the objects in the group, by wrapping this into a class of some sort, we would be able to work with the results more effectively than when using dictionaries.
It could also maybe mimic the behavior of the model (with rest of the fields set to None ??) itself (just playing with ideas, please somebody tell me if this makes any sense to anybody).
The motivation behind all this is that it seems a waste to just return the data and drop all the metadata in the process.
On Fri, Apr 25, 2008 at 7:05 AM, Russell Keith-Magee
<freakboy3...@gmail.com> wrote: > On Thu, Apr 24, 2008 at 10:14 PM, Nicolas Lara <nicolasl...@gmail.com> wrote:
> > On the ıssue of returnıng model objects ınstead of values > > (dıctıonarıes): I belıeve the problem wıth that ıs ınconsıstency, > > sınce when usıng values to restrıct the elements that are grouped > > together you cannot retrıeve an object. I am -0 on retrıevıng the > > objects because ıt becomes complıated for the users, but wouldn't > > opose much because values ıs specıfıed whenever a valuesQuerySet would > > be returned.
> I'm not sure why you see this as being complicated for users:
> >>> authors = Author.objects.annotate(book__sum='books_published') > >>> for a in authors: > ... print a.name,'has published',a.books_published,'books.'
> There is no problem making each a in authors an actual Author object; > the only difference is that each author is annotated. Sure, this means > that the author intances have attributes that aren't part of the > original definition of Author, but this is perfectly consistent with > Python - you can associate arbitrary attributes with any object > instances.
plus we currently do this in .extra(select={ ... } )
I am +1 on returning actual objects - the use case would be that I want to print the table of users with number of comments they made and i really want to be able to access their profile, their absolute url and any other attribute or method of the model involved
> If a values() clause is involved in a query, then it makes sense that > what gets returned should be a value list rather than an object - but > that's perfectly consistent with the behaviour of values() currently. > However, in the simple case - where you are grouping by full object > instances - I don't see why we shouldn't return object instances.
On Apr 22, 9:24 pm, "Nicolas E. Lara G." <nicolasl...@gmail.com>
wrote:
> So a QuerySet might be:
> >>> Students.objects.all().aggregate(height__avg='average_height')
> {'average_height' : 1.43}
This looks really great, but one thing that initially confused me was
the ordering of the arguments - it seems strange to ask for
(height_avg = 'average_height') and then get back {'average_height':
1.43} - the name in the output is in a different place from the name
in the input.
Have you considered syntax that looks like this instead? :
That way the API feels like you are passing in an example of the data
structure you want, and Django is using your example and filling in
the gaps for you.
On Fri, Apr 25, 2008 at 9:35 PM, Simon Willison <si...@simonwillison.net> wrote:
> On Apr 22, 9:24 pm, "Nicolas E. Lara G." <nicolasl...@gmail.com> > wrote:
> > So a QuerySet might be: > > >>> Students.objects.all().aggregate(height__avg='average_height') > > {'average_height' : 1.43}
> This looks really great, but one thing that initially confused me was > the ordering of the arguments - it seems strange to ask for > (height_avg = 'average_height') and then get back {'average_height': > 1.43} - the name in the output is in a different place from the name > in the input.
Hi Simon
I can see what you are saying here, but I'm not sure I agree. The underlying idea for this syntax is to follow the example of the 'double underscore then operator' syntax used by filters. Yes, the 'assignment' in an aggregate is in the opposite order to what it would be in normal code - but then, filter(height__lt=3) doesn't really match perfectly with normal code either.
If you're looking for a way to reconcile the syntax with expectation - one suggestion is that the syntax aggregate(height__avg='foobar') can be read as "the average height has the alias foobar". Not ideal, I know, but it is a start.
> Have you considered syntax that looks like this instead? : > >>> Students.objects.all().aggregate({'average_height': 'height__avg}) > > {'average_height' : 1.43}
My issue with this syntax is that every aggregate requires 4 extra punctuation characters (two parentheses, two quotes), and all you really get in return is to reverse the order so that you get result=operation type syntax... except that you can't use the = operator. It also doesn't have any particular analogs with existing ORM syntax.
On Fri, Apr 25, 2008 at 11:40 PM, Russell Keith-Magee
<freakboy3...@gmail.com> wrote: > > Have you considered syntax that looks like this instead? : > > >>> Students.objects.all().aggregate({'average_height': 'height__avg}) > > > {'average_height' : 1.43}
> My issue with this syntax is that every aggregate requires 4 extra > punctuation characters (two parentheses, two quotes), and all you > really get in return is to reverse the order so that you get > result=operation type syntax... except that you can't use the = > operator. It also doesn't have any particular analogs with existing > ORM syntax.
Of course that query won't work, since you can't just toss an aggregate function into a select list and expect to get anything but an error message. But compare it to this one, which does work:
> On Fri, Apr 25, 2008 at 11:40 PM, Russell Keith-Magee
> <freakboy3...@gmail.com> wrote:
> > > Have you considered syntax that looks like this instead? :
> > > >>> Students.objects.all().aggregate({'average_height': 'height__avg})
> > > > {'average_height' : 1.43}
> > My issue with this syntax is that every aggregate requires 4 extra
> > punctuation characters (two parentheses, two quotes), and all you
> > really get in return is to reverse the order so that you get
> > result=operation type syntax... except that you can't use the =
> > operator. It also doesn't have any particular analogs with existing
> > ORM syntax.
That was my thinking exactly. I think the cognitive dissonance
involved in height__avg='average_height' is much greater than that in
({'average_height': 'height__avg}). Sure, it's new ORM syntax but it's
also a new ORM feature. Overloading the existing syntax to do
something that's actually quite different feels really strange to me.
name__contains='blah' is field-name__comparison-operator = input-to-
comparison-operator
height__avg='average_height' is field-name__aggregate-function = alias
under which the result should be known
I think the syntax looking similar here is misleading rather than
helpful.
On Sat, Apr 26, 2008 at 6:32 PM, Simon Willison <si...@simonwillison.net> wrote:
> On Apr 26, 7:34 am, "Ian Kelly" <ian.g.ke...@gmail.com> wrote: > > On Fri, Apr 25, 2008 at 11:40 PM, Russell Keith-Magee
> > <freakboy3...@gmail.com> wrote: > > > > Have you considered syntax that looks like this instead? : > > > > >>> Students.objects.all().aggregate({'average_height': 'height__avg}) > > > > > {'average_height' : 1.43}
> > > My issue with this syntax is that every aggregate requires 4 extra > > > punctuation characters (two parentheses, two quotes), and all you > > > really get in return is to reverse the order so that you get > > > result=operation type syntax... except that you can't use the = > > > operator. It also doesn't have any particular analogs with existing > > > ORM syntax.
> That was my thinking exactly. I think the cognitive dissonance > involved in height__avg='average_height' is much greater than that in > ({'average_height': 'height__avg}). Sure, it's new ORM syntax but it's > also a new ORM feature. Overloading the existing syntax to do > something that's actually quite different feels really strange to me.
I can see that there might be a bit of cognitive dissonance; however, it doesn't bug me personally as much as the syntax your are proposing. Excessive parentheses and quotes like that increase the possibilities for mistyping mistakes, and don't really serve to improve clarity in the underlying expression.
That said, the cognitive dissonance is a reasonable enough objection to warrant some more exploration of the syntax options.
So - here's a slightly modified proposal. Last time this topic came up on django-dev, Justin Fagnani made an interesting suggestion which I think will answer your objections, and leaves open some interesting possibilities for userspace extension. Rather than the column__aggregate=alias syntax:
The raw idea - the aggregates themselves are tied up into functions (Avg, Max, etc) that are constructed as an object. The argument to the aggregate is the query column, using the standard double-underscore notation if necessary to cross tables. In QS-RF, this syntax is being used for cross-table order_by() clauses, so there is an analog there.
If the aggregate function is provided as an anonymous argument (e.g., Avg('height')), then the aggregate is asked to provide an appropriate alias - for example, Avg() might use 'avg__%s' % field_name. If the aggregate function is provided as a kwarg, the kwarg name is used as the alias.
Userspace extension becomes a lot easier - If a particular database has an interesting aggregate that isn't supported as part of the core set, you can write your own aggregate and use it as you see fit.
It also leaves open the door to do nifty things like annotate(foo=Avg('a') * Sum('b')) - i.e., alegebraic expressions between aggregates. Obviously, this sort of API will require a whole lot of work to get right, but the option is there.
On Apr 26, 12:33 pm, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:
> So - here's a slightly modified proposal. Last time this topic came up
> on django-dev, Justin Fagnani made an interesting suggestion which I
> think will answer your objections, and leaves open some interesting
> possibilities for userspace extension. Rather than the
> column__aggregate=alias syntax:
> >>> Student.objects.all().annotate(
> Avg('height'), tallest=Max('height'), avg_friend_age=Avg('friend__age')
> )
> The raw idea - the aggregates themselves are tied up into functions
> (Avg, Max, etc) that are constructed as an object. The argument to the
> aggregate is the query column, using the standard double-underscore
> notation if necessary to cross tables. In QS-RF, this syntax is being
> used for cross-table order_by() clauses, so there is an analog there.
I like this a lot - makes much more sense to me than the other syntax,
and I think I prefer it to my proposed dictionary based syntax as
well.
> If the aggregate function is provided as an anonymous argument (e.g.,
> Avg('height')), then the aggregate is asked to provide an appropriate
> alias - for example, Avg() might use 'avg__%s' % field_name. If the
> aggregate function is provided as a kwarg, the kwarg name is used as
> the alias.
Is it necessary to provide an anonymous argument syntax at all? Seeing
just annotate(Avg('height')) left me confused as to what the alias
would be until I read your further explanation. Is there any harm in
requiring the user of the API to always explicitly specify their alias
using keyword argument syntax? Unless there's a really good reason to
provide default aliases I think requiring explicit aliases would be
perfectly reasonable ("explicit is better than implicit").
> > If the aggregate function is provided as an anonymous argument (e.g., > > Avg('height')), then the aggregate is asked to provide an appropriate > > alias - for example, Avg() might use 'avg__%s' % field_name. If the > > aggregate function is provided as a kwarg, the kwarg name is used as > > the alias.
> Is it necessary to provide an anonymous argument syntax at all? Seeing > just annotate(Avg('height')) left me confused as to what the alias > would be until I read your further explanation. Is there any harm in > requiring the user of the API to always explicitly specify their alias > using keyword argument syntax? Unless there's a really good reason to > provide default aliases I think requiring explicit aliases would be > perfectly reasonable ("explicit is better than implicit").
Its mostly there as a convenience for the really simple cases. There some duplication in typing annotate(avg_height=Avg('height'), so the shortcut is a convenience. IMHO, this is the sort of thing could be easily explained in the documentation for each aggregate function.
However, it's hardly a mandatory feature - if public opinion is against this idea, I won't lose any sleep about it.
> > > If the aggregate function is provided as an anonymous argument (e.g.,
> > > Avg('height')), then the aggregate is asked to provide an appropriate
> > > alias - for example, Avg() might use 'avg__%s' % field_name. If the
> > > aggregate function is provided as a kwarg, the kwarg name is used as
> > > the alias.
> > Is it necessary to provide an anonymous argument syntax at all? Seeing
> > just annotate(Avg('height')) left me confused as to what the alias
> > would be until I read your further explanation. Is there any harm in
> > requiring the user of the API to always explicitly specify their alias
> > using keyword argument syntax? Unless there's a really good reason to
> > provide default aliases I think requiring explicit aliases would be
> > perfectly reasonable ("explicit is better than implicit").
> Its mostly there as a convenience for the really simple cases. There
> some duplication in typing annotate(avg_height=Avg('height'), so the
> shortcut is a convenience. IMHO, this is the sort of thing could be
> easily explained in the documentation for each aggregate function.
> However, it's hardly a mandatory feature - if public opinion is
> against this idea, I won't lose any sleep about it.
On Apr 27, 1:59 am, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:
> Its mostly there as a convenience for the really simple cases. There
> some duplication in typing annotate(avg_height=Avg('height'), so the
> shortcut is a convenience. IMHO, this is the sort of thing could be
> easily explained in the documentation for each aggregate function.
Seems reasonable. I think I'd like to see the explicit syntax
introduced first in the docs just so people aren't confused by the
very first example, but avoiding the duplication in
avg_height=Avg('height') is a worthy goal.
> That was my thinking exactly. I think the cognitive dissonance > involved in height__avg='average_height' is much greater than that in > ({'average_height': 'height__avg}). Sure, it's new ORM syntax but it's > also a new ORM feature. Overloading the existing syntax to do > something that's actually quite different feels really strange to me.
Regarding this, I believe an option would be to use the reverse syntax:
aggregate(alias='field__function') because it mimics what we normally do when assigning in code. or simply aggregate('field__function') and have the standard alias be the same as the requested aggregate (it would be 'field__function') I still prefer this approach over the lookup objects. But this is probably because i've been thinking of aggregates with the filter-like syntax for a while and I became fond of it. Anyway I do think the lookup object syntax is a very good approach, so here are some considerations:
Extending seems easy, but is it? (in practice). The way I see to extend this is to create custom classes (like Max, Avg, etc). this classes would have a __init__ function that receives 2 parameters: self, of course, and the field name. The newly defined class should, then, call something like get_column_name('field') to be able to translate itself into SQL. Pleas don't furiously reply there might be other ways to do it, I know there are, this is just one of the ways I could come up from the top of my head to ilustrate what I mean... A proposition on this follows...
Though the objective syntax mimics Q objects, when using Q objects you have one class, Q, that is used for all the queries. Having multiple classes seems confusing. I would propose to have a single class (A?) to do the queries. So you could do something like:
This way A could receive the name of the function (or the actuall function object) and the field to which it will be applied. A good thing that could come from this is that extending (adding a new function, or set of functions) would be really easy because, since the heavy lifting would be done by A which would already be part of the framework, the new function only needs to define how it transkates to SQL given the column to which would be applied. Having the first parameter be class/classname instead of function/function name is also an option. This would allow us to add members to the class so that A can act differentely. An example of this would be using a class that has the 'alias_prefix' attribute to define what comes before the field name in the default alias (A alias function could also be available to receive the fieldname and return the alias).
I don't like the automatic lowercasing... Its very too much magic, but I found the camel case quiet ugly...
Any thoughs?
-- Nicolas Lara Linux user #380134 Public key id: 0x152e7713 at http://keyserver.noreply.org/ # Anti-Spam e-mail addresses: python -c "print '@'.join(['nicolaslara', '.'.join([x for x in reversed( 'com|gmail'.split('|') )])])" python -c "print '@'.join(['nicolas', '.'.join([x for x in reversed( 've|usb|labf|ac'.split('|') )])])"
Good to see this has come up again. Congrats on getting the GSoC selection Nicolas. The previous thread[1] has a lot of good discussion in it if anyone here wasn't following that.
Obviously, I prefer the expression object API. I've actually been using it a bit in an order processing app I'm working on. I've been keeping the patches up to date with qs-rf and I just merged it with the new trunk. If anyone wants to check them out, where should I put them, as an attachment on #3566? (It works a bit differently than what's described here, folding all the behavior into values(), which might not be ideal)
I really like Honza's idea of an AggregateModel, at least for cases where there's a 1-1 correspondence between results and actual instances, so that the model will still behave as expected. To keep from cluttering the model's attributes, aggregate values could be put into another object or dict:
>>>myproduct.aggregates['avg_price']
I like the idea less when the result would be a representative of a group. There could be unexpected results from calling methods on an instance because not all the data is there, or it has invalid values (averaging an integer field, etc). In these cases, I don't think it's a bad idea to require the use of values() and/or aggregate().
Also, there will probably be cases where we'd want to iterate over the members of the groups, so maybe instead of a list of dicts, aggregate() returns list of objects, so that a query like:
>>> people = Person.objects.values('age').aggregate(Avg('income'))
will return a list of objects that you can use like a dict:
>>>people[0]['age']
and get a queryset from:
>>>people[0].objects()
On Sun, Apr 27, 2008 at 11:26 AM, Nicolas Lara <nicolasl...@gmail.com> wrote:
> Having multiple classes seems confusing.
I'm not sure why multiple classes would be confusing, since they do represent different behaviors. If it has to do with dealing with many classes, then it doesn't seem different than the many function names that need to be passed to the A class.
> I would propose to have a single class (A?) to do the queries. So you > could do something like:
> I really like Honza's idea of an AggregateModel, at least for cases where
> there's a 1-1 correspondence between results and actual instances, so that
> the model will still behave as expected. To keep from cluttering the model's
> attributes, aggregate values could be put into another object or dict:
> >>>myproduct.aggregates['avg_price']
> I like the idea less when the result would be a representative of a group.
> There could be unexpected results from calling methods on an instance
> because not all the data is there, or it has invalid values (averaging an
> integer field, etc). In these cases, I don't think it's a bad idea to
> require the use of values() and/or aggregate().
I didn't understand your point here. When is the data incomplete? and
I dont see the problem in averaging an integerField (probably in
averaging a CharField, but IMHO this kind of invalid values should
make the whole query fail).
> Also, there will probably be cases where we'd want to iterate over the
> members of the groups, so maybe instead of a list of dicts, aggregate()
> returns list of objects, so that a query like:
> >>> people = Person.objects.values('age').aggregate(Avg('income'))
> and get a queryset from:
> >>>people[0].objects()
I like this idea quite a lot. It is basically a shortcut for a common
use case that is generating a QuerySet for the objects that have been
aggregated.Still, doing this queryset manually is also very simple.
> On Sun, Apr 27, 2008 at 11:26 AM, Nicolas Lara <nicolasl...@gmail.com>
> wrote:
> > Having multiple classes seems confusing.
> I'm not sure why multiple classes would be confusing, since they do
> represent different behaviors. If it has to do with dealing with many
> classes, then it doesn't seem different than the many function names that
> need to be passed to the A class.
> > I would propose to have a single class (A?) to do the queries. So you
> > could do something like:
I aggre. An advantage (which I think might also be a double edge
sword) of the string function names is the possibility to fallback to
a standard convertion to SQL when no function is defined. With this we
wouldn't need to wrap every aggregate method in a python class that
translates roughly the same way and we could take advantage of the
direct usage of the DB defined aggregate functions. Also with the A
syntax you would do only one import instead of one for each function.
Still the point of readability is a very important one and I believe
it might come before other features.
> In addition, some cases have special requirements and dealing with them in a
> class is easy.
I can't think of a requirement that can't be dealt with in a method
also... got an example?
On Mon, Apr 28, 2008 at 2:10 PM, Nicolas E. Lara G. <nicolasl...@gmail.com> wrote:
> I didn't understand your point here. When is the data incomplete?
An example would be grouping people by age and averaging income, what should the values of the name field be? None might make sense at first, as Honza suggests, but what happens to methods that depend on name? The whole point of returning models is to have access to the methods and other class metadata, but if data is missing or not meaningful, then those methods might not be useful.
I dont see the problem in averaging an integerField (probably in
> averaging a CharField, but IMHO this kind of invalid values should > make the whole query fail).
The problem with averaging an IntegerField is that the result is unlikely to be an integer, yet parts of the model might assume that it is. At the very least AggregateModels in this situation are unable to be validated, saved or deleted, and it's possible that other methods don't work, so either they shouldn't be models at all, or we need to disable lots of things and issue big, big warnings.
> I aggre. An advantage (which I think might also be a double edge > sword) of the string function names is the possibility to fallback to > a standard convertion to SQL when no function is defined.
I'm not sure how wise this is. Considering the variation of available functions, and their different behaviors, just passing the function name straight into SQL could easily encourage non-portable queries. I know some users will want DB specific functions, but like extras, it should probably be clearly advanced functionality.
With this we
> wouldn't need to wrap every aggregate method in a python class that > translates roughly the same way and we could take advantage of the > direct usage of the DB defined aggregate functions.
Defining most functions is quite easy. Here's Max:
class Max(Aggregate): _func = 'MAX'
I'm sure we can get all the standard functions in there.
> Also with the A > syntax you would do only one import instead of one for each function.
Ideally expressions.py should be import * friendly
> In addition, some cases have special requirements and dealing with them in > a > > class is easy.
> I can't think of a requirement that can't be dealt with in a method > also... got an example?
Count() is an easy example. First, it takes an extra kwarg, 'distinct', and changes the output of as_sql() if set. Then it also has to do a special case in relabel_aliases() and whatever method passes column names to query.setup_joins() for COUNT(*)
> On Mon, Apr 28, 2008 at 2:10 PM, Nicolas E. Lara G. <nicolasl...@gmail.com> > wrote: > > I didn't understand your point here. When is the data incomplete?
> An example would be grouping people by age and averaging income, what should > the values of the name field be? None might make sense at first, as Honza > suggests, but what happens to methods that depend on name? The whole point > of returning models is to have access to the methods and other class > metadata, but if data is missing or not meaningful, then those methods might > not be useful.
In this case: should we actually return the actual objects? If a values modifier is involved the expected behavior is to receive a list of dicts. IMO, we should not allow aggregates to override this behavior. Also, having some value for the fields wouldn't make sense since we cannot ensure there is a unique value for a given field for that group (unless it is specified in values, in which case it would automatically appear). I like the idea of having a QuerySet for retrieving the grouped objects. This can be done simply by having something like: {'objects': <QuerySet object>}
> > I dont see the problem in averaging an integerField (probably in > > averaging a CharField, but IMHO this kind of invalid values should > > make the whole query fail).
> The problem with averaging an IntegerField is that the result is unlikely to > be an integer, yet parts of the model might assume that it is. At the very > least AggregateModels in this situation are unable to be validated, saved or > deleted, and it's possible that other methods don't work, so either they > shouldn't be models at all, or we need to disable lots of things and issue > big, big warnings.
or they can be a list of Models only when no 'values' is specified.
> > I aggre. An advantage (which I think might also be a double edge > > sword) of the string function names is the possibility to fallback to > > a standard convertion to SQL when no function is defined.
> I'm not sure how wise this is. Considering the variation of available > functions, and their different behaviors, just passing the function name > straight into SQL could easily encourage non-portable queries. I know some > users will want DB specific functions, but like extras, it should probably > be clearly advanced functionality.
I am not sure either. I see the problems it might bring.
> Ideally expressions.py should be import * friendly
> > > In addition, some cases have special requirements and dealing with them > in a > > > class is easy.
> > I can't think of a requirement that can't be dealt with in a method > > also... got an example?
> Count() is an easy example. First, it takes an extra kwarg, 'distinct', and > changes the output of as_sql() if set. Then it also has to do a special case > in relabel_aliases() and whatever method passes column names to > query.setup_joins() for COUNT(*)
This could all be handled in a count method. Still I see that using classes poses and advantage.