Improving aggregate support (#14030)

522 views
Skip to first unread message

Josh Smeaton

unread,
Dec 20, 2013, 8:15:12 PM12/20/13
to django-d...@googlegroups.com
I'm interested in tackling ticket#14030, ideally before the Jan 20th cutoff, but I'm looking for some guidance. There have been a lot of references to Anssi Kääriäinen attempting an ORM refactor before trying to fix aggregates, but I'm not sure if this is still a goal or not.

The work done by nateb at https://github.com/django/django/pull/46 was very similar to the structure I believed would be the right way forward - refactoring Aggregates to inherit from ExpressionNode. Unfortunately, the patch submitted has been closed because it no longer applies cleanly. There was also discussion regarding type coercion and a lack of real world tests that should be considered.

What I'd like to do is attempt to rewrite the above PR so that it applies cleanly. The reviews that had already happened on that patch would then mostly apply to the rewrite, eliminating the need for a start to finish review of the concept in general. Is this something that I should go through with, or are we better off waiting for the ORM refactor then reevaluating the way forward?

Regards,

Josh

Anssi Kääriäinen

unread,
Dec 22, 2013, 6:04:36 AM12/22/13
to django-d...@googlegroups.com
On Saturday, December 21, 2013 3:15:12 AM UTC+2, Josh Smeaton wrote:
I'm interested in tackling ticket#14030, ideally before the Jan 20th cutoff, but I'm looking for some guidance. There have been a lot of references to Anssi Kääriäinen attempting an ORM refactor before trying to fix aggregates, but I'm not sure if this is still a goal or not.

Yes, more refactoring is still the plan. A lot has already been done. The main part of the work has been bug fixing, code cleanup and join generation changes.

The next part is introducing custom lookups for 1.7 (see https://github.com/django/django/pull/2019), and later on in 1.8 allow usage of custom lookups in every part of the ORM. This means one can do queries such as .order_by(' jsonfield__user__firstname'), or more relevant to this ticket, .annotate(avg_year=Avg('datefield__year')).

To make the above possible Django's ORM needs to switch to using unified query expression API. For example aggregates, custom lookups and Expression nodes will all respond to the same API. The API deals mainly with how to produce SQL for the expressions, how to convert db values to Python values and things like that. As a side effect Avg(F('somefield') + F('anotherfield')) should just work.
 
The work done by nateb at https://github.com/django/django/pull/46 was very similar to the structure I believed would be the right way forward - refactoring Aggregates to inherit from ExpressionNode. Unfortunately, the patch submitted has been closed because it no longer applies cleanly. There was also discussion regarding type coercion and a lack of real world tests that should be considered.

What I'd like to do is attempt to rewrite the above PR so that it applies cleanly. The reviews that had already happened on that patch would then mostly apply to the rewrite, eliminating the need for a start to finish review of the concept in general. Is this something that I should go through with, or are we better off waiting for the ORM refactor then reevaluating the way forward?

Making Aggregates inherit from ExpressionNode shouldn't make the unified query expression API significantly easier or harder to achieve. If the code complexity stays manageable, I don't see a reason to not merge the changes. The change will be needed for Avg('somefield')  + Avg('otherfield') support even when unified query expression API was implemented.

One of the main problems for this ticket has been the code complexity. Both Aggregates and Expressions are somewhat complicated in the ORM. The combination of the two results in even worse code complexity. On the other hand this feature is clearly needed by many users, so maybe some code complexity can be tolerated.

 - Anssi

Josh Smeaton

unread,
Dec 22, 2013, 6:54:41 PM12/22/13
to django-d...@googlegroups.com
Thanks for the response. I've noticed the JoinPromoter recently, and I have been looking into the custom lookups, but wasn't sure what the purpose was until now.

I'll see what I can do with this patch, and try to avoid as much complexity as I can. Then it just comes down to review on whether it makes it in or not (and my ability of course).

Regards,

Josh

Josh Smeaton

unread,
Dec 23, 2013, 7:41:19 PM12/23/13
to django-d...@googlegroups.com
So, after hacking away a little bit yesterday, I came to the same conclusion that Anssi did back when the first PR was sent. Namely, that the ExpressionNode <-> SQLEvaluator structure seems overly complex, and makes it difficult to create custom "ExpressionNodes".

To build a custom Expression, one needs to:

        - Subclass ExpressionNode
        - Build in SQL into the backend (sometimes)
        - Modify SQLEvaluator (or create a new one)
        - Modify sql/query to learn about the different types of ExpressionNodes

I think it'd be better if SQLEvaluator was dropped altogether, and ExpressionNode was given the evaluator responsibilities. In this world, creating a custom Expression would look like:

        - Subclass ExpressionNode
        - The subclass contains SQL for each supported backend (and has access to the 'connection' instance)
        - sql/query asks ExpressionNode to evaluate itself

Provided that ExpressionNode (or one of its subclasses) had the appropriate methods, it would allow library authors to create their own custom functions. #11305 (Conditional Aggregates) could then be implemented quite easily, without touching the rest of the stack.

So this is what I'm going to attempt first. I'll "merge" the functions of SQLEvaluator and ExpressionNode, and see how far along that gets me. If that turns out to simplify the handling of F(), then it should be equally useful when refactoring Aggregate as an ExpressionNode. As a POC, I'll also implement a version of Conditional Aggregates to confirm the API. Now, this is definitely more complex than porting nateb's original changes, but it's probably going to be a more simplified implementation. As such, my hopes for inclusion in 1.7 are probably far-fetched.

Will report back when I've got something a little more concrete to talk about.

Regards,

- Josh

Josh Smeaton

unread,
Dec 25, 2013, 10:52:10 AM12/25/13
to django-d...@googlegroups.com
So I've gone ahead and tried to remove SQLEvaluator by pushing its functions down into ExpressionNode. For the most part, it's working well. I rebased on akaariai/lookups_3 to make sure that any changes to lookups would be reflected in this change as they are *somewhat* related. The diff between lookups and my patch is below:


There is still a lot missing (I haven't touched GIS yet), and some of the tests are failing. I'm running tests against SQLITE and Postgres at the moment, but I'll add in Oracle and MySQL once I've got something closer to finished.

I'm having some trouble with the test that is currently failing, and if someone could point out where I'm going wrong that'd be awesome. Specifically, ExpressionTests.test_filter() is failing on updates spanning relationships (line 179). The problem is that relabeled_clone isn't being called or used at the appropriate time. I did some debugging, and I *think* that Lookup.process_rhs() is being called before Lookup.relabeled_clone(), which results in the ExpressionNode using the incorrect alias. I assume this is a problem with how I've refactored ExpressionNode since the same tests pass when I checkout lookups_3. ExpressionNode.relabeled_clone is called, but after the SQL has already been generated. Should I be providing a method or attribute for Lookup to inspect maybe?

I'd appreciate some feedback on the patch as it stands too if possible. I think this is the right direction to head towards, as it simplifies the existing F() evaluation, and should allow a straightforward implementation of Aggregates on top. I'll be away until the 2nd of January, but will pick this up again then.

Cheers

- Josh

Anssi Kääriäinen

unread,
Dec 28, 2013, 12:15:44 PM12/28/13
to django-d...@googlegroups.com
On Wednesday, December 25, 2013 5:52:10 PM UTC+2, Josh Smeaton wrote:
So I've gone ahead and tried to remove SQLEvaluator by pushing its functions down into ExpressionNode. For the most part, it's working well. I rebased on akaariai/lookups_3 to make sure that any changes to lookups would be reflected in this change as they are *somewhat* related. The diff between lookups and my patch is below:


There is still a lot missing (I haven't touched GIS yet), and some of the tests are failing. I'm running tests against SQLITE and Postgres at the moment, but I'll add in Oracle and MySQL once I've got something closer to finished.

I'm having some trouble with the test that is currently failing, and if someone could point out where I'm going wrong that'd be awesome. Specifically, ExpressionTests.test_filter() is failing on updates spanning relationships (line 179). The problem is that relabeled_clone isn't being called or used at the appropriate time. I did some debugging, and I *think* that Lookup.process_rhs() is being called before Lookup.relabeled_clone(), which results in the ExpressionNode using the incorrect alias. I assume this is a problem with how I've refactored ExpressionNode since the same tests pass when I checkout lookups_3. ExpressionNode.relabeled_clone is called, but after the SQL has already been generated. Should I be providing a method or attribute for Lookup to inspect maybe?


To me it seems the problem is in the relabeled_clone method. The children are relabeled_clone()'d, but the return value (the relabeled clone) is just discarded.

In addition avoid defining custom __deepcopy__. Instead you could copy.copy() self in start, and then alter self's mutable values manually.

I'd appreciate some feedback on the patch as it stands too if possible. I think this is the right direction to head towards, as it simplifies the existing F() evaluation, and should allow a straightforward implementation of Aggregates on top. I'll be away until the 2nd of January, but will pick this up again then.

I have always found the F() -> SQLEvaluator way of producing SQL limiting and confusing, so +1 from me for getting rid of that.

The idea of F() evaluated by SQLEvaluator (or models.aggregates evaluated by models.sql.aggregates) is that this deduplicates the user facing API from the implementation. This is needed if you have custom Query class (for example for nonrelational engine).

The implementation (that is, SQLEvaluator) is used in two stages. First, when adding the Expression to the query and then when compiling the actual query string. If you need to customize how Expressions are added to the query, then it means you have a custom Query class. That again means that you can wrap the F()/Aggregate object in any way you like, as you control the Query object. For the second part the new query expression API offers you a direct way to alter query string compilation for different backends (see add_implementation() in https://github.com/django/django/pull/2019/files#diff-986a7aa3149cc32d19b751d7ab166082R222).

So, I think we should be safe as far as feature parity goes. If the new way of writing expressions is actually better for non-sql backend support isn't clear to me. I just haven't worked with non-SQL backends enough to say.

The big problem is that if we drop the SQLEvaluator and similarly the models/aggregates.py <-> models/sql/aggregates.py ways of writing Expressions and Aggregates, then there will be a lot of broken 3rd party aggregates and expressions. Even if the APIs are private I am not willing to just drop the old way. So, some backwards compatibility is required.

 - Anssi

Josh Smeaton

unread,
Jan 3, 2014, 11:34:13 AM1/3/14
to django-d...@googlegroups.com
I now have all tests passing on Postgres and SQLite (See here). I still haven't touched GIS, but I believe it should be as simple as changing the query to directly call prepare on the expression, and removing references to SQLEvaluator.

The big problem is that if we drop the SQLEvaluator and similarly the models/aggregates.py <-> models/sql/aggregates.py ways of writing Expressions and Aggregates, then there will be a lot of broken 3rd party aggregates and expressions. Even if the APIs are private I am not willing to just drop the old way. So, some backwards compatibility is required.

What level of backwards compatibility would be required? I like the idea of using the @add_implementation decorator to extend Expressions and Aggregates, but would that be enough? I've not used django-nonrel or any other non-official backend, so I'm really not sure what would be required or where a line should be drawn. The SQLEvaluator API (mostly) has moved down to ExpressionNode, so the callers only change by calling .prepare on the value rather than wrapping it in an Evaluator.

I'll make a start on refactoring Aggregate to inherit from ExpressionNode regardless. Hopefully that will make clear some uncertainties.

Cheers,

Josh

Josh Smeaton

unread,
Jan 6, 2014, 11:41:01 PM1/6/14
to django-d...@googlegroups.com
An update. I've had some success with refactoring Aggregates as ExpressionNodes. The aggregates test suite is now passing, but I still have some failures with aggregates_regress test suite. The refactoring should allow behaviour like below possible:

Model.objects.annotate(total_time=Sum( F('timer1') + F('timer2') / 2 ))

An issue that the first patch ran into was deducing the python return types since there was no field type to "attach" to. I've introduced some basic behaviour that checks that all nested expressions are the same type, and then grabs the first column type. If the types are different, an error is thrown. To get around this, I introduced a field_type parameter to aggregates:

Model.objects.annotate(total_time=Sum( F('timer1') + F('timer2') / 2, field_type=FloatField() ))

Feedback appreciated and welcome.

 My plan is:

- Make sure all existing tests pass
- Create a bunch of new tests that stress "complex annotations/aggregations"
- Figure out how to allow F() in annotations. Annotation currently accepts arbitrary expressions, but does nothing with them.
- Clean up anything that looks too messy

I think I will be done with the above list inside the next 7 days. I would welcome help (or pointers) on testing against GIS once I'm done, as that should be all that's left to tackle.

- Josh

Anssi Kääriäinen

unread,
Jan 8, 2014, 8:05:23 AM1/8/14
to django-d...@googlegroups.com
On Friday, January 3, 2014 6:34:13 PM UTC+2, Josh Smeaton wrote:
I now have all tests passing on Postgres and SQLite (See here). I still haven't touched GIS, but I believe it should be as simple as changing the query to directly call prepare on the expression, and removing references to SQLEvaluator.

The big problem is that if we drop the SQLEvaluator and similarly the models/aggregates.py <-> models/sql/aggregates.py ways of writing Expressions and Aggregates, then there will be a lot of broken 3rd party aggregates and expressions. Even if the APIs are private I am not willing to just drop the old way. So, some backwards compatibility is required.

What level of backwards compatibility would be required? I like the idea of using the @add_implementation decorator to extend Expressions and Aggregates, but would that be enough? I've not used django-nonrel or any other non-official backend, so I'm really not sure what would be required or where a line should be drawn. The SQLEvaluator API (mostly) has moved down to ExpressionNode, so the callers only change by calling .prepare on the value rather than wrapping it in an Evaluator.

The biggest problem are 3rd party Aggregates. To write a custom aggregate one needed two classes, one for user facing API, one for implementing that API for SQL queries.

Taking first example I found from Google, from http://voices.canonical.com/michael.hudson/2012/09/02/using-postgres-array_agg-from-django/:

class SQLArrayAgg(SQLAggregate):
    sql_function = 'array_agg'

class ArrayAgg(Aggregate):
    name = 'ArrayAgg'
    def add_to_query(self, query, alias, col, source, is_summary):
        klass = SQLArrayAgg
        aggregate = klass(
            col, source=source, is_summary=is_summary, **self.extra)
        aggregate.field = DecimalField() # vomit
        query.aggregates[alias] = aggregate

Then doing:
qs = SomeModel.objects.annotate(ArrayAgg('id'))
should produce results using the ArrayAgg class. This is using private API, but there is a lot of code relying on this. We should support this if at all possible.

Another case you should check is how GIS works *without* rewriting anything in django.contrib.gis. If that works, then that shows fairly good backwards compatibility.

Lets try to find the common use cases and try to make sure that they work without any modification during deprecation period, and they are as easy as possible to rewrite to use the new way.

If this turns out to be too painful to support backwards compatibility we have to consider either dropping compatibility or maybe supporting both NewAggregate and old style Aggregate simultaneously.

I am not sure how much custom ExpressionNodes there are, at least such nodes that rely on having SQLEvaluator available. If anybody reading this knows of any please mention them here.

I'll try to check your work in detail as soon as possible. Unfortunately I am very busy at the moment, so I can't guarantee anything.

 - Anssi

Josh Smeaton

unread,
Jan 8, 2014, 8:33:08 AM1/8/14
to django-d...@googlegroups.com
I'll preface by saying it's late for me, and I'm about to sign off, so any mistakes below I will correct tomorrow after some sleep.
 
To write a custom aggregate one needed two classes, one for user facing API, one for implementing that API for SQL queries.

I've taken note of this already, and that is the reason I've kept the sql.Aggregate class where it is, though it's not used internally anymore. There are two options that I can see right now:

- The non-compatible way involves moving the sql_function attribute to the Aggregate from the SQLAggregate, and everything should work fine
- The backward-compatible way might involve re-adding the "add_to_query" function, and using the SQLAggregate to monkeypatch the Aggregate by copying over the sql_function attribute. I would have to experiment and test this though. I assume this would be the more appropriate option. We can include a DeprecationWarning to phase it out.
 
Another case you should check is how GIS works *without* rewriting anything in django.contrib.gis

Good idea, that's how I'll approach the aggregate stuff. Not sure that I can leave it completely alone due to Expressions though, but we'll see (and hope). 

I'll try to check your work in detail as soon as possible. Unfortunately I am very busy at the moment, so I can't guarantee anything.

Totally understandable, I appreciate your comments regardless. It's just unfortunate there aren't more people deeply familiar with the ORM - it's bloody complicated :P

I'm now to the point where all tests in aggregate, aggregate_regress, expressions, expressions_regress, and lookup pass. I've got a small problem with .count() regressions failing, but am working through that now. Tomorrow I should be able to start playing with new functionality and writing new tests. I'll also take a look into running tests on GIS and fixing any errors that rear their heads there.

Cheers,

- Josh

Josh Smeaton

unread,
Jan 11, 2014, 3:06:14 AM1/11/14
to django-d...@googlegroups.com
A milestone has been reached. The diff is here: https://github.com/jarshwah/django/compare/akaariai:lookups_3...aggregates-14030?expand=1
  • Existing annotations and aggregations working as they were
  • Complex annotations and complex aggregations now work
  • Complex annotations can now be referenced by an aggregation
  • Cleaned up the logic relating to aggregating over an annotation

Things I still need to tackle:
  • GIS
  • Custom backend implementation of aggregates and annotations
  • Annotations that aren't aggregates. I think this should be extremely easy, but I might not be aware of some underlying difficulties.
Things that should be done in the future, as I don't think there will be time for this just now:
  • Change order_by (and other functions) to accept Expressions.
I would think that, as it stands, this code is ready for review with the above caveats. If there are any backend authors reading (mssql, django-nonrel), I would encourage you to look at the implementation here, and let me know if there's anything glaringly obvious (to you) that is missing or broken.

Regards,

- Josh

Michael Manfre

unread,
Jan 11, 2014, 6:35:58 PM1/11/14
to django-d...@googlegroups.com
With minor tweaks to django-mssql's SQLCompiler, I am able to pass the aggregation and aggregation_regress tests. If you create the pull request, comments can be attached to specific lines of code.

I haven't had time to do a full review of your changes, but my initial questions and comments are:

- Why were the aggregate specific fields, properties, etc. added to ExpressionNode, instead of being on django.db.models.aggregates?
- Why does an ExpressionNode containing any Aggregates need to pollute the entire tree with that knowledge?
- Instead of using is_ordinal and is_computed (confusing name, why not is_float), you could eliminate those and define output_type for each Aggregate that does something special.
- What does is_summary mean?

Regards,
Michael Manfre


--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/518c79cd-050c-49f7-8420-1e53bfa4b78e%40googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Josh Smeaton

unread,
Jan 11, 2014, 8:47:45 PM1/11/14
to django-d...@googlegroups.com
Hi Michael,

Thanks for taking a look, much appreciated. I haven't created a PR yet because I've rebased on top of Ansii's Lookups refactor which hasn't hit Master yet (and is changing slightly based on feedback of his PR). To answer your questions:

Why were the aggregate specific fields, properties, etc. added to ExpressionNode, instead of being on django.db.models.aggregates?

Originally they were, but then Sum('field') + Sum('anotherfield') didn't work as that expression produces an ExpressionNode with AggregateNode children. That ExpressionNode (+ (Sum, Sum)) has to behave like an Aggregate. The previous SQLEvaluator had specific methods that related to F() expressions and DateModifierNodes for exactly this reason, which have now been pushed down to the ExpressionNode.

- Why does an ExpressionNode containing any Aggregates need to pollute the entire tree with that knowledge?

As mentioned above. The ExpressionNode becomes an Aggregate to the outside world and should be treated like one. The leaves of the Expression tree form the actual aggregate functions, but the root of the tree forms the entire Aggregate expression. Using the example above:

.annotate(label=Sum('field') + Sum('anotherfield')) ->

ExpressionNode(+):
    : Sum(field)
    : Sum(anotherfield)

The framework needs to know several things about an Aggregate, like its output type and the type of columns that are wrapped. When there are multiple aggregate functions in a single expression, the root needs to check its children and make a decision on what to present.

Instead of using is_ordinal and is_computed (confusing name, why not is_float), you could eliminate those and define output_type for each Aggregate that does something special.

These are remnants of existing aggregates, and can probably be replaced by setting the field_type property directly. I will look into doing that.

What does is_summary mean?

It means that .aggregate() has been invoked which evaluates immediately. Internally the aggregate is prepared differently though as it has to check if it refers to an aggregate annotation. I don't like how the query sets the is_summary property directly though, and might grow .prepare() to also accept an is_summary parameter.

Cheers,

- Josh

Anssi Kääriäinen

unread,
Jan 12, 2014, 3:11:11 AM1/12/14
to django-d...@googlegroups.com
On Saturday, January 11, 2014 10:06:14 AM UTC+2, Josh Smeaton wrote:
  • Custom backend implementation of aggregates and annotations
Any idea how to do this?

The new way in custom lookups is trying to first call as_vendorname(qn, connection), if that doesn't exist then call the standard implementation, that is as_sql(). For example on sqlite Django will first try to call as_sqlite(), if that doesn't exist then as_sql() will be called.

For aggregates I guess monkey-patch addition of as_vendorname() will work. It is a bit ugly to write the custom implementations. And as the function template for an aggregate is defined as a variable in the aggregate itself changing that isn't going to be that easy. You can't just change the template and call super - if the same query is executed on different backend then things will not work. I guess what is needed is some way of non-permanent change of the template. Possibilities seem to be addition of template kwarg to as_sql() so you can just do:
    def as_mysql():
        super(qn, connection, template='MYSQLCOUNT')

Alternatively there could be some way to clone the aggregate into backend specific class if needed.
  • Annotations that aren't aggregates. I think this should be extremely easy, but I might not be aware of some underlying difficulties.
I don't think it is that easy. Django assumes that if any annotation exist, then the query is an aggregation query. But of course that doesn't hold if the above is done.
 
Things that should be done in the future, as I don't think there will be time for this just now:
  • Change order_by (and other functions) to accept Expressions.
Not just Expressions but anything implementing query expression API. That is, it would be very useful to be able to do qs.order_by('birthdate__year'). And if this is done it will be actually pretty easy to inject custom SQL into the queries. It will be relatively easy to write a RefSQL class which allows one to do: .order_by(RefSQL("case when {{author__birthdate__year}} > %s then 1 else 0 end", [1981])). Joins generated by Django, and you can use any custom lookups you have. No need for .extra() any more!

 - Anssi

Josh Smeaton

unread,
Jan 12, 2014, 3:52:03 AM1/12/14
to django-d...@googlegroups.com
Any idea how to do this?

The new way in custom lookups is trying to first call as_vendorname(qn, connection), if that doesn't exist then call the standard implementation, that is as_sql(). For example on sqlite Django will first try to call as_sqlite(), if that doesn't exist then as_sql() will be called.

I've been following your PR and noted the change (and made some comments too) away from @add_implementation. I'm having a bit of an internal debate about how I've structured expressions with respect to as_sql though. Each expression also has a get_sql() method that as_sql() calls internally on leaf nodes. I'm thinking it'd be best for the extension point to be on get_sql(), but it'd be inconsistent with the Lookups API. The problem with overriding as_sql() would be that the extender would also need to know to recurse the entire tree which is not a good idea at all. Would the inconsistency be acceptable? When I spend some time actually trying out swappable implementations, this may all become clearer to me.
  • Annotations that aren't aggregates. I think this should be extremely easy, but I might not be aware of some underlying difficulties.
I don't think it is that easy. Django assumes that if any annotation exist, then the query is an aggregation query. But of course that doesn't hold if the above is done.

You're right. I was hoping that I could just append nodes into the query.select list, but that was a dead end. 

I've been playing around with non-aggregate annotations today, and I think it's almost fine to just put them directly into the query.aggregates dictionary (and rename it to annotations). It works in trivial cases only, by checking the is_aggregate property. F() expressions over annotations are forced into the HAVING part. Aggregations over annotations fail, because the annotation is promoted to the select list. I think both of these problems I can overcome fairly easy, but that doesn't mean there won't be more issues.

An alternative would be to special case non-aggregate annotations in a similar way that aggregates are treated in sql/query now. I don't like this idea at all, because too much needs to change. I'm trying to change as little as possible in sql/query and sql/compiler.

order_by(RefSQL("case when {{author__birthdate__year}} > %s then 1 else 0 end", [1981])). Joins generated by Django, and you can use any custom lookups you have. No need for .extra() any more!

That's quite interesting. I've already introduced the concept of a ValueNode, which is an expression that outputs whatever value you give it. Already one could write .annotate(ValueNode("case when ...")) as a string, and it'd work, but that misses the {{author__birthdate__year}} stuff in your example. Once I've got my main points implemented, I was going to experiment with something like:

.annotate(Case(condition=Q(author__birthdate__year__gt=1891), true=1, false=0))

The actual use case I had in mind was in reference to #11305 (Conditional Aggregates):

.annotate(Count(Case(condition=Q(author__birthdate__year__gt=1891), true=1, false=0))

But that's just speculation at the moment. More work needs to be done to see what that'd look like in annotations, but then the path is clear for implementing it with order_by and friends.

Cheers,

- Josh

Anssi Kääriäinen

unread,
Jan 12, 2014, 5:28:33 AM1/12/14
to django-d...@googlegroups.com
On Sunday, January 12, 2014 10:52:03 AM UTC+2, Josh Smeaton wrote:
Any idea how to do this?

The new way in custom lookups is trying to first call as_vendorname(qn, connection), if that doesn't exist then call the standard implementation, that is as_sql(). For example on sqlite Django will first try to call as_sqlite(), if that doesn't exist then as_sql() will be called.

I've been following your PR and noted the change (and made some comments too) away from @add_implementation. I'm having a bit of an internal debate about how I've structured expressions with respect to as_sql though. Each expression also has a get_sql() method that as_sql() calls internally on leaf nodes. I'm thinking it'd be best for the extension point to be on get_sql(), but it'd be inconsistent with the Lookups API. The problem with overriding as_sql() would be that the extender would also need to know to recurse the entire tree which is not a good idea at all. Would the inconsistency be acceptable? When I spend some time actually trying out swappable implementations, this may all become clearer to me.

Maybe both Aggregate and Expression could implement the query expression API (that is, those things Col class has in the lookups patch). This way F('author__age') + F('author__books_sold') would translate to:
  Expression(+):
    Col(author, age)
    Col(author, books_sold)

If you write F('author__age') + F(Sum('author__books__pages'))) this would translate to:
  Expression(+):
    Col(author, age)
    Sum(Col(books, pages))

The SQL would be generated by:
Expression.as_sql(qn, connection):
    params = []
    sql = []
    for node in self.children:
        node_sql, node_params = qn.compile(node)
        sql.append(node_sql)
        params.extend(node_params)
    return ' + '.join(sql), params

Note that it doesn't matter at all if the children is an aggregate, a col reference of another Expression.

You would also gain support for F('author__birthdate__year'), that is your extract lookups are available. An extract is after all just another class implementing the query expression API. Full support needs a bit of refactoring to the Query class as expressions are currently only resolved in build_filter(). A generic get_ref(lookup_path) method is needed.

Internally this would mean that Expression just generates the connector between nodes, but doesn't produce any SQL for the nodes itself. It seems Expression is trying to do too much currently. It produces connector between nodes when it is internal node and is also generates SQL for the leaf nodes, too.

This is just an idea. To me it seems this is cleaner, as now Expression node is just connecting two parts implementing the sql expression API. OTOH this is really large refactoring, so maybe it is better to go with the current patch and then check if things can be improved further later on.

 - Anssi

Josh Smeaton

unread,
Jan 13, 2014, 5:24:40 PM1/13/14
to django-d...@googlegroups.com
I've rebased off your current lookups pull request to drag in all the changes made recently, and I pulled against upstream/master which has made looking at the diff between our branches extremely painful. Once your PR is applied to master, I'll be able to submit a proper PR though (after some squashing).

Onto the detail.

Maybe both Aggregate and Expression could implement the query expression API (that is, those things Col class has in the lookups patch). This way F('author__age') + F('author__books_sold') would translate to:
  Expression(+):
    Col(author, age)
    Col(author, books_sold)

Absolutely that should be the goal. I've already written Expressions to support the query expression API (I had to, since lookups expected aggregates to support it), but I still can't use the Transforms because, as you said, build_filter() does all the work currently.

Internally this would mean that Expression just generates the connector between nodes, but doesn't produce any SQL for the nodes itself. It seems Expression is trying to do too much currently. It produces connector between nodes when it is internal node and is also generates SQL for the leaf nodes, too.

You were right, it was doing way too much. I've refactored Expressions significantly so that it does much much less. All the leaf nodes have changed their get_sql() methods to as_sql() meaning that they now support as_vendorname(), since the compiler is what generates the SQL. I still need to test as_vendorname to confirm I haven't missed anything, but it should work with little modification now. I'll write those tests tonight.

I started this refactoring with the hope that it'd be able to make the January 20th cutoff, but that's not going to be possible if I aim for getting non-aggregate annotations in. It needs more thought, and potentially a lot of work. However, I feel the work done up until this point gives a lot more power to aggregates that was not possible before. I'd like to propose submitting aggregate improvements now, closing 14030, and opening another ticket for supporting non-aggregate annotations.

Getting this change included relies on a few things though;

- There needs to be time for you (or someone else) to properly review the change. With only 6 or so days to go, I think this is asking too much.
- Docs need to be written
- I still need to sort out GIS
- Support for 3rd party backends (or non-sql backends) needs to be evaluated. The maintainer of django-mssql has taken a cursory look, but review seemed mostly positive (it worked with minimal changes). This does not necessarily mean that non-sql backends will be fine though.
- Tests need to be run on mysql and oracle. I couldn't get the tests running on mysql (errors regarding long index names when creating the DB), or oracle, since I only have oracle cluster at my disposal, which doesn't support creating the test database. I can run up an Oracle XE, but I won't do that just now if 1.8 is going to be the target.
- Cleaning up the commits
- Waiting for Lookups to land on master so a clean PR can be submitted

What are your thoughts?

Whether or not this is picked up for 1.7 or 1.8, I'll continue working on the patch and keeping it up to date with master so it doesn't go stale. That's happened enough times :P

Cheers

- Josh

Josh Smeaton

unread,
Jan 14, 2014, 1:56:29 AM1/14/14
to django-d...@googlegroups.com
Vendor specific SQL is now possible, and uses the monkey patching described above. I worked around the problem of temporarily changing the template, by putting the template and the function into the extra dict on init. Any modifications made to the dict are then made for that instance only. Complicated overrides are possible, giving full control when needed, but calling super after modifying the dicts should be the preferred way of patching.

Relevant code:


And the tests:


One of the relevant tests, showing the most simple way of overriding an aggregate:

        def lower_case_function_super(self, qn, connection):
            self.extra['sql_function'] = self.extra['sql_function'].lower()
            return super(Sum, self).as_sql(qn, connection)
        setattr(Sum, 'as_%s' % connection.vendor, lower_case_function_super)

Cheers,

- Josh

Josh Smeaton

unread,
Jan 14, 2014, 8:15:56 AM1/14/14
to django-d...@googlegroups.com
I've also written a basic hack (there's no other word for it really) that should support the majority of custom aggregates out in the wild. The exact commit is here:


Basically, it checks to see if add_to_query exists, call it with a known alias so that we can then extract the SQLAggregate from query.aggregates, and steals the sql_function and sql_template attributes. That said, to support the "new" way, the user just needs to copy those two attributes onto the models/aggregate, and they now support both the old and new way.

Unfortunately, this doesn't work for GIS. I'll need to port GeoAggregate to work like a standard Aggregate, but the work there is extremely minimal. It would also gain the support I've built in above for custom Geo Aggregates. The "hard" work for GIS is changing the models/query and models/sql/query as I've done for core.

Cheers,

- Josh

Josh Smeaton

unread,
Jan 15, 2014, 8:57:33 AM1/15/14
to django-d...@googlegroups.com
Quick update. GIS support is nearly done, I just have two failing tests to take care of (one to do with lookups setting srid, and another that is producing the wrong value). I have a few validation checks to make inside prepare, and then it's on to writing tests that make use of aggregates and expressions together. I'm unsure what kind of expressions I can write with GIS queries though, so if anyone has any suggestions, I'd be glad to implement them.

Examples of tests with regular aggregates would be:

.annotate(Sum('field')+Sum('anotherfield'))
.annotate(Max('field')*2)

I'm not sure how you'd compose GeoAggregates as I've never used them before.

I should mention that I've only run GIS tests against PostGIS. I've noticed that Oracle is special cased in a number of areas, so I'd appreciate any help I could receive in testing against Oracle spatial.


There was a bug in gis/sql/compiler that passed the quote_name function instead of itself to the expression. Whether or not this change is accepted, that bug should be fixed in master as it breaks the query expression API.

Cheers,

- Josh

Anssi Kääriäinen

unread,
Jan 16, 2014, 3:36:57 AM1/16/14
to django-d...@googlegroups.com
I'm sorry I haven't had enough time to review your work. Unfortunately I won't have any more time in the near future either. So, it looks like this has to wait for 1.8. Sorry for that.

If you want to ask specific questions, #django-dev IRC channel is a good place to ask those questions. It is often easier to solve design issues real-time than on a mailing list.

 - Anssi

Josh Smeaton

unread,
Jan 16, 2014, 6:35:21 AM1/16/14
to django-d...@googlegroups.com, django-d...@googlegroups.com
That's fine, I knew I was cutting it close when I started. At least I can start getting to bed at a reasonable time now :)

I was wondering where the dev Irc channel was, so I'll make use of that too. Thanks for the pointers so far. 

- Josh
--
You received this message because you are subscribed to a topic in the Google Groups "Django developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/8vEAwSwJGMc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-develop...@googlegroups.com.

To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Josh Smeaton

unread,
Jan 19, 2014, 8:00:00 AM1/19/14
to django-d...@googlegroups.com
I've finally sent a PR, so if you're still able, I'd love to see the specific comments you had.


Cheers,

- Josh

On Sunday, 12 January 2014 10:35:58 UTC+11, Michael Manfre wrote:

Michael Manfre

unread,
Jan 19, 2014, 10:34:39 AM1/19/14
to django-d...@googlegroups.com
There are a few other things that I need to try and get in before the 1.7 feature freeze. Since this is not slated for 1.7, I'll take a look a the PR after the freeze and see if the comments I had still apply.

Regards,
Michael Manfre


Josh Smeaton

unread,
Mar 30, 2014, 10:12:25 AM3/30/14
to django-d...@googlegroups.com
Code complete: https://github.com/django/django/pull/2496

- Aggregates have learnt arithmetic and how to interact with F() expressions
- Annotations can now accept non-aggregate expressions

I'm in the process of writing documentation, some of which I've committed and is available within the PR. I plan to have the documentation in a ready for review state next weekend.

I'd really appreciate some reviewing of the code and the approach though, ideally before master diverges too much and makes merging too difficult.

Cheers,

Josh
Reply all
Reply to author
Forward
0 new messages