Complex aggregate and expression composition.

287 views
Skip to first unread message

Nate Bragg

unread,
Mar 20, 2012, 7:27:31 PM3/20/12
to django-d...@googlegroups.com
Hello all,

Since even before I saw Alex Gaynor's presentation "I hate the Django ORM"
(the one with the `Sum(F("end_time") - F("start_time"))` query), the problem
of complex aggregates and expressions has vexed me. So, I figured I would
try to solve it.

Originally, I started this trying to pursue a solution to ticket #14030, but after
I took a couple of lousy shots at it, it dawned on me that the ticket would be
better resolved as a result of solving the more general case. I realized that
aggregates were just a special case of expressions, and that the best solution
was going to take a refactoring of Aggregate into ExpressionNode.

I have uploaded my branch; it can be found here:

https://github.com/NateBragg/django/tree/14030

Since this is a non-trivial change, I was hoping to open the topic for debate
here, and get some feedback before proposing my solution for inclusion.

Some particular points of note:
* I tried to preserve as much interface as possible; I didn't know how much
  was considered to be more public, so generally I tried to add rather than
  subtract. However, I did remove a couple things - if you see something
  missing that shouldn't be, let me know.
* Currently, I'm getting the entire test suite passed on sqlite, oracle, mysql,
  postgres, and postgis. I was unable to test on oracle spatial - any help
  with that would be appreciated.
* When fields are combined, they are coerced to a common type;
  IntegerFields are coerced to FloatFields, which are coerced into
  DecimalFields as needed. Any other kinds of combinations must be of
  the same types. Also, when coerced to a DecimalField, the precision
  is limited by the original DecimalField. If this is not correct, or other
  coercions should be added, I'd like to correct that.
* When joins are required, they tend to be LEFT OUTER; I'd like some
  feedback on this, as I'm not 100% sure its always the best behavior.
* As the aggregates are a little more complicated, on trivial cases there
  is a minor reduction in performance; using djangobench, I measured
  somewhere between a 3% and 8% increase in runtime.
* I don't have enough tests - mostly for a lack of creativity. What kind of
  composed aggregates and expressions would you like to see? I'll add
  tests for them.

Thank you all, and sorry for the above being a short book.

-- Nate Bragg

Kääriäinen Anssi

unread,
Mar 20, 2012, 10:07:09 PM3/20/12
to django-d...@googlegroups.com
I took a quick look at your patch. I don't have more time now, so just some quick comments:
- In general, the approach where aggregates are just expressions sounds and looks
valid.
- I would not worry about the extra time used in djangobench. However, profiling why
there is extra time used is always recommended.
- I am a bit scared of the type coercions. The reason is that this could prove to be
hopelessly complex to get right in every case. However, I do not have concrete
examples where this is in fact a problem. The default should probably not be an exception,
but just returning what the database happens to give you back.

I think the approach you have taken is correct in general. I would encourage to check
if you can somewhat easily incorporate the conditional aggregate support (#11305)
into the ExpressionNode based aggreagates. It does not belong into the same
patch, but is a good sanity check if the approach taken is extensible.


[Following is a bit off-topic]
I wonder if the ExpressionNode itself should be refactored into a public API. This way
you could easily write your own SQL snippets injectable into the query. This could be
custom aggregates, or this could be just NULLS LAST order by clauses.

The reason I bring this up is that in the long run, adding more and more special case
support to the ORM (conditional aggregates, different SQL functions) doesn't seem to be
the right way forward. Once you get expression composition in, you only have 90% of
SQL constructs left...

Spend the time in building support for user writable SQL snippets, so that they can
use just the SQL they want. In my opinion NULLS LAST/FIRST support is a great
example: it is common enough that users need it from time to time, but it is not common
enough to spend the time to support this special case. Why not just:
qs.order_by(SQL('%s NULLS LAST', F('pub_date'))
and you now got support for _any_ order by clause the user wishes to use. Replaces
extra(), but in a cleaner way. The above could support relabel_aliases(). Or you could
write it just as qs.order_by(SQL('pub_date NULLS LAST')) if you don't care for relabel
aliases support.

For the F-expression support in aggregates this would mean you get actually not just
F expression support in aggregates, but any SQL snippet can be injected into the
aggregates, for example Sum(SQL('case when person.age > friend.age then 1 else 0 end'))

- Anssi
________________________________________
From: django-d...@googlegroups.com [django-d...@googlegroups.com] On Behalf Of Nate Bragg [jonatha...@alum.rpi.edu]
Sent: Wednesday, March 21, 2012 01:27
To: django-d...@googlegroups.com
Subject: Complex aggregate and expression composition.

Hello all,

https://github.com/NateBragg/django/tree/14030

-- Nate Bragg

--
You received this message because you are subscribed to the Google Groups "Django developers" group.
To post to this group, send email to django-d...@googlegroups.com.
To unsubscribe from this group, send email to django-develop...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/django-developers?hl=en.

Nate Bragg

unread,
Apr 10, 2012, 12:09:09 PM4/10/12
to django-d...@googlegroups.com
I took a quick look at your patch. I don't have more time now, so just some quick comments:
[...] 
I think the approach you have taken is correct in general. I would encourage to check
if you can somewhat easily incorporate the conditional aggregate support (#11305)
into the ExpressionNode based aggreagates. It does not belong into the same
patch, but is a good sanity check if the approach taken is extensible.

Thanks for taking a look!  Sorry it took me so long to get back to you, I've been quite
busy lately.  I finally had two hours to rub together to try your suggestion; I set up a
branch (  https://github.com/NateBragg/django/tree/11305  ) that implements this
behavior using (more or less) your approach.

The difference can be found here:



I also wanted to thank you on another count, as parts of your 11305 approach
served as a launching point for my work - and even gave me a good solution for
one of the problems I listed in my original email (regarding join promotion), that
I have since incorporated it onto the 14030 branch.

 - I am a bit scared of the type coercions. The reason is that this could prove to be
   hopelessly complex to get right in every case. However, I do not have concrete
   examples where this is in fact a problem. The default should probably not be an exception,
   but just returning what the database happens to give you back.

It could be quite complex, I agree, but it doesn't have to be.  I based it off of the
preexisting concept that Aggregates needed to know about their field types,
and whether or not they were ordinal or computed aggregate types.



So, in general what I need is other folks trying this, finding problems, or taking
issue with my proposal.  It might not be quite as exciting as user.auth, but I
think refactoring like this needs review and feedback.  I'd really love to propose
this for inclusion, so that others could benefit from this.


 
As it's a couple weeks old, in case you missed my original email, I've included it below.

Thanks,

Nate
Reply all
Reply to author
Forward
0 new messages