> since I have a lot more SQL in my projects than I expected when I
> started using Django.
Would it be possible to abstract away that hand crafted SQL from
Django by the simple expedient of hiding it behind a database artifact
such as a view, stored procedure, pipelined function or such?
That way Django doesn't need to know about your SQL and your SQL
doesn't need to be embedded in Django.
The best of both world really :)
Would it be possible to abstract away that hand crafted SQL from
Django by the simple expedient of hiding it behind a database artifact
such as a view, stored procedure, pipelined function or such?
> I think that would be difficult, since I regularly switch databases. I
> really like using sqlite for development, for instance, and Postgres
> when I
> can, but I'm forced to use MySQL on a shared host. So far, either with
> extras or just using a cursor, my SQL has been completely portable.
Out of curiosity, does your project contains any database specific DDL
or DML? Perhaps some create table, index, partition, constrain,
sequence, analytic, statistic, or such?
Or is it fully driven end-to-end by Django's ORM? E.g. all the
database artifacts, as well as the database itself is generated,
managed by Django, from cradle-to-grave?
If your project already have database specific artifacts, it is but a
small step to re-factor some of the more advance DML away from Django
and expose them back to Django in an ORM friendly way such as a view
or cursor or whatnot.
After all, what's the point of an ORM but to shield you away from
common SQL trivialities?
On the other hand, if an ORM starts mimicking most of SQL
peculiarities, perhaps something has gone wrong along the way :)
Could you not hijack the thread, please. Nicola started this to talk
about aggregates, not how one particular project uses custom SQL.
Thank you,
Malcolm
--
No one is listening until you make a mistake.
http://www.pointy-stick.com/blog/
> Could you not hijack the thread, please. Nicola started this to talk
> about aggregates, not how one particular project uses custom SQL.
Fair enough.
On the other hand, one could argue that the subtext of those questions
is something along the lines of:
"Don't bother re-implementing SQL in your favorite language"
Which is pretty much what emerges from your recent "Queryset
Implementation" post:
http://www.pointy-stick.com/blog/2008/03/11/queryset-implementation/
In any case, apologies for disturbing the groupthink :)
Hi Nicholas,
This is an ambitious project, so good luck! Long time readers will
know that I have an interest in aggregates, so i'll be here to help
out however I can.
Here is some feedback on your specific proposals:
> There should be an aggregate() QuerySet modifier, that _always_
> returns a QuerySet (probably an AggregateQuerySet subclass of
> QuerySet).
>
> suppose you have::
> >>> queryset = Buyer.objects.all()
>
> This modifier receives arguments in two ways:
>
> As function=tuple for the different aggregate functions::
> >>> queryset.aggregate( sum=( 'payment',), avg=('quantity',
> 'payment', 'age') )
>
> which results in a queryset who's values() would be::
> [
> {
> 'name' : 'Peter Smith',
> ... other attributes ...
> 'payment' : {'sum': 20000, 'avg' : 200},
> 'quantity' : {'avg' : 1500}
> }
> {
> 'name' : 'Jane Doe',
> ... other attributes ...
> 'payment' : {'sum': 62000, 'avg' : 321},
> 'quantity' : {'avg' : 12604}
> }
> {
> 'age': {'avg' : 35 }
> }
> ]
I'm not sure I'm entirely comfortable with a few aspects of this.
1) The output format: Using the example data structure you provided,
how do you extract the average age? The best I can see is
"values[2]['age']['avg']" - values[2] is not a very helpful access
mechanism.
2) The query mechanism confuses the idea of an aggregate over the
query set itself (e.g., avg age of all buyers) and an aggregate of a
member in the group (e.g., the average quantity purchased by each
individual buyer).
My original proposal had 2 mechanisms: aggregate() to get aggregates
over the entire QS, which returned a dictionary of the resulting
aggregates, and annotate() to augment object instance with aggregate
attributes.
The particular goal of annotate() was to enable you to get a list
actual objects which have extra attributes, so you could reference
"somebuyer.average_payment".
In your proposal, I suppose annotate() would return the first two
elements of your result set, and aggregate() would return the last
element.
3) How does this syntax extend into multiple-table aggregates (i.e,
when you are querying over Buyer and need to get an aggregate over
something related to the Purchase table) - in particular the output
format?
> and in filter-like syntax::
> >>> queryset.aggregate( payment__sum='total_payment',
> quantity__avg='normal_request', payment__avg='avg_pay',
> age__avg='mean_age')
Obviously, I like the aliasing format, but the same comments about
output format and inconsistency between aggregates over instances and
over groups.
> When a field name exists in the base model that has the same name of
> the field on the related model we want to aggregate we could resolve
> the ambiguety by refering to therelated field like:
Add this to the list of problems that are avoided if you separate
aggregates on the model from aggregates on the instance. :-)
> There is no way to specify that only the aggregated fields should
> appear in the results because this could be done with the modifiers
> proposed in [#5420].
Agreed.
I would add to this - if you take the 'annotated object' approach to
the output format, you could use 'values()' to pull out the subset of
values that you want for each object.
> In the case where group_by is not specified it is always grouped by
> the ForeignKey (or ManyToMany). If the group_by is specified (this is
> always specified in the function=tuple format) the object is grouped
> by those fields.
You've lost me here. Where does 'group_by' come from?
> The count function should, IMHO, be handled in the filter-like syntax
> so you can specify on which field to count allowing you to count
> excluding null values for foreign tables.
Agreed.
> To specify ordering you would use the function order_by refering to
> the alias or the field__function. this is::
>
> qs.order_by('-payment__sum')
> qs.order_by('-total_payment')
Agreed. My only comment here is that it would be nice if the
payment__sum format had some analog in the original query syntax. This
is my biggest problem with the unaliased aggregate notation that you
are proposing - it doesn't really share an analog with anything else
in Django, whereas the 'payment__sum' approach and the aliased format
both have immediate and relatively obvious analogs with the syntax
used by filter().
> SQL
I think you're on the right track here, except for:
> As said before, the GROUP BY clause, when not specified, would be
> generated by using the related key.
Again - you seem to have skipped this bit of your proposal.
> Downsides
> ---------
>
> The only downside I see with having the two syntax possibilities is
> inconsistency. The person doing, for example, the templates would need
> to know how the query was executed in order to decide weather use::
>
> {{ object.total_payment }}
>
> or::
>
> {{ object.payment.sum }}
>
> Of course, this could also be seen as a ''aliasing feature'', but I'm
> not thrilled.
I don't see this as a huge problem. Personally, I could live without
the non-aliased format, but it doesn't bother me enough to want to
reject it, and the syntax makes sense in context.
> Another problem I see is that group_by is too SQLish. You gotta think
> in tables/sql instead of objects. I still don't see a better way of
> allowing the user to specify by which fields to group. Maybe a simple
> name change would make it easier.
Again - where does group_by come from?
> Extras
> ------
>
> 1- An idea I had, but I am not sure yet how it could be well
> implemented
> is to allow users to define their own aggregate function and provide
> it to extend the aggregate api.
Interesting idea, but one that I suspect could be added after the
fact, possibly as part of a larger customization project. For example,
GeoDjango could probably make use of some customized filter tags -
whatever approach is used for aggregates should probably be mirrored
in filters.
> 2- Another extra thing that could be explored is Russell's idea of
> allowing field lookups to compare two fields:
Again, this is something that is useful outside aggregates itself, and
could be implemented independent of any aggregate implementation.
> I put this as an extra since it does not involves the whole ORM and
> not just the aggregates though I think would be a very nice to have
> feature.
>
> 3- Another thing I've been looking into is to allow, for performance
> reasons, the aggregates to be kept in a table as is done in
> django-simpleaggregation [1].
I'll need to look into this before I can make a full comment, but my
immediate reaction is that we should concentrate on simple SQL query
construction, rather than optimization strategies.
Yours,
Russ Magee %-)
But they're not equivalent. One mechanism returns a dictionary of
aggregates. The other returns a list of model instances, each of which
is annotated with aggregates. The only similarity is that 'they are
both about aggregates' - beyond that superficial similarity, they are
quite different. Given that there are two quite different output
modes, I don't think having two modifiers is at all unusual.
Potential name clashes, and needing to introduce a syntax that allows
for multiple ways to reference the same aggregate - this is a feature?
> My point is that It could be done in a
> way that you always need to refer to the related field when calling an
> aggregate on an instance but this is unnecesary if there are no name
> clashes. It is equally as easy to check the fields on the related
> table to find out which is to be used and this is a very non-expesive
> operation. Still I wouldn't opose to always specifying the related
> field name in aggregates over groups.
The throwaway line to counter this is "Explicit is better than implicit".
On a practial note, I haven't sat down and worked on an
implementation, but I suspect that you will find that the logic
required to resolve these shortcut aggregates will be more headache
than it is worth.
> > > In the case where group_by is not specified it is always grouped by
> > > the ForeignKey (or ManyToMany). If the group_by is specified (this is
> > > always specified in the function=tuple format) the object is grouped
> > > by those fields.
> >
> > You've lost me here. Where does 'group_by' come from?
>
> It is true that I didn't say much about the group by feature, so here
> it goes:
> Though group by seems very SQL-ish to me, I understand is a feature
> that many people would like to have since it allows more flexibility
> (and nobody likes to have to go and do it in raw sql). The group_by
> attribute could be added (in the function=tuple way) to specify the
> fields to go in the GROUP BY clause. In the case that this attribute
> is present the group by clause would be populated with those fields
> instead of the foreign key/id.
Remember, as with all Django ORM work, we're only trying to get to
80%; complex problems solutions will always require complex solutions,
for which the right answer is to roll your own SQL.
With that in mind: I'm not convinced of the need for group_by(). Can
you provide a use case where it would actually be useful? It seems to
me that both of the use cases driven by the output formats (aggregate
of entire QS, and annotated aggregates of QS members) pretty much
imply the group_by that is required. If you start playing with
group_by, you're going to get QS output which isn't a clean or easily
interpreted object list.
Yours,
Russ Magee %-)
For me the only real reasons to keep function=tuple syntax are
specifying group by (which I believe would be done with something
similar in your syntax, but lets not speculate)
and to not tie the
user to select aliases when they are not needed. Aliasing is a great
functionality but if my aliases are always the same name as the
function I am calling I want them to be generated for me.
I believe your syntax does bit the one I'm proposing regarding the
power of user defined functions, but as for general syntax it still
seems a bit unconsistent with existing api syntax.
Also it would make
the user defined functions to be directly specified in the queryset
and this is bad for re-use.
> It may also be possible to use the aggregation functions directly inThis I don't like that much. Though it would definitelly make the
> filters:
>
> >>> cheap_books = queryset.filter(price__lt=Avg('price'))
>
> I think that's an pretty clean way to specify that query. It should be
> roughly equivalent to the following (using the F(), which I think is an
> absolutely great idea, and annotate() from Russell):
>
> >>> cheap_books = queryset.annotate
>
> (price_avg=Avg('price')).filter(price__lte=F('price_avg'))
>
calls simpler the use of aggregate functions everywhere might come
back to bite you. Also, users will probably not realize that it
represents a more expensive operation that a simple "price__lt=3".
I believe that your syntax would be a great replacement for the
tuple=function syntax (and I believe that just lowercase-ing the
function names would make it look less odd to me)
I'll just make a few quick comments on your proto-implementation.
You've pretty much found a logical way to fit this in, but you need to
be more imaginative in your test cases to drive your understanding a bit
further.
On Sun, 2008-03-16 at 22:22 -0700, Justin Fagnani wrote:
[...]
> #Aggregate objects:
> class Sum(object):
> def __init__(self, col=None, alias=None):
> self.col = col
> self._alias = alias
> def as_sql(self, quote_func=None):
> return "SUM(%s)" % self.col
> def alias(self):
> return self._alias or (self.col + "_sum")
I'll just consider this one class. The comments apply to both.
You can't just deal with "col" as a string here, since there will also
be a table alias attached to make sure the column is selected from the
correct table. Just using "col" could very easily be ambiguous. For
example, what if I was summing the "id" values? There are many tables
with an "id" column that could be involved in the query.
As a general observation, there are multiple uses of the word alias
floating around here: table aliases and the alias used for the output
column. You might want to be very explicit and call the output alias
something clear like output_alias, since "alias" is used throughout the
Query class to refer to table aliases.
Have a look at how add_filter() works out both the alias and the column
name to use when it's constructing the where-clause entry (look at the
"entry" tuple in add_filter(), for example). You'll need to replace the
name that is passed in with a reference to the (right alias,
db_column_name) pair. Looking at the existing Count() class might give
you some starting points -- particularly how it deals with the two part
of it's "col" attribute, although that is not intended to be a fully
fledged count implementation, since it only counts the entire result
set. However, it shouldn't be too far away from what is needed and the
API should be mostly what is needed.
Any class like this has to implement relabel_aliases(). There are a
number of cases where the aliases might need to change after the
structure has been created, so you need that method. It's passed a
dictionary that maps the old alias name to the new name for any aliases
that are being changed. Again, the existing Count class will give you an
example of what happens there.
[...]
>
> Aggregation/grouping across relationships doesn't work. I tried to
> figure that out, but totally hit the wall in the qs code.
This should Just Work(tm) once you start handling the mapping from input
attribute names to (alias, column) pairs correctly. It's not just five
minutes work to understand, but that's because it's pretty much the meat
of the whole query construction process.
Basically, the string you pass in the "name" parameter to setup_joins()
is something that looks like foo__bar__baz (in your current example, it
would be something like "price"). It's the normal Django notation for
referring to a field, either on the current model or on a related model
via the double-underscore connections. The setup_joins() method works
out which tables need to be joined so that you can query the right
columns to get to your information. Remember that the database table and
column names are not the same as the field names you pass in. So you
need to look at the "field" object you get back from setup_joins() to
work out the real column name.
The best way to get a feeling for what's going on there is to print out
what you get back from setup_joins() and poke around at those
structures. It's somewhat in flux, since I'm doing some heavy
optimisation work on that code at the moment, but what you learn now
will still be relevant in any changed version in a day or two. The most
likely change there is that the nested list of joins that is returned
from setup_joins() will be a flat list and there'll be a parameter that
says where the final table join started (each successive model that is
added to the filter could add one or two tables to the join list and we
need to know where the last model started sometimes). Anyway, I'm still
working on that, since the code now behaves correctly, so I can now
modify it to make a few things go faster.
Your change to the iterator() method in ValuesQuerySet looks ugly. The
attribute is called field_names, so it should contain the field names,
not either the field names or some structure that has to be poked at to
get the field names. My point is that modifications to handle these
non-string inputs should go in _setup_query(), rather than iterator().
Note there's a bug in _setup_query() and iterator() that I'll eventually
fix. I only mention it here in case you're wondering how it works at the
moment. If you pass in multi-model values() fields and also have
extra(select=...), the wheels fall off. Also, if you only select some of
the extra(select=...) fields in values(), the result is wrong. That's
easy enough to fix, though. Just a matter of whacking on _setup_query()
some more.
Also, ValuesQuerySets are going to grow a bit more error checking. Right
now,
Entry.objects.filter(....).values(...) |
Entry.objects.filter(...)
doesn't raise an error. It should, since the two halves are different
sorts of objects. It should also raise an error if you combine two
values() queries with different fields in them, most likely (or perhaps
merge the fields). That's still on the TODO list, so don't do that yet,
but eventually you'll need to test what happens when you try to combine
two querysets via "&" or "|". But I haven't implemented specific
versions of __and__ and __or__ on ValuesQuerySets yet.
Okay.. that's a lot to think about. If you really want to understand the
alias stuff and why it's relevant (or get a feeling for what the result
should look like), print out some of these queries
(my_queyrset.query.as_sql()) to see what is being produced. If you have
a naked column -- that is, one without a table alias of some kind
attached -- it will lead to bugs down the track because it will be
possible to introduce that columnn name in more than one table in the
query and that isn't valid SQL.
Regards,
Malcolm
--
Why be difficult when, with a little bit of effort, you could be
impossible.
http://www.pointy-stick.com/blog/
So...
Django:
Buyer.objects.all().aggregate(age__avg='mean_age')
SQL:
select avg( age ) as mean_age
from buyer
Django:
Purchase
.objects
.all
().values('identifier').aggregate(quantity__sum='num_adquisiitions',
payment__count='num_payments', payment__avg='average_earnings')
SQL:
select identifier,
sum( quantity ) as num_adquisiitions,
count( payment ) as num_payments
from purchase
group by identifier
"Magic" is an emotive and terribly ill-defined term. It has no place in
a technical discussion.
> 3) It overloads __ for yet another purpose
>
>
> __ is already used to traverse relations and to specify query
> operators, and now it'll be used for functions too. What happens when
> these uses collide?
Your argument goes right off the rails here. We already use double
underscores to separate fields and functions -- the lookup types. The
name clash potential is actually a non-issue. The last item in a
sequence is the lookup type (and aggregates just become another sort of
lookup type). If the last item isn't a valid lookup type, we append
__exact to it for filtering purposes. So if you have a field called
"gte" in your model, you can still filter on "foo__gte__exact=6", etc.
The same disambiguation would apply in this case.
Malcolm
--
Plan to be spontaneous - tomorrow.
http://www.pointy-stick.com/blog/
If you go back to the original discussions (almost 2 and a half years
ago), you'll see why this was rejected. In short, the argument goes
something like this:
No - you don't want to do group_by. You want to achieve some logical
goal, which is achieved in SQL implements using GROUP BY syntax.
Django isn't in the business of redefining SQL syntax - if we tried to
do this, we would fail. However, Django _is_ in the business of
identifying common object-based use-cases, and building an elegant
syntax for representing those use-cases.
One common logical goal is to evaluate aggregates - Calculating
averages, sums, etc of collections of objects. You obviously have
another use case. If you can quantify that use case in a general way,
and express it as an operation on an object (or group of objects), you
may have the basis for a proposal for an extension to the Django ORM.
Of course, there will always be a subset of valid and useful SQL
queries that can't be distilled into a broader use-case; in those
cases, there will always be a need to drop to raw SQL. I believe that
the QS-refactor will make writing raw SQL easier, but that isn't the
same as 'just exposing group_by()'.
Yours,
Russ Magee %-)
You may not have _all_ the core developers attention, but _some_ of
them are listening - I'm definitely reading everything, and responding
when I get a chance; Malcolm is evidently reading too.
This doesn't mean that aggregates are a low priority. They are
certainly a high priority for me. I'm sure Adrian and Jacob have
noticed the discussion, and they will have also noticed that Malcolm
and I are involved. They probably won't say anything until Malcolm or
I ask them for an official judgement, or until they see something that
Malcolm or I are supporting that offends their sensibilities.
I haven't been as responsive as I was hoping to be on this thread -
I'm in the process of doing a project handover at work, and the
weekend was hectic for me personally; with luck, I should be able to
give you some more in-depth feedback on your proposal in the next day
or so.
> 1) My first, and main, objection to the syntax is that it's simply
> unnecessary
>
> 2) It's magic. Again, even though the filter syntax is a little magical,
> but it was necessary.
>
> 3) It overloads __ for yet another purpose
I disagree. The usage of the aggregate syntax is no different to the
filter syntax in rendition or interpretation, except that the
'functions' are aggregates rather than comparators. '__' is used for
join specification, or function specification in the terminal case.
The potential for name space clashes is almost nil, since the last
clause in the query term must be the aggregate function. The sum over
a field called sum is sum__sum; the first sum is the field name, the
second is the sum function specifier.
> 4) It adds more logic to QuerySet/Query and may be hard to extend.
This is actually one of the aspects of your proposal that I
particularly like. In particular, it provides an entry point for
complex aggregate cases. The example that springs to mind is
aggregates with complex group_by clauses that work over multiple table
joins.
We may not want to provide out-of-the-box support for these complex
cases in the form of a native operator or syntax, but it would be
helpful to be able to provide a clean mechanism for end users to
define 'custom aggregates' in a similar fashion to custom Q objects.
It also affords all sorts of aggregate math opportunities, as you have
previously indicated.
Yours,
Russ Magee %-)
This doesn't mean that aggregates are a low priority.
> 1) My first, and main, objection to the syntax is that it's simply
> unnecessary
>
> 2) It's magic. Again, even though the filter syntax is a little magical,I disagree. The usage of the aggregate syntax is no different to the
> but it was necessary.
>
> 3) It overloads __ for yet another purpose
filter syntax in rendition or interpretation, except that the
'functions' are aggregates rather than comparators. '__' is used for
join specification, or function specification in the terminal case.
The potential for name space clashes is almost nil
> 4) It adds more logic to QuerySet/Query and may be hard to extend.This is actually one of the aspects of your proposal that I
particularly like. In particular, it provides an entry point for
complex aggregate cases. The example that springs to mind is
aggregates with complex group_by clauses that work over multiple table
joins.
We may not want to provide out-of-the-box support for these complex
cases in the form of a native operator or syntax, but it would be
helpful to be able to provide a clean mechanism for end users to
define 'custom aggregates' in a similar fashion to custom Q objects.
It also affords all sorts of aggregate math opportunities, as you have
previously indicated.
Keep in mind that an important part of API usability is consistency
with existing APIs. Django queries have a very distinct flavour.
Wherever possible, extensions and improvements to Django queries
should reflect that flavour. You obviously are not as enamoured with
that flavour as the core devs (which includes myself), but we're not
going to abandon it or contradict it without a very good reason.
> Could you provide some examples? Once I got alias remapping working a huge
> amount of things started Just Working(tm), like Malcolm said would happen.
> Grouping over multiple tables is one of those. I don't think that's unique
> to my proposal though, sql.Query pretty much just does it.
Ok; here's the problem space that got me interested in Django
aggregates in the first place. It's complex enough to require table
joins and some potentially interesting groupings.
A Project consists of Milestones.
A Milestone consists of a list of tasks.
A Task is assigned to a User, and contains an initial estimate (in
hours) of the effort required, and a boolean flag that indicates when
the task is complete.
Effort (measured in hours) can be logged against a given task by any given User.
Now, given that multi-table setup, here are a series of questions to
answer, roughly increasing in complexity:
- Tell me how much effort has been expended by User U?
- Give me a list of tasks, with the total effort that has been spent
on each task
- Give me list of tasks that have exceeded their initial effort estimate
- Give me a list of milestones with the amount of effort outstanding
on each milestone
- Give me the total effort that has been expended on project P
- How much effort is outstanding on project P
- Calculate the average discrepancy (in absolute and in relative
terms) between task estimate and expended effort for completed tasks.
- Calculate the average discrepancy between task estimate and expended
effort on a per-milestone basis
This isn't an exhaustive list, but it should give you an idea of the
sort of things that might be required; you may need to group by
Milestone, Task, or User, in conjunction with queries that span the
same space.
It may not be possible to express some of these problems with a simple
syntax. While it would be nice to have a syntax that could hit all
these queries, I don't necessarily consider them to be strict
acceptance criteria. However, a syntax that allows for easy user
extension would allow me to write a helper aggregate class that
_could_ hit the extreme cases.
> One thing to think about is that expressions, being more open-ended, may be
> easier to abuse. They'll certainly require very good error checking,
> otherwise there will be times that the queryset is constructed without
> complaint, but the query fails when executed. This might not be any
> different from the situation with Q objects though.
We can't prevent dedicated people from making mistakes. However, we
can make sure that the APIs we provide have a single obvious
interpretation. Malcolm wrote a good blog post recently referencing
Rusty Russell's guidelines for API design - it's well worth a read.
Yours,
Russ Magee %-)
This is me noticing :)
As a point of order, assuming that no responses == no attention is a
bit of a stretch. I read every single word on django-developers, but
only reply where I'm going to help the discussion. In this case Russ
has thought about this issue far more deeply than me; the chance that
I'll disagree with his judgement is nearly nil.
Jacob
[For the record, of course, I agree with everything Russ has said so far :)]
Keep in mind that an important part of API usability is consistency
with existing APIs. Django queries have a very distinct flavour.
Wherever possible, extensions and improvements to Django queries
should reflect that flavour. You obviously are not as enamoured with
that flavour as the core devs (which includes myself), but we're not
going to abandon it or contradict it without a very good reason.
- Tell me how much effort has been expended by User U?
- Give me a list of tasks, with the total effort that has been spent
on each task
- Give me list of tasks that have exceeded their initial effort estimate
- Give me a list of milestones with the amount of effort outstanding
on each milestone
- Give me the total effort that has been expended on project P
- How much effort is outstanding on project P
- Calculate the average discrepancy (in absolute and in relative
terms) between task estimate and expended effort for completed tasks.
- Calculate the average discrepancy between task estimate and expended
effort on a per-milestone basis
We can't prevent dedicated people from making mistakes. However, we
can make sure that the APIs we provide have a single obvious
interpretation. Malcolm wrote a good blog post recently referencing
Rusty Russell's guidelines for API design - it's well worth a read.
As a point of order, assuming that no responses == no attention is abit of a stretch.
Could you please clarify what you mean by this - I can see this as
being either very good, or awful.
The implication of your examples is that annotate/aggregate won't work
unless you provide a values() object. If this is what you are
intending, I see this as a significant weakness. I can't see any
reason that values() should be required.
My concept for aggregates has always been that the GROUP BY clause
would include all non-aggregates fields in the SELECT. This means that
in the default case:
>>> Model.objects.annotate(field1__sum='total')
What you get back is a list of model instances that have an extra
annotated attribute (in this case, total). You have the flexibility to
use the objects as you would any other model instance, except that you
have the summary data coming along for the ride. This sample query
would return _all_ objects, annotated, since it isn't otherwise
filtered or restricted, and the GROUP BY clause would (by necessity)
include all the fields on the model.
This does open an interesting intersection with the way values()
operates. When you use a values() clause, you are restricting a query
to a subset of fields:
>>> Model.objects.values('field2','field3').annotate(field1__sum='total')
This restricts the fields returned by the SELECT, and it would make
sense to reflect this in the GROUP BY clause. The returned values in
this case aren't annotated model objects, since there won't
(necessarily) be 1-1 correspondence between the elements in the
returned data and the underlying model data, so the dictionary output
is appropriate. This strikes me as an elegant coverage of the 80% case
for wanting an explicit GROUP BY clause.
> I will open a ticket with the proposed syntax and the respective
> patches.
On this note - ticket #3566 has been used for a long time to track the
development of aggregates. There is no need to open a new ticket -
many interested eyballs are focussed on the original ticket, so that
is the best place to keep discussion.
Yours,
Russ Magee %-)
I don't completely agree that values() is the right place. Firstly, it
ignores the difference between annotate() and aggregate() that I
mentioned earlier. Secondly, it removes the ability to calculate
aggregates, but not include them in the return set. As it currently
stands, values() can be used to reduce the return set to those columns
in which you have an interest; if you overload values() to include
aggregates as well, you won't be able to exclude aggregates from the
result columns.
> > - Tell me how much effort has been expended by User U?
> >>> u.work_log.values(Sum('hours'))
This highlights my earlier point: values() returns a ValuesQuerySet -
a list of dictionaries. What gets returned in this example?
> > - Give me a list of milestones with the amount of effort outstanding
> > on each milestone
> >>> Milestone.objects.values('id', hours_left=Sum('task__estimate) -
> Sum('task__work_log__hours'))
This actually starts to point out a flaw in your proposed syntax - or
at least, an API weakness that we won't be able to avoid. kwargs need
to come after args when you call a Python function, so any aliased
aggregates will need to come later in the argument list. While this
won't break anything, my Magic 8-ball predicts many mailing list
discussions wondering why they are getting argument ordering
exceptions getting thrown from what looks like a perfectly legal
aggregate query.
Using the underscore approach, _all_ aggregate specifiers are are
kwargs, so you can define them in whatever order makes sense.
> I hope I interpreted those right.
To a cursory examination, yes.
Some overall comments:
1) For all spruiking of a "cleaner, more Pythonic format", it looks
very brackety and messy to me. To me, filter clauses read much easier.
However, I acknowledge that this is somewhat a matter of personal
taste (although the taste of the core devs will, in general, carry a
little more weight).
2) The ability to do math on aggregate clauses is quite cool. However,
the types of math that are likely to be required will be fairly
consistent on a per-model basis. I'm wondering if the right mechansim
by which to acheive this is to provide a way to define new filter and
aggregate clauses on the QuerySet.
For example, you can define a custom manager that defines a foobar()
method to provide Model.objects.foobar() as an API entry point.
Managers provide a way to override the default QuerySet; if we extend
the filter and aggregate syntax so that '__foobar' in a filter looks
for filter_foobar() as a method on the queryset, or aggregate_foobar()
if used in an aggregate; these then expand out into whatever SQL might
be required.
3) We still haven't hit a use case where the aggregate has unusual
grouping requirements. My original thought is that end users could be
writing custom Aggregate clauses that ultimately turn into:
MyAggregate('fieldname', group_by='field2', other='some argument')
However, given that I can't come up with a use case, this is starting
to get a big YAGNI to me.
Yours,
Russ %-)
I don't completely agree that values() is the right place.
Firstly, it ignores the difference between annotate() and aggregate() that I mentioned earlier.
Secondly, it removes the ability to calculate
aggregates, but not include them in the return set. As it currently
stands, values() can be used to reduce the return set to those columns
in which you have an interest; if you overload values() to include
aggregates as well, you won't be able to exclude aggregates from the
result columns.
> > - Tell me how much effort has been expended by User U?This highlights my earlier point: values() returns a ValuesQuerySet -
> >>> u.work_log.values(Sum('hours'))
a list of dictionaries. What gets returned in this example?
> > - Give me a list of milestones with the amount of effort outstandingThis actually starts to point out a flaw in your proposed syntax - or
> > on each milestone
> >>> Milestone.objects.values('id', hours_left=Sum('task__estimate) -
> Sum('task__work_log__hours'))
at least, an API weakness that we won't be able to avoid...
...my Magic 8-ball predicts many mailing list
discussions wondering why they are getting argument ordering
exceptions getting thrown from what looks like a perfectly legal
aggregate query.
Some overall comments:
1) For all spruiking
of a "cleaner, more Pythonic format", it looks
very brackety and messy to me. To me, filter clauses read much easier.
However, I acknowledge that this is somewhat a matter of personal
taste (although the taste of the core devs will, in general, carry a
little more weight).
2) The ability to do math on aggregate clauses is quite cool. However,
the types of math that are likely to be required will be fairly
consistent on a per-model basis. I'm wondering if the right mechansim
by which to acheive this is to provide a way to define new filter and
aggregate clauses on the QuerySet.
For example, you can define a custom manager that defines a foobar()
method to provide Model.objects.foobar() as an API entry point.
Managers provide a way to override the default QuerySet; if we extend
the filter and aggregate syntax so that '__foobar' in a filter looks
for filter_foobar() as a method on the queryset, or aggregate_foobar()
if used in an aggregate; these then expand out into whatever SQL might
be required.
3) We still haven't hit a use case where the aggregate has unusual
grouping requirements. My original thought is that end users could be
writing custom Aggregate clauses that ultimately turn into:
MyAggregate('fieldname', group_by='field2', other='some argument')
However, given that I can't come up with a use case, this is starting
to get a big YAGNI to me.
perhaps I've just read over it, but how to you plan to handle not
numeric types like datetime? In mysql you have to use something like
"FROM_UNIXTIME(AVG(UNIX_TIMESTAMP(today))", but how should that be
handled in the orm especially as its backend dependant.
Regards
Sorry -- I missed page 2. So GROUP BY and similar things will be
supported through an Aggregates base class?
I haven't given it any specific thought, but it stands to reason that
the eccentricities of any given backend should be abstracted away.
Part of this design process is to identify the crusty edge cases that
we need to handle. Thanks for the pointer.
Yours
Russ Magee %-)
And yet, the example you give is a count on related objects.
> SELECT my_table.* FROM my_table JOIN my_other_table WHERE my_other
> table.my_table_id = my_table.id GROUP BY my_table.id HAVING COUNT(1) >
> 1;
>
> How do you ever hope to achieve something like this in the ORM without
> patches like mine?
Lets see if I can make myself clearer by using your example.
You have provided some sample SQL. However, that wasn't the start
point. Once upon a time, you had a problem you needed to solve: "I
need to find all the MyTable objects that have more than one related
OtherTable objects". Notice that this question doesn't use the phrase
"GROUP BY".
Django's ORM didn't support aggregates, so you hand crafted some SQL,
and since aggregates were involved, you needed to use a GROUP BY
clause. This is because GROUP BY is the way SQL handles aggregation.
Now, we're talking about adding aggregates to Django. So, we need to
find a way to express your original question as Django syntax. The
solution to this problem isn't "add GROUP BY to Django Syntax" - it's
to find an elegant way to represent the underlying problem as an
object-based syntax. Based on the discussions so far, something like
this would seem appropriate:
MyTable.objects.aggregate(othertable__count='related_count').filter(related_count__gt=1)
Notice that this query doesn't require the use of the phrase GROUP BY
in the ORM syntax. When this hits the database, GROUP BY will
certainly be required. However, the terms for the GROUP BY can be
entirely implied from the rest of the query. There is no need to
expose GROUP BY into the ORM syntax.
Your original question wasn't "How do I perform a GROUP BY over a
MyTable" - it was a real-world question that needed an object-based
answer. The goal of the Django ORM is to solve your _actual_ problem,
not your SQL problem.
> My main argument is there's no reason to *not* include the
> functionality except that a few select people say "we dont need it".
There is a very good reason, which I have told you several times. The
Django ORM does not provide an alternate syntax for SQL. It provides a
way of interrogating an object model that happens to be backed by a
database. As a result, the external APIs are not SQL specific - they
encompass broader object-based use-cases. "Just adding GROUP BY"
doesn't achieve any of the Django design goals.
Now, if you can present a use case that isn't satisfied by the
discussions we are currently having - that is, a usage of GROUP BY
that steps outside the bounds of what could be implied from other
parts of an ORM query - let us know. We want to be able to cover as
many potential queries as possible. However, even then, I doubt that
the solution will be "add GROUP BY" - there will be another class of
object-based queries that we haven't considered, for which we will
need a new _object based_ syntax.
Yours,
Russ Magee %-)
And now we finally reach your actual use case - the fact that you need
to issue queries that produce non-distinct results for each object
instance in the result set. Any given instance of MyModel in your
database will return N instances in the result set, where N is the
number of m2m related objects associated with the instance. To me,
this seems like a variant on the annotation use case. Might I suggest
the following:
MyModel.objects.annotate(some_kind_of_manytomany__blah__extra='extra_data')
I don't think __extra is the best keyword to use here, but for the
moment it will do as a placeholder. The idea is to define an annotated
output set - but rather than annotating with a computed aggregate,
you're annotating with data from related models. Since the extra_data
annotations aren't (necessarily) unique for an instance, there will be
N instances of each MyModel instance in the result set, with each
instance in the result set annotated with a unique joined value from
the m2m relation. The GROUP BY can be implied as the list of fields in
MyModel, plus any __extra annotations.
The naming of __extra notwithstanding, how does this sound?
> If you tag on anything for an additional select that distinct can't
> handle, you need group by.
Like I keep saying - no, you don't need GROUP BY. GROUP BY is a means
to an end, not the only way of expressing the concept. What is needed
is a mechanism in the ORM that can represent the underlying use case
that necessitates the use of GROUP BY in the SQL query that is
generated.
> Btw, your example doesn't work, having happens after the query
> finishes executing, not within the where clause. That specific use
> case *has* to execute exaclty as shown.
Which bit doesn't work? As far as I can make out, there isn't anything
ambiguous about the query I gave you - the ORM is given a bunch of
filter statements, and can make a decision about whether they are
WHERE or HAVING statements based on the GROUP BY that is implied by
the original object query.
Yours,
Russ Magee %-)
Except that values() doesn't currently return 'just values' - it
returns a list of values, notionally tied to object instances.
Top-level aggregates won't ever be a list, and instance-based
(annotated) aggregates will always be associated with an instance.
> I hadn't really considered this, but when would you want to calculate a
> value and not include it? If it's for query purposes I'd think that a good
> way to handle it would be to add expression support directly to filter() so
> that you could say queryset.filter(field1__lte=Avg('field1')) and not have
> the aggregate in the results. Or do something similar with a Q-like object.
The simplest case I can think of is that you may want to filter on an
aggregate, but not include it as part of the result set because you
don't want to dirty the attribute list of the instances being
returned.
> > > > - Tell me how much effort has been expended by User U?
> > > >>> u.work_log.values(Sum('hours'))
> >
> > This highlights my earlier point: values() returns a ValuesQuerySet -
> > a list of dictionaries. What gets returned in this example?
> >
> It's still a ValuesQuerySet, but because there's no group by clause it's
> only one item:
> [{'hours_sum': 7.5}]
Seriously - this sounds like a good idea to you? A list that will only
ever contain a single dictionary? I think I can safely put on my core
developer hat and say "This idea is a non-starter".
> It's possible to do set result_type=SINGLE where there's an aggregate but no
> group by, but I think it's be better for values() to always return the same
> type.
Well, sure. Then, we could put a result_type=SINGLE into filter(), and
we could get rid of get().
Or, we could use a different clause, and not have to worry about
remembering the result_type argument, importing the SINGLE constant,
or debugging the different types of output that could be returned. :-)
> > > > - Give me a list of milestones with the amount of effort outstanding
> > > > on each milestone
> > > >>> Milestone.objects.values('id', hours_left=Sum('task__estimate) -
> > > Sum('task__work_log__hours'))
> >
> > This actually starts to point out a flaw in your proposed syntax - or
> > at least, an API weakness that we won't be able to avoid...
> > ...my Magic 8-ball predicts many mailing list
> > discussions wondering why they are getting argument ordering
> > exceptions getting thrown from what looks like a perfectly legal
> > aggregate query.
>
> I don't know if I agree with that concern so much, but your 8-ball is surely
> better tuned than mine for this type of thing. The SyntaxError in that case
> is pretty clear, and users can be kindly reminded that they are still
> writing Python.
The problem is that a non-trivial number of Django newcomers don't
think of Django as a Python library - they think of it as a language
unto itself. Case in point - the number of questions asked on
Django-users that can be answered by pointing to the Python docs.
Setting up APIs that can raise errors based upon non-adherence to
Python doctrine is the sort of thing that sets of alarm bells for me.
Yours,
Russ Magee %-)