I didn't see anything about aggregate support in VersionOneFeatures[1], are people still hung up about the syntax, or is anyone hacking on this? It's been mentioned a couple times since PyCon, but what's the status?
the queryset refactoring must (well, should) happen before aggregation support. After the refactoring, it should be relatively easy to put in the aggregations.
I am currently finishing one project, it should be over in a couple of days. If I don't get a new one I may finally have some time to do some work on it. But I cannot promise anything, so if you have the time, go for it.. ;)
> I didn't see anything about aggregate support in > VersionOneFeatures[1], are people still hung up about the syntax, or > is anyone hacking on this? It's been mentioned a couple times since > PyCon, but what's the status?
> I didn't see anything about aggregate support in > VersionOneFeatures[1], are people still hung up about the syntax, or > is anyone hacking on this? It's been mentioned a couple times since > PyCon, but what's the status?
Many proposals have been made on this feature, but nothing has been decided upon. For the record, here's my last attempt at a proposal:
I'm _very_ keen to get aggregates into Django - it was one of the original reasons I started contributing to the project, but magic-removal, the test frameworks, fixtures, and the 0.96 push took priority.
Implementing this feature is also waiting on a QuerySet rewrite. I believe Malcolm said he was working on this, but I don't know how far he has got.
On 4/13/07, Honza Král <honza.k...@gmail.com> wrote:
> the queryset refactoring must (well, should) happen before aggregation > support. After the refactoring, it should be relatively easy to put in > the aggregations.
On 4/13/07, Honza Král <honza.k...@gmail.com> wrote:
> the queryset refactoring must (well, should) happen before aggregation > support. After the refactoring, it should be relatively easy to put in > the aggregations.
Ok; I've had a chance to look at this now. Although there are some syntax differences, I think your idea and mine [1] are acutally pretty close. A bit of both, and we might just have something.
Comments:
GROUP BY = values() ~~~~~~~~~~~~~~~~~~
Isn't the group_by clause duplicating what is already provided by the values() filter? i.e., saying 'return me a list of dictionaries of object data that isn't a full object'? When the aggregates clause is in use, a GROUP BY over the selected fields is implied. If you don't provide a values clause, then you want to GROUP BY all the fields in the model
To me, the clause is returning a list of data groups (i.e., the results of the values() clause); putting the grouping data in a group_by dictionary seems overkill. This approach does introduce some new magic-words that are not allowed as field names (or are at least dangerous if you do). Personally, I don't see that as a major problem.
Weaknesses in the function=tuple argument approach ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
While the function=tuple of field names approach does work, I feel it is lacking: - You can't use aliases: SUM(field) as "total_pay" - Filtering on aggregated values is difficult (impossible?) - It doesn't handle joins in the same way as filter()
The approach I proposed was to mirror the filter syntax: pay__sum='total_pay' other_field__average='some_alias'
This approach mirrors the filter mechanism for attribute naming and joins, allows for aliases, which provides a mechanism to filter on aliases.
The only downside I can see is that the simple case requires the user to provide an alias - i.e., pay__sum='pay_sum'. I'll admit that this isn't particularly elegant.
However, it should be possible to combine both approaches:
This means the sum of pay isn't aliased, so it can't be filtered, but you can filter on 'mean_value'. Argument clashes aren't a problem either - 'sum' by itself is unambiguous as an argument; any filter-like argument will need to have at least a '__' in it.
Ambiguities in COUNT ~~~~~~~~~~~~~~~~~~~~
aggregate(count=True) is already covered with count(); I'm not sure I'm wild about the duplication. There is also the problem that it precludes counting other things: e.g., counting related objects - return the number of books that each author has written.
Author.objects.aggregate(books_count='number_of_books_written') or Author.objects.aggregate(count=('books'))
On top of this, you can get the gross count from:
len(Author.objects.aggregate(...))
Aggregate of results vs aggregates of related objects ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you run an aggregate over a field from the table you are querying, the aggregate is a table level property, not a row level property. Compare the expected results from:
- Get me the date of the earliest and latest published article: Return a single value for the min, and a single value for the max. - Annotate each author with the date of their earliest and latest published article: Return a min and max value for each author.
while legal, isn't the most elegant way at getting at a unique min/max for an entire table. This was the reason for my suggestion for two functions: aggregate and annotate. Aggregate returns a dictionary of just aggregated values:
Whereas annotate() returns a list of annotated dictionary data.
================================
I'd better stop now - this is getting longer than I originally anticipated, and if it gets much longer nobody will read it :-) As always, feedback welcome.
On 4/13/07, Russell Keith-Magee <freakboy3...@gmail.com> wrote:
> On 4/13/07, Honza Král <honza.k...@gmail.com> wrote: > > the queryset refactoring must (well, should) happen before aggregation > > support. After the refactoring, it should be relatively easy to put in > > the aggregations.
> Ok; I've had a chance to look at this now. Although there are some > syntax differences, I think your idea and mine [1] are acutally pretty > close. A bit of both, and we might just have something.
great, thanks for your comments. I attached my answers in the text
> Comments:
> GROUP BY = values() > ~~~~~~~~~~~~~~~~~~
> Isn't the group_by clause duplicating what is already provided by the > values() filter? i.e., saying 'return me a list of dictionaries of > object data that isn't a full object'?
from a technical view - YES, from user's view: NO that should be two different things, only the same format of returned data doesn't seem to be enough to merge these two functions
When the aggregates clause is
> in use, a GROUP BY over the selected fields is implied. If you don't > provide a values clause, then you want to GROUP BY all the fields in > the model
if you want to go with values() (which I am personally against), why not do:
qset.values('field', 'field2', 'field3__sum', 'f4__count', ...) < = > SELECT SUM(field3), count(f4) GROUP BY field, field2
I still don't like it, but seems better, and would solve issues with trying to refer to the aggregated value -- order_by('-f4__count','field'), see below
> To me, the clause is returning a list of data groups (i.e., the > results of the values() clause); putting the grouping data in a > group_by dictionary seems overkill. This approach does introduce some > new magic-words that are not allowed as field names (or are at least > dangerous if you do). Personally, I don't see that as a major problem.
I do, I just don't like constraints on naming, it just feels wrong, when there is a simple way around this. it would also allow for easier generic processing of the results (for example when building a general statistics app, we will know without inspecting the model, which fields were in the GROUP BY etc. -- explicit is better than implicit ;)
> Weaknesses in the function=tuple argument approach > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> While the function=tuple of field names approach does work, I feel it > is lacking: > - You can't use aliases: SUM(field) as "total_pay" > - Filtering on aggregated values is difficult (impossible?)
true, this is a problem for anything I am presenting
you could use something like order_by('sum(field3)'), but that seems weird... I outlined one solution in the values() part, but I am not happy with that either...
> - It doesn't handle joins in the same way as filter()
true, but it should, that's why I want to keep the arguments as simple string containing only the field name (including optional foreign__or__other__key__lookups), so same logic could be used for filters and these
> The approach I proposed was to mirror the filter syntax: > pay__sum='total_pay' > other_field__average='some_alias'
if you go with __sum etc., you don't need aliases, you can use this name (field__sum) as a reference to the field:
qset.filter( field__contains='AA' ) # results in a WHERE clause qset.filter( field3__sum__lte=23 ) # results in a HAVING clause qset.order_by( '-field3__sum', 'field' )
> This approach mirrors the filter mechanism for attribute naming and > joins, allows for aliases, which provides a mechanism to filter on > aliases.
> The only downside I can see is that the simple case requires the user > to provide an alias - i.e., pay__sum='pay_sum'. I'll admit that this > isn't particularly elegant.
> However, it should be possible to combine both approaches:
this is inconsistent, there should imho be just one way for this...
> This means the sum of pay isn't aliased, so it can't be filtered, but > you can filter on 'mean_value'. Argument clashes aren't a problem > either - 'sum' by itself is unambiguous as an argument; any > filter-like argument will need to have at least a '__' in it.
> Ambiguities in COUNT > ~~~~~~~~~~~~~~~~~~~~
> aggregate(count=True) is already covered with count(); I'm not sure > I'm wild about the duplication.
well, but what if you need number of articles per author? existing .count() wouldn't cut it (in one select) and once you are doing aggregation anyway, its simple to throw in COUNT(*)... so if you a) change .count() to behave like .aggregate(count=True), it would break code b) don't allow count in aggregation, you will have more queries for no good reason. I see .count() as a very useful shortcut for .aggregate(count=True)['count']
There is also the problem that it
> precludes counting other things: e.g., counting related objects - > return the number of books that each author has written.
> Author.objects.aggregate(books_count='number_of_books_written') > or > Author.objects.aggregate(count=('books'))
> On top of this, you can get the gross count from:
> len(Author.objects.aggregate(...))
I see, we obviously don't understand each other: I don't want count=True to add count of groups produced by group by, I want it to add information on the group's size: SELECT autor_id, COUNT(*), AVG(rating) FROM articles GROUP BY article_id; which would return me list of authors with number of published articles and their average rating
> Aggregate of results vs aggregates of related objects > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> If you run an aggregate over a field from the table you are querying, > the aggregate is a table level property, not a row level property. > Compare the expected results from:
> - Get me the date of the earliest and latest published article: Return > a single value for the min, and a single value for the max. > - Annotate each author with the date of their earliest and latest > published article: Return a min and max value for each author.
I don't understand this at all: you only asked for aggregates, you didn't specify group_by so the SELECT should look something like:
SELECT MIN(pub_date) as earliest, MAX(pub_date) as latest FROM articles; there is no way this would return more that one row
and why would it magically include 'title' in the result ? I am confused again.
> while legal, isn't the most elegant way at getting at a unique min/max > for an entire table. This was the reason for my suggestion for two > functions: aggregate and annotate. Aggregate returns a dictionary of > just aggregated values:
> Whereas annotate() returns a list of annotated dictionary data.
but those two are the same thing, the aggregate just has an empty group by clause, so it groups the entire table and is missing the 'group_by' field in the result (or any other fields except the aggregates you asked for)
> ================================
> I'd better stop now - this is getting longer than I originally > anticipated, and if it gets much longer nobody will read it :-) As > always, feedback welcome.
Over the weekend, I tried responding to our last message inline, but my response kept getting long and confused. So, I've summarized my thoughts. I have three problems with your proposal:
- Your approach is very group/SQL-centric, rather than being object/goal-centric. I would expect that the two most common groups are 'all columns in the table' and 'no columns in the table'. Using your syntax, the first case is very cumbersome (you have to manually name every column you want to see in the output), and the second case yields verbose output (you have to go 2 levels into a dictionary to get a simple column average).
- I accept that avoiding name clashes is a desirable (and arguably essential) goal. However, I feel that your output syntax is longwinded for the simple case (summary of a single table, no groups), and not particularly expressive for the complex case (summary of joined tables). I don't buy the 'generic summary statistic app' argument; mostly because I don't see the use case for such an application, but also because summary statistics for joined tables will require the use of __ syntax in the result dictionaries, which isn't as immediately expressive as aliased names.
- While your syntax is very good for generating groups and (aggregates of those groups) on a single table, I don't see it extending well into joins - especially filtering on joins. Using __sum__lte as you suggest is a problem for filters, because there is the potential for a name clash between a model field called sum and the desire to use a sum aggregate.
However, an example is worth a thousand words. Here are some of what I would consider common use cases that I don't see how your syntax addresses. Keep in mind that in each case when I ask for an author, the group is the full list of attributes; I would like to get the full set of object attributes list of objects along with the aggregates for each object in a single query rather than requiring one query to get the objects and one to get the aggregates.
R>>> is the syntax for queries using my proposed syntax. H>>> is your syntax, as I understand it. In most cases, I can't see how your syntax reaches the use case. Please correct me if I have misunderstood, or if there is a better way to use your syntax.
1) Get a list of authors, with the number of articles written by each author: H>>> Author.objects.aggregate(group_by=('first_name','last_name',... 'phone_number','address','email'), count=True)
8) What is the average rating for authors called 'George'? What about 'Fred'? H>>> Author.objects.aggregate(group_by=('first_name'), average=('article__rating'))
Out of these, I feel your syntax probably has the advantage in case (7), but getting the output from your results requires two levels of dictionary lookup, whereas only 1 is required in my case.
In addition to seeing you fill in these use cases with your syntax, I would be interested in any use cases where you feel that your syntax has the advantage. My use cases are driven by the way in which I currently use (or would like to use) aggregates; If your use cases are significantly different, I'd be interested in hearing them.
In particular, one of your objections to GROUP BY==values() equivalence is that while they are technically equivalent, they are not conceptually equivalent to the end user. Can you provide a use case where this is true (example 8 is my counterexample)?
On 4/16/07, Russell Keith-Magee <freakboy3...@gmail.com> wrote:
> Hi Honza,
> Over the weekend, I tried responding to our last message inline, but > my response kept getting long and confused. So, I've summarized my > thoughts. I have three problems with your proposal:
> - Your approach is very group/SQL-centric, rather than being > object/goal-centric.
yes, I confess - I come from the DB world and SQL is my language of choice ;) I love the language and the way you can express your thoughts in it.
besides, Django's queryset is very SQL-centric so far
I would expect that the two most common groups
> are 'all columns in the table' and 'no columns in the table'.
'all columns in the table' ?? you mean like GROUP BY first_name, last_name, address, email, ... even if you would rewrite it as GROUP BY id and join the result with the original table, it would be unnecessary overhead
you would either have to have smart optimizations that would recognize this case and group by the foreign key on the data table instead, or you would end up with a query that would most likely kill your DB.
Since Django's query building is very simple (the only thing it does magically for you is joining the tables, escaping and returning objects instead of fields), it would require a massive change to introduce this sort of optimizations and the overhead without it is simply too big to ignore for clear syntax.
Using
> your syntax, the first case is very cumbersome (you have to manually > name every column you want to see in the output),
yes, but I cannot think of a use case where that would be the problem when grouping or asking for overall statistics. Note please that if you ask for a foreign key field, you will get the entire object.
I also cannot come up with a way that would allow you not to specify every field you want on output and would still perform well
and the second case
> yields verbose output (you have to go 2 levels into a dictionary to > get a simple column average).
yes, I know and am not happy with that either. But if not two levels, than how?
> - I accept that avoiding name clashes is a desirable (and arguably > essential) goal. However, I feel that your output syntax is longwinded > for the simple case (summary of a single table, no groups), and not > particularly expressive for the complex case (summary of joined > tables). I don't buy the 'generic summary statistic app' argument; > mostly because I don't see the use case for such an application,
I am building one right now: I have a system that keeps track of students in a school as they come and go. The customer now asked for a module that would enable them to compute statistics at the end of the year: how many students came from this background, how many teacher wrote the most reports on the students, what was the average height etc.
With this in Django, I could just build a helper view, that would construct the queryset and a generic template that could display the results and let them build the queries mostly on their own (including some checks, of course).
I think that most bigger applications will need something like this sooner or later. Reporting and (even this primitive) BI is very lucrative business and customers want it.
but
> also because summary statistics for joined tables will require the use > of __ syntax in the result dictionaries, which isn't as immediately > expressive as aliased names.
true
> - While your syntax is very good for generating groups and (aggregates > of those groups) on a single table, I don't see it extending well into > joins - especially filtering on joins.
if you use the field 'names' as you would in a resulting template (as I used in the examples here), you could do that without any name clashes. but again, the syntax will become more verbose.
Using __sum__lte as you suggest
> is a problem for filters, because there is the potential for a name > clash between a model field called sum and the desire to use a sum > aggregate.
> However, an example is worth a thousand words. Here are some of what I > would consider common use cases that I don't see how your syntax > addresses. Keep in mind that in each case when I ask for an author, > the group is the full list of attributes;
see the beginning of my response for why this is bad.
I would like to get the full
> set of object attributes list of objects along with the aggregates for > each object in a single query rather than requiring one query to get > the objects and one to get the aggregates.
> R>>> is the syntax for queries using my proposed syntax. H>>> is your > syntax, as I understand it. In most cases, I can't see how your syntax > reaches the use case. Please correct me if I have misunderstood, or if > there is a better way to use your syntax.
> 1) Get a list of authors, with the number of articles written by each author: > H>>> Author.objects.aggregate(group_by=('first_name','last_name',... > 'phone_number','address','email'), count=True)
well, this would be
H>>> Article.objects.aggregate( group_by=('author',), count=True ) provided that when grouping on foreign key, the query will return the entire object, which makes sense
> 2) Get a list of authors that have written at least 3 articles:
you got me here - as I said earlier I have not yet come up with a way to refer to the aggregated fields. I will post my examples here using the same syntax you would use to retrieve the data in template (using __ instead of . ):
> 8) What is the average rating for authors called 'George'? What about 'Fred'? > H>>> Author.objects.aggregate(group_by=('first_name'), > average=('article__rating'))
> Out of these, I feel your syntax probably has the advantage in case > (7), but getting the output from your results requires two levels of > dictionary lookup, whereas only 1 is required in my case.
true, I would like to access the data more easily, but I am not willing to compromise the naming or force users to manually alias their lookups for this.
> In addition to seeing you fill in these use cases with your syntax, I > would be interested in any use cases where you feel that your syntax > has the advantage. My use cases are driven by the way in which I > currently use (or would like to use) aggregates; If your use cases are > significantly different, I'd be interested in hearing them.
> In particular, one of your objections to GROUP BY==values() > equivalence is that while they are technically equivalent, they are > not conceptually equivalent to the end user. Can you provide a use > case where this is true (example 8 is my counterexample)?
in my opinion, if I would specify values(), I haven't said anything about aggregating the result. If the ORM does that behind my back, I don't like it. I just asked for individual values, I didn't say anything about aggregation.
it just doesn't seem nice, especially if values() was along for some time now and people got used to how it works
btw. refactoring the queryset code is necessary for this to work, so perhaps we should focus our efforts there first. Maybe something will emerge from that that would help us decide.
On 4/16/07, Honza Král <honza.k...@gmail.com> wrote:
> besides, Django's queryset is very SQL-centric so far
To a large extent, yes, but no in some very important ways. Article.objects.all() isn't the least bit SQL, although the mapping to SQL is obvious. Preserving the object flavor of the ORM is an important design consideration for the ORM. More than one proposed extension to the ORM system has been knocked back for being little more than a light wrapper around SQL.
> you would either have to have smart optimizations that would recognize > this case and group by the foreign key on the data table instead, or > you would end up with a query that would most likely kill your DB.
One way around this is to use MAX on all the required non-pk fields from the object list being retrieved: SELECT object.id, MAX(object.field1), SUM(related.value) FROM object INNER JOIN related ON object.related = related.id GROUP BY object.id
Since each group for object.id has only 1 member, the MAX returns the matching field value, without needing to be explicitly included in the group.
> > tables). I don't buy the 'generic summary statistic app' argument; > > mostly because I don't see the use case for such an application,
> I am building one right now: ... > I think that most bigger applications will need something like this > sooner or later. Reporting and (even this primitive) BI is very > lucrative business and customers want it.
Ok; I misunderstood your intent. My mind was off building some sort of 'summary of summary statistics'. A 'generic view for summary stats' or some such beast makes sense.
> > Out of these, I feel your syntax probably has the advantage in case > > (7), but getting the output from your results requires two levels of > > dictionary lookup, whereas only 1 is required in my case.
> true, I would like to access the data more easily, but I am not > willing to compromise the naming or force users to manually alias > their lookups for this.
This is why I thought a merging the two syntaxes might have its place. I can see significant value in allowing aliases - the ability to filter comes almost by accident, and it allows the developer to use meaningful names, rather than ('average_age', rather than 'article__author__age__average'). However, the simple case ("just give me the average, and put it somewhere simple') is extremely useful, and is cumbersome if you must specify an alias.
> btw. refactoring the queryset code is necessary for this to work, so > perhaps we should focus our efforts there first. Maybe something will > emerge from that that would help us decide.
Granted. At one point, Malcolm was pushing this refactor, but I don't know if he actually started work on it. Malcolm - you out there?
On Mon, 2007-04-16 at 21:55 +0800, Russell Keith-Magee wrote:
[...]
> > btw. refactoring the queryset code is necessary for this to work, so > > perhaps we should focus our efforts there first. Maybe something will > > emerge from that that would help us decide.
> Granted. At one point, Malcolm was pushing this refactor, but I don't > know if he actually started work on it. Malcolm - you out there?
I'm working on it actively. It's taken a back seat until I've finished the unicode work, though -- since that is necessary to fix some real bugs we have right now -- which should be end of this coming weekend, I would hope.
On 4/17/07, Malcolm Tredinnick <malc...@pointy-stick.com> wrote:
> On Mon, 2007-04-16 at 21:55 +0800, Russell Keith-Magee wrote: > [...] > > > btw. refactoring the queryset code is necessary for this to work, so > > > perhaps we should focus our efforts there first. Maybe something will > > > emerge from that that would help us decide.
> > Granted. At one point, Malcolm was pushing this refactor, but I don't > > know if he actually started work on it. Malcolm - you out there?
> I'm working on it actively. It's taken a back seat until I've finished > the unicode work, though -- since that is necessary to fix some real > bugs we have right now -- which should be end of this coming weekend, I > would hope.
great to hear that.. if you need any help, I have some free time now and am willing to help. I gave the refactoring and aggregations a lot of thought and I know my way around SQL.
On Tue, 2007-04-17 at 03:03 +0200, Honza Král wrote: > On 4/17/07, Malcolm Tredinnick <malc...@pointy-stick.com> wrote:
> > On Mon, 2007-04-16 at 21:55 +0800, Russell Keith-Magee wrote: > > [...] > > > > btw. refactoring the queryset code is necessary for this to work, so > > > > perhaps we should focus our efforts there first. Maybe something will > > > > emerge from that that would help us decide.
> > > Granted. At one point, Malcolm was pushing this refactor, but I don't > > > know if he actually started work on it. Malcolm - you out there?
> > I'm working on it actively. It's taken a back seat until I've finished > > the unicode work, though -- since that is necessary to fix some real > > bugs we have right now -- which should be end of this coming weekend, I > > would hope.
> great to hear that.. if you need any help, I have some free time now > and am willing to help. I gave the refactoring and aggregations a lot > of thought and I know my way around SQL.
Good to hear. :-)
I'll get the basic stuff out there as soon as it holds a little water. It's very similar to the prototype version you wrote earlier (I'd started a little before you, but it was surprising how similar they were), just with a few less classes. So you'll find the code very similar.
I'm not going to touch the aggregation stuff, since there are enough people thinking about that already.