As a hint to the community - one of the best suggestions we can have
at this stage of the project is a set of interesting problems to
solve. This would give us a set of concrete 'problems' that we can use
to evaluate potential solutions. It may not be possible to find a
solution for every problem, but we won't know that until we have a
list of real problems to work with.
So, if you have an interesting use case for aggregates (i.e.,
aggregate queries that you would like to make, but currently need to
use raw SQL or an .extra() hack), let us know. If you can phrase that
query using the Author/Publisher/Book/Store model space that the new
aggregation model test defines, all the better.
Here's a few suggestions to get the ball rolling (with some reasoning
on why they are interesting):
* Does Author X have any friends?
(aggregate on a related object)
* Give me a list of all Books with more than 2 authors
(Query over the count of objects in a m2m relationship)
* Give me a list of all Authors that have no friends
(Query that counts when rows that don't exist in the m2m table)
* Give me a list of all publishers that have published more than 10 books
(Query over the reverse direction of a FK)
* Give me a list of all publishers that have published more than 10
books this year
(Query over the reverse direction of a FK, with a filter on the related object)
* Give me a list of all Books that were written by X and one other author.
(Query that checks a count of m2m objects, plus membership of the m2m set)
* Give me a list of all Authors that have published a book with at
least one other person
(Filters over a count generated on a related object)
* Give me the average price of all Books that were written by X and
one other author.
(Aggregate over objects discovered using membership of the m2m set)
Yours,
Russ Magee %-)
-Alen
(PS. if there is a need to give some real example with regards to this
I could put some up.)
That's really what we're looking for. I would be highly surprised if
aggregates weren't used for reporting purposes - the standing question
is "what types of reports do people find interesting?"
The goal is not necessarily to enable _every_ aggregate query to be
expressed using the Django ORM, but we do want the most common queries
to be expressed easily. If your examples are extraordinarily complex,
they may be outside the scope of what we're looking to achieve, but we
won't know what is possible until we know what people want to do.
Yours,
Russ Magee %-)
--
Nicolas Lara
Linux user #380134
Public key id: 0x152e7713 at http://keyserver.noreply.org/
# Anti-Spam e-mail addresses:
python -c "print '@'.join(['nicolaslara', '.'.join([x for x in
reversed( 'com|gmail'.split('|') )])])"
python -c "print '@'.join(['nicolas', '.'.join([x for x in reversed(
've|usb|labf|ac'.split('|') )])])"
--
Is that the kind of aggregate you needed?
This could be expressed as:
c = Contract.objects.all().annotate(total_amount=Sum('finansial_transaction__amount')).filter(finantial_transaction__pub_date__gt=from_date,
finantial_transaction__pub_date__lt=to_date,
financial_transaction__type='M').order_by('id',
'properties__short_title')
Please correct me if this is not that you meant...
--
This is something we're going to want to normalize, and as Alex noted,
floats would make a lot more sense.
> The issue of the datastructure that represents the results is only one
> part of the problem. Due to the inner representation of numbers and
> other factors of the different backends, the numeric results might be
> different, for example, postgres might return 34.7321987 while sqlite
> returns something like 34.732211 (rounding differentely the results).
Keep in mind that there are already a few places in the tests where we
have to deal with inexact floating point references. The easy solution
is modify the doctest to only print n decimal places, where n is
enough places to check the test case, but won't contain any floating
point ambiguity (in your case, check for 34.73 or 34.732)
Yours,
Russ Magee %-)
I don't mean to be rude, David, but the difference between the two
functions has been discussed _at length_, both in the past, and in
recent memory.
Short version:
Annotate returns a query set of objects, but with each returned object
enhanced with aggregated attribute(s) (e.g., a list of people, each of
which is annotated with the sum of their individual purchases)
Aggregate returns the summary statistics over a query set (e.g., the
average age of people matching a query).
Yours,
Russ Magee %-)
Pardon?
Annotate and aggregate are two very distinct features. They can be
easily distinguished by their return types.
annotate() - returns a query set which contains object instances
aggregate() - returns a dictionary containing aggregated values only
How exactly do you propose collapsing these two functions?
Yours,
Russ Magee %-)
Also, Im glad to see you are following the updates closely :)
--
Why would replacing distinct() ever make any sense?
--
Collin Grady
--
But forcing people to group by every field they select in order to get distinct
is absolutely idiotic, so why even suggest it?
.distinct() maps to DISTINCT, and even if in some way its identical, there's no
harm leaving it for the ease of use.
--
Collin Grady
"If you don't want your dog to have bad breath, do what I do: Pour a little
Lavoris in the toilet."
-- Comedian Jay Leno
> DISTINCT and GROUP BY are both annotating the results. Taking many
> rows, and turning them into a single row based on a key. DISTINCT would
> be the same as GROUP BY every, field, in, the, select.
Mathematically true, but I think having two separate concepts makes the code
easier to read. And they also map to two different SQL syntaxes, which might
have different optimalisation consequences.
- Sander
As a hint to the community - one of the best suggestions we can haveat this stage of the project is a set of interesting problems to
solve. This would give us a set of concrete 'problems' that we can use
to evaluate potential solutions. It may not be possible to find a
solution for every problem, but we won't know that until we have a
list of real problems to work with.
So, if you have an interesting use case for aggregates (i.e.,
aggregate queries that you would like to make, but currently need to
use raw SQL or an .extra() hack), let us know. If you can phrase that
query using the Author/Publisher/Book/Store model space that the new
aggregation model test defines, all the better.
>> * Does Author X have any friends?
>> (aggregate on a related object)
>>
>
> Author.objects.filter(pk=1).aggregate(Count('friends__id'))
I would have thought that Count('friends') would be automatically
expanded to a count over the primary key. i.e., the example you
provide is explicit, aggregate(Count('friends')) would be an
equivalent implicit form.
>> * Give me a list of all Books with more than 2 authors
>> (Query over the count of objects in a m2m relationship)
>
>>>>Book.objects.all().annotate(num_authors=Count('authors__name')).filter(num_authors__gt=2)
Looks to me, with the exception of a similar comment about implicit
expansion of authors__name.
>> * Give me a list of all Authors that have no friends
>> (Query that counts when rows that don't exist in the m2m table)
>>
>
>>>> Author.objects.all().annotate(num_friends=Count('friends__id', distinct=True), allow_nulls=True).filter(num_friends=0)
* I've commented to you previously about the allow_nulls=True option.
* Is a distinct option required as part of the Count operator? I would
have thought the existing distinct() queryset modifier would be
sufficient.
* Again, there is the potential for implicit expansion of Count('friends')
>> * Give me a list of all publishers that have published more than 10 books
>> (Query over the reverse direction of a FK)
>>
>
>>>> Publisher.objects.all().annotate(num_books=Count('book__id')).filter(num_books__gt=10)
With the exception of the implicit book__id expansion, looks good to me.
>> * Give me a list of all publishers that have published more than 10
>> books this year
>> (Query over the reverse direction of a FK, with a filter on the related object)
>>
>
>>>> Publisher.objects.all().annotate(num_books=Count('book__id')).filter(num_books__gt=10, book__price__lt=50)
The filter you implement here is a little different to the question as
stated, but it covers the use case I was driving at.
> So far so good. Though filter() is not yet implemented on aggregates
> the previous examples would work as soon as fileter starts working on
> aggregates. The generated SQL and what is planned for flter seems to
> match very well the problems in question.
>
> Now the hard ones:
>
>> * Give me a list of all Books that were written by X and one other author.
>> (Query that checks a count of m2m objects, plus membership of the m2m set)
>>
>
> We have two possibilities here that would make sense.
> The first intuitive way to express this would be:
>
>>>> Book.objects.all().filter(authors__name__contains='Norvig').annotate(num_authors=Count('authors__id')).filter(num_authors__gt=1)
>
> This has problems because, given the nature of the SQL that is being
> generated, when the filtering occurs (It is transformed into a WHERE
> clause) we are restricting the rows to only those that have the author
> X. This way, counting them wont refer to the actual number of authors
> for each book but the number of authors currently selected (just 1).
> To work around this, the way of doing the query is this:
>
>>>> Book.objects.all().annotate(num_authors=Count('authors__id')).filter(authors__name__contains='Norvig', num_authors__gt=1)
>
> which would do the filtering after the aggregate so the count is
> executed on all the books.
>
> Should we consider this something that should be solved with
> documentation or should I implement the logic to restructure the query
> if an aggregate is executed on a QuerySet that has been filtered?
With the exception of terminal queryset clauses (like count() or
values()), I can't think of any other ways that the order of
operations in a queryset is dependent on order of application. If at
all possible, I would think that early filters should be restructured.
>> * Give me a list of all Authors that have published a book with at
>> least one other person
>> (Filters over a count generated on a related object)
>>
...
> aggregation syntax. Do you have any suggestion on how to express this
> (with either existing or non-existant syntax)?
Not specifically. I was deliberately trying to be concerned with the
questions, rather than the answers. As I said previously, it's not
essential that we represent every possible query - complex queries
will always require manual SQL - but it is important that we have some
idea of the limitations of the syntax.
> An option for this would probably be the previously suggested F-
> syntax. So we could do:
>>>> Author.objects.all().annotate(num_coleagues=Count('book__authors__id'), num_books=Count('book__id', distinct=True)).filter(num_coleagues__gt=F(num_books))
I don't think this is correct. By my reading of this example, an
author that writes 5 books by themself will have 5 colleagues.
However, I think you are correct that this example may require F()
style syntax. If this puts this example outside your scope for v1 of
aggregation, thats fine with me.
On the subject of F() - I don't know if you've noticed the thread on
django-developers, but there is now a patch in the system (#7210) that
implements the F() syntax.
>> * Give me the average price of all Books that were written by X and
>> one other author.
>> (Aggregate over objects discovered using membership of the m2m set)
>>
>
> Also for this problem, how would you express it in aggregation syntax?
>
> An option that looks very good from the syntax point of view would be:
>>>> Book.objects.all().annotate(num_authors=Count('authors__id')).filter(num_authors__gt=1).aggregate(Avg('price'))
>
> but for this to work we would need to introduce the notion of
> subqueries which, AFAIK, have only been dealt with in extra().
Why a subquery? Isn't this a WHERE clause over an aggregate that isn't
in the return set?
> I believe it would be quite clean to introduce subqueries in common
> cases like this. Especially when dealing with aggregate since
> aggregate is a terminal clause.
If we need subqueries, I'd rather be explicit about it: e.g.,
>>> Book.objects.filter(author__in=Author.objects.filter(...))
Yours,
Russ Magee %-)
I agree, I haven't looked at the code of the F() patch yet, but I've
read the discussions. I am still a bit far from the point where I
would need to integrate F with aggregates so it might even get to
trunk before I need to work with it.
> b) Is there
> any plan to make selecting on the primary key the default, so one can
> do Count('authors') in place of Count('authors__id')?
>
Yes, thats the idea.. But one would still be able to do
Count('authors__name') fro counting non-null names or go explicit with
Count('authors__id').
--
Thanks for taking the time to test the code and for providing a, very
useful, example/test_case =)
>
> class Entries(models.Model):
> EntryID = models.AutoField(primary_key=True, db_column='Entry ID')
> Entry = models.CharField(unique=True, max_length=50)
> Exclude = models.BooleanField()
>
> class Clues(models.Model):
> ID = models.AutoField(primary_key=True)
> EntryID = models.ForeignKey(Entries, verbose_name='Entry', db_column =
> 'Entry ID')
> PuzzleID = models.ForeignKey(Puzzles, verbose_name='Puzzle', db_column =
> 'Puzzle ID')
> Clue = models.CharField(max_length=150, core=True)
> ....and a few other fields that have no bearing on this query...
>
> What I'm looking to generate is a table of most common Entries, and for each
> Entry to have a count of Appearances, a count of Distinct Clues for the
> Entry, and a 'Freshness Factor' which is #distinct clues / #appearances. I
> want to order this list by #Appearances descending and limit it to 50
> elements. The raw (MySQL) SQL I currently use for this is:
>
> SELECT Entry AS `Entry`, Count(`Entry ID`) AS `Appearances`,
> Count(Distinct(Clue)) AS `Distinct Clues`,
> Format(Count(Distinct(Clue))/count(`Entry ID`),2) AS `Freshness Factor` FROM
> Entries NATURAL JOIN Clues GROUP BY `Entry ID` ORDER BY `Appearances` DESC
> LIMIT 50
Not all the things you need to do here are implemented. But most (for
not saying all) of them are planned.
> Trying to translate that to annotate/aggregate I first think I need to start
> with the Clues model (even though I think of this more as an Entries query)
> because I can't see how to count distinct Clues if I start with Entries,
> since I don't know how to go back from Entries to Clues...did I get that
> right or am I confused?
>
For me it seems more intuitive to do it on Entries, but it is possible
to do it both ways.
> Anyway, starting with Clues I've gotten this far:
>
> Clues.objects.values('EntryID__Entry').annotate(Appearances=Count('Entry
> ID'), Distinct_Clues=Count('Clue', distinct=True))
>
The query you came up with can also be expressed as:
>>> Entries.objects.values('Entry').annotate(Appearances=Count('Entry ID'), Distinct_Clues=Count('clues__Clue', distinct=True))
which is equivalent.
> (Note I think there's a little bug there -- the Clues field is EntryID,
> while the database column name is 'Entry ID' -- I had to put the space in on
> the Count() call but I think I shouldn't have to do that? But that's really
> beside the point of my main questions.)
This is true. I've added it as a bug in thr project's trac and it
should soon be fixed. Thanks.
> which gives me a ValuesQuerySet that returns a dictionary containing
> EntryID__Entry, Appearances, and Distinct_Clues for each Entry. What I
> don't know how to do is:
>
> 1 - add in the 'Freshness Factor'. Sure, I can calculate it myself from
> Distinct_Clues/Appearances but I am curious if there is some way to push the
> work to the DB like I did in my original raw sql? (Which would also allow
> for ordering on this derived value, if I knew the answer to #2.)
>
Currently the way of defining new Aggregate objects is not documented.
This is mainly because I haven't work too much in the documentation
yet. Also the way this is done now might change in the future in
favour of something more user friendly. Now, If you wanted to define
your own Aggregate object you could do something like this:
class FreshnessFactor(Aggregate):
def __init__(self, lookup):
super(FreshnessFactor, self).__init__(lookup)
self.sql_template = """Count(Distinct("Clue"))/count("Entry ID")"""
and then do something like:
Entries.objects.values('Entry').annotate(Appearances=Count('clues__EntryID'),
Distinct_Clues=Count('clues__Clue', distinct=True),
Freshness_Factor=FreshnessFactor(''))
Note that this is a dirty hack and I havent tested it. (It will
probably not work because of the bug with the spaces you point out...
It also doesnt take into acount the column aliases) This is just the
way to work it out with the current implementation. But as soon as the
basic part is over I will start working on a cleaner implementation
for arithmetic and defining your own aggregate objects.
> 2 - order by Appearances. How do I order by an aggregate result? I can't
> find any mention of this in the doc or in a brief scan of the previous
> discussions. This one I really do want the database to do vs. my code due
> to the number of distinct entries in the DB (>100,000). Have I just missed
> how to do this somewhere?
>
Also it is not implemented but will be soon. The idea is to simply
issue a order_by('Appearences') call and get the ordering.
> All in all it's taking me a bit of effort to train my brain to think in
> terms of the aggregate/annotate vs. raw sql, but it seems pretty cool.
Thanks =)
>
> Thanks,
> Karen
It will.
>>> * Give me a list of all Books with more than 2 authors
>>> (Query over the count of objects in a m2m relationship)
>>
>>>>>Book.objects.all().annotate(num_authors=Count('authors__name')).filter(num_authors__gt=2)
>
> Looks to me, with the exception of a similar comment about implicit
> expansion of authors__name.
>
Same as before.. implicit is comming :p
But I dont think the idea is to always use the implicit. In this case
the problem adapts so that any non-nullable field would work, but this
query is currently counting "authors that have names". I know it is
not exactely the problem, but ilustrates the possibility of using this
kind of counts. "authors that have names" might sound silly in really
life, buit maybe if "Price" wrote a book he wouldn't be counted by our
query. (just kidding, there are many real life uses for this =) )
>>> * Give me a list of all Authors that have no friends
>>> (Query that counts when rows that don't exist in the m2m table)
>>>
>>
>>>>> Author.objects.all().annotate(num_friends=Count('friends__id', distinct=True), allow_nulls=True).filter(num_friends=0)
>
> * I've commented to you previously about the allow_nulls=True option.
Yes, allow_nulls is going away, just a copy/paste from the tests.
> * Is a distinct option required as part of the Count operator? I would
> have thought the existing distinct() queryset modifier would be
> sufficient.
distinct is optional. The problem with using the distinct() queryset
modifier is that it would affect the entire query. What would happen
if you would like to solve something like this:
"Give me a list of kids with the number of pets among all of their
friends and the number of different toys their friends have"
It could be solved like this:
>>> Kids.objects.annotate(Count('friends__toys'), Count('friens__toys', distinct=True))
while distinct would act on the select, or, even if we adapt it, it
would act inside all the Count objects.
> * Again, there is the potential for implicit expansion of Count('friends')
same here
>
>>> * Give me a list of all publishers that have published more than 10 books
>>> (Query over the reverse direction of a FK)
>>>
>>
>>>>> Publisher.objects.all().annotate(num_books=Count('book__id')).filter(num_books__gt=10)
>
> With the exception of the implicit book__id expansion, looks good to me.
>
>>> * Give me a list of all publishers that have published more than 10
>>> books this year
>>> (Query over the reverse direction of a FK, with a filter on the related object)
>>>
>>
>>>>> Publisher.objects.all().annotate(num_books=Count('book__id')).filter(num_books__gt=10, book__price__lt=50)
>
> The filter you implement here is a little different to the question as
> stated, but it covers the use case I was driving at.
>
Yes, i adapt it to the models. I could add a date field to the
models... anyway it's isomorphic.
I'll get back to you on this later since I would like to give it a
deeper analysis.
>>> * Give me a list of all Authors that have published a book with at
>>> least one other person
>>> (Filters over a count generated on a related object)
>>>
> ...
>> aggregation syntax. Do you have any suggestion on how to express this
>> (with either existing or non-existant syntax)?
>
> Not specifically. I was deliberately trying to be concerned with the
> questions, rather than the answers. As I said previously, it's not
> essential that we represent every possible query - complex queries
> will always require manual SQL - but it is important that we have some
> idea of the limitations of the syntax.
>
>> An option for this would probably be the previously suggested F-
>> syntax. So we could do:
>>>>> Author.objects.all().annotate(num_coleagues=Count('book__authors__id'), num_books=Count('book__id', distinct=True)).filter(num_coleagues__gt=F(num_books))
>
> I don't think this is correct. By my reading of this example, an
> author that writes 5 books by themself will have 5 colleagues.
>
yes, this is why I am filtering out those that have less or the same
number of coleagues than books. Maybe "coleagues" was a bad choice of
word.
> However, I think you are correct that this example may require F()
> style syntax. If this puts this example outside your scope for v1 of
> aggregation, thats fine with me.
>
> On the subject of F() - I don't know if you've noticed the thread on
> django-developers, but there is now a patch in the system (#7210) that
> implements the F() syntax.
>
Yes, I dont know when will it be ready, but anyway I would need to
integrate it into aggregation. I believe the most reasonable aproach
(for the sanity of both sebastian_noack and I) is to wait until it is
stable and then try to merge it in. But probably everything depends on
the timing.
>>> * Give me the average price of all Books that were written by X and
>>> one other author.
>>> (Aggregate over objects discovered using membership of the m2m set)
>>>
>>
>> Also for this problem, how would you express it in aggregation syntax?
>>
>> An option that looks very good from the syntax point of view would be:
>>>>> Book.objects.all().annotate(num_authors=Count('authors__id')).filter(num_authors__gt=1).aggregate(Avg('price'))
>>
>> but for this to work we would need to introduce the notion of
>> subqueries which, AFAIK, have only been dealt with in extra().
>
> Why a subquery? Isn't this a WHERE clause over an aggregate that isn't
> in the return set?
>
>> I believe it would be quite clean to introduce subqueries in common
>> cases like this. Especially when dealing with aggregate since
>> aggregate is a terminal clause.
>
> If we need subqueries, I'd rather be explicit about it: e.g.,
>
>>>> Book.objects.filter(author__in=Author.objects.filter(...))
>
(strong statements to come... please correct me if im wrong)
There are two possibilities in which I think subqueries might be
useful (regarding aggregation).
One is something like this:
"Give me the average number of authors per book"
the expected result is something like:
{average: 2.3}
so we are calculating an average over a count. since aggregate
functions cannot be nested, the way to do AVG(COUNT(some_field)) is to
do a subquery that annotates each object with a calculation and the
over that set we aggregate.
the other is something like:
"Give me the average price of the books that have more than 1 author"
(similar to the one above)
In this one we are not aggregating on the annotated fields. But, since
there is grouping, we cannot aggregate over the whole set. So we use a
subquery.
the way to express this in our syntax would be (intuitively and
regardless of the underlaying sql):
annotate(...).aggergate(...)
This could result in a subquery or could be restricted in favour of
some more explicit syntax as you suggest.
I possibility to make this explicit (though to me it seems ugly) would be:
Book.objects.aggregate(Avg(Book.objects.annotate(num_authors=Count('authors')))
)... maybe it would need also a way of specifying on which field to
aggregate.
I still dont see how to solve the last problem without subqueries.
Cheers,
Nicolas, who thinks this mail got way out of hands (and off topic)
after re-reading =)
Hi Karen,
Thanks for taking the time to test the code and for providing a, very
useful, example/test_case =)
> Trying to translate that to annotate/aggregate I first think I need to startFor me it seems more intuitive to do it on Entries, but it is possible
> with the Clues model (even though I think of this more as an Entries query)
> because I can't see how to count distinct Clues if I start with Entries,
> since I don't know how to go back from Entries to Clues...did I get that
> right or am I confused?
>
to do it both ways.
The query you came up with can also be expressed as:
> Anyway, starting with Clues I've gotten this far:
>
> Clues.objects.values('EntryID__Entry').annotate(Appearances=Count('Entry
> ID'), Distinct_Clues=Count('Clue', distinct=True))
>
>>> Entries.objects.values('Entry').annotate(Appearances=Count('Entry ID'), Distinct_Clues=Count('clues__Clue', distinct=True))
which is equivalent.
> (Note I think there's a little bug there -- the Clues field is EntryID,This is true. I've added it as a bug in thr project's trac and it
> while the database column name is 'Entry ID' -- I had to put the space in on
> the Count() call but I think I shouldn't have to do that? But that's really
> beside the point of my main questions.)
should soon be fixed. Thanks.
> 2 - order by Appearances. How do I order by an aggregate result? I can'tAlso it is not implemented but will be soon. The idea is to simply
> find any mention of this in the doc or in a brief scan of the previous
> discussions. This one I really do want the database to do vs. my code due
> to the number of distinct entries in the DB (>100,000). Have I just missed
> how to do this somewhere?
>
issue a order_by('Appearences') call and get the ordering.
Providing a way to count different fields is a good idea, and the
approach you've described looks good to me. I just wanted to make sure
that the simplest case (count all instances) is trivial (i.e., doesn't
require explicit naming of the PK field).
>> * Is a distinct option required as part of the Count operator? I would
>> have thought the existing distinct() queryset modifier would be
>> sufficient.
>
> distinct is optional. The problem with using the distinct() queryset
> modifier is that it would affect the entire query.
...
>>>> Kids.objects.annotate(Count('friends__toys'), Count('friens__toys', distinct=True))
> while distinct would act on the select, or, even if we adapt it, it
> would act inside all the Count objects.
Ah - I hadn't considered COUNT(DISTINCT *). This seems like a
reasonable to me. It also acts as a validation for choosing the
function notation for aggregates - this wouldn't have been possible if
we were using the __count= syntax.
>>> An option for this would probably be the previously suggested F-
>>> syntax. So we could do:
>>>>>> Author.objects.all().annotate(num_coleagues=Count('book__authors__id'), num_books=Count('book__id', distinct=True)).filter(num_coleagues__gt=F(num_books))
>>
>> I don't think this is correct. By my reading of this example, an
>> author that writes 5 books by themself will have 5 colleagues.
>>
>
> yes, this is why I am filtering out those that have less or the same
> number of coleagues than books. Maybe "coleagues" was a bad choice of
> word.
Ah - now I see what you are doing. Ignore me. I don't know nothing
about nothing :-)
>>> I believe it would be quite clean to introduce subqueries in common
>>> cases like this. Especially when dealing with aggregate since
>>> aggregate is a terminal clause.
>>
>> If we need subqueries, I'd rather be explicit about it: e.g.,
>>
>>>>> Book.objects.filter(author__in=Author.objects.filter(...))
>>
> (strong statements to come... please correct me if im wrong)
>
> There are two possibilities in which I think subqueries might be
> useful (regarding aggregation).
Don't get me wrong - I can see that subqueries can be useful. I'm just
hesitant to try and induce subqueries from a flat query syntax. If
subqueries are required, my gut reaction is that we will have less
headaches if we make the definition language able to express
subqueries.
That said - if there are very clear and obvious rules for when a
subquery will get formed, this might be ok. More on this below.
> the way to express this in our syntax would be (intuitively and
> regardless of the underlaying sql):
> annotate(...).aggergate(...)
>
> This could result in a subquery or could be restricted in favour of
> some more explicit syntax as you suggest.
>
> I possibility to make this explicit (though to me it seems ugly) would be:
> Book.objects.aggregate(Avg(Book.objects.annotate(num_authors=Count('authors')))
> )... maybe it would need also a way of specifying on which field to
> aggregate.
I agree that this approach isn't pretty.
I think I have a slightly better idea of what you are proposing. If
I'm understanding you correctly, you're working on the idea that:
SELECT AVG(pages) FROM Book
is essentially the same as
SELECT AVG(pages) from (SELECT pages FROM Book)
The second form is an inefficient - in practice, you would remove the
subquery as an optimization - but it is a generic construction that
lends itself to more complex inner queries, including those that have
annotate() clauses:
SELECT AVG(num_authors) from (SELECT id, name, COUNT(author.id) AS
num_authors FROM Book INNER JOIN Author ...
Given that aggregate() is a terminal clause, users aren't ever going
to be affected by the existence of the subquery, so I suppose that
this isn't so bad.
I still have a gut feeling that you should be able to compose this
query without a subselect. However, I would need to sit down and look
at the problem for a while to confirm that.
After all that - I wouldn't get too hung up on these last cases. I
posed them as intentionally difficult queries. While it would be nice
to be able to handle the, getting the simple cases working (especially
joins) is a much bigger priority. There will always be queries that
can't be expressed using the ORM - the goal is to make the 80% of
common queries trivial. The remaining 20% can always be serviced using
raw SQL.
Yours,
Russ Magee %-)
--------------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG.
Version: 8.0.100 / Virus Database: 269.24.1/1468 - Release Date: 26/05/2008
15.23
This is fixed in the latest revision. Now it is possible to do:
>>> explicit = list(Author.objects.annotate(Count('book__id')))
>>> implicit = list(Author.objects.annotate(Count('book')))
>>> explicit == implicit
True
>>> * Is a distinct option required as part of the Count operator? I would
>>> have thought the existing distinct() queryset modifier would be
>>> sufficient.
>>
>> distinct is optional. The problem with using the distinct() queryset
>> modifier is that it would affect the entire query.
> ...
>>>>> Kids.objects.annotate(Count('friends__toys'), Count('friens__toys', distinct=True))
>
>> while distinct would act on the select, or, even if we adapt it, it
>> would act inside all the Count objects.
>
> Ah - I hadn't considered COUNT(DISTINCT *). This seems like a
> reasonable to me. It also acts as a validation for choosing the
> function notation for aggregates - this wouldn't have been possible if
> we were using the __count= syntax.
>
True. And for custom aggregate functions the option of having multiple
parameters seems very useful.
>>>> I believe it would be quite clean to introduce subqueries in common
>>>> cases like this. Especially when dealing with aggregate since
>>>> aggregate is a terminal clause.
>>>
>>> If we need subqueries, I'd rather be explicit about it: e.g.,
>>>
>>>>>> Book.objects.filter(author__in=Author.objects.filter(...))
>>>
>> (strong statements to come... please correct me if im wrong)
>>
>> There are two possibilities in which I think subqueries might be
>> useful (regarding aggregation).
>
> Don't get me wrong - I can see that subqueries can be useful. I'm just
> hesitant to try and induce subqueries from a flat query syntax. If
> subqueries are required, my gut reaction is that we will have less
> headaches if we make the definition language able to express
> subqueries.
>
I think also that being explicit is more django-ish/pythonic but, as
you said....
> That said - if there are very clear and obvious rules for when a
> subquery will get formed, this might be ok. More on this below.
>
...there might be some exceptions.
I think this is also far ahead. so Ill keep a note on it not to forget
it, but wont deal with it until the first version is ready, tested and
documented.
>> the way to express this in our syntax would be (intuitively and
>> regardless of the underlaying sql):
>> annotate(...).aggergate(...)
>>
>> This could result in a subquery or could be restricted in favour of
>> some more explicit syntax as you suggest.
>>
>> I possibility to make this explicit (though to me it seems ugly) would be:
>> Book.objects.aggregate(Avg(Book.objects.annotate(num_authors=Count('authors')))
>> )... maybe it would need also a way of specifying on which field to
>> aggregate.
>
> I agree that this approach isn't pretty.
>
> I think I have a slightly better idea of what you are proposing. If
> I'm understanding you correctly, you're working on the idea that:
>
> SELECT AVG(pages) FROM Book
>
> is essentially the same as
>
> SELECT AVG(pages) from (SELECT pages FROM Book)
>
> The second form is an inefficient - in practice, you would remove the
> subquery as an optimization - but it is a generic construction that
> lends itself to more complex inner queries, including those that have
> annotate() clauses:
>
> SELECT AVG(num_authors) from (SELECT id, name, COUNT(author.id) AS
> num_authors FROM Book INNER JOIN Author ...
>
> Given that aggregate() is a terminal clause, users aren't ever going
> to be affected by the existence of the subquery, so I suppose that
> this isn't so bad.
>
I agree. I'll probably bring the subject up again in a few weeks to
discuss a good syntax for subqueries and the possibility of doing it
implicity in annotate().aggregate() (which should be carefully
documented!)
> I still have a gut feeling that you should be able to compose this
> query without a subselect. However, I would need to sit down and look
> at the problem for a while to confirm that.
>
> After all that - I wouldn't get too hung up on these last cases. I
> posed them as intentionally difficult queries. While it would be nice
> to be able to handle the, getting the simple cases working (especially
> joins) is a much bigger priority. There will always be queries that
> can't be expressed using the ORM - the goal is to make the 80% of
> common queries trivial. The remaining 20% can always be serviced using
> raw SQL.
>
Good.
There is still the issue of the difference between
>>> Book.objects.all().filter(authors__name__contains='Norvig').annotate(num_authors=Count('authors__id')).filter(num_authors__gt=1)
and
>>> Book.objects.all().annotate(num_authors=Count('authors__id')).filter(authors__name__contains='Norvig', num_authors__gt=1)
But I will not speculate and, instead, start implementing filter() for
aggregates to further understand the problem and find possible
solutions.
Best Regards, =)
--
Nicolas