On aggregates

Nicolas E. Lara G.

unread,

May 8, 2008, 5:38:27 PM5/8/08

to Django developers

For those that are interested the aggregation project is hosted in
http://code.google.com/p/django-aggregation/

The code is basically an up to date copy of trunk patched with the
aggregation support and the relevant tests and doc.

Currently there is support for both aggregate and annotate on the
simplest level. Some tests might be found under tests/modeltests/
aggregation/ of the current state of the project. They can probably do
as documentaion in the early stages of the project (i.e. now :p )

Any comments or sugestions are more than welcome! :-)

Nicolas

alex....@gmail.com

unread,

May 8, 2008, 5:51:02 PM5/8/08

to Django developers

I haven't had a ton of time to look over everything(just read the
tests so far), but why did you choose to return things as decimals in
place of a float?

On May 8, 4:38 pm, "Nicolas E. Lara G." <nicolasl...@gmail.com> wrote:
> For those that are interested the aggregation project is hosted inhttp://code.google.com/p/django-aggregation/

alex....@gmail.com

unread,

May 8, 2008, 5:56:29 PM5/8/08

to Django developers

And a second question, does filtering on an annotated field work, for
example:
Tag.objects.all().annotate(num_articles=Count('articles')).filter(num_articles__gt=2)
which return all tags who have more than 2 articles.

On May 8, 4:51 pm, "alex.gay...@gmail.com" <alex.gay...@gmail.com>
wrote:

Russell Keith-Magee

unread,

May 9, 2008, 12:46:02 AM5/9/08

to django-d...@googlegroups.com

On Fri, May 9, 2008 at 5:38 AM, Nicolas E. Lara G.
<nicol...@gmail.com> wrote:
>
> Currently there is support for both aggregate and annotate on the
> simplest level. Some tests might be found under tests/modeltests/
> aggregation/ of the current state of the project. They can probably do
> as documentaion in the early stages of the project (i.e. now :p )
>
> Any comments or sugestions are more than welcome! :-)

As a hint to the community - one of the best suggestions we can have
at this stage of the project is a set of interesting problems to
solve. This would give us a set of concrete 'problems' that we can use
to evaluate potential solutions. It may not be possible to find a
solution for every problem, but we won't know that until we have a
list of real problems to work with.

So, if you have an interesting use case for aggregates (i.e.,
aggregate queries that you would like to make, but currently need to
use raw SQL or an .extra() hack), let us know. If you can phrase that
query using the Author/Publisher/Book/Store model space that the new
aggregation model test defines, all the better.

Here's a few suggestions to get the ball rolling (with some reasoning
on why they are interesting):

* Does Author X have any friends?
(aggregate on a related object)

* Give me a list of all Books with more than 2 authors
(Query over the count of objects in a m2m relationship)

* Give me a list of all Authors that have no friends
(Query that counts when rows that don't exist in the m2m table)

* Give me a list of all publishers that have published more than 10 books
(Query over the reverse direction of a FK)

* Give me a list of all publishers that have published more than 10
books this year
(Query over the reverse direction of a FK, with a filter on the related object)

* Give me a list of all Books that were written by X and one other author.
(Query that checks a count of m2m objects, plus membership of the m2m set)

* Give me a list of all Authors that have published a book with at
least one other person
(Filters over a count generated on a related object)

* Give me the average price of all Books that were written by X and
one other author.
(Aggregate over objects discovered using membership of the m2m set)

Yours,
Russ Magee %-)

Alen Ribic

unread,

May 9, 2008, 3:58:19 AM5/9/08

to django-d...@googlegroups.com

Reporting is quite common in my area of work at the moment and most of
the time some form of aggregation is required to build meaningful
reports. (For such case I have substantially more custom/native
queries then standard ones.)

-Alen
(PS. if there is a need to give some real example with regards to this
I could put some up.)

Russell Keith-Magee

unread,

May 9, 2008, 4:12:20 AM5/9/08

to django-d...@googlegroups.com

On Fri, May 9, 2008 at 3:58 PM, Alen Ribic <alen....@gmail.com> wrote:
>
> Reporting is quite common in my area of work at the moment and most of
> the time some form of aggregation is required to build meaningful
> reports. (For such case I have substantially more custom/native
> queries then standard ones.)
>
> -Alen
> (PS. if there is a need to give some real example with regards to this
> I could put some up.)

That's really what we're looking for. I would be highly surprised if
aggregates weren't used for reporting purposes - the standing question
is "what types of reports do people find interesting?"

The goal is not necessarily to enable _every_ aggregate query to be
expressed using the Django ORM, but we do want the most common queries
to be expressed easily. If your examples are extraordinarily complex,
they may be outside the scope of what we're looking to achieve, but we
won't know what is possible until we know what people want to do.

Yours,
Russ Magee %-)

Alen Ribic

unread,

May 9, 2008, 6:07:37 AM5/9/08

to Django developers

Here is one of my current real-world, common, examples:

Monthly maintenance costs - Report

custom_sql_query = """
SELECT c.id, p.short_title, sum(f.amount)
FROM manager_contract c, manager_financialtransaction f,
manager_property p
WHERE f.pub_date >= '%s' and f.pub_date <= '%s'
GROUP BY p.short_title, f.transaction_type,
c.property_to_be_letted_id, p.id, f.contract_id, c.id
HAVING f.transaction_type='M' AND f.contract_id=c.id AND
c.property_to_be_letted_id=p.id
ORDER BY c.id, p.short_title
""" % (from_date, to_date)

1. Sum of maintenance costs per Contract are printed on this report.
(Contract ref number + Property Short title are printed with the total
amounts for cost of maintenance for the given data range.)
2. User also has a date range filter which defaults to [1st of current
month to the last day of this month].

-Alen Ribic

On May 9, 10:12 am, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:

Nicolas Lara

unread,

May 10, 2008, 2:37:20 AM5/10/08

to django-d...@googlegroups.com

Hello,
The choice of using Decimal was not mine but that of the modules that
connect to the backends (postgres in this case), and this is one of
the problems I am facing right now. Different backends return the
results in different formats (Postgres: Decimal, SQLite: float... etc)
I am looking into a way of normalizing the results to the same format.
The issue of the datastructure that represents the results is only one
part of the problem. Due to the inner representation of numbers and
other factors of the different backends, the numeric results might be
different, for example, postgres might return 34.7321987 while sqlite
returns something like 34.732211 (rounding differentely the results).
I believe we should not tamper with the results since we should trust
our backend calculations, but then the tests would pass depending on
the backend used (And this might even change according to tha
backend's versions but I haven't tried that).
I will look into this problem in depth and then write back when I come
up with some solutions.

--
Nicolas Lara
Linux user #380134
Public key id: 0x152e7713 at http://keyserver.noreply.org/
# Anti-Spam e-mail addresses:
python -c "print '@'.join(['nicolaslara', '.'.join([x for x in
reversed( 'com|gmail'.split('|') )])])"
python -c "print '@'.join(['nicolas', '.'.join([x for x in reversed(
've|usb|labf|ac'.split('|') )])])"

Nicolas Lara

unread,

May 10, 2008, 2:40:26 AM5/10/08

to django-d...@googlegroups.com

It is not possible yet. But it will after the first stage (basic
stuff) is working correctely.
A great help would be, as Russell suggests, providing some complex
cases of aggregates (the actual plain english problems that want to be
solved with aggregates) so we can see if the proposed functionalities
adapt well to our needs

--

Nicolas Lara

unread,

May 10, 2008, 3:22:25 AM5/10/08

to django-d...@googlegroups.com

If I understand correctely you are trying to do something like:
"Give me the and the sum of the financial transactions of type 'M'
between a given date range for each contract ordered by id and short
title)"

Is that the kind of aggregate you needed?

This could be expressed as:

c = Contract.objects.all().annotate(total_amount=Sum('finansial_transaction__amount')).filter(finantial_transaction__pub_date__gt=from_date,
finantial_transaction__pub_date__lt=to_date,
financial_transaction__type='M').order_by('id',
'properties__short_title')

Please correct me if this is not that you meant...

--

Russell Keith-Magee

unread,

May 10, 2008, 9:48:00 AM5/10/08

to django-d...@googlegroups.com

On Sat, May 10, 2008 at 2:37 PM, Nicolas Lara <nicol...@gmail.com> wrote:
>
> Hello,
> The choice of using Decimal was not mine but that of the modules that
> connect to the backends (postgres in this case), and this is one of
> the problems I am facing right now. Different backends return the
> results in different formats (Postgres: Decimal, SQLite: float... etc)
> I am looking into a way of normalizing the results to the same format.

This is something we're going to want to normalize, and as Alex noted,
floats would make a lot more sense.

> The issue of the datastructure that represents the results is only one
> part of the problem. Due to the inner representation of numbers and
> other factors of the different backends, the numeric results might be
> different, for example, postgres might return 34.7321987 while sqlite
> returns something like 34.732211 (rounding differentely the results).

Keep in mind that there are already a few places in the tests where we
have to deal with inexact floating point references. The easy solution
is modify the doctest to only print n decimal places, where n is
enough places to check the test case, but won't contain any floating
point ambiguity (in your case, check for 34.73 or 34.732)

Yours,
Russ Magee %-)

David Cramer

unread,

May 12, 2008, 10:50:46 AM5/12/08

to Django developers

What is the difference between annotate and aggregate? They seem like
they'd do the same thing, except annotate sounds like it should be
doing GROUP BY, which, if that's the case, then this goes against the
very reasoning which a group_by or something similar should not be
used. The logic in the implied filters should be enough to determine
if you need to use a GROUP BY or a WHERE, or honestly, even a DISTINCT
(as I discussed in IRC the other day, DISTINCT is just another
approach to grouping, and should be dropped if we won't support GROUP
BY directly).

On May 10, 6:48 am, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:

koenb

unread,

May 12, 2008, 3:05:17 PM5/12/08

to Django developers

If I understand it correctly, the aggregate and annotate ideas are
result driven, not by the SQL that might be involved (which should
indeed be entirely hidden). Aggregate normally gives one result,
annotate gives you a bunch of objects, but with extra information PER
object (normally from an aggregate on a foreign table).

David Cramer

unread,

May 12, 2008, 5:11:51 PM5/12/08

to Django developers

My assumption was annotate was the equiv of GROUP BY, which is what
annotate sounds like :)

If that's the case, I believe annotate should also replace distinct(),
as it is very very similar in its use-cases.

Russell Keith-Magee

unread,

May 12, 2008, 7:37:56 PM5/12/08

to django-d...@googlegroups.com

On Mon, May 12, 2008 at 10:50 PM, David Cramer <dcr...@gmail.com> wrote:
>
> What is the difference between annotate and aggregate? They seem like
> they'd do the same thing, except annotate sounds like it should be
> doing GROUP BY,

I don't mean to be rude, David, but the difference between the two
functions has been discussed _at length_, both in the past, and in
recent memory.

Short version:

Annotate returns a query set of objects, but with each returned object
enhanced with aggregated attribute(s) (e.g., a list of people, each of
which is annotated with the sum of their individual purchases)

Aggregate returns the summary statistics over a query set (e.g., the
average age of people matching a query).

Yours,
Russ Magee %-)

alex....@gmail.com

unread,

May 12, 2008, 10:26:51 PM5/12/08

to Django developers

Hey, just saw the update, and my suggestions would be, instead of
using round() figure out how many places of congruency there are
between the backends and use the ... syntax instead.

For example if:
pgsql: 37.4697
sqlite: 37.4701
mysql: 37.47

just print out
37.4...

On May 12, 6:37 pm, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:

David Cramer

unread,

May 13, 2008, 12:01:12 AM5/13/08

to Django developers

I realize how the aggregates work, but if annotate is just for
aggregates, then remove it as a standalone method. If it's not, then
it should solve all the problems with one blow.

On May 12, 4:37 pm, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:

Russell Keith-Magee

unread,

May 13, 2008, 12:27:42 AM5/13/08

to django-d...@googlegroups.com

On Tue, May 13, 2008 at 12:01 PM, David Cramer <dcr...@gmail.com> wrote:
>
> I realize how the aggregates work, but if annotate is just for
> aggregates, then remove it as a standalone method. If it's not, then
> it should solve all the problems with one blow.

Pardon?

Annotate and aggregate are two very distinct features. They can be
easily distinguished by their return types.

annotate() - returns a query set which contains object instances
aggregate() - returns a dictionary containing aggregated values only

How exactly do you propose collapsing these two functions?

Yours,
Russ Magee %-)

Nicolas Lara

unread,

May 13, 2008, 11:29:54 AM5/13/08

to django-d...@googlegroups.com

Thanks for the sugestion Alex,
I will do that.

Also, Im glad to see you are following the updates closely :)

--

David Cramer

unread,

May 13, 2008, 12:55:21 PM5/13/08

to Django developers

That's much more clear than what I had been reading.

If that is the case, then annotate would replace GROUP BY, and should
also be able to replace distinct().

On May 12, 9:27 pm, "Russell Keith-Magee" <freakboy3...@gmail.com>
wrote:

Collin Grady

unread,

May 13, 2008, 1:29:19 PM5/13/08

to django-d...@googlegroups.com

David Cramer said the following:

> If that is the case, then annotate would replace GROUP BY, and should
> also be able to replace distinct().

Why would replacing distinct() ever make any sense?

--
Collin Grady

David Cramer

unread,

May 13, 2008, 1:36:23 PM5/13/08

to django-d...@googlegroups.com

DISTINCT and GROUP BY are both annotating the results. Taking many rows, and turning them into a single row based on a key. DISTINCT would be the same as GROUP BY every, field, in, the, select.

--
David Cramer
Director of Technology
iBegin
http://www.ibegin.com/

Nicolas Lara

unread,

May 13, 2008, 3:00:39 PM5/13/08

to django-d...@googlegroups.com

Hi, David,
even if annotate() actually _could_ replace distinct()... Wouldn't you
like to keep the syntactic sugar around? not to mention the backward
compatibility... Also, for users looking for the simple distinct
behaviour, it is very complex and unnecessary having to think of it in
terms of grouping and rows.The idea is to povide simple abstractions
that resemble the way we think about the problems.

--

Collin Grady

unread,

May 13, 2008, 3:15:08 PM5/13/08

to django-d...@googlegroups.com

David Cramer said the following:

> DISTINCT and GROUP BY are both annotating the results. Taking many rows,
> and turning them into a single row based on a key. DISTINCT would be the
> same as GROUP BY every, field, in, the, select.

But forcing people to group by every field they select in order to get distinct
is absolutely idiotic, so why even suggest it?

.distinct() maps to DISTINCT, and even if in some way its identical, there's no
harm leaving it for the ease of use.

--
Collin Grady

"If you don't want your dog to have bad breath, do what I do: Pour a little
Lavoris in the toilet."
-- Comedian Jay Leno

Sander Steffann

unread,

May 13, 2008, 4:03:21 PM5/13/08

to django-d...@googlegroups.com

Hi,

> DISTINCT and GROUP BY are both annotating the results. Taking many
> rows, and turning them into a single row based on a key. DISTINCT would
> be the same as GROUP BY every, field, in, the, select.

Mathematically true, but I think having two separate concepts makes the code
easier to read. And they also map to two different SQL syntaxes, which might
have different optimalisation consequences.

- Sander

David Cramer

unread,

May 13, 2008, 4:53:03 PM5/13/08

to django-d...@googlegroups.com

Im not suggestung to replace the sql use just that they are identical in abstraction, and if the api truly doesn't want an sql feel, then it shouldn't have one

Nicolas E. Lara G.

unread,

May 24, 2008, 9:18:38 AM5/24/08

to Django developers

Hi Russell,
I have been adding your test cases to the test suit and ran into
troubles with some of them so I would like to know a bit more on what
you were expecting from the queries:

> * Does Author X have any friends?
> (aggregate on a related object)
>

Author.objects.filter(pk=1).aggregate(Count('friends__id'))

> * Give me a list of all Books with more than 2 authors
> (Query over the count of objects in a m2m relationship)
>

>>>Book.objects.all().annotate(num_authors=Count('authors__name')).filter(num_authors__gt=2)

> * Give me a list of all Authors that have no friends
> (Query that counts when rows that don't exist in the m2m table)
>

>>> Author.objects.all().annotate(num_friends=Count('friends__id', distinct=True), allow_nulls=True).filter(num_friends=0)

> * Give me a list of all publishers that have published more than 10 books
> (Query over the reverse direction of a FK)
>

>>> Publisher.objects.all().annotate(num_books=Count('book__id')).filter(num_books__gt=10)

> * Give me a list of all publishers that have published more than 10
> books this year
> (Query over the reverse direction of a FK, with a filter on the related object)
>

>>> Publisher.objects.all().annotate(num_books=Count('book__id')).filter(num_books__gt=10, book__price__lt=50)

So far so good. Though filter() is not yet implemented on aggregates
the previous examples would work as soon as fileter starts working on
aggregates. The generated SQL and what is planned for flter seems to
match very well the problems in question.

Now the hard ones:

> * Give me a list of all Books that were written by X and one other author.
> (Query that checks a count of m2m objects, plus membership of the m2m set)
>

We have two possibilities here that would make sense.
The first intuitive way to express this would be:

>>> Book.objects.all().filter(authors__name__contains='Norvig').annotate(num_authors=Count('authors__id')).filter(num_authors__gt=1)

This has problems because, given the nature of the SQL that is being
generated, when the filtering occurs (It is transformed into a WHERE
clause) we are restricting the rows to only those that have the author
X. This way, counting them wont refer to the actual number of authors
for each book but the number of authors currently selected (just 1).
To work around this, the way of doing the query is this:

>>> Book.objects.all().annotate(num_authors=Count('authors__id')).filter(authors__name__contains='Norvig', num_authors__gt=1)

which would do the filtering after the aggregate so the count is
executed on all the books.

Should we consider this something that should be solved with
documentation or should I implement the logic to restructure the query
if an aggregate is executed on a QuerySet that has been filtered?

> * Give me a list of all Authors that have published a book with at
> least one other person
> (Filters over a count generated on a related object)
>

In this one I cheated doing this:
>>> [a for a in Author.objects.all().annotate(num_coleagues=Count('book__authors__id'), num_books=Count('book__id', distinct=True)) if a.num_coleagues - a.num_books > 0]
but I am sure the intention was to get it done by just using the
aggregation syntax. Do you have any suggestion on how to express this
(with either existing or non-existant syntax)?

An option for this would probably be the previously suggested F-
syntax. So we could do:
>>> Author.objects.all().annotate(num_coleagues=Count('book__authors__id'), num_books=Count('book__id', distinct=True)).filter(num_coleagues__gt=F(num_books))

> * Give me the average price of all Books that were written by X and
> one other author.
> (Aggregate over objects discovered using membership of the m2m set)
>

Also for this problem, how would you express it in aggregation syntax?

An option that looks very good from the syntax point of view would be:
>>> Book.objects.all().annotate(num_authors=Count('authors__id')).filter(num_authors__gt=1).aggregate(Avg('price'))

but for this to work we would need to introduce the notion of
subqueries which, AFAIK, have only been dealt with in extra().

I believe it would be quite clean to introduce subqueries in common
cases like this. Especially when dealing with aggregate since
aggregate is a terminal clause.

Cheers,

Nicolas Lara

alex....@gmail.com

unread,

May 24, 2008, 5:00:48 PM5/24/08

to Django developers

All of this looks pretty good. The 2 things I have to say are, a) I
think the F() way of dealing with that, in place of list comprehension
is the right way to go, there has been a lot of good work on that
patch, and I think it works well, conceptually, with this. b) Is there
any plan to make selecting on the primary key the default, so one can
do Count('authors') in place of Count('authors__id')?

Alex

On May 24, 8:18 am, "Nicolas E. Lara G." <nicolasl...@gmail.com>
wrote:

Karen Tracey

unread,

May 25, 2008, 12:38:41 AM5/25/08

to django-d...@googlegroups.com

On Fri, May 9, 2008 at 12:46 AM, Russell Keith-Magee <freakb...@gmail.com> wrote:

As a hint to the community - one of the best suggestions we can have
at this stage of the project is a set of interesting problems to
solve. This would give us a set of concrete 'problems' that we can use
to evaluate potential solutions. It may not be possible to find a
solution for every problem, but we won't know that until we have a
list of real problems to work with.

So, if you have an interesting use case for aggregates (i.e.,
aggregate queries that you would like to make, but currently need to
use raw SQL or an .extra() hack), let us know. If you can phrase that
query using the Author/Publisher/Book/Store model space that the new
aggregation model test defines, all the better.

I don't have an example phrased in your Autor/Publisher/Book/Store model but I do have some actual cases where I've been using raw sql, specifically some stats pages for my crossword database. I've been meaning to see if I could re-frame them using this aggregate/annotate support, and finally found some time to look at that today. (But maybe not enough, since I didn't get very far.) I started with one query that involves two models (names are all non-Djanoish because the DB existed before Django):

class Entries(models.Model):
    EntryID = models.AutoField(primary_key=True, db_column='Entry ID')
    Entry = models.CharField(unique=True, max_length=50)
    Exclude = models.BooleanField()

class Clues(models.Model):
    ID = models.AutoField(primary_key=True)
    EntryID = models.ForeignKey(Entries, verbose_name='Entry', db_column = 'Entry ID')
    PuzzleID = models.ForeignKey(Puzzles, verbose_name='Puzzle', db_column = 'Puzzle ID')
    Clue = models.CharField(max_length=150, core=True)
    ....and a few other fields that have no bearing on this query...

What I'm looking to generate is a table of most common Entries, and for each Entry to have a count of Appearances, a count of Distinct Clues for the Entry, and a 'Freshness Factor' which is #distinct clues / #appearances. I want to order this list by #Appearances descending and limit it to 50 elements. The raw (MySQL) SQL I currently use for this is:

SELECT Entry AS `Entry`, Count(`Entry ID`) AS `Appearances`, Count(Distinct(Clue)) AS `Distinct Clues`, Format(Count(Distinct(Clue))/count(`Entry ID`),2) AS `Freshness Factor` FROM Entries NATURAL JOIN Clues GROUP BY `Entry ID` ORDER BY `Appearances` DESC LIMIT 50

Trying to translate that to annotate/aggregate I first think I need to start with the Clues model (even though I think of this more as an Entries query) because I can't see how to count distinct Clues if I start with Entries, since I don't know how to go back from Entries to Clues...did I get that right or am I confused?

Anyway, starting with Clues I've gotten this far:

Clues.objects.values('EntryID__Entry').annotate(Appearances=Count('Entry ID'), Distinct_Clues=Count('Clue', distinct=True))

(Note I think there's a little bug there -- the Clues field is EntryID, while the database column name is 'Entry ID' -- I had to put the space in on the Count() call but I think I shouldn't have to do that? But that's really beside the point of my main questions.)

which gives me a ValuesQuerySet that returns a dictionary containing EntryID__Entry, Appearances, and Distinct_Clues for each Entry. What I don't know how to do is:

1 - add in the 'Freshness Factor'. Sure, I can calculate it myself from Distinct_Clues/Appearances but I am curious if there is some way to push the work to the DB like I did in my original raw sql? (Which would also allow for ordering on this derived value, if I knew the answer to #2.)

2 - order by Appearances. How do I order by an aggregate result? I can't find any mention of this in the doc or in a brief scan of the previous discussions. This one I really do want the database to do vs. my code due to the number of distinct entries in the DB (>100,000). Have I just missed how to do this somewhere?

All in all it's taking me a bit of effort to train my brain to think in terms of the aggregate/annotate vs. raw sql, but it seems pretty cool.

Thanks,
Karen

Russell Keith-Magee

unread,

May 26, 2008, 8:10:08 AM5/26/08

to django-d...@googlegroups.com

On Sat, May 24, 2008 at 9:18 PM, Nicolas E. Lara G.
<nicol...@gmail.com> wrote:
>
> Hi Russell,
> I have been adding your test cases to the test suit and ran into
> troubles with some of them so I would like to know a bit more on what
> you were expecting from the queries:

>> * Does Author X have any friends?
>> (aggregate on a related object)
>>
>
> Author.objects.filter(pk=1).aggregate(Count('friends__id'))

I would have thought that Count('friends') would be automatically
expanded to a count over the primary key. i.e., the example you
provide is explicit, aggregate(Count('friends')) would be an
equivalent implicit form.

>> * Give me a list of all Books with more than 2 authors
>> (Query over the count of objects in a m2m relationship)
>
>>>>Book.objects.all().annotate(num_authors=Count('authors__name')).filter(num_authors__gt=2)

Looks to me, with the exception of a similar comment about implicit
expansion of authors__name.

>> * Give me a list of all Authors that have no friends
>> (Query that counts when rows that don't exist in the m2m table)
>>
>
>>>> Author.objects.all().annotate(num_friends=Count('friends__id', distinct=True), allow_nulls=True).filter(num_friends=0)

* I've commented to you previously about the allow_nulls=True option.

* Is a distinct option required as part of the Count operator? I would
have thought the existing distinct() queryset modifier would be
sufficient.

* Again, there is the potential for implicit expansion of Count('friends')

>> * Give me a list of all publishers that have published more than 10 books
>> (Query over the reverse direction of a FK)
>>
>
>>>> Publisher.objects.all().annotate(num_books=Count('book__id')).filter(num_books__gt=10)

With the exception of the implicit book__id expansion, looks good to me.

>> * Give me a list of all publishers that have published more than 10
>> books this year
>> (Query over the reverse direction of a FK, with a filter on the related object)
>>
>
>>>> Publisher.objects.all().annotate(num_books=Count('book__id')).filter(num_books__gt=10, book__price__lt=50)

The filter you implement here is a little different to the question as
stated, but it covers the use case I was driving at.

> So far so good. Though filter() is not yet implemented on aggregates
> the previous examples would work as soon as fileter starts working on
> aggregates. The generated SQL and what is planned for flter seems to
> match very well the problems in question.
>
> Now the hard ones:
>
>> * Give me a list of all Books that were written by X and one other author.
>> (Query that checks a count of m2m objects, plus membership of the m2m set)
>>
>
> We have two possibilities here that would make sense.
> The first intuitive way to express this would be:
>
>>>> Book.objects.all().filter(authors__name__contains='Norvig').annotate(num_authors=Count('authors__id')).filter(num_authors__gt=1)
>
> This has problems because, given the nature of the SQL that is being
> generated, when the filtering occurs (It is transformed into a WHERE
> clause) we are restricting the rows to only those that have the author
> X. This way, counting them wont refer to the actual number of authors
> for each book but the number of authors currently selected (just 1).
> To work around this, the way of doing the query is this:
>
>>>> Book.objects.all().annotate(num_authors=Count('authors__id')).filter(authors__name__contains='Norvig', num_authors__gt=1)
>
> which would do the filtering after the aggregate so the count is
> executed on all the books.
>
> Should we consider this something that should be solved with
> documentation or should I implement the logic to restructure the query
> if an aggregate is executed on a QuerySet that has been filtered?

With the exception of terminal queryset clauses (like count() or
values()), I can't think of any other ways that the order of
operations in a queryset is dependent on order of application. If at
all possible, I would think that early filters should be restructured.

>> * Give me a list of all Authors that have published a book with at
>> least one other person
>> (Filters over a count generated on a related object)
>>

...

> aggregation syntax. Do you have any suggestion on how to express this
> (with either existing or non-existant syntax)?

Not specifically. I was deliberately trying to be concerned with the
questions, rather than the answers. As I said previously, it's not
essential that we represent every possible query - complex queries
will always require manual SQL - but it is important that we have some
idea of the limitations of the syntax.

> An option for this would probably be the previously suggested F-
> syntax. So we could do:
>>>> Author.objects.all().annotate(num_coleagues=Count('book__authors__id'), num_books=Count('book__id', distinct=True)).filter(num_coleagues__gt=F(num_books))

I don't think this is correct. By my reading of this example, an
author that writes 5 books by themself will have 5 colleagues.

However, I think you are correct that this example may require F()
style syntax. If this puts this example outside your scope for v1 of
aggregation, thats fine with me.

On the subject of F() - I don't know if you've noticed the thread on
django-developers, but there is now a patch in the system (#7210) that
implements the F() syntax.

>> * Give me the average price of all Books that were written by X and
>> one other author.
>> (Aggregate over objects discovered using membership of the m2m set)
>>
>
> Also for this problem, how would you express it in aggregation syntax?
>
> An option that looks very good from the syntax point of view would be:
>>>> Book.objects.all().annotate(num_authors=Count('authors__id')).filter(num_authors__gt=1).aggregate(Avg('price'))
>
> but for this to work we would need to introduce the notion of
> subqueries which, AFAIK, have only been dealt with in extra().

Why a subquery? Isn't this a WHERE clause over an aggregate that isn't
in the return set?

> I believe it would be quite clean to introduce subqueries in common
> cases like this. Especially when dealing with aggregate since
> aggregate is a terminal clause.

If we need subqueries, I'd rather be explicit about it: e.g.,

>>> Book.objects.filter(author__in=Author.objects.filter(...))

Yours,
Russ Magee %-)

Nicolas Lara

unread,

May 26, 2008, 11:10:14 AM5/26/08

to django-d...@googlegroups.com

On Sat, May 24, 2008 at 5:00 PM, alex....@gmail.com
<alex....@gmail.com> wrote:
>
> All of this looks pretty good. The 2 things I have to say are, a) I
> think the F() way of dealing with that, in place of list comprehension
> is the right way to go, there has been a lot of good work on that
> patch, and I think it works well, conceptually, with this.

I agree, I haven't looked at the code of the F() patch yet, but I've
read the discussions. I am still a bit far from the point where I
would need to integrate F with aggregates so it might even get to
trunk before I need to work with it.

> b) Is there
> any plan to make selecting on the primary key the default, so one can
> do Count('authors') in place of Count('authors__id')?
>

Yes, thats the idea.. But one would still be able to do
Count('authors__name') fro counting non-null names or go explicit with
Count('authors__id').

--

Nicolas Lara

unread,

May 26, 2008, 12:25:47 PM5/26/08

to django-d...@googlegroups.com

Hi Karen,

Thanks for taking the time to test the code and for providing a, very
useful, example/test_case =)

>
> class Entries(models.Model):
> EntryID = models.AutoField(primary_key=True, db_column='Entry ID')
> Entry = models.CharField(unique=True, max_length=50)
> Exclude = models.BooleanField()
>
> class Clues(models.Model):
> ID = models.AutoField(primary_key=True)
> EntryID = models.ForeignKey(Entries, verbose_name='Entry', db_column =
> 'Entry ID')
> PuzzleID = models.ForeignKey(Puzzles, verbose_name='Puzzle', db_column =
> 'Puzzle ID')
> Clue = models.CharField(max_length=150, core=True)
> ....and a few other fields that have no bearing on this query...
>
> What I'm looking to generate is a table of most common Entries, and for each
> Entry to have a count of Appearances, a count of Distinct Clues for the
> Entry, and a 'Freshness Factor' which is #distinct clues / #appearances. I
> want to order this list by #Appearances descending and limit it to 50
> elements. The raw (MySQL) SQL I currently use for this is:
>
> SELECT Entry AS `Entry`, Count(`Entry ID`) AS `Appearances`,
> Count(Distinct(Clue)) AS `Distinct Clues`,
> Format(Count(Distinct(Clue))/count(`Entry ID`),2) AS `Freshness Factor` FROM
> Entries NATURAL JOIN Clues GROUP BY `Entry ID` ORDER BY `Appearances` DESC
> LIMIT 50

Not all the things you need to do here are implemented. But most (for
not saying all) of them are planned.

> Trying to translate that to annotate/aggregate I first think I need to start
> with the Clues model (even though I think of this more as an Entries query)
> because I can't see how to count distinct Clues if I start with Entries,
> since I don't know how to go back from Entries to Clues...did I get that
> right or am I confused?
>

For me it seems more intuitive to do it on Entries, but it is possible
to do it both ways.

> Anyway, starting with Clues I've gotten this far:
>
> Clues.objects.values('EntryID__Entry').annotate(Appearances=Count('Entry
> ID'), Distinct_Clues=Count('Clue', distinct=True))
>

The query you came up with can also be expressed as:

>>> Entries.objects.values('Entry').annotate(Appearances=Count('Entry ID'), Distinct_Clues=Count('clues__Clue', distinct=True))

which is equivalent.

> (Note I think there's a little bug there -- the Clues field is EntryID,
> while the database column name is 'Entry ID' -- I had to put the space in on
> the Count() call but I think I shouldn't have to do that? But that's really
> beside the point of my main questions.)

This is true. I've added it as a bug in thr project's trac and it
should soon be fixed. Thanks.

> which gives me a ValuesQuerySet that returns a dictionary containing
> EntryID__Entry, Appearances, and Distinct_Clues for each Entry. What I
> don't know how to do is:
>
> 1 - add in the 'Freshness Factor'. Sure, I can calculate it myself from
> Distinct_Clues/Appearances but I am curious if there is some way to push the
> work to the DB like I did in my original raw sql? (Which would also allow
> for ordering on this derived value, if I knew the answer to #2.)
>

Currently the way of defining new Aggregate objects is not documented.
This is mainly because I haven't work too much in the documentation
yet. Also the way this is done now might change in the future in
favour of something more user friendly. Now, If you wanted to define
your own Aggregate object you could do something like this:

class FreshnessFactor(Aggregate):
def __init__(self, lookup):
super(FreshnessFactor, self).__init__(lookup)
self.sql_template = """Count(Distinct("Clue"))/count("Entry ID")"""

and then do something like:

Entries.objects.values('Entry').annotate(Appearances=Count('clues__EntryID'),
Distinct_Clues=Count('clues__Clue', distinct=True),
Freshness_Factor=FreshnessFactor(''))

Note that this is a dirty hack and I havent tested it. (It will
probably not work because of the bug with the spaces you point out...
It also doesnt take into acount the column aliases) This is just the
way to work it out with the current implementation. But as soon as the
basic part is over I will start working on a cleaner implementation
for arithmetic and defining your own aggregate objects.

> 2 - order by Appearances. How do I order by an aggregate result? I can't
> find any mention of this in the doc or in a brief scan of the previous
> discussions. This one I really do want the database to do vs. my code due
> to the number of distinct entries in the DB (>100,000). Have I just missed
> how to do this somewhere?
>

Also it is not implemented but will be soon. The idea is to simply
issue a order_by('Appearences') call and get the ordering.

> All in all it's taking me a bit of effort to train my brain to think in
> terms of the aggregate/annotate vs. raw sql, but it seems pretty cool.

Thanks =)

>
> Thanks,
> Karen

Nicolas Lara

unread,

May 26, 2008, 1:54:38 PM5/26/08

to django-d...@googlegroups.com

>>
>> Author.objects.filter(pk=1).aggregate(Count('friends__id'))
>
> I would have thought that Count('friends') would be automatically
> expanded to a count over the primary key. i.e., the example you
> provide is explicit, aggregate(Count('friends')) would be an
> equivalent implicit form.
>

It will.

>>> * Give me a list of all Books with more than 2 authors
>>> (Query over the count of objects in a m2m relationship)
>>
>>>>>Book.objects.all().annotate(num_authors=Count('authors__name')).filter(num_authors__gt=2)
>
> Looks to me, with the exception of a similar comment about implicit
> expansion of authors__name.
>

Same as before.. implicit is comming :p

But I dont think the idea is to always use the implicit. In this case
the problem adapts so that any non-nullable field would work, but this
query is currently counting "authors that have names". I know it is
not exactely the problem, but ilustrates the possibility of using this
kind of counts. "authors that have names" might sound silly in really
life, buit maybe if "Price" wrote a book he wouldn't be counted by our
query. (just kidding, there are many real life uses for this =) )

>>> * Give me a list of all Authors that have no friends
>>> (Query that counts when rows that don't exist in the m2m table)
>>>
>>
>>>>> Author.objects.all().annotate(num_friends=Count('friends__id', distinct=True), allow_nulls=True).filter(num_friends=0)
>
> * I've commented to you previously about the allow_nulls=True option.

Yes, allow_nulls is going away, just a copy/paste from the tests.

> * Is a distinct option required as part of the Count operator? I would
> have thought the existing distinct() queryset modifier would be
> sufficient.

distinct is optional. The problem with using the distinct() queryset
modifier is that it would affect the entire query. What would happen
if you would like to solve something like this:

"Give me a list of kids with the number of pets among all of their
friends and the number of different toys their friends have"
It could be solved like this:
>>> Kids.objects.annotate(Count('friends__toys'), Count('friens__toys', distinct=True))
while distinct would act on the select, or, even if we adapt it, it
would act inside all the Count objects.

> * Again, there is the potential for implicit expansion of Count('friends')

same here

>
>>> * Give me a list of all publishers that have published more than 10 books
>>> (Query over the reverse direction of a FK)
>>>
>>
>>>>> Publisher.objects.all().annotate(num_books=Count('book__id')).filter(num_books__gt=10)
>
> With the exception of the implicit book__id expansion, looks good to me.
>
>>> * Give me a list of all publishers that have published more than 10
>>> books this year
>>> (Query over the reverse direction of a FK, with a filter on the related object)
>>>
>>
>>>>> Publisher.objects.all().annotate(num_books=Count('book__id')).filter(num_books__gt=10, book__price__lt=50)
>
> The filter you implement here is a little different to the question as
> stated, but it covers the use case I was driving at.
>

Yes, i adapt it to the models. I could add a date field to the
models... anyway it's isomorphic.

I'll get back to you on this later since I would like to give it a
deeper analysis.

>>> * Give me a list of all Authors that have published a book with at
>>> least one other person
>>> (Filters over a count generated on a related object)
>>>
> ...
>> aggregation syntax. Do you have any suggestion on how to express this
>> (with either existing or non-existant syntax)?
>
> Not specifically. I was deliberately trying to be concerned with the
> questions, rather than the answers. As I said previously, it's not
> essential that we represent every possible query - complex queries
> will always require manual SQL - but it is important that we have some
> idea of the limitations of the syntax.
>
>> An option for this would probably be the previously suggested F-
>> syntax. So we could do:
>>>>> Author.objects.all().annotate(num_coleagues=Count('book__authors__id'), num_books=Count('book__id', distinct=True)).filter(num_coleagues__gt=F(num_books))
>
> I don't think this is correct. By my reading of this example, an
> author that writes 5 books by themself will have 5 colleagues.
>

yes, this is why I am filtering out those that have less or the same
number of coleagues than books. Maybe "coleagues" was a bad choice of
word.

> However, I think you are correct that this example may require F()
> style syntax. If this puts this example outside your scope for v1 of
> aggregation, thats fine with me.
>
> On the subject of F() - I don't know if you've noticed the thread on
> django-developers, but there is now a patch in the system (#7210) that
> implements the F() syntax.
>

Yes, I dont know when will it be ready, but anyway I would need to
integrate it into aggregation. I believe the most reasonable aproach
(for the sanity of both sebastian_noack and I) is to wait until it is
stable and then try to merge it in. But probably everything depends on
the timing.

>>> * Give me the average price of all Books that were written by X and
>>> one other author.
>>> (Aggregate over objects discovered using membership of the m2m set)
>>>
>>
>> Also for this problem, how would you express it in aggregation syntax?
>>
>> An option that looks very good from the syntax point of view would be:
>>>>> Book.objects.all().annotate(num_authors=Count('authors__id')).filter(num_authors__gt=1).aggregate(Avg('price'))
>>
>> but for this to work we would need to introduce the notion of
>> subqueries which, AFAIK, have only been dealt with in extra().
>
> Why a subquery? Isn't this a WHERE clause over an aggregate that isn't
> in the return set?
>
>> I believe it would be quite clean to introduce subqueries in common
>> cases like this. Especially when dealing with aggregate since
>> aggregate is a terminal clause.
>
> If we need subqueries, I'd rather be explicit about it: e.g.,
>
>>>> Book.objects.filter(author__in=Author.objects.filter(...))
>

(strong statements to come... please correct me if im wrong)

There are two possibilities in which I think subqueries might be
useful (regarding aggregation).

One is something like this:
"Give me the average number of authors per book"
the expected result is something like:
{average: 2.3}
so we are calculating an average over a count. since aggregate
functions cannot be nested, the way to do AVG(COUNT(some_field)) is to
do a subquery that annotates each object with a calculation and the
over that set we aggregate.

the other is something like:
"Give me the average price of the books that have more than 1 author"
(similar to the one above)
In this one we are not aggregating on the annotated fields. But, since
there is grouping, we cannot aggregate over the whole set. So we use a
subquery.

the way to express this in our syntax would be (intuitively and
regardless of the underlaying sql):
annotate(...).aggergate(...)

This could result in a subquery or could be restricted in favour of
some more explicit syntax as you suggest.

I possibility to make this explicit (though to me it seems ugly) would be:
Book.objects.aggregate(Avg(Book.objects.annotate(num_authors=Count('authors')))
)... maybe it would need also a way of specifying on which field to
aggregate.

I still dont see how to solve the last problem without subqueries.

Cheers,

Nicolas, who thinks this mail got way out of hands (and off topic)
after re-reading =)

Karen Tracey

unread,

May 26, 2008, 2:57:13 PM5/26/08

to django-d...@googlegroups.com

On Mon, May 26, 2008 at 12:25 PM, Nicolas Lara <nicol...@gmail.com> wrote:

Hi Karen,

Thanks for taking the time to test the code and for providing a, very
useful, example/test_case =)

You're welcome, glad if my playing around is useful.

[snipped some]

> Trying to translate that to annotate/aggregate I first think I need to start
> with the Clues model (even though I think of this more as an Entries query)
> because I can't see how to count distinct Clues if I start with Entries,
> since I don't know how to go back from Entries to Clues...did I get that
> right or am I confused?
>

For me it seems more intuitive to do it on Entries, but it is possible
to do it both ways.

> Anyway, starting with Clues I've gotten this far:
>
> Clues.objects.values('EntryID__Entry').annotate(Appearances=Count('Entry
> ID'), Distinct_Clues=Count('Clue', distinct=True))
>

The query you came up with can also be expressed as:

>>> Entries.objects.values('Entry').annotate(Appearances=Count('Entry ID'), Distinct_Clues=Count('clues__Clue', distinct=True))

which is equivalent.

Oh, right. I just forgot how to traverse the relation backwards. I also find it more intuitive to start with Entries, now that I'm reminded of how to get back to Clues from it.

> (Note I think there's a little bug there -- the Clues field is EntryID,
> while the database column name is 'Entry ID' -- I had to put the space in on
> the Count() call but I think I shouldn't have to do that? But that's really
> beside the point of my main questions.)

This is true. I've added it as a bug in thr project's trac and it
should soon be fixed. Thanks.

Great.

This actually does work now if I spell out Count(Distinct(`Clues`.`Clue`))/count(`Entries`.`Entry ID`) in the sql. Not sure what it was counting without the full spec, but the numbers weren't coming out right. (I presume when you get this implemented for real I'll be able to use the Django model/field names vs. actual table/column names.) Glad to hear stuff like this is planned.

> 2 - order by Appearances. How do I order by an aggregate result? I can't
> find any mention of this in the doc or in a brief scan of the previous
> discussions. This one I really do want the database to do vs. my code due
> to the number of distinct entries in the DB (>100,000). Have I just missed
> how to do this somewhere?
>

Also it is not implemented but will be soon. The idea is to simply
issue a order_by('Appearences') call and get the ordering.

Great. I did try that but found it didn't work and wasn't sure based on the other stuff I had read if it was ever intended to.

I'll try to follow along as you make improvements and try more stuff. For now I have one more question. The next piece of this particular stats query is to limit the results based on entry length. So my current raw sql sometimes (based on what the user has chosen) includes a WHERE LENGTH(Entry)=3 (or some other number). Currently I think to do that with the ORM I need to use an extra(where=['length(Entry)=3']). Have I overlooked a better way to do this currently and if not is this something the aggregate code might provide better support for in the future?

Thanks!

Karen

Russell Keith-Magee

unread,

May 27, 2008, 8:32:41 AM5/27/08

to django-d...@googlegroups.com

On Tue, May 27, 2008 at 1:54 AM, Nicolas Lara <nicol...@gmail.com> wrote:
>
> But I dont think the idea is to always use the implicit. In this case
> the problem adapts so that any non-nullable field would work, but this
> query is currently counting "authors that have names". I know it is
> not exactely the problem, but ilustrates the possibility of using this
> kind of counts. "authors that have names" might sound silly in really
> life, buit maybe if "Price" wrote a book he wouldn't be counted by our
> query. (just kidding, there are many real life uses for this =) )

Providing a way to count different fields is a good idea, and the
approach you've described looks good to me. I just wanted to make sure
that the simplest case (count all instances) is trivial (i.e., doesn't
require explicit naming of the PK field).

>> * Is a distinct option required as part of the Count operator? I would
>> have thought the existing distinct() queryset modifier would be
>> sufficient.
>
> distinct is optional. The problem with using the distinct() queryset
> modifier is that it would affect the entire query.

...

>>>> Kids.objects.annotate(Count('friends__toys'), Count('friens__toys', distinct=True))

> while distinct would act on the select, or, even if we adapt it, it
> would act inside all the Count objects.

Ah - I hadn't considered COUNT(DISTINCT *). This seems like a
reasonable to me. It also acts as a validation for choosing the
function notation for aggregates - this wouldn't have been possible if
we were using the __count= syntax.

>>> An option for this would probably be the previously suggested F-
>>> syntax. So we could do:
>>>>>> Author.objects.all().annotate(num_coleagues=Count('book__authors__id'), num_books=Count('book__id', distinct=True)).filter(num_coleagues__gt=F(num_books))
>>
>> I don't think this is correct. By my reading of this example, an
>> author that writes 5 books by themself will have 5 colleagues.
>>
>
> yes, this is why I am filtering out those that have less or the same
> number of coleagues than books. Maybe "coleagues" was a bad choice of
> word.

Ah - now I see what you are doing. Ignore me. I don't know nothing
about nothing :-)

>>> I believe it would be quite clean to introduce subqueries in common
>>> cases like this. Especially when dealing with aggregate since
>>> aggregate is a terminal clause.
>>
>> If we need subqueries, I'd rather be explicit about it: e.g.,
>>
>>>>> Book.objects.filter(author__in=Author.objects.filter(...))
>>
> (strong statements to come... please correct me if im wrong)
>
> There are two possibilities in which I think subqueries might be
> useful (regarding aggregation).

Don't get me wrong - I can see that subqueries can be useful. I'm just
hesitant to try and induce subqueries from a flat query syntax. If
subqueries are required, my gut reaction is that we will have less
headaches if we make the definition language able to express
subqueries.

That said - if there are very clear and obvious rules for when a
subquery will get formed, this might be ok. More on this below.

> the way to express this in our syntax would be (intuitively and
> regardless of the underlaying sql):
> annotate(...).aggergate(...)
>
> This could result in a subquery or could be restricted in favour of
> some more explicit syntax as you suggest.
>
> I possibility to make this explicit (though to me it seems ugly) would be:
> Book.objects.aggregate(Avg(Book.objects.annotate(num_authors=Count('authors')))
> )... maybe it would need also a way of specifying on which field to
> aggregate.

I agree that this approach isn't pretty.

I think I have a slightly better idea of what you are proposing. If
I'm understanding you correctly, you're working on the idea that:

SELECT AVG(pages) FROM Book

is essentially the same as

SELECT AVG(pages) from (SELECT pages FROM Book)

The second form is an inefficient - in practice, you would remove the
subquery as an optimization - but it is a generic construction that
lends itself to more complex inner queries, including those that have
annotate() clauses:

SELECT AVG(num_authors) from (SELECT id, name, COUNT(author.id) AS
num_authors FROM Book INNER JOIN Author ...

Given that aggregate() is a terminal clause, users aren't ever going
to be affected by the existence of the subquery, so I suppose that
this isn't so bad.

I still have a gut feeling that you should be able to compose this
query without a subselect. However, I would need to sit down and look
at the problem for a while to confirm that.

After all that - I wouldn't get too hung up on these last cases. I
posed them as intentionally difficult queries. While it would be nice
to be able to handle the, getting the simple cases working (especially
joins) is a much bigger priority. There will always be queries that
can't be expressed using the ORM - the goal is to make the 80% of
common queries trivial. The remaining 20% can always be serviced using
raw SQL.

Yours,
Russ Magee %-)

Giovanni Dr. Chiozza

unread,

May 27, 2008, 9:06:47 AM5/27/08

to django-d...@googlegroups.com

DO NOT SEND ME ANY MESSAGGE PLEASE !!

--------------------------------------------------------------------------------

No virus found in this incoming message.
Checked by AVG.
Version: 8.0.100 / Virus Database: 269.24.1/1468 - Release Date: 26/05/2008
15.23

Nicolas Lara

unread,

May 28, 2008, 11:59:10 AM5/28/08

to django-d...@googlegroups.com

>
> Providing a way to count different fields is a good idea, and the
> approach you've described looks good to me. I just wanted to make sure
> that the simplest case (count all instances) is trivial (i.e., doesn't
> require explicit naming of the PK field).
>

This is fixed in the latest revision. Now it is possible to do:
>>> explicit = list(Author.objects.annotate(Count('book__id')))
>>> implicit = list(Author.objects.annotate(Count('book')))
>>> explicit == implicit
True

>>> * Is a distinct option required as part of the Count operator? I would
>>> have thought the existing distinct() queryset modifier would be
>>> sufficient.
>>
>> distinct is optional. The problem with using the distinct() queryset
>> modifier is that it would affect the entire query.
> ...
>>>>> Kids.objects.annotate(Count('friends__toys'), Count('friens__toys', distinct=True))
>
>> while distinct would act on the select, or, even if we adapt it, it
>> would act inside all the Count objects.
>
> Ah - I hadn't considered COUNT(DISTINCT *). This seems like a
> reasonable to me. It also acts as a validation for choosing the
> function notation for aggregates - this wouldn't have been possible if
> we were using the __count= syntax.
>

True. And for custom aggregate functions the option of having multiple
parameters seems very useful.

>>>> I believe it would be quite clean to introduce subqueries in common
>>>> cases like this. Especially when dealing with aggregate since
>>>> aggregate is a terminal clause.
>>>
>>> If we need subqueries, I'd rather be explicit about it: e.g.,
>>>
>>>>>> Book.objects.filter(author__in=Author.objects.filter(...))
>>>
>> (strong statements to come... please correct me if im wrong)
>>
>> There are two possibilities in which I think subqueries might be
>> useful (regarding aggregation).
>
> Don't get me wrong - I can see that subqueries can be useful. I'm just
> hesitant to try and induce subqueries from a flat query syntax. If
> subqueries are required, my gut reaction is that we will have less
> headaches if we make the definition language able to express
> subqueries.
>

I think also that being explicit is more django-ish/pythonic but, as
you said....

> That said - if there are very clear and obvious rules for when a
> subquery will get formed, this might be ok. More on this below.
>

...there might be some exceptions.

I think this is also far ahead. so Ill keep a note on it not to forget
it, but wont deal with it until the first version is ready, tested and
documented.

>> the way to express this in our syntax would be (intuitively and
>> regardless of the underlaying sql):
>> annotate(...).aggergate(...)
>>
>> This could result in a subquery or could be restricted in favour of
>> some more explicit syntax as you suggest.
>>
>> I possibility to make this explicit (though to me it seems ugly) would be:
>> Book.objects.aggregate(Avg(Book.objects.annotate(num_authors=Count('authors')))
>> )... maybe it would need also a way of specifying on which field to
>> aggregate.
>
> I agree that this approach isn't pretty.
>
> I think I have a slightly better idea of what you are proposing. If
> I'm understanding you correctly, you're working on the idea that:
>
> SELECT AVG(pages) FROM Book
>
> is essentially the same as
>
> SELECT AVG(pages) from (SELECT pages FROM Book)
>
> The second form is an inefficient - in practice, you would remove the
> subquery as an optimization - but it is a generic construction that
> lends itself to more complex inner queries, including those that have
> annotate() clauses:
>
> SELECT AVG(num_authors) from (SELECT id, name, COUNT(author.id) AS
> num_authors FROM Book INNER JOIN Author ...
>
> Given that aggregate() is a terminal clause, users aren't ever going
> to be affected by the existence of the subquery, so I suppose that
> this isn't so bad.
>

I agree. I'll probably bring the subject up again in a few weeks to
discuss a good syntax for subqueries and the possibility of doing it
implicity in annotate().aggregate() (which should be carefully
documented!)

> I still have a gut feeling that you should be able to compose this
> query without a subselect. However, I would need to sit down and look
> at the problem for a while to confirm that.
>
> After all that - I wouldn't get too hung up on these last cases. I
> posed them as intentionally difficult queries. While it would be nice
> to be able to handle the, getting the simple cases working (especially
> joins) is a much bigger priority. There will always be queries that
> can't be expressed using the ORM - the goal is to make the 80% of
> common queries trivial. The remaining 20% can always be serviced using
> raw SQL.
>

Good.
There is still the issue of the difference between

>>> Book.objects.all().filter(authors__name__contains='Norvig').annotate(num_authors=Count('authors__id')).filter(num_authors__gt=1)

and

>>> Book.objects.all().annotate(num_authors=Count('authors__id')).filter(authors__name__contains='Norvig', num_authors__gt=1)

But I will not speculate and, instead, start implementing filter() for
aggregates to further understand the problem and find possible
solutions.