Annotation failure (Combining multiple aggregations)

312 views
Skip to first unread message

Paulo Maciel

unread,
Dec 4, 2015, 10:47:43 AM12/4/15
to Django developers (Contributions to Django itself)
Combining multiple aggregations with annotate() will yield the wrong results, as multiple tables are cross joined. Due to the use of LEFT OUTER JOIN, duplicate records will be generated if some of the joined tables contain more records than the others

The Count aggregate has a distinct parameter that may help:
q = Book.objects.annotate(Count('authors', distinct=True), Count('chapters', distinct=True))
Why not "distinct=True" to use in Sum?
q = Book.objects.annotate(Sum('val_a', distinct=True), Sum('val_b', distinct=True))

Michael

unread,
Dec 7, 2015, 8:16:20 PM12/7/15
to Django developers (Contributions to Django itself)
I have the same problem.

I tried to use `extra` but then I can't filter so it's not the best.

Anybody has more info or a workaround?

Thanks

Anssi Kääriäinen

unread,
Dec 8, 2015, 2:19:52 AM12/8/15
to Django developers (Contributions to Django itself)
On Friday, December 4, 2015 at 5:47:43 PM UTC+2, Paulo Maciel wrote:
Combining multiple aggregations with annotate() will yield the wrong results, as multiple tables are cross joined. Due to the use of LEFT OUTER JOIN, duplicate records will be generated if some of the joined tables contain more records than the others

This has nothing to do with cross joins, usage of LEFT JOINS or having more records in one table than the others. The reason is that we should use subqueries where we currently use JOINs.

As a complex example, (one) correct SQL query for Book.objects.annotate(chapter_cnt=Count('chapter'), author_cnt=Count('author')) is:

  SELECT subq1.chapter_cnt, subq2.author_cnt, book_id
   FROM book
   LEFT JOIN (SELECT count(*) as chapter_cnt FROM chapter GROUP BY book_id) as subq1 ON subq1.book_id = book.book_id
   LEFT JOIN (SELECT count(*) as author_cnt FROM author GROUP BY book_id) as subq2 ON subq2.book_id = book.book_id
   GROUP BY book.book_id

This is still an easy case for the ORM. To add a bit more complexity, consider Book.objects.filter(Q(chapter__pages__lt=100)|Q(author__age__gt=50)).annotate(chapter_cnt=Count('chapter'), author_cnt=Count('author')). I *think* we need a query like:

  SELECT subq1.chapter_cnt, subq2.author_cnt, book_id
   FROM book
   LEFT JOIN (SELECT count(*) as chapter_cnt FROM chapter WHERE chapter.pages < 100 GROUP BY book_id) as subq1 ON subq1.book_id = book.book_id
   LEFT JOIN (SELECT count(*) as author_cnt FROM author WHERE author.age > 50 GROUP BY book_id) as subq2 ON subq2.book_id = book.book_id
  WHERE subq1.book_id IS NOT NULL OR subq2.book_id IS NOT NULL
   GROUP BY book.book_id

There are even more complex cases. For example you might need a subquery inside the generated subqueries in some cases.

I have some ideas of how to fix this (though in cases like this it is hard to know if the ideas actually work before trying). I believe fixing this takes at least a couple of months of development time. At least my development budget seems to be around one day a month currently.

An alternate approach is to error out when the ORM detects a complex enough query, and add explicit ways for users to add subqueries, subselects and stuff like that to ORM queries. This might make more sense than trying to generate complex queries automatically, as there are multiple ways to write the above query, and the different forms perform vastly differently depending on the data in the tables and the used database backend. Still, we can have it both ways - automatically generate a correct query, but also offer a way to explicitly add subqueries to ORM queries.

 - Anssi

Anssi Kääriäinen

unread,
Dec 8, 2015, 2:26:52 AM12/8/15
to Django developers (Contributions to Django itself)


On Tuesday, December 8, 2015 at 3:16:20 AM UTC+2, Michael wrote:
I have the same problem.

I tried to use `extra` but then I can't filter so it's not the best.

Anybody has more info or a workaround?

Come to think of it, we already have a workaround:
   Book.objects.annotate(author_cnt=RawSQL("(select count(*) from author where author.book_id = book.book_id)").
The fact that you can do subselects with models.RawSQL is something that a lot of users seem to miss. There are a couple of reasons why one shouldn't use RawSQL too much, like the way it hard codes column names and table aliases.

 - Anssi

Josh Smeaton

unread,
Dec 8, 2015, 7:52:07 AM12/8/15
to Django developers (Contributions to Django itself)
Yeah I'm wary that a lot of the problems with extra() will be replicated by users switching to RawSQL() instead. It won't handle realiasing or group by correctly.

Michael

unread,
Dec 8, 2015, 10:28:08 AM12/8/15
to Django developers (Contributions to Django itself)
On Tuesday, December 8, 2015 at 1:26:52 AM UTC-6, Anssi Kääriäinen wrote:

Come to think of it, we already have a workaround:
   Book.objects.annotate(author_cnt=RawSQL("(select count(*) from author where author.book_id = book.book_id)").

Yes, you can do a subquery to get the Sum or Count but then you cannot "filter": `qs.filter(author_cnt__gt=0)` won't work.

I think it would require the correct "GROUP BY" and "HAVING" clause to be able to filter, correct?


Anssi Kääriäinen

unread,
Dec 9, 2015, 2:00:12 AM12/9/15
to Django developers (Contributions to Django itself)
On Tuesday, December 8, 2015 at 2:52:07 PM UTC+2, Josh Smeaton wrote:
Yeah I'm wary that a lot of the problems with extra() will be replicated by users switching to RawSQL() instead. It won't handle realiasing or group by correctly.

The group by case we can fix by creating a separate RawAggregate class, or by adding aggregate flag to RawSQL. The hardcoding of column and table names, and the realisasing problem can be solved with refsql style annotations:
    qs = Author.objects.annotate( casewhen=RefSQL("case when {{name}} = %s then height else weight end", ('Anssi',),
  output_field=models.IntegerField()) ).order_by('name')
(from https://github.com/akaariai/django-refsql/blob/master/testproject/testproject/tests.py).

The idea is quite simply that instead of using hardcoded table.col references in the code, you use normal Django lookups, and refsql converts those to Col expressions.

But, whenever possible we should provide real expressions.

  - Anssi

Anssi Kääriäinen

unread,
Dec 9, 2015, 2:02:13 AM12/9/15
to Django developers (Contributions to Django itself)

This one actually does work with filter(). As the count is inside a subselect, it can be used directly in the WHERE clause without GROUP BY or HAVING. Using a direct count without a subselect wouldn't work.

 - Anssi

Michael

unread,
Dec 9, 2015, 10:02:42 AM12/9/15
to Django developers (Contributions to Django itself)
On Wednesday, December 9, 2015 at 1:02:13 AM UTC-6, Anssi Kääriäinen wrote:
This one actually does work with filter(). As the count is inside a subselect, it can be used directly in the WHERE clause without GROUP BY or HAVING. Using a direct count without a subselect wouldn't work.

Ha, actually, it works since django 1.8.6: "Allowed filtering over a RawSQL annotation (#25506)."
(https://docs.djangoproject.com/en/1.8/releases/1.8.6/#bugfixes)

I was using a previous version django, that is why it was not working for me.

Thanks.

Anssi Kääriäinen

unread,
Dec 16, 2015, 2:57:32 AM12/16/15
to Django developers (Contributions to Django itself)
On Tuesday, December 8, 2015 at 9:19:52 AM UTC+2, Anssi Kääriäinen wrote:
On Friday, December 4, 2015 at 5:47:43 PM UTC+2, Paulo Maciel wrote:
Combining multiple aggregations with annotate() will yield the wrong results, as multiple tables are cross joined. Due to the use of LEFT OUTER JOIN, duplicate records will be generated if some of the joined tables contain more records than the others

This has nothing to do with cross joins, usage of LEFT JOINS or having more records in one table than the others. The reason is that we should use subqueries where we currently use JOINs.

As a complex example, (one) correct SQL query for Book.objects.annotate(chapter_cnt=Count('chapter'), author_cnt=Count('author')) is:

  SELECT subq1.chapter_cnt, subq2.author_cnt, book_id
   FROM book
   LEFT JOIN (SELECT count(*) as chapter_cnt FROM chapter GROUP BY book_id) as subq1 ON subq1.book_id = book.book_id
   LEFT JOIN (SELECT count(*) as author_cnt FROM author GROUP BY book_id) as subq2 ON subq2.book_id = book.book_id
   GROUP BY book.book_id

This is still an easy case for the ORM. To add a bit more complexity, consider Book.objects.filter(Q(chapter__pages__lt=100)|Q(author__age__gt=50)).annotate(chapter_cnt=Count('chapter'), author_cnt=Count('author')). I *think* we need a query like:

  SELECT subq1.chapter_cnt, subq2.author_cnt, book_id
   FROM book
   LEFT JOIN (SELECT count(*) as chapter_cnt FROM chapter WHERE chapter.pages < 100 GROUP BY book_id) as subq1 ON subq1.book_id = book.book_id
   LEFT JOIN (SELECT count(*) as author_cnt FROM author WHERE author.age > 50 GROUP BY book_id) as subq2 ON subq2.book_id = book.book_id
  WHERE subq1.book_id IS NOT NULL OR subq2.book_id IS NOT NULL
   GROUP BY book.book_id

There are even more complex cases. For example you might need a subquery inside the generated subqueries in some cases.

I have some ideas of how to fix this (though in cases like this it is hard to know if the ideas actually work before trying). I believe fixing this takes at least a couple of months of development time. At least my development budget seems to be around one day a month currently.

An alternate approach is to error out when the ORM detects a complex enough query, and add explicit ways for users to add subqueries, subselects and stuff like that to ORM queries. This might make more sense than trying to generate complex queries automatically, as there are multiple ways to write the above query, and the different forms perform vastly differently depending on the data in the tables and the used database backend. Still, we can have it both ways - automatically generate a correct query, but also offer a way to explicitly add subqueries to ORM queries.

I started writing some notes about the changes needed to handle automatically any .filter().annotate() combo. The notes are of very bad quality, but I'll publish them here even if that is the case. The point is that there is too much work, and the end result would be something where users don't understand why they get the queries they are getting. Still, the code would be complex enough to be extremely fragile.

I strongly believe the approach must be that we offer users explicit ways to modify the query (ability to join subqueries will get us a long way here), and that Django simply errors out when it sees query constructs it can't handle. We can later on maybe automate *some* of the error cases, but doing it for all cases seems extremely hard.

The notes assume we implement filter() for multivalued joins by doing subqueries instead of using joins. This is in practice needed for the approach to work at all.

When we do qs = Organization.objects.filter(subunits__size__gte=20)
  - add a new subquery condition to the query's where. Note that we added a subquery
    condition for "subunits".
  - query at this point is:
      select * from ... where pk in (select parent_unit_id from organization where size >= 20)

Next, when we do qs = qs.annotate(sum=Sum('subunits__size'))
  - We find the existing where condition (with a subquery) for "subunits"
  - Remove the subquery condition from the where condition, decrease the refcount for leading joins to that subquery by 1
  - Add the sum to the subquery
  - Add the subquery as join to the query (*now* we can do the filter as a join. The
    group by in the subquery enforces at maximum a single row per row in the outer query.).
    The join type should be INNER, except in cases where the subquery condition was ORed
    in the outer query.
  - Add subq.sum to the outer query's annotations.
  - Add "subq.select[0] IS NOT NULL" to the outer query's where (in place of the removed subquery
    contidion)
  - Mark that subunits has now a joined subquery
  - Note that we must use the same GROUP BY in the joined subquery than we do for the outer query.
  - So, for example Organization.values('type').filter(subunits__size__gte=20).annotate(avg_size=Avg('subunits__size'))
       -> Meaning: for each type in Organization, calculcate the average size of subunits, but only for subunits which are
          of size 20 or bigger.
       -> Query:
          SELECT organization.type, subq.avg_size
            FROM organization
           INNER JOIN (SELECT U0.type, AVG(U1.size) as avg_age
                         FROM organization U0
                        INNER JOIN organization U1
                        WHERE U1.size >= 20
                        GROUP BY U0.name) subq ON organization.type = subq.type;

Even if we were to do the above, we still wouldn't handle all legal Django queries correctly.

 - Anssi
Reply all
Reply to author
Forward
0 new messages