Subquery join support

1,832 views
Skip to first unread message

Alexandr Tatarinov

unread,
Apr 6, 2020, 9:34:55 AM4/6/20
to Django developers (Contributions to Django itself)
Hello folks,

Following the discussion https://groups.google.com/forum/#!topic/django-developers/b370mxfKCHg, I would like to suggest adding the ability to join QuerySet with a subquery clause to Django.
The benefits are pretty much described and discussed in the linked topic, the only one I would like to add (my use case) is the ability to select multiple columns from a Subquery without the need to repeat the subquery clause, which hurts performance.
As Anssi Kääriäinen mentioned, generic subquery supports seems to be a better idea than CTEs, and simple implementation as a starting point will allow moving gradually and add RETURNING support later on (for nested creates and updates).

I have created a working prototype and successfully using it in one of my projects in production. After reading Daniel Miller's https://github.com/dimagi/django-cte, I've rewritten the API inspired by his ideas, plus tried to keep it as Django-style as possible.
Here is the usage example.

taxes = JoinedSubquery(Tax.objects.all())
products
= Product.objects.all()

# .join is probably better be placed in QuerySet class, so it will be products.join(taxes, ...)
products_with_taxes
= taxes.join(
    products
,
    join_type
=LOUTER,  # Optional argument, LOUTER by default
    country
=taxes.F('product_country'),  # Also Q objects are supported
    type
=taxes.F('product_type')
).annotate(
    tax_amount
=taxes.F('amount')  # .join() and .annotate() calls needn't to be chained
)


I am not sure about named joins discussed in the CTE topic
qs.attach(q1=some_qs).filter(a=F('q1__b'))

It seems to be more implicit and harder to implement, requiring changes to existing functionality and thus introducing a big patch which I would like to avoid.

It is not possible to follow relations in join.F expressions, but it is easily achievable by firstly fetching needed columns by annotate(), so I don't see the need to implement it.

Source code is available here https://gist.github.com/tatarinov1997/126c49c4b1bb44ae6c57afbc5f43f58d , tested with Django 2.2. Feel free to copy and play around.

So, please let me know what do you think about it. I am willing to continue working on it, and any help is appreciated.

Thanks,
Alexandr.

Matthew Pava

unread,
Apr 6, 2020, 10:53:31 AM4/6/20
to django-d...@googlegroups.com

It is my opinion that the Subquery object that already exists in Django should act as a CTE.

There is a ticket for this feature already:

https://code.djangoproject.com/ticket/28919

 

I have noticed that several developers have proposed several different patches for CTEs, but I do not know why they have never made it in. Perhaps someone with more knowledge on that matter can help.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/8c41e40f-876c-46c0-9e24-09e8c72592ce%40googlegroups.com.

Alexandr Tatarinov

unread,
Apr 6, 2020, 4:26:42 PM4/6/20
to Django developers (Contributions to Django itself)

schinckel

unread,
Apr 7, 2020, 1:39:47 AM4/7/20
to Django developers (Contributions to Django itself)
Great work on this, Alexandr. I've been thinking a lot about doing joins in Django to subqueries. As you point out, when you are doing several subquery annotations of the same subquery, just with different columns, that can really hurt performance. Currently, I've been recommending doing a JSON object as the subquery column.

I really like the idea of an explicit join to a subquery, as it can perform significantly better - in the case where most of the rows share the same subquery, the database can perform the subquery once per distinct row joined, rather than once for each row.

I think, instead of using a `JoinedSubquery`, we should just make it that a `Subquery`, containing unresolved OuterRef instances, can also be resolved by doing a `QuerySet.join(Subquery(...))`.

We could also have a lateral argument that allowed a LATERAL JOIN to a subquery, but perhaps that's later down the track.

(Actually, now I think about it, a QuerySet should probably just act as a Subquery when it's used as one, but perhaps that's too magic).

Matt.

Alexandr Tatarinov

unread,
Apr 7, 2020, 5:40:35 AM4/7/20
to Django developers (Contributions to Django itself)
Thanks, folks, that's a very valuable insight.
I really love the idea of resolving OuterRef via .join(), it may help overcome problems I had with reusing Subquery.
However, I believe Subquery exists apart from QuerySet for a reason - to separate responsibilities, so should we mix responsibilities now with Subquery / JoinedSubquery?

If we reuse Subquery for QuerySet.join(), we need a way to refer to the columns from that Subquery, and either current .F(...) method will be added to Subquery, or we need to use named joins ( QuerySet.join(taxes=Subquery(...)) ).
Named joins are problematic, because current implementation of Query can .resolve_ref() (setup_joins / names_to_path) only to related table OR filtered_relation. And handling yet another case (OR joined_subquery) will make the branching even more complex.
We could refactor somehow to reduce conditionals usage, but it is not something I believe I am capable of.

Yes, it is probably possible to modify existing code, but I am trying to avoid adding new paths and conditions here and there - because over time such changes will lead to hard to understand codebase no one wants to touch. I believe that new behavior is better added via extending, not modifications (open-closed principle). There is already some conditional handling in ORM (i.e. filtered_relation, which can be found in many places across whole ORM code, yet it is just one feature (don't get me wrong, it's a valuable contribution, I am just worried of a general direction), and this way of introducing changes will complicate ORM more and more.

Instead, the suggested solution relies on existing code, but does not modify it and is relatively small.

I will definitely try to create something with OuterRef, but I am afraid we will end up with: Subquery which actually contains code of two separate classes, and several conditional branches (i.e. if subquery.is_joined) in the QuerySet/Query code.
I will come back with new information as soon as I have time (probably by the weekend).

Have a nice day,
Alexandr.

Alexandr Tatarinov

unread,
Apr 12, 2020, 12:35:55 PM4/12/20
to Django developers (Contributions to Django itself)
Hello everyone.
So far, I've managed to implement joins via Subquery and OuterRef with some caveats.

To begin with, the API looks like this

Product.objects.create(name='Pizza', country='USA', type='FOOD')
Tax.objects.create(name='USA Food Tax', amount=10, product_country='USA', product_type='FOOD')
       
taxes
= Tax.objects.filter(
    product_type
=OuterRef('type'),
    product_country
=OuterRef('country')
)
products
= Product.objects.annotate(taxes=Subquery(taxes, join=True, join_type=LOUTER))
product
= products.annotate(tax=F('taxes__amount'), tax_name=F('taxes__name'))

The generated SQL is
SELECT "core_product"."id", "core_product"."name", "core_product"."country", "core_product"."type", "taxes"."__col3" AS "tax", "taxes"."__col4" AS "tax_name"
FROM
"core_product"
LEFT OUTER JOIN
(
    SELECT
       
"core_tax"."id", "core_tax"."product_country", "core_tax"."product_type", "core_tax"."name", "core_tax"."amount",
       
"core_tax"."product_country" AS "__col1", "core_tax"."product_type" AS "__col2", "core_tax"."amount" AS "__col3", "core_tax"."name" AS "__col4"
    FROM
"core_tax"
) "taxes" ON ("taxes"."__col1" = ("core_product"."country") AND "taxes"."__col2" = ("core_product"."type"))


We use .annotate for joins to avoid adding new methods, also .annotate has some useful checks to avoid name collisions.
To distinguish plain subquery from the joined subquery, a join flag is added. To specify join type, there is an optional join_type param to Subquery.

Then, one can refer to joined subquery columns by the given alias, but only via F() expression.
This means it is not possible to write .values('taxes__name') directly, but can be achieved via .annotate(tax_name=F('taxes__name')) .values('tax_name').
Other examples are .order_by(F('taxes__name')) and .values_list(F('taxes__name')), they work without .annotate() but require F() wrapper. 

The reasons why the string fields and F expressions are treated separately and differently in various QuerySet methods are still unclear for me.
The F expression works because the implementation relies on Query.resolve_ref to be called to short-circuit columns from the joined subqueries.
Maybe, we will be able to fix this later or leave as-is with the documentation in place to always use .annotate.

The implementation is ~170 lines long, but I think some corner cases are yet to be discovered.
For now, it would be great to hear your feedback on the API and the feature itself, so we can move on with the ticket and discuss the implementation.

Thanks, Alexandr.



On Monday, 6 April 2020 16:34:55 UTC+3, Alexandr Tatarinov wrote:

Alexandr Tatarinov

unread,
Apr 15, 2020, 3:52:03 AM4/15/20
to Django developers (Contributions to Django itself)
This has not received a lot of feedback so far, so I think some code can help. I am not sure what is the next steps; as I've understood, we need some kind of consensus to open a ticket.

Please, don't hesitate to express your opinion on the feature and the API, as well as the implementation to move this forward.

Cheers,
Alexandr.

Andre Terra

unread,
Apr 15, 2020, 10:15:58 PM4/15/20
to Django developers (Contributions to Django itself)
Thanks, Alexandr, for your effort in addressing what I personally consider a much needed feature in the ORM. I gave your PR a try and noticed that evaluating the same Product.objects.annotate line below twice in a row returns two different SQL queries, the second of which results in an error.

Product.objects.annotate(taxes=Subquery(taxes, join=True, join_type=LOUTER))

I'm using different sample models in my project since I didn't have your models handy to test, but here's a gist with the terminal session and traceback: https://gist.github.com/airstrike/e6ac4feb96136b71dda5e3fcfa32418e You should be able to recreate the error easily with your own models by just running the line above twice. Seems like the Subquery() instantiation is having some unintended side-effects...

Now, more broadly, it seems like we currently have various initial attempts at expanding the ability to handle subqueries in the ORM but with a few different purposes – I count at least 5:
Rather than letting each of these efforts operate (and for the most part, die) independently, I believe Django would benefit from a concerted approach for a few reasons:
  1. leveraging each other's expertise
  2. leveraging code between those different efforts and minimizing complexity / churn
  3. creating a consistent API
  4. actually getting buy-in from the core developers to have this merged at some point
I'm not close enough to Django or sufficiently well-versed in the ORM to know exactly what the next steps are, but it feels to me that we need a DEP that starts by reviewing all of the past efforts, existing needs and outlines a consistent API along with the equivalent SQL for clarity.

So I invite others to provide feedback on whether they believe a DEP is the right path forward – though I'll note that I don't have the time or, most importantly, the expertise to be its Author


Best,
Andy
Reply all
Reply to author
Forward
0 new messages