Hey all,
I'd like to suggest adding Common Table Expression (CTE) query generation as a feature to
Django.
I've been working on a project that required manipulation of many records at once, and as with many ORMs found that this wasn't an ideal use-case in Django. As the rest of our code base and related projects are in Django, there was a strong preference to find a way to do it and keep to the same model-is-the-truth design.
I first did this by writing some hackish functions using raw querysets and generating my own CTE based queries, but it lacked ideal flexibility and maintainability. So I've now written some modifications into my Django to do this in a more Django-esque way and think that this functionality would be beneficial within the project itself, but am unsure exactly where to start the conversation about that.
Why generate CTE based queries from querysets?
By allowing querysets to be attached to each other, and setting
appropriate WHERE clauses, arbitrary and nested SQL queries can be
generated. Where the results of the queries are only necessary for
the execution of following queries this saves a very substantial
amount of time and database work. Once these features exist, other
functionality can also transparently use these to generate more
efficient queries (such as large IN clauses).
This allows several powerful use cases I think Django would benefit from:
Large 'IN' clauses, can be implemented as CTEs reducing expensive lookups to a single CTE INNER JOIN. For sets of thousands to match from tables of millions of records this can be a very substantial gain.
Composite 'IN' conditions, where multiple fields must match and you're matching against a large set of condition rows. In my usage this was "where the md5/sha hashes match one of the million md5/sha tuples in my match set". This is simply a CTE JOIN with two clauses in the WHERE.
Nested data creation, where the parent doesn't yet exist. Django doesn't currently do this as the primary keys are needed, and this makes normalised data structures unappealing. Using INSERTs as CTEs that supply those keys to following statements means that entire nested data structures of new information can be recreated in the database at once, efficiently and atomically.
Non-uniform UPDATEs, such that a modified set of objects can all be updated with different data at the same time by utilising a CTE values statement JOINed to the UPDATE statement. As there's currently no way to do this kind of bulk update the alternative is to update each instance individually, and this doesn't scale well.
These could also be used with aggregations and other calculated fields to create complex queries that aren't possible at the moment.
What my PoC looks like
With another mildly hackish PoC that creates a VALUEs set from a dict/namedtuple which can be used to provide large input data, my present modified version syntax looks a bit like this (not perfect queries):
Returning the newly created records.class Hashes(models.Model):
md5 = models.UUIDField(verbose_name="MD5 hash (base16)", db_index=True)
sha2 = models.CharField(max_length=44, null=True, verbose_name="SHA256 hash (base64)")
# Mock QuerySet of values q_mo = Hashes.as_literal(input_hashes).values("md5", "sha2")
# A big IN query
q_in = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5"))
# Matched existing values with composite 'IN' (where md5 and sha2 match, or md5 matches and existing record lacks sha2) q_ex = Hashes.objects.attach(q_mo).filter(md5=q_mo.ref("md5")).filter(Q(sha160=q_mo.ref("sha160")) | Q(sha160=None)) # Create new records that don't exist q_cr = Hashes.objects.attach(q_mo, q_ex).filter(md5=q_mo.ref("md5")).exclude(md5=q_ex.ref("md5")).values("md5", "sha2").as_insert()
That is:WITH cte_1_0 (md5, sha2) AS ( VALUES ('00002d30243bfe9d06673765c432c2bd'::uuid, 'fsA8okuCuq9KybxqcAzNdjlIyAx1QJjTPdf1ZFK/hDI='::varchar(44)),
('0000f20a46e4e60338697948a0917423', '6bVZgpYZtit1E32BlANWXoKnFFFDNierDSIi0SraND4=')), cte_1 AS ( SELECT "hashes"."id", "hashes"."md5", "hashes"."sha2" FROM "hashes" , "cte_1_0" WHERE ("hashes"."md5" = (cte_1_0.md5) AND ("hashes"."sha2" = (cte_1_0.sha2) OR "hashes"."sha2" IS NULL) )) SELECT "hashes"."md5" FROM "hashes" , "cte_1_0" , "cte_1" WHERE ("hashes"."md5" = (cte_1_0.md5) AND NOT ("hashes"."md5" = (cte_1.md5)))
As these can be attached to querysets that are attached to querysets, that are... etc, many statements can be chained allowing substantial changes to be performed without needing Django to have to receive, process, and resend at every step.
I've had a read through the enhancement proposal docs etc, and I'm willing to do what's needed to make this functionality solid, and put forth a proposal to add it. But am first seeking feedback on it, and whether this is a feature that will be considered.
supported is added in MySQL 8 [0]
--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscribe@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CANvPqgDWUPuUA6rw8uawa7_AsUkXFhuegFM1cJ-gjVUCxvBUwg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-developers+unsubscribe@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/9af2d3dc-81a0-4d76-a3a0-cdef850158ee%40googlegroups.com.
You received this message because you are subscribed to a topic in the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/django-developers/b370mxfKCHg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to django-developers+unsubscribe@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAMwjO1EqNmJXZk2E6HMFw2LErto-%3DC7f8OHvXnzEX-PLzofu1w%40mail.gmail.com.
Regarding Anssi's comments about SubQuery, we do now have that in core as of 1.11 [0]. It does look like an .attach() approach might actually have been a nicer version of this, but on the other hand it's currently implementable solely with the Expressions API. It seems like the OuterRef is very similar to your queryset.ref(). An even nicer approach using attach could be to say qs.attach(q1=some_qs).filter(a=F('q1__b'))?
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
q1 = Book.objects.values('author_id').annotate(avg_price=Avg('price'))
q2 = Author.objects.attach('book_prices', q1, id=F('book_prices__author_id'))
def attach(name, queryset, **params):
# Would look something like this.
...
I will always suggest that we use the Subquery API to make CTEs. To make them recursive, just add a keyword argument (recursive=True) and/or use a union.
It’s been a while since I looked at CTEs, so I might be missing something.
I would hate to see us create an entirely separate API for CTEs.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/c559fd08-447f-4953-9940-b2090f2148e5n%40googlegroups.com.