I believe that the Django ORM is generating seriously inefficient SQL due
to it not using `WHERE EXISTS` but instead using a `DISTINCT` with a `LEFT
JOIN`. by comparison, SQLAlchemy will use `WHERE EXISTS`.
I have two models, `Exam` and `Series`. `Series` objects have a foreign
key to an `Exam` object. Both of the models contain a field
`description_user`. I am trying to search for all `Exam`s that have a
search term in `description_user` or have a child `Series` with that term
in its `description_user`. I want to do this for a number of search terms
(requiring all of them). I also want to de-duplicate the results (ie not
get the same Exam multiple times).
This is roughly what the filter looks like:
{{{
a = (Q(**{'series__description_user__icontains': 'bar'}) |
Q(**{'description_user__icontains': 'bar'}))
b = (Q(**{'series__description_user__icontains': 'foo'}) |
Q(**{'description_user__icontains': 'foo'}))
c = (Q(**{'series__description_user__icontains': 'baz'}) |
Q(**{'description_user__icontains': 'baz'}))
Exam.objects.filter(a & b & c).distinct()
}}}
with corresponding SQL:
{{{
SELECT DISTINCT
"exam_storage_exam"."id",
"exam_storage_exam"."description_user"
FROM "exam_storage_exam"
LEFT OUTER JOIN "exam_storage_series"
ON (
"exam_storage_exam"."id" =
"exam_storage_series"."exam_id"
AND (
"exam_storage_series"."removed" IS NULL) )
WHERE (
"exam_storage_exam"."removed" IS NULL
AND (
"exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
OR
"exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\')
AND (
"exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
OR
"exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\')
AND (
"exam_storage_series"."description_user" LIKE %s ESCAPE \'\\\'
OR
"exam_storage_exam"."description_user" LIKE %s ESCAPE \'\\\'))
}}}
The issue is that as the number of search terms grows, the size of the
intermediate data set before the DISTINCT operation grows as well.
Ideally the SQL would look like:
{{{
SELECT *
FROM exam
WHERE (EXISTS (SELECT 1
FROM exam_storage_series
WHERE exam.id = series.exam_id AND (
series.description_user LIKE '%foo%'
)) or exam.description_user LIKE '%foo%') AND
(EXISTS (SELECT 1
FROM exam_storage_series
WHERE exam.id = series.exam_id AND (
series.description_user LIKE '%bar%'
)) or exam.description_user LIKE '%bar%') AND
(EXISTS (SELECT 1
FROM exam_storage_series
WHERE exam.id = series.exam_id AND (
series.description_user LIKE '%baz%'
)) or exam.description_user LIKE '%baz%')
}}}
Currently the performance of Django query is terrible. This style
searching comes up for example in how [https://github.com/tomchristie
/django-rest-
framework/blob/43c45cc9391ec2bed9481a8b309990dec35b6ac8/rest_framework/filters.py#L132-L180
DRF generates search queries].
--
Ticket URL: <https://code.djangoproject.com/ticket/25789>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* cc: akaariai (added)
* needs_better_patch: => 0
* type: Uncategorized => Cleanup/optimization
* needs_tests: => 0
* needs_docs: => 0
Comment:
Anssi, is this feasible? The closest existing ticket I could find is
#16603 but not sure if it should be considered a duplicate.
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:1>
Comment (by akaariai):
Yes, this is possible, and something Django's ORM should do.
Unfortunately this is hard to implement correctly. The big problem is
aggregation, when doing .filter().annotate(), where both operations target
the same relation, the aggregation must use results of the join generated
by the filter. But if the filter doesn't generate a join, then we have a
problem.
It would be Djangoic if the exists query would be generated automatically.
For the above mentioned reason this is hard. Maybe it would be easier if
we had something like .filter(models.Exists(somerel__col=val)), where the
Exists() class would inform Django that an exists subquery should be
generated.
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:2>
* stage: Unreviewed => Accepted
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:3>
* cc: mjtamlyn (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:4>
* keywords: => QuerySet.extra
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:5>
Comment (by schinckel):
It's not totally relevant (at least not yet), but I have a version of this
that works with .annotate():
https://github.com/django/django/pull/6478
There's also a django-developers thread that hopefully may get some
discussion.
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:6>
Comment (by James Howe):
Similarly, I have a use-case where this query structure has the only
acceptable performance (in PostgreSQL):
{{{#!sql
... FROM model_1
WHERE NOT EXISTS (
SELECT TRUE FROM model_2
WHERE model_1.pkey = model_2.fkey
AND (model_2.property IS NULL OR model_2.property >= 42)
)
}}}
Currently using `extra()` to do it.
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:7>
* cc: James Howe (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:8>
Comment (by Simon Charette):
James, since this ticket was filled support for `Exists` expression was
added so you should be able to avoid using `extra`
{{{#!python
queryset.annotate(
foo_exists=Exists(subqueryset.filter(foo_id=OuterRef('pk'))
).filter(
foo_exists=False,
)
}}}
Once #25367 lands it'll be possible to pass `~Exists` directly to filter.
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:9>
Comment (by Simon Charette):
Another ''nice to have'' that could be added here would be allow the usage
of transforms on related fields so we could have lookups of the form
`__exists` or `__count`
{{{#!python
Author.objects.filter(books__exists=Q(title__icontains="ring"))
Author.objects.filter(books__count__gt=4)
}}}
instead of
{{{#!python
Author.objects.filter(Exist(Book.objects.filter(author=OuterRef("pk"),
title__icontains="ring")))
Author.objects.filter(
GreaterThan(
Book.objects.filter(
author_id=OutRef("pk"),
).values(
"author_id"
).values(Count("id")), 4
)
)
}}}
I think it would go a long way in making it easier for users to work
around this issue and #10060.
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:10>
* cc: David Wobrock (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:11>
Comment (by Simon Charette):
FWIW I had a very limited try at this in
[https://github.com/django/django/compare/main...charettes:django:exists-
m2m-lookup this branch]. Mostly some test and debugging shims but I
figured I'd provide them if someone is interesting in looking this issue.
I'd add that this work has a bit of overlap with #28296 as it basically
needs to turn a series of transforms and lookups against a related field
into a subquery without causing JOINs on the outer query which the default
behaviour of using the `__` syntax.
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:12>
* cc: John Speno (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/25789#comment:13>