[Django] #33015: QuerySet.extra Use Case: use db indexes when filtering on large lists of tuples

14 views
Skip to first unread message

Django

unread,
Aug 11, 2021, 2:40:12 PM8/11/21
to django-...@googlegroups.com
#33015: QuerySet.extra Use Case: use db indexes when filtering on large lists of
tuples
-------------------------------------+-------------------------------------
Reporter: kris- | Owner: nobody
swann |
Type: | Status: new
Uncategorized |
Component: Database | Version: 2.2
layer (models, ORM) |
Severity: Normal | Keywords: QuerySet.extra
Triage Stage: | Has patch: 0
Unreviewed |
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
-------------------------------------+-------------------------------------
Hello,

While reading through the documentation for .extra(..) it looked like use
cases should be logged as a ticket here -- so here is that.

In an effort to speed up some slow queries that we've been encountering
I've been looking into taking advantage of db indexes.

Basically we have run into the same issue as this SO question:
https://stackoverflow.com/questions/23991090/django-filter-multiple-
columns-with-a-list-of-tuples

We are using Postgres as our backing db.

We have a model that looks like this

{{{
class ExampleModel(models.Model):
val1 = models.TextField()
val2 = models.TextField()
# etc...

class Meta:
indexes = [
models.Index(fields=['val1', 'val2'])
]
}}}


For demonstration purposes, we'll fill it with sample data

{{{
sample_data = [
ExampleModel(val1=str(v1), val2=str(v2))
for v1 in range(0,1000)
for v2 in range(0,1000)
]
ExampleModel.objects.bulk_create(sample_data)
}}}

Using a or-chained Q object is significantly slower

Note: In practice this can be a list of any arbitrary combinations, but
for ease of generation, we'll use an easy to generate list of tuples
{{{
items_to_fetch = [
(i, j)
for i in range(0,100)
for j in range(0,100)
]
}}}


Using an or-chained Q object:
{{{
import time
query = Q()
for v1, v2 in items_to_fetch:
query |= Q(val1=v1, val2=v2)
start = time.perf_counter()
ExampleModel.objects.filter(query)
end = time.perf_counter()
print(f"{end - start} seconds")

>>> OUTPUT:
>>> 43.73997112699999 seconds
}}}

VS.

Using .extra(...)
{{{
import time
start = time.perf_counter()
ExampleModel.objects.extra(
where=["(val1, val2) in %s"],
params=[tuple(items_to_fetch)]
)
end = time.perf_counter()
print(f"{end - start} seconds")

>>> OUTPUT:
>>> 0.0004218950000449695 seconds
}}}

If there are any alternatives worth trying out, that would be really
helpful info. I wasn't able to find anything in the docs (although I might
have just been looking in the wrong places too).

--
Ticket URL: <https://code.djangoproject.com/ticket/33015>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
Aug 11, 2021, 2:48:44 PM8/11/21
to django-...@googlegroups.com
#33015: QuerySet.extra Use Case: filtering on large lists of tuples
-------------------------------------+-------------------------------------
Reporter: kris-swann | Owner: nobody
Type: Uncategorized | Status: new
Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: QuerySet.extra | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

--
Ticket URL: <https://code.djangoproject.com/ticket/33015#comment:1>

Django

unread,
Aug 13, 2021, 12:55:49 PM8/13/21
to django-...@googlegroups.com
#33015: QuerySet.extra Use Case: filtering on large lists of tuples
-------------------------------------+-------------------------------------

Reporter: kris-swann | Owner: nobody
Type: Uncategorized | Status: new
Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: QuerySet.extra | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Simon Charette):

I think you might be able to achieve what you're after with the following
(off the main branch)

{{{#!python
from django.db.models import lookups
from django.db.models.expressions import Func

ExampleModel.objects.filter(
lookups.In(
Func('val1', 'val2', template='(%(expressions)s)'),
tuple(items_to_fetch),
)
)
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/33015#comment:2>

Django

unread,
Aug 14, 2021, 7:59:52 AM8/14/21
to django-...@googlegroups.com
#33015: QuerySet.extra Use Case: filtering on large lists of tuples
-------------------------------------+-------------------------------------

Reporter: kris-swann | Owner: nobody
Type: Uncategorized | Status: new
Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: QuerySet.extra | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Keryn Knight):

* cc: Keryn Knight (added)


Comment:

Having been bitten by this same thing multiple times over the years, I'm
interested.

The performance profile of this is reproducible for me on the `2.2.9` and
`3.2.6` tags:
{{{
In [5]: %%timeit
...: query = Q()
...: for v1, v2 in items_to_fetch:
...: query |= Q(val1=v1, val2=v2)
21.9 s ± 1.01 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: %timeit -r3 -n1 x = ExampleModel.objects.filter(query)
48.2 s ± 1.54 s per loop (mean ± std. dev. of 3 runs, 1 loop each)
}}}
but doesn't appear to be a ''big'' issue on `main` since #32940; my
suspicion is that both `WhereNode` and `Q` were hitting the pathological
case for `data in self.children` being O(n), in case that rings a bell
Simon; below is main as of `8208381ba6`; the problem was present at
`501a8db465` and basically gone at `ff661dbd50`:
{{{
In [5]: %%timeit
...: query = Q()
...: for v1, v2 in items_to_fetch:
...: query |= Q(val1=v1, val2=v2)
209 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: %timeit x = ExampleModel.objects.filter(query)
644 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
}}}
Though it's still faster to avoid combining Q objects and kwargs:
{{{
In [14]: %timeit Q(*(Q(('val1', v1), ('val2', v2)) for v1, v2 in
items_to_fetch), _connector=Q.OR)
15.1 ms ± 761 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
}}}
That's true for `2.2.9` too, but regrettably has less impact on the query
compilation itself:
{{{
In [4]: %timeit Q(*(Q(('val1', v1), ('val2', v2)) for v1, v2 in
items_to_fetch), _connector=Q.OR)
22.9 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: %timeit -r3 -n1 x = ExampleModel.objects.filter(query)
20.6 s ± 874 ms per loop (mean ± std. dev. of 3 runs, 1 loop each)
}}}

I don't know that the simple Q based version will ever be as fast as
`extra` (as used above) or the `ExpressionTuple` (also above), but it
begins to look ''almost'' acceptable, certainly compared to the previous.

--
Ticket URL: <https://code.djangoproject.com/ticket/33015#comment:3>

Django

unread,
Aug 17, 2021, 4:14:47 AM8/17/21
to django-...@googlegroups.com
#33015: QuerySet.extra Use Case: filtering on large lists of tuples
-------------------------------------+-------------------------------------
Reporter: kris-swann | Owner: nobody
Type: | Status: new
Cleanup/optimization |

Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: QuerySet.extra, | Triage Stage: Accepted
Documentation |
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Carlton Gibson):

* keywords: QuerySet.extra => QuerySet.extra, Documentation
* type: Uncategorized => Cleanup/optimization
* stage: Unreviewed => Accepted


--
Ticket URL: <https://code.djangoproject.com/ticket/33015#comment:4>

Django

unread,
Aug 17, 2021, 7:57:54 AM8/17/21
to django-...@googlegroups.com
#33015: QuerySet.extra Use Case: filtering on large lists of tuples
-------------------------------------+-------------------------------------

Reporter: kris-swann | Owner: nobody
Type: | Status: new
Cleanup/optimization |
Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: QuerySet.extra, | Triage Stage: Accepted
Documentation |
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Simon Charette):

* cc: Simon Charette (added)


Comment:

Thanks for the benchmarking Keryn! That's definitely `data in
self.children` biting us again.

With the current state of things in terms on `Q` performance and the
possibility of passing lookups to filter in 4.0 (#27021) I wonder what's
left to do here. What were you thinking of documenting Carlton?

--
Ticket URL: <https://code.djangoproject.com/ticket/33015#comment:5>

Django

unread,
Aug 17, 2021, 11:24:19 AM8/17/21
to django-...@googlegroups.com
#33015: QuerySet.extra Use Case: filtering on large lists of tuples
-------------------------------------+-------------------------------------

Reporter: kris-swann | Owner: nobody
Type: | Status: new
Cleanup/optimization |
Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: QuerySet.extra, | Triage Stage: Accepted
Documentation |
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Carlton Gibson):

> What were you thinking of documenting Carlton?

It struck that there's some good comments in this thread, that might go
well **<somewhere, need to search where>**.

Opening comment:

> I wasn't able to find anything in the docs (although I might have just
been looking in the wrong places too).

e.g. Can we help folks avoid `extra` better? (But I also thought Keryn's
example was perhaps something we could put somewhere 🤔

But:

> I wonder what's left to do here

Yes, I wondered that too, but didn't want to just close it without
thinking about the potential docs change.

Happy to close if we decide that it's not actionable.

--
Ticket URL: <https://code.djangoproject.com/ticket/33015#comment:6>

Django

unread,
Aug 18, 2021, 9:56:20 AM8/18/21
to django-...@googlegroups.com
#33015: QuerySet.extra Use Case: filtering on large lists of tuples
-------------------------------------+-------------------------------------

Reporter: kris-swann | Owner: nobody
Type: | Status: new
Cleanup/optimization |
Component: Database layer | Version: 2.2
(models, ORM) |
Severity: Normal | Resolution:
Keywords: QuerySet.extra, | Triage Stage: Accepted
Documentation |
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Keryn Knight):

So, I'm not against documenting the most efficient way to generate a big
old Q (it's ''usually'' an `OR` IME); I suggested it as an aside in
#32946. Mariusz would prefer not to promote using `_connector` (see
comment 1 in that ticket), and though the reasons aren't given, I think I
kind of get them, and it boils down to: ideally, we shouldn't need to...

In an ideal world, throwing Q objects around even at a large scale
''should'' be fast enough to be usable and not need ... tricks. But even
with attempts to improve other bits around that (#32948 ... doesn't make a
huge amount of difference here, as most time is in `add`, avoiding the
cost of using kwargs because they're being artificially inflated for `Q`
etc), there's an unfortunate inflexion point where it becomes bigger than
most request/response time budgets before you even ''run'' the query. The
query given above is a perfect example of that, where the query may take
~2ms (on sqlite locally for me) but the evaluation into a tuple might take
many times more than that (`100ms` for `30, 30` `items_to_fetch` otherwise
sqlite doesn't like the expression length being more than 1000) ... and
for a change of pace that's ''not'' being spent in `Model.__init__` or any
of the usual suspects (cloning methods). It's entirely in the `Query`
building (if you copy the `.query` once it's been built and paste it into
a new QuerySet, the whole thing is `34ms`). As mentioned, I ''have'' seen
this personally in production (luckily in an out of band celery task,
where time mattered less) and that's how I learnt to mitigate it. I don't
''like it'' and I don't ''like'' that it's something others might have to
learn, but even cutting 50% off the costs of `build_filter` and handling Q
objects leaves you probably searching for additional optimisations to make
it feasible in a normal request/response cycle (at the given `100, 100`
example)

I don't have a good answer for it, really. Documenting that "Yeah it's
just not good at huge condition lists" doesn't seem great, but nor does
leaving users without an escape hatch for ''if'' they encounter such
things. If Simon's suggested alternative via the new functionality is well
documented, this use case could be used as a good example of that new
hotness (though, caveat, I tried running it on `main` and got
`AttributeError: 'F' object has no attribute '_output_field_or_none'` ...
as I know nothing about that functionality (it's a neat surprise), I dunno
if that's because it was an untested ''sketch'' or if it should work but
doesn't).

--
Ticket URL: <https://code.djangoproject.com/ticket/33015#comment:7>

Reply all
Reply to author
Forward
0 new messages