[Django] #35399: Reduce the "Case-When" sequence for a bulk_update when the values for a certain field are the same.

Django

unread,

Apr 23, 2024, 11:39:49 AMApr 23

to django-...@googlegroups.com

#35399: Reduce the "Case-When" sequence for a bulk_update when the values for a
certain field are the same.
-------------------------------------+-------------------------------------
Reporter: Willem | Owner: nobody
Van Onsem |
Type: New | Status: new
feature |
Component: Database | Version: 5.0
layer (models, ORM) | Keywords: db, bulk_update,
Severity: Normal | case, when
Triage Stage: | Has patch: 1
Unreviewed |
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
-------------------------------------+-------------------------------------
Django's `.bulk_update(..)` seems to work with sequences of `Case-When`
*per* field and *per* record to update. In other words, if we have three
items we want to update with `pk` being 1, 4 and 5, and we have a field
`yn` and the records for pk `1` and `5` have value `y`, whereas the one
for `pk=4` it is `n`, then this will make a query that looks like:

{{{
UPDATE my_table
SET yn=CASE WHEN id=1 THEN 'y' WHEN id=4 THEN 'n' WHEN id=5 THEN 'y' END
WHERE id IN (1,4,5)
}}}

It thus builds a long sequence of `Case`-`When`s, which is not efficient.

There are for most databases solutions with a temporary table that is then
upserted into the main table. The Django ORM could probably use existent
tools for this where we first make a temporary model, create it with the
migration model, insert in bulk, then upsert with a query, and finally
remove the temporary table. But a problem with this is, we might not have
privileges to create an extra table in that database.

But a low-hanging optimization is to first *group* the values together
that we can group together. This can be done with a `defaultdict` that
maps the value to a list of primary keys for that value. In case the value
is not hashable, we can fallback on the original case-when logic. But for
values like strings, this would reduce the query to:

{{{
UPDATE my_table
SET yn=CASE WHEN id IN (1, 5) THEN 'y' WHEN id=4 THEN 'n' END
WHERE id IN (1,4,5)
}}}

That being said, I think bulk_updates should still be done in a more
efficient manner, perhaps moving it to the engines, since a lot of
databases allow the creation of temporary tables to make upserts a lot
more efficient.
--
Ticket URL: <https://code.djangoproject.com/ticket/35399>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

Apr 23, 2024, 11:40:07 AMApr 23

to django-...@googlegroups.com

#35399: Reduce the "Case-When" sequence for a bulk_update when the values for a
certain field are the same.
-------------------------------------+-------------------------------------

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Changes (by Willem Van Onsem):

* Attachment "group_the_case-whens_into_sub-chunks_.patch" added.

Django

unread,

Apr 23, 2024, 12:22:17 PMApr 23

to django-...@googlegroups.com

#35399: Reduce the "Case-When" sequence for a bulk_update when the values for a
certain field are the same.
-------------------------------------+-------------------------------------
Reporter: Willem Van Onsem | Owner: nobody

Type: New feature | Status: closed

Component: Database layer | Version: 5.0
(models, ORM) |

Severity: Normal | Resolution: duplicate

Changes (by Simon Charette):

* resolution: => duplicate
* status: new => closed

Comment:

This was discussed in #35124 recently and at lead to
[https://forum.djangoproject.com/t/could-bulk-update-aggregate-the-pks-in-
when-clauses/27093/2 this discussion] about the complexity of and cost of
hashing certain values quickly. Unless you can provide benchmarks that
this is actually faster I think time is better invested towards #29771 and
#31202.
--
Ticket URL: <https://code.djangoproject.com/ticket/35399#comment:1>

Django

unread,

Apr 24, 2024, 9:33:47 AMApr 24

to django-...@googlegroups.com

#35399: Reduce the "Case-When" sequence for a bulk_update when the values for a
certain field are the same.
-------------------------------------+-------------------------------------
Reporter: Willem Van Onsem | Owner: nobody
Type: New feature | Status: closed
Component: Database layer | Version: 5.0
(models, ORM) |
Severity: Normal | Resolution: duplicate
Keywords: db, bulk_update, | Triage Stage:
case, when | Unreviewed
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Willem Van Onsem):

I created a small testcase that has a speedup of 53.2% ±4.073% (p=95%):

{{{
def benching(n=100, m=10):
def benchmark():
EventStatistics.objects.bulk_update(es,
fields=('scans_in',))
def benchmark2():
EventStatistics.objects.bulk_update2(es,
fields=('scans_in',))

es = list(EventStatistics.objects.order_by('?')[:n])
for e in es:
e.scans_in = random.randint(0, m)
without = timeit.timeit(benchmark, number=15)

es = list(EventStatistics.objects.order_by('?')[:n])
for e in es:
e.scans_in = random.randint(0, 10)
_with = timeit.timeit(benchmark2, number=15)

speedup = round(100*_with/without) # percentage of the new one
compared to the old
print(m, n, without, _with, f'{speedup}%')
return speedup
}}}

Where I added the `bulk_update2` method to the `QuerySet`. The code is the
proposed change, whereas `bulk_update` is the original one.

Here `scans_in` is an `IntegerField`, `m` is the number of different
values (well randomly determined, so an upper bound), `n` is the number of
records to update. I used random records to avoid that these are all
located near each other, and furthermore used different ones for the two,
to avoid that the indexes for example are cached, or some other speedup.
Every test ran 15 times with the `timeit` module with the following
results (first two columns are `m` and `n`, then the timings in seconds,
and finally the time of the speedup, compared to the original):

{{{
10 100 0.23681159999978263 0.1789207000110764 76%
10 500 1.103456400000141 0.5554370999889215 50%
10 1000 2.4681710000004387 1.3000370000081602 53%
10 5000 13.48228929999459 6.651690699989558 49%
10 10000 33.66046780000033 15.352994000000763 46%
20 100 0.25399490000563674 0.17354469999554567 68%
20 500 1.041674799998873 0.5452321000047959 52%
20 1000 2.2131319000036456 1.3127163000026485 59%
20 5000 26.524631100008264 10.656895499996608 40%
20 10000 36.67004110000562 14.241265099990414 39%
50 100 0.21403420000569895 0.13851690001320094 65%
50 500 1.2294649999967078 0.6158274999907007 50%
50 1000 2.1560433999984525 1.1467072000086773 53%
50 5000 14.895891699998174 7.622537400005967 51%
50 10000 34.32338119999622 15.66809479999938 46%
100 100 0.24263419999624602 0.15937580000900198 66%
100 500 1.225836999990861 0.634526800000458 52%
100 1000 2.245729800008121 1.1381603000045288 51%
100 5000 13.832920399989234 6.757217299993499 49%
100 10000 39.29845810000552 19.19906909999554 49%
}}}

or as a table with the "speedup" (smaller is a **stronger** speedup):

{{{
--- --- --- ---- ---- -----
x 100 500 1000 5000 10000
10 76% 50% 53% 49% 46%
20 68% 52% 59% 40% 39%
50 65% 50% 53% 51% 46%
100 66% 52% 51% 49% 49%
--- --- --- ---- ---- -----
}}}

I also ran these in the opposite direction, to make sure it was not that
the first queries somehow let the database respond faster (because the
PostgreSQL database probably contains code pages, indexes, etc. that could
be in the cache). Then we achieve the following results:

{{{
10 100 0.16778360000171233 0.26211979999789037 156%
10 500 0.575189399998635 1.1075479999999516 193%
10 1000 1.0994390999985626 2.76130669999111 251%
10 5000 6.549138200003654 14.862665100008599 227%
10 10000 14.595775299996603 41.87046229999396 287%
20 100 0.19783109999843873 0.26739639999868814 135%
20 500 0.8581540000013774 1.7171591000078479 200%
20 1000 1.173938799998723 2.096230099996319 179%
20 5000 7.9105590999970445 15.384056599999894 194%
20 10000 16.410469500013278 36.77300239999022 224%
50 100 0.22504739998839796 0.22487839999666903 100%
50 500 0.6390990999934729 1.1038762999960454 173%
50 1000 1.2254422000114573 2.0602585000015097 168%
50 5000 6.814096200003405 13.404201700002886 197%
50 10000 18.954565500011086 41.927377500003786 221%
100 100 0.2481432000058703 0.22796000000380445 92%
100 500 0.7834087000082945 1.3490471999975853 172%
100 1000 1.5713398999942 2.1639006000041263 138%
100 5000 7.152072100012447 13.975013000002946 195%
100 10000 16.15594609999971 36.78454979999515 228%
}}}

or as a table with the parameters (greater is a **stronger** speedup) we
get 186.5% ±21.048% (p=95%):

{{{
--- ---- ---- ---- ---- -----
x 100 500 1000 5000 10000
10 156% 193% 251% 227% 287%
20 135% 200% 179% 194% 224%
50 100% 173% 168% 197% 221%
100 92% 172% 138% 195% 228%
--- ---- ---- ---- ---- -----
}}}

Finally I also moved the generation or random numbers into the testcase,
this because if the values are once updated, this might reduce the query
time, since it does not require writing to a disk, so:

{{{
def benching(n=100, m=10):
def benchmark():
es = list(EventStatistics.objects.order_by('?')[:n])
for e in es:
e.scans_in = random.randint(0, m)
EventStatistics.objects.bulk_update(es,
fields=('scans_in',))
def benchmark2():
es = list(EventStatistics.objects.order_by('?')[:n])
for e in es:
e.scans_in = random.randint(0, 10)
EventStatistics.objects.bulk_update2(es,
fields=('scans_in',))

without = timeit.timeit(benchmark, number=15)
_with = timeit.timeit(benchmark2, number=15)

speedup = round(100*_with/without) # percentage of the new one
compared to the old
print(m, n, without, _with, f'{speedup}%')
return speedup
}}}

This will "flatten out" the speedup, since the part where we assign values
also counts as time, and this is the same for both versions. With the
original `bulk_update` first, we get the following results (76.55%
±10.293%, p=95%):

{{{
10 100 2.384173699989333 2.2892083000042476 96%
10 500 3.9282099999982165 2.949110200002906 75%
10 1000 5.0259445999981835 5.3350902999955 106%
10 5000 20.10881380000501 11.846710900004837 59%
10 10000 50.26291349998792 24.37554049999744 48%
20 100 1.7966537000029348 2.1353922999987844 119%
20 500 2.8477611999987857 3.5514096999977482 125%
20 1000 4.702102600000217 3.8196782000013627 81%
20 5000 19.417894899990642 12.532883399995626 65%
20 10000 45.69321430000127 20.617325500003062 45%
50 100 1.9369860000006156 1.8516613000101643 96%
50 500 3.1590656000043964 2.6837016999925254 85%
50 1000 4.705449500004761 2.9762405999936163 63%
50 5000 17.380039900002885 10.148739099997329 58%
50 10000 40.880203600012464 20.52081229999021 50%
100 100 1.7478849000035552 1.6249813000031281 93%
100 500 3.1730380000080913 2.2462558999977773 71%
100 1000 4.052220900004613 3.3273870999983046 82%
100 5000 17.978366199997254 11.050199799996335 61%
100 10000 38.403917099989485 20.502074199990602 53%
--- ---- ---- ---- ---- -----
x 100 500 1000 5000 10000
10 96% 75% 106% 59% 48%
20 119% 125% 81% 65% 45%
50 96% 85% 63% 58% 50%
100 93% 71% 82% 61% 53%
--- ---- ---- ---- ---- -----
}}}

and with the "improved" version first (142.9% ±19.563%, p=95%):

{{{
10 100 2.5029061999957776 1.902948799994192 76%
10 500 2.3854384000005666 3.409526500006905 143%
10 1000 3.062690600010683 4.10313469999528 134%
10 5000 10.875457200003439 17.695201499998802 163%
10 10000 21.629492600011872 48.996730700004264 227%
20 100 2.12547889999405 2.262495199989644 106%
20 500 2.699372699993546 3.4100137000059476 126%
20 1000 3.840364799994859 4.572843499990995 119%
20 5000 10.895366700002342 19.433951699989848 178%
20 10000 22.678348500005086 42.50577279999561 187%
50 100 3.069858699993347 2.5085584000044037 82%
50 500 2.7575027999992017 3.3038902999978745 120%
50 1000 3.572029700007988 4.635487000006833 130%
50 5000 10.223477299994556 20.650777300004847 202%
50 10000 30.925161999999546 56.15645150000637 182%
100 100 3.82010420000006 2.949311400006991 77%
100 500 3.6471351999934996 6.216737800001283 170%
100 1000 6.9234703999973135 9.566761500012944 138%
100 5000 26.763996299996506 25.526094300003024 95%
100 10000 25.562641399999848 51.87259610000183 203%
--- ---- ---- ---- ---- -----
x 100 500 1000 5000 10000
10 76% 143% 134% 163% 227%
20 106% 126% 119% 178% 187%
50 82% 120% 130% 202% 182%
100 77% 170% 138% 95% 203%
--- ---- ---- ---- ---- -----
}}}

Some extra ideas to optimize is to put the `Case` with the largest number
of pks first, since that is then the first one that will be taken,
avoiding to have to walk over all cases by the database, which is
essentially linear search.
--
Ticket URL: <https://code.djangoproject.com/ticket/35399#comment:2>

Django

unread,

Apr 24, 2024, 9:34:06 AMApr 24

to django-...@googlegroups.com

#35399: Reduce the "Case-When" sequence for a bulk_update when the values for a
certain field are the same.
-------------------------------------+-------------------------------------
Reporter: Willem Van Onsem | Owner: nobody

Type: New feature | Status: new

Component: Database layer | Version: 5.0
(models, ORM) |
Severity: Normal | Resolution:

Changes (by Willem Van Onsem):

* resolution: duplicate =>
* status: closed => new

--
Ticket URL: <https://code.djangoproject.com/ticket/35399#comment:3>

Django

unread,

Apr 24, 2024, 1:11:01 PMApr 24

to django-...@googlegroups.com

#35399: Reduce the "Case-When" sequence for a bulk_update when the values for a
certain field are the same.
-------------------------------------+-------------------------------------
Reporter: Willem Van Onsem | Owner: nobody
Type: New feature | Status: new
Component: Database layer | Version: 5.0
(models, ORM) |
Severity: Normal | Resolution:
Keywords: db, bulk_update, | Triage Stage:
case, when | Unreviewed
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Simon Charette):

Thank you for taking the time to perform the benchmarks, I think you
missed the point brought up about hashing being expensive on non-literal
values though.

In other words `int.__hash__` is pretty fast but `Expression.__hash__`
isn't and you need to perform an implicit one to gather values in
`defaultdict(list)`. Since `bulk_update` support both expressions and
literal assignment the benchmark must be run against both to be
representative.

Try running them with assignments of `F("scans_in") +
Value(random.randint(0, m))` instead and you should see a slowdown.

The need for _hashability_ also poses another problem. What should be done
with values that are not hashable such as `JSONField` assignment of
`dict`s?

{{{#!python
objects = [
Object(pk=1, data={"foo": "bar"}),
Object(pk=2, data={"foo": "bar"}),
]
Object.objects.bulk_update(es, fields=["data"])
}}}

When all of these edge cases are accounted for (we'll need a route for
non-hashable values mixed with hashable values) I still believe that time
would be better spend not using `CASE` / `WHEN` at all and favor investing
efforts towards #29771 and #31202.
--
Ticket URL: <https://code.djangoproject.com/ticket/35399#comment:4>

Django

unread,

Apr 24, 2024, 2:08:15 PMApr 24

to django-...@googlegroups.com

#35399: Reduce the "Case-When" sequence for a bulk_update when the values for a
certain field are the same.
-------------------------------------+-------------------------------------
Reporter: Willem Van Onsem | Owner: nobody

Type: | Status: closed
Cleanup/optimization |

Component: Database layer | Version: 5.0
(models, ORM) |

Severity: Normal | Resolution: duplicate

Changes (by Natalia Bidart):

* resolution: => duplicate
* status: new => closed

* type: New feature => Cleanup/optimization

Comment:

Hello Willem, while the benchmarks are useful, considering Simon's
responses and Django's usual procedures for accepting changes, I believe
it would be best to continue the conversation in the forum topic that has
already been created for this issue. This way, more members of the
community will receive notifications from the forum, unlike ticket
reports:

https://forum.djangoproject.com/t/could-bulk-update-aggregate-the-pks-in-
when-clauses/27093
--
Ticket URL: <https://code.djangoproject.com/ticket/35399#comment:5>

Django

unread,

Apr 24, 2024, 2:09:14 PMApr 24

to django-...@googlegroups.com

#35399: Reduce the "Case-When" sequence for a bulk_update when the values for a
certain field are the same.
-------------------------------------+-------------------------------------
Reporter: Willem Van Onsem | Owner: nobody
Type: | Status: closed
Cleanup/optimization |
Component: Database layer | Version: 5.0
(models, ORM) |
Severity: Normal | Resolution: duplicate
Keywords: db, bulk_update, | Triage Stage:
case, when | Unreviewed
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Willem Van Onsem):

Hi, that is already in the changeset: in case the item is not hashable, it
will raise a type error: we catch that, and then fallback on the original
way of doing it, without much delay. So it is more a boosting mechanism
that when fails, is "disabled" when running through it.

The hash of an expression indeed is take more time than the one of an
`int`, but the idea is that this often will still pay off, since the
cascade of Case`-`When` essentially forces the database into linear
search, while it *might* be possible that a certain database finds a way,
if that is not the case, it will first look at the first `When`, if that
fails, the next one, etc. so for ~1000 `When`s, it will on average take
500 items per row.

I agree that a temporary table is probably more ideal.
--
Ticket URL: <https://code.djangoproject.com/ticket/35399#comment:6>

Django

unread,

Apr 24, 2024, 3:10:54 PMApr 24

to django-...@googlegroups.com

#35399: Reduce the "Case-When" sequence for a bulk_update when the values for a
certain field are the same.
-------------------------------------+-------------------------------------
Reporter: Willem Van Onsem | Owner: nobody
Type: | Status: closed
Cleanup/optimization |
Component: Database layer | Version: 5.0
(models, ORM) |
Severity: Normal | Resolution: duplicate
Keywords: db, bulk_update, | Triage Stage:
case, when | Unreviewed
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Simon Charette):

> it will raise a type error: we catch that, and then fallback on the
original way of doing it, without much delay.

Creating and catching exceptions is not cheap either so I'm not sure we
can qualify it as ''without much delay''. I don't see any benchmark
backing it is what I mean.

> the idea is that this often will still pay off, since the cascade of
Case - When essentially forces the database into linear search

I see what you mean with the database level linear scan but my concerns is
that the extra amount time we spend ''massaging'' the data on the Python
side pre-emptively in hope of preventing the database level slowdown from
happening is going to cause more harm than good.

You benchmarked against are the ''happy path'' where a there's a single
field to update, the values are all literals, the values are all hashable.
I think that a proper solution should ensure that the ''Python layer tax''
of organizing the data doesn't outweigh the expected database level
benefits.

Having separate benchmarks for query / SQL generation in both variants of
`bulk_update` vs database level execution of the SQL on supported backends
would help in taking more advised decision here. All of that can take
place on the forum though.
--
Ticket URL: <https://code.djangoproject.com/ticket/35399#comment:7>

Django

unread,

Apr 24, 2024, 3:40:35 PMApr 24

to django-...@googlegroups.com

#35399: Reduce the "Case-When" sequence for a bulk_update when the values for a
certain field are the same.
-------------------------------------+-------------------------------------
Reporter: Willem Van Onsem | Owner: nobody
Type: | Status: closed
Cleanup/optimization |
Component: Database layer | Version: 5.0
(models, ORM) |
Severity: Normal | Resolution: duplicate
Keywords: db, bulk_update, | Triage Stage:
case, when | Unreviewed
Has patch: 1 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Willem Van Onsem):

Well typically hashing should run linear with the data, so `Value(1)`
should indeed take more time than `1`, but not dramatically more (by the
way, it already hashes `Value(1)`, since it first checks the `if not
hasattr(attr, "resolve_expression")` first, and wraps it into a `Value`,
so that is where the current benmarks originate from. If we would have
done `=Value(random.randint(0, 10))` for this benchmark, it would make no
difference. From the moment it encounters a hash error, it sets the
dictionary to `None`, and thus no longer hashes, and saves these cycles,
it thus will stop looking for hashes if one of the items can not be
hashed.

But probably the main reason why it would be very strange that the hashing
would increase time significantly is that in order to generate an SQL
counterpart of some expression (like F('a') + Value(1)`, it will *also*
take linear). So essentially *building* the SQL query with *all* `Case` /
`When` items, will always take approximately the same effort as hashing
all these items, since both run linear in the "size" of the SQL
expressions (or at least, that is a reasonable assumption), so we will
lose that effort anyway. It will thus probably at most ~double the amount
of effort to generate the query, if there are (close) to no duplicate
expressions available, and my experience is that building the query
itself, often is not really the bottleneck: once to generate the hashes,
and once to generate the query. If there are duplicate expressions, it
saves also on generate parts of the SQL query, which again will probably
not have much impact in a positive or negative way, since that is almost
never the bottleneck.
--
Ticket URL: <https://code.djangoproject.com/ticket/35399#comment:8>

Reply all

Reply to author

Forward