[Django] #31202: Bulk update suffers from poor performance with large numbers of models and columns

530 views
Skip to first unread message

Django

unread,
Jan 23, 2020, 9:20:21 AM1/23/20
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom | Owner: nobody
Forbes |
Type: | Status: new
Cleanup/optimization |
Component: Database | Version: master
layer (models, ORM) |
Severity: Normal | Keywords:
Triage Stage: | Has patch: 0
Unreviewed |
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
-------------------------------------+-------------------------------------
A user has reported seeing extremely slow update times when using
bulk_update. With the django-bulk-update package, which does not use the
expressions API and constructs raw SQL directly, an update with 100,000
objects and and 10 fields takes 24 seconds. With the built in bulk_update
it takes 2 minutes and 24 seconds - 6x as slow.

The user has provided a reproduction case here: https://github.com/mikicz
/bulk-update-tests/blob/master/apps/something/models.py and
https://github.com/mikicz/bulk-update-
tests/blob/master/apps/something/test_bulk_update.py

From an initial look at the profiling that has been provided
(https://github.com/aykut/django-bulk-
update/files/4060369/inbuilt_pyinstrument.txt) it seems a lot of overhead
comes from building the SQL query rather than executing it - I think if we
can improve the performance it could speed up other workloads.

See: https://github.com/aykut/django-bulk-
update/issues/75#issuecomment-576886385

--
Ticket URL: <https://code.djangoproject.com/ticket/31202>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
Jan 23, 2020, 9:21:14 AM1/23/20
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: master
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Tom Forbes):

* owner: nobody => Tom Forbes
* status: new => assigned


Old description:

> A user has reported seeing extremely slow update times when using
> bulk_update. With the django-bulk-update package, which does not use the
> expressions API and constructs raw SQL directly, an update with 100,000
> objects and and 10 fields takes 24 seconds. With the built in bulk_update
> it takes 2 minutes and 24 seconds - 6x as slow.
>
> The user has provided a reproduction case here: https://github.com/mikicz
> /bulk-update-tests/blob/master/apps/something/models.py and
> https://github.com/mikicz/bulk-update-
> tests/blob/master/apps/something/test_bulk_update.py
>
> From an initial look at the profiling that has been provided
> (https://github.com/aykut/django-bulk-
> update/files/4060369/inbuilt_pyinstrument.txt) it seems a lot of overhead
> comes from building the SQL query rather than executing it - I think if
> we can improve the performance it could speed up other workloads.
>
> See: https://github.com/aykut/django-bulk-
> update/issues/75#issuecomment-576886385

New description:

A user has reported seeing extremely slow update times when using
bulk_update. With the django-bulk-update package, which does not use the

expressions API and constructs raw SQL directly (https://github.com/aykut
/django-bulk-update/blob/master/django_bulk_update/helper.py#L202), an


update with 100,000 objects and and 10 fields takes 24 seconds. With the
built in bulk_update it takes 2 minutes and 24 seconds - 6x as slow.

The user has provided a reproduction case here: https://github.com/mikicz
/bulk-update-tests/blob/master/apps/something/models.py and
https://github.com/mikicz/bulk-update-
tests/blob/master/apps/something/test_bulk_update.py

From an initial look at the profiling that has been provided
(https://github.com/aykut/django-bulk-
update/files/4060369/inbuilt_pyinstrument.txt) it seems a lot of overhead
comes from building the SQL query rather than executing it - I think if we
can improve the performance it could speed up other workloads.

See: https://github.com/aykut/django-bulk-
update/issues/75#issuecomment-576886385

--

--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:1>

Django

unread,
Jan 23, 2020, 9:32:38 AM1/23/20
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: master
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage:
| Unreviewed
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Mikuláš Poul):

* cc: Mikuláš Poul (added)


--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:2>

Django

unread,
Jan 23, 2020, 10:57:18 AM1/23/20
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: master
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Simon Charette):

* stage: Unreviewed => Accepted


Comment:

From looking at the profiles the performance boost is likely due to the
fact Django uses `Case`, `When`, and `Value` expressions that needs to be
resolved [https://github.com/aykut/django-bulk-
update/blob/399e64d820133a79f910682c1cf247938c5c4784/django_bulk_update/helper.py
instead of raw SQL like the third party package does].

I suspect using `RawSQL` with `CASE`/`WHEN` SQL templates instead of
`Case`/`When` expressions would significantly speed up the current
implementation.

--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:3>

Django

unread,
Jan 23, 2020, 3:23:22 PM1/23/20
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: master
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Mikuláš Poul):

I was running some more experiments earlier, just now got to analysing
them: it doesn't seem like it depends on the type of field, at least in
the couple of types I have in my project specifically (see
https://github.com/mikicz/bulk-update-
tests/blob/master/single_field_comparison.pdf). The performance is linear,
in both number of objects to be updated and number of fields to be updated
(see https://github.com/mikicz/bulk-update-
tests/blob/master/two_fields_comparison.pdf and https://github.com/mikicz
/bulk-update-tests/blob/master/all_fields_comparison.pdf).
The code that was running the experiment is in the `experiment` app.

--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:4>

Django

unread,
Feb 11, 2020, 5:14:45 PM2/11/20
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: master
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Tom Forbes):

I've been taking a look at this today. It's been a while since I've poked
around inside the expressions API, but I can see a couple of issues:
1. We generate a _lot_ of allocations during the whole expression
resolution process. 40% of the time is spent allocating according to
vmprof
2. "resolve_expression" for the Case statement is very copy happy. Each
resolution copies all the inner nodes, which then copy themselves.
3. We apparently needlessly call "resolve_expression" *twice* for each
bulk update. On the first invocation we call [it from this
method](https://github.com/django/django/blob/3259983f569151232d8e3b0c3d0de3a858c2b265/django/db/models/sql/subqueries.py#L105-L109)
which replaces the Q objects with WhereNodes. We then resolve *again*
during execution [from this with
statement](https://github.com/django/django/blob/master/django/db/models/query.py#L769-L770).

I'll keep investigating, but after naively disabling the first resolution
locally everything worked as expected, which implies for the happy path
this might be wasted.

--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:5>

Django

unread,
Apr 10, 2020, 10:51:40 AM4/10/20
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: master
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by Florian Demmer):

* cc: Florian Demmer (added)


--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:6>

Django

unread,
Mar 17, 2021, 3:54:01 PM3/17/21
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: dev

(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Hannes Ljungberg):

One suggestion is to use the `UPDATE..FROM`-syntax along with `VALUES` on
backends which supports it. For example:

{{{
UPDATE mytable
SET
mytext = myvalues.mytext,
myint = myvalues.myint
FROM (
VALUES
(1, 'textA', 99),
(2, 'textB', 88),
) AS myvalues (pk, mytext, myint)
WHERE mytable.pk = myvalues.pk
}}}

I know PostgreSQL supports this syntax but I haven't looked into the
others.

--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:7>

Django

unread,
Mar 28, 2021, 6:48:14 PM3/28/21
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: dev
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by Chris Jerdonek):

Here's another approach I thought would be worth mentioning. PostgreSQL
[https://www.postgresql.org/docs/13/sql-copy.html supports] `COPY FROM`
for copying data from a file (or stream) to a table. For `bulk_update()`
then, you could use `COPY FROM` to copy the data to a temp table, and then
do the update from that. I wonder if that would be faster than even
`django-bulk-update`'s approach, in part because little SQL would be
needed.

--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:8>

Django

unread,
Jan 27, 2022, 6:34:05 AMJan 27
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: dev
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by jerch):

I have started to implement an UPDATE FROM VALUES variant in
https://github.com/netzkolchose/django-
computedfields/blob/master/computedfields/fast_update.py, as I needed
faster updates in the lib.

First, the speedup is huge, it performs 10 - 25 times better than the
current bulk_update implementation.

But ofc, there are several issues with this:
- only supported by newer DBMS versions (SQLITE 3.33+, MariaDB 10.3+,
Mysql 8, Oracle unclear)
- not official ANSI SQL, thus the backends need their very own SQL
templates
- MariaDB and Mysql8 have different templates, which is not covered by
Django's backend distinction
- not easy to fit into the ORM's SQL compiler templates, thus I went with
string formatting for now
- F expressions cannot be used in VALUES at all

For Postgres imho COPY FROM would be even faster, but I did not do it due
to the needed complexity with totally different code paths for just
postgres.

--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:9>

Django

unread,
Feb 22, 2022, 5:49:12 AMFeb 22
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: dev
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by jerch):

Some early tests with COPY FROM indicate, that it even outperforms UPDATE
FROM VALUES by far (at least 4 times faster). But the code is even more
degenerated and violates django paradigms in almost every aspect. This is
probably nothing for django itself, but maybe a third party lib, that
people can use if they are aware of basic restrictions.

What I find bothersome with COPY FROM:
- needs dealing with temporary tables, columns partially bootstrapped from
target tables
- might need own cast/escape rules for more complex field types (have not
yet looked at psycopg3's copy cursor mechanics)
- might not work for all field types / payloads (this most likely depends
on alot on the used transport format/delimiters)
- impact of index on pk field of temp table uncertain (whether index
creation outweighs the index gain on the final update)

Yeah well, this needs alot more investigation, before it will be useable
with a nice interface...

--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:10>

Django

unread,
Mar 5, 2022, 6:27:27 AMMar 5
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: dev
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by jerch):

Some update on COPY FROM:

Did a first playground implementation just to see the effects on
performance, see
https://gist.github.com/jerch/fd0fae0107ce7b153b7540111b2e89ab. The chart
over there shows the mean runtime of 100 runs of x updated records in a
100k table, updating just one integer field per record. The perf tests
were done with plain postgres 14 docker image, with no tweaking of any
settings.

The implementation `copy_update` is not yet optimized for perf or neatly
integrated yet, it is just to get an idea, where the ballpark for COPY
FROM would end up. `bulk_update` is django's current implementation,
`django-bulk-update` is from here: https://github.com/aykut/django-bulk-
update, `fast_update` is my early impl of direct UPDATE FROM VALUES from
above.

Some observations from that:
- `bulk_update` shows much worse runtime behavior than all others (thus
accounting is stopped early)
- `django-bulk-update` can keep up a bit longer, but then shows the same
worse runtime growth (prolly due to the same SQL logic used?). This gets
really worse for updates >5k (not shown).
- `copy_update` has a much higher setup costs (1 to 256 updated records)
- between 4096 to 8192 updates `copy_update` starts to pay off compared to
`fast_update`, at 32k updates it is almost twice as fast
- not shown in the charts: for higher updates counts it keeps gaining
ground (almost being 4 times faster for 1M update records)
- There is something going on between 256-512 updates, as almost all
implementations show a steep jump there (postgres b-tree depth change? did
not investigate that further...)

Some early insights from that:
- As already stated above in an earlier comment, django's `bulk_update` is
currently pretty wasteful, as it even drops far behind `django-bulk-
update`, which uses the same SQL update strategy.
- `fast_update` is the winner in small to medium update counts, up to
~10k.
- `copy_update` starts to shine for update counts >10k.

--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:11>

Django

unread,
Apr 8, 2022, 8:57:14 AMApr 8
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: dev
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------

Comment (by jerch):

Made some progress in https://github.com/netzkolchose/django-fast-update,
which contains `fast_update` for Postgres, sqlite, MariaDB and Mysql8, as
well as `copy_update` for postgres. The package is currently alpha (though
quite complete in CI tests), would be glad to get some feedback and some
real world tests.

Some notes about the implementations - I kinda gave up on closer
integration of `fast_update` with ORM internals, this needs someone with
deeper knowledge and time to actually do it. `copy_update` is more a PoC
with additional constraints (like values' transport repr relies alot on
correct `__str__` output), imho this could be realized more straight
forward with psycopg3's superior COPY support, once django moved there.

--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:12>

Django

unread,
Apr 25, 2022, 10:05:40 AMApr 25
to django-...@googlegroups.com
#31202: Bulk update suffers from poor performance with large numbers of models and
columns
-------------------------------------+-------------------------------------
Reporter: Tom Forbes | Owner: Tom
Type: | Forbes
Cleanup/optimization | Status: assigned
Component: Database layer | Version: dev
(models, ORM) |
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted
Has patch: 0 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
-------------------------------------+-------------------------------------
Changes (by John Speno):

* cc: John Speno (added)


--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:13>

Reply all
Reply to author
Forward
0 new messages