The user has provided a reproduction case here: https://github.com/mikicz
/bulk-update-tests/blob/master/apps/something/models.py and
https://github.com/mikicz/bulk-update-
tests/blob/master/apps/something/test_bulk_update.py
From an initial look at the profiling that has been provided
(https://github.com/aykut/django-bulk-
update/files/4060369/inbuilt_pyinstrument.txt) it seems a lot of overhead
comes from building the SQL query rather than executing it - I think if we
can improve the performance it could speed up other workloads.
See: https://github.com/aykut/django-bulk-
update/issues/75#issuecomment-576886385
--
Ticket URL: <https://code.djangoproject.com/ticket/31202>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* owner: nobody => Tom Forbes
* status: new => assigned
Old description:
> A user has reported seeing extremely slow update times when using
> bulk_update. With the django-bulk-update package, which does not use the
> expressions API and constructs raw SQL directly, an update with 100,000
> objects and and 10 fields takes 24 seconds. With the built in bulk_update
> it takes 2 minutes and 24 seconds - 6x as slow.
>
> The user has provided a reproduction case here: https://github.com/mikicz
> /bulk-update-tests/blob/master/apps/something/models.py and
> https://github.com/mikicz/bulk-update-
> tests/blob/master/apps/something/test_bulk_update.py
>
> From an initial look at the profiling that has been provided
> (https://github.com/aykut/django-bulk-
> update/files/4060369/inbuilt_pyinstrument.txt) it seems a lot of overhead
> comes from building the SQL query rather than executing it - I think if
> we can improve the performance it could speed up other workloads.
>
> See: https://github.com/aykut/django-bulk-
> update/issues/75#issuecomment-576886385
New description:
A user has reported seeing extremely slow update times when using
bulk_update. With the django-bulk-update package, which does not use the
expressions API and constructs raw SQL directly (https://github.com/aykut
/django-bulk-update/blob/master/django_bulk_update/helper.py#L202), an
update with 100,000 objects and and 10 fields takes 24 seconds. With the
built in bulk_update it takes 2 minutes and 24 seconds - 6x as slow.
The user has provided a reproduction case here: https://github.com/mikicz
/bulk-update-tests/blob/master/apps/something/models.py and
https://github.com/mikicz/bulk-update-
tests/blob/master/apps/something/test_bulk_update.py
From an initial look at the profiling that has been provided
(https://github.com/aykut/django-bulk-
update/files/4060369/inbuilt_pyinstrument.txt) it seems a lot of overhead
comes from building the SQL query rather than executing it - I think if we
can improve the performance it could speed up other workloads.
See: https://github.com/aykut/django-bulk-
update/issues/75#issuecomment-576886385
--
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:1>
* cc: Mikuláš Poul (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:2>
* stage: Unreviewed => Accepted
Comment:
From looking at the profiles the performance boost is likely due to the
fact Django uses `Case`, `When`, and `Value` expressions that needs to be
resolved [https://github.com/aykut/django-bulk-
update/blob/399e64d820133a79f910682c1cf247938c5c4784/django_bulk_update/helper.py
instead of raw SQL like the third party package does].
I suspect using `RawSQL` with `CASE`/`WHEN` SQL templates instead of
`Case`/`When` expressions would significantly speed up the current
implementation.
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:3>
Comment (by Mikuláš Poul):
I was running some more experiments earlier, just now got to analysing
them: it doesn't seem like it depends on the type of field, at least in
the couple of types I have in my project specifically (see
https://github.com/mikicz/bulk-update-
tests/blob/master/single_field_comparison.pdf). The performance is linear,
in both number of objects to be updated and number of fields to be updated
(see https://github.com/mikicz/bulk-update-
tests/blob/master/two_fields_comparison.pdf and https://github.com/mikicz
/bulk-update-tests/blob/master/all_fields_comparison.pdf).
The code that was running the experiment is in the `experiment` app.
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:4>
Comment (by Tom Forbes):
I've been taking a look at this today. It's been a while since I've poked
around inside the expressions API, but I can see a couple of issues:
1. We generate a _lot_ of allocations during the whole expression
resolution process. 40% of the time is spent allocating according to
vmprof
2. "resolve_expression" for the Case statement is very copy happy. Each
resolution copies all the inner nodes, which then copy themselves.
3. We apparently needlessly call "resolve_expression" *twice* for each
bulk update. On the first invocation we call [it from this
method](https://github.com/django/django/blob/3259983f569151232d8e3b0c3d0de3a858c2b265/django/db/models/sql/subqueries.py#L105-L109)
which replaces the Q objects with WhereNodes. We then resolve *again*
during execution [from this with
statement](https://github.com/django/django/blob/master/django/db/models/query.py#L769-L770).
I'll keep investigating, but after naively disabling the first resolution
locally everything worked as expected, which implies for the happy path
this might be wasted.
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:5>
* cc: Florian Demmer (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:6>
Comment (by Hannes Ljungberg):
One suggestion is to use the `UPDATE..FROM`-syntax along with `VALUES` on
backends which supports it. For example:
{{{
UPDATE mytable
SET
mytext = myvalues.mytext,
myint = myvalues.myint
FROM (
VALUES
(1, 'textA', 99),
(2, 'textB', 88),
) AS myvalues (pk, mytext, myint)
WHERE mytable.pk = myvalues.pk
}}}
I know PostgreSQL supports this syntax but I haven't looked into the
others.
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:7>
Comment (by Chris Jerdonek):
Here's another approach I thought would be worth mentioning. PostgreSQL
[https://www.postgresql.org/docs/13/sql-copy.html supports] `COPY FROM`
for copying data from a file (or stream) to a table. For `bulk_update()`
then, you could use `COPY FROM` to copy the data to a temp table, and then
do the update from that. I wonder if that would be faster than even
`django-bulk-update`'s approach, in part because little SQL would be
needed.
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:8>
Comment (by jerch):
I have started to implement an UPDATE FROM VALUES variant in
https://github.com/netzkolchose/django-
computedfields/blob/master/computedfields/fast_update.py, as I needed
faster updates in the lib.
First, the speedup is huge, it performs 10 - 25 times better than the
current bulk_update implementation.
But ofc, there are several issues with this:
- only supported by newer DBMS versions (SQLITE 3.33+, MariaDB 10.3+,
Mysql 8, Oracle unclear)
- not official ANSI SQL, thus the backends need their very own SQL
templates
- MariaDB and Mysql8 have different templates, which is not covered by
Django's backend distinction
- not easy to fit into the ORM's SQL compiler templates, thus I went with
string formatting for now
- F expressions cannot be used in VALUES at all
For Postgres imho COPY FROM would be even faster, but I did not do it due
to the needed complexity with totally different code paths for just
postgres.
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:9>
Comment (by jerch):
Some early tests with COPY FROM indicate, that it even outperforms UPDATE
FROM VALUES by far (at least 4 times faster). But the code is even more
degenerated and violates django paradigms in almost every aspect. This is
probably nothing for django itself, but maybe a third party lib, that
people can use if they are aware of basic restrictions.
What I find bothersome with COPY FROM:
- needs dealing with temporary tables, columns partially bootstrapped from
target tables
- might need own cast/escape rules for more complex field types (have not
yet looked at psycopg3's copy cursor mechanics)
- might not work for all field types / payloads (this most likely depends
on alot on the used transport format/delimiters)
- impact of index on pk field of temp table uncertain (whether index
creation outweighs the index gain on the final update)
Yeah well, this needs alot more investigation, before it will be useable
with a nice interface...
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:10>
Comment (by jerch):
Some update on COPY FROM:
Did a first playground implementation just to see the effects on
performance, see
https://gist.github.com/jerch/fd0fae0107ce7b153b7540111b2e89ab. The chart
over there shows the mean runtime of 100 runs of x updated records in a
100k table, updating just one integer field per record. The perf tests
were done with plain postgres 14 docker image, with no tweaking of any
settings.
The implementation `copy_update` is not yet optimized for perf or neatly
integrated yet, it is just to get an idea, where the ballpark for COPY
FROM would end up. `bulk_update` is django's current implementation,
`django-bulk-update` is from here: https://github.com/aykut/django-bulk-
update, `fast_update` is my early impl of direct UPDATE FROM VALUES from
above.
Some observations from that:
- `bulk_update` shows much worse runtime behavior than all others (thus
accounting is stopped early)
- `django-bulk-update` can keep up a bit longer, but then shows the same
worse runtime growth (prolly due to the same SQL logic used?). This gets
really worse for updates >5k (not shown).
- `copy_update` has a much higher setup costs (1 to 256 updated records)
- between 4096 to 8192 updates `copy_update` starts to pay off compared to
`fast_update`, at 32k updates it is almost twice as fast
- not shown in the charts: for higher updates counts it keeps gaining
ground (almost being 4 times faster for 1M update records)
- There is something going on between 256-512 updates, as almost all
implementations show a steep jump there (postgres b-tree depth change? did
not investigate that further...)
Some early insights from that:
- As already stated above in an earlier comment, django's `bulk_update` is
currently pretty wasteful, as it even drops far behind `django-bulk-
update`, which uses the same SQL update strategy.
- `fast_update` is the winner in small to medium update counts, up to
~10k.
- `copy_update` starts to shine for update counts >10k.
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:11>
Comment (by jerch):
Made some progress in https://github.com/netzkolchose/django-fast-update,
which contains `fast_update` for Postgres, sqlite, MariaDB and Mysql8, as
well as `copy_update` for postgres. The package is currently alpha (though
quite complete in CI tests), would be glad to get some feedback and some
real world tests.
Some notes about the implementations - I kinda gave up on closer
integration of `fast_update` with ORM internals, this needs someone with
deeper knowledge and time to actually do it. `copy_update` is more a PoC
with additional constraints (like values' transport repr relies alot on
correct `__str__` output), imho this could be realized more straight
forward with psycopg3's superior COPY support, once django moved there.
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:12>
* cc: John Speno (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:13>
Comment (by Taylor H):
Hey everyone, just wanted to mention that I created the following library
for very fast model updates and inserts in Postgres. We have been using it
in production for over 2 years.
https://github.com/cedar-team/django-bulk-load
More details about the library:
https://decode.cedar.com/fast-django-model-inserts-with-postgres/
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:14>
* cc: Akash Kumar Sen (added)
--
Ticket URL: <https://code.djangoproject.com/ticket/31202#comment:15>