general interest in faster bulk_update implementation

244 views
Skip to first unread message

Jörg Breitbart

unread,
Apr 24, 2022, 4:19:16 PM4/24/22
to django-d...@googlegroups.com
Hi there,

I have recently tried to tackle the known performance issues of
bulk_update. For more context plz see the original thread:
https://code.djangoproject.com/ticket/31202

In another issue we kinda ended up at the same idea again for a
different reason (https://code.djangoproject.com/ticket/33647).

My early attempt for a faster implementation:
https://github.com/netzkolchose/django-fast-update

This impl has several drawbacks/restrictions which I'd like to discuss,
before wasting more time to get things more aligned with the ORM, esp
these aspects:
- missing f-expression support
- only support for recent db engines

@f-expressions
This is not directly possible from a constant table (thats how postgres
calls tables created from VALUES(...)). At this point idk if
f-expressions is an actively used thing with bulk_update at all, or if
support for those would just penalize the update throughput for no good
reason. My impl above currently skips such a workaround. Would like to
hear some more opinions about that.

@db engine support
Are there any strict db support promises made by django releases? I see
that some django versions have notes about certain db engines and
versions, but I cant find a "support grid" or an explicit doc for that.
If django does not officially restrict db versions, then the impl above
would need some sort of a fallback for unsupported db versions.

Plz also tell me, if this all sounds too wild for getting considered
into django (tbh the db version dependencies bug me alot), then it
prolly is better kept in a 3rd party package.

Regards,
jerch

Florian Apolloner

unread,
Apr 26, 2022, 2:59:00 AM4/26/22
to Django developers (Contributions to Django itself)
Hi Jörg,

Regarding your question about database version support: https://docs.djangoproject.com/en/4.0/ref/databases/ lists the supported versions. Granted not as support grid and you have to scroll to every database, but it is usually in the first paragraph.

If f-expressions are currently supported your implementation could fall back to a slower implementation if it encounters them?

Cheers,
Florian

Jörg Breitbart

unread,
Apr 26, 2022, 10:05:56 AM4/26/22
to django-d...@googlegroups.com
@Florian

Thx for your response.

Looking through the release notes and the listed databases I find these
version requirements:
- PostgreSQL 10+
- MariaDB 10.2+
- MySQL 5.7+
- Oracle 19c+
- SQLite 3.9.0+

Compared to the UPDATE FROM VALUES pattern requirements:
- MariaDB 10.3.3+
- MySQL 8.0.19+
- Oracle currently no impl at all
- SQLite 3.33+

thus only postgres would work out of the box. Question then is, whether
to raise version requirements for django. Imho a good indicator for that
might be the age of a db release, its EOL state, and whether it is still
part of LTS distros:
- MariaDB 10.2: EOL 04/2022
- MariaDB 10.3: released in 04/2018
(Ubuntu 20.04 LTS is on 10.3 line, 18.04 LTS on 10.1)
--> prolly safe to raise to 10.3 line?

- MySQL 5.7: EOL 10/2023
- MySQL 8.0: released in 04/2018
(Ubuntu 20.04 LTS contains 8.0 line, 18.04 LTS on 5.7)
--> 5.7 is still within lifetime for 1.5ys
--> Cut old ropes early here?

- SQLite: 3.22 on ubuntu 18.04, 3.31 on ubuntu 20.04
--> imho 3.33+ cannot be requested here, as upgrading sqlite3
packages is much more of a hassle for peeps

- others non supported (incl. oracle):
Should there be a default fallback within django? Or should db vendors
be bugged for implementing an abstract interface for UPDATE FROM VALUES?

Especially the last 2 points (sqlite and general db vendor compat) are
tricky. Here I think a general fallback within django ORM might be the
only way to not let db version issues surface its way to the user. Not
sure yet, how practical/maintainable that would be in the end, as it
would have to provide 2 internal code paths for the same bulk_update API
endpoint:
- fast one, if UPDATE FROM VALUES pattern is supported
- fallback for backends not supporting the fast update pattern

Thinking the problem from a db vendor perspective, it could look like
this in a db package (just brainstorming atm):
- a flag indicating support for UPDATE FROM VALUES pattern
- the flag result might be active code, if the db driver has
to test support on server side (thats the currently case for mysql)
- to provide an easy upgrade path for db vendors, the flag might be
missing on the db backend at first (hasattr is your friend)
- if supported: implementation of an abstract ORM interface

While writing this down it kinda became clear to me, that for easy
transition of the db backends, a fallback impl in the ORM always would
be needed. Furthermore with that flag scheme in the db backends a strict
version match is not needed at all, as the db backend could always say
"nope, cannot do that" and the fallback would kick in. This fallback
could be the current bulk_update impl.

The downside of such an approach is clearly the needed code complexity,
furthermore the ORM would leave its mostly(?) ISO/ANSI grounds and have
to delegate the real sql creation to vendor specific implementation in
the backends.


@f-expressions
Yes the current .bulk_update implementation inherits the expression
support from .update (kinda passes things along). I am currently not
sure, if that can be mimicked 100% around an optimized implementation by
pre-/post executing updates for those fields, as col/row ordering might
have weird side effects. I'd first need to do some tests how the current
implementation deals with field refs while the ref'ed field itself gets
updated before/after the ref usage. (Not even sure if thats guaranteed
to always do the same thing across db engines)
Pre-/posthandling of f-expressions will slow down the code, as those
most likely have to go into the SET clause of a second update statement.
There might be several faster UNION tricks possible here, but I have not
tested those.

Cheers,
jerch

Mariusz Felisiak

unread,
Apr 26, 2022, 2:56:50 PM4/26/22
to Django developers (Contributions to Django itself)
Support for MariaDB 10.2 is already dropped in Django 4.1. We will drop support for MySQL 5.7 in Django 4.2 or 5.0 (probably 4.2).


Best,
Mariusz

Jörg Breitbart

unread,
Apr 27, 2022, 5:15:22 AM4/27/22
to django-d...@googlegroups.com
@Mariusz

Thx for the update on the db support roadmap, helps alot to get things
into perspective.


@expression support
After a few more tests it turns out, that expression support prolly can
be saved exactly as it is by a combination of the current CASE chaining
with the VALUES table. At least sqlite and postgres support the
following format (mysql variants not tested yet):

UPDATE table SET f1 = CASE pk=1 THEN f2 ELSE foo.column2 END FROM
(VALUES (1, NULL), (2, 'Hello')) AS foo where table.pk = foo.column1;

This would be the result from something like
bulk_update([
Model(pk=1, f1=F('f2')),
Model(pk=..., f1=<literal value>),
...], ['f1'])

This again introduces the bad runtime of the CASE chains into the
statement, but only for objects using expression fields, so the user has
limited control over the runtime needs (dont use expressions if you want
very fast value pumping...)

Advantage of this - it should do exactly the same updates as the
bulk_update implementation (no weird ordering side effects), thus it
gets alot easier to prove, that the fast and fallback code path exhibit
the same update behavior.


Cheers,
jerch

Jörg Breitbart

unread,
Apr 28, 2022, 7:22:33 AM4/28/22
to django-d...@googlegroups.com
Some findings/ideas on possible expression field support:

The SQL standard requires, that within a single UPDATE command the
column ordering does not matter. This effectively means, that the RHS of
the SET clause operates on old record values, while LHS gets the new
values, which dont manifest before the whole command returned.*

This allows a more aggressive transformation of updating fields with
expression values - those field updates could be pulled upfront and
executed separately, before applying changes from literal values in the
VALUES table. But since the expression values might be randomly
scattered across all to-be-updated fields, slicing into
expression/literal value fields at individual objects would lead to tons
of VALUES tables with different dimensions + extra field ref updates
(again creating high query load).

Suggestion:
A more straight forward way excludes objects with expression values from
the VALUES table handling, and updates them upfront with the fallback
path. This should still be an equivalent transformation in set theory,
given that expressions cannot directly ref a different row. Is that
always the case? (Otherwise we have a bigger problem with row order
dependency during an update...)
Compared to the individual object/field splitting, this way has several
advantages:
- easier to comprehend (at least for me)
- avoids building CASE chains in the VALUES table update command (keeps
the fast path fast)
- much easier to test (as branching happens in two main blocks, and not
deep at single object-update level creating a deep control-flow
branching hell)
- automatically deals with the MySQL update issue the same way as the
current bulk_update implementation


While I think that this should produce the same update results as the
current bulk_update implementation, I may have overlooked crucial
details, that need further to be addressed. Esp. around multi-table
inheritance the correct field update ordering is not yet clear to me -
Is this always done as second update after the local fields? How does
the ascent into parent fields work here? Is this python-mro ordered?


Cheers,
jerch


[*] MySQL/MariaDB have a non-standard UPDATE implementation in this
regard, as can be tested in this fiddle:
https://dbfiddle.uk/?rdbms=mariadb_10.3&fiddle=5e596bc7d256ad85a5fd4718acd46496
(works for all other db engines...)
Should this be noted in the .update docs?

Jörg Breitbart

unread,
Apr 29, 2022, 3:35:41 AM4/29/22
to django-d...@googlegroups.com
Have found workarounds for older db engines, which makes the more
demanding version requirements from above obsolete. Db support with
these workaround would be:

- SQLite 3.15+ (should work with Python 3.7+ installer, Ubuntu 18 LTS)
- MySQL 5.7+ (older versions should work too, not tested)

The workarounds construct the literal values tables from multiple
SELECTs + UNION ALL, which is perfwise slightly worse for sqlite (~40%
slower), but on par for mysql (well, mysql runs much earlier into stack
issues than with TVC, but this can be configured by the user).

Downside - this creates 2 more code paths for 2 db engine versions, that
would need to be tested with the test battery. The nuisance can be
removed by a later release, once db version support is dropped for other
reasons.

I also found possible fast update pattern for:
- oracle 19c (prolly older as well, UNION ALL + correlated update)
- oracle 21c (UNION ALL + join update)
- SQL Server 2014+ (FROM VALUES pattern)

but this needs anyone else to test and integrate, since I have no
development environments for those. So whether they can gain significant
performance remains uncertain until actually adopted.

Jörg Breitbart

unread,
Apr 30, 2022, 3:17:01 PM4/30/22
to django-d...@googlegroups.com
Released the second version of fast_update
(https://pypi.org/project/django-fast-update/), based on some findings
above, e.g. it now should work with all recent db engine versions
supported by django (despite oracle).

Would be happy to get some tests/feedback, before moving things closer
to django itself.

Cheers,
Jerch


Am 29.04.22 um 09:34 schrieb Jörg Breitbart:

jobe...@gmail.com

unread,
Oct 18, 2022, 10:05:48 AM10/18/22
to Django developers (Contributions to Django itself)
Jerch,

I love that you're improving the `bulk_update` performance with your package. I am definitely looking to adopt it. I can start working on it pretty quickly, so if you need testing input (Django 3.2, Postgres) I can offer feedback from what I find.

Can you tell me more about this statement:
Note copy_update will probably never leave the alpha/PoC-state, as psycopg3 brings great COPY support, which does a more secure value conversion and has a very fast C-version.

Where can I learn more about that COPY statement, and how/where that statement might be integrated with the Django ORM?

Jörg Breitbart

unread,
Oct 18, 2022, 11:07:51 AM10/18/22
to django-d...@googlegroups.com
> pretty quickly, so if you need testing input (Django 3.2, Postgres) I
> can offer feedback from what I find.

Yes testing would be awesome, esp. for edge cases (test coverage for
default cases is pretty complete for `fast_update` I think).

> Can you tell me more about this statement:
> > *Note* copy_update will probably never leave the alpha/PoC-state, as
> psycopg3 brings great COPY support, which does a more secure value
> conversion and has a very fast C-version.

Well I created the `copy_update` alternative for postgres just to see
the advantage of COPY FROM over UPDATE FROM VALUES. The impl uses
psycopg2's COPY interface, which got heavily revamped in psycopg3,
including proper value adapters written C. V2 does not have this yet,
therefore I had to create the value encoders in python, which are less
strict about values and still ~3 times slower than the C adapters in
psycopg3.
The message above is meant as a warning, that I dont plan to put too
much effort into polishing this soon outdated implementation.

> Where can I learn more about that COPY statement, and how/where that
> statement might be integrated with the Django ORM?

Plz check the postgres docs
(https://www.postgresql.org/docs/current/sql-copy.html), it covers all
important low level details. Furthermore check psycopg3 docs (and also
psycopg2 docs, if you want to get your hands on the `copy__update` impl).
I dont think that driving `bulk_update` by COPY FROM for postgres is a
good idea, there are quite some semantic differences, also it is slower
for tiny changesets than UPDATE FROM VALUES, thought it starts to shine
for changesets >1000 (up to ~4 times faster for a 1M changeset compared
to `fast_update` in my tests). Maybe it can be added to the postgres
subpackage, if there is demand for it.

Feel free to create issues or to comment on open ones. Important pending
issues are:
- proper duplicate check
(https://github.com/netzkolchose/django-fast-update/issues/13)
- good story whether to integrate support for f-expressions back or to
keep them out (currently unsupported, as the steps to get this working
are very cumbersome)

Cheers,
Jerch

jobe...@gmail.com

unread,
Oct 18, 2022, 12:07:07 PM10/18/22
to Django developers (Contributions to Django itself)
Will there *not* be a Django ORM implementation of psycopg3 COPY FROM when that lands? And, I guess I'll need to figure out when that lands/would land.

Jörg Breitbart

unread,
Oct 18, 2022, 12:34:31 PM10/18/22
to django-d...@googlegroups.com
> Will there *not* be a Django ORM implementation of psycopg3 COPY FROM
> when that lands? And, I guess I'll need to figure out when that
> lands/would land.
Not automagically, since the ORM does not use COPY internally. The main
ORM parts are pretty generic, thus mostly based on ANSI compliant SQL
statements, which is good to keep it db vendor independent.

COPY is a postgres invention to move plain constant table data as fast
as possible skipping most higher DBMS functionality (in/out or between
tables, many postgres backup tools use it).
Reply all
Reply to author
Forward
0 new messages