general interest in faster bulk_update implementation

157 views
Skip to first unread message

Jörg Breitbart

unread,
Apr 24, 2022, 4:19:16 PMApr 24
to django-d...@googlegroups.com
Hi there,

I have recently tried to tackle the known performance issues of
bulk_update. For more context plz see the original thread:
https://code.djangoproject.com/ticket/31202

In another issue we kinda ended up at the same idea again for a
different reason (https://code.djangoproject.com/ticket/33647).

My early attempt for a faster implementation:
https://github.com/netzkolchose/django-fast-update

This impl has several drawbacks/restrictions which I'd like to discuss,
before wasting more time to get things more aligned with the ORM, esp
these aspects:
- missing f-expression support
- only support for recent db engines

@f-expressions
This is not directly possible from a constant table (thats how postgres
calls tables created from VALUES(...)). At this point idk if
f-expressions is an actively used thing with bulk_update at all, or if
support for those would just penalize the update throughput for no good
reason. My impl above currently skips such a workaround. Would like to
hear some more opinions about that.

@db engine support
Are there any strict db support promises made by django releases? I see
that some django versions have notes about certain db engines and
versions, but I cant find a "support grid" or an explicit doc for that.
If django does not officially restrict db versions, then the impl above
would need some sort of a fallback for unsupported db versions.

Plz also tell me, if this all sounds too wild for getting considered
into django (tbh the db version dependencies bug me alot), then it
prolly is better kept in a 3rd party package.

Regards,
jerch

Florian Apolloner

unread,
Apr 26, 2022, 2:59:00 AMApr 26
to Django developers (Contributions to Django itself)
Hi Jörg,

Regarding your question about database version support: https://docs.djangoproject.com/en/4.0/ref/databases/ lists the supported versions. Granted not as support grid and you have to scroll to every database, but it is usually in the first paragraph.

If f-expressions are currently supported your implementation could fall back to a slower implementation if it encounters them?

Cheers,
Florian

Jörg Breitbart

unread,
Apr 26, 2022, 10:05:56 AMApr 26
to django-d...@googlegroups.com
@Florian

Thx for your response.

Looking through the release notes and the listed databases I find these
version requirements:
- PostgreSQL 10+
- MariaDB 10.2+
- MySQL 5.7+
- Oracle 19c+
- SQLite 3.9.0+

Compared to the UPDATE FROM VALUES pattern requirements:
- MariaDB 10.3.3+
- MySQL 8.0.19+
- Oracle currently no impl at all
- SQLite 3.33+

thus only postgres would work out of the box. Question then is, whether
to raise version requirements for django. Imho a good indicator for that
might be the age of a db release, its EOL state, and whether it is still
part of LTS distros:
- MariaDB 10.2: EOL 04/2022
- MariaDB 10.3: released in 04/2018
(Ubuntu 20.04 LTS is on 10.3 line, 18.04 LTS on 10.1)
--> prolly safe to raise to 10.3 line?

- MySQL 5.7: EOL 10/2023
- MySQL 8.0: released in 04/2018
(Ubuntu 20.04 LTS contains 8.0 line, 18.04 LTS on 5.7)
--> 5.7 is still within lifetime for 1.5ys
--> Cut old ropes early here?

- SQLite: 3.22 on ubuntu 18.04, 3.31 on ubuntu 20.04
--> imho 3.33+ cannot be requested here, as upgrading sqlite3
packages is much more of a hassle for peeps

- others non supported (incl. oracle):
Should there be a default fallback within django? Or should db vendors
be bugged for implementing an abstract interface for UPDATE FROM VALUES?

Especially the last 2 points (sqlite and general db vendor compat) are
tricky. Here I think a general fallback within django ORM might be the
only way to not let db version issues surface its way to the user. Not
sure yet, how practical/maintainable that would be in the end, as it
would have to provide 2 internal code paths for the same bulk_update API
endpoint:
- fast one, if UPDATE FROM VALUES pattern is supported
- fallback for backends not supporting the fast update pattern

Thinking the problem from a db vendor perspective, it could look like
this in a db package (just brainstorming atm):
- a flag indicating support for UPDATE FROM VALUES pattern
- the flag result might be active code, if the db driver has
to test support on server side (thats the currently case for mysql)
- to provide an easy upgrade path for db vendors, the flag might be
missing on the db backend at first (hasattr is your friend)
- if supported: implementation of an abstract ORM interface

While writing this down it kinda became clear to me, that for easy
transition of the db backends, a fallback impl in the ORM always would
be needed. Furthermore with that flag scheme in the db backends a strict
version match is not needed at all, as the db backend could always say
"nope, cannot do that" and the fallback would kick in. This fallback
could be the current bulk_update impl.

The downside of such an approach is clearly the needed code complexity,
furthermore the ORM would leave its mostly(?) ISO/ANSI grounds and have
to delegate the real sql creation to vendor specific implementation in
the backends.


@f-expressions
Yes the current .bulk_update implementation inherits the expression
support from .update (kinda passes things along). I am currently not
sure, if that can be mimicked 100% around an optimized implementation by
pre-/post executing updates for those fields, as col/row ordering might
have weird side effects. I'd first need to do some tests how the current
implementation deals with field refs while the ref'ed field itself gets
updated before/after the ref usage. (Not even sure if thats guaranteed
to always do the same thing across db engines)
Pre-/posthandling of f-expressions will slow down the code, as those
most likely have to go into the SET clause of a second update statement.
There might be several faster UNION tricks possible here, but I have not
tested those.

Cheers,
jerch

Mariusz Felisiak

unread,
Apr 26, 2022, 2:56:50 PMApr 26
to Django developers (Contributions to Django itself)
Support for MariaDB 10.2 is already dropped in Django 4.1. We will drop support for MySQL 5.7 in Django 4.2 or 5.0 (probably 4.2).


Best,
Mariusz

Jörg Breitbart

unread,
Apr 27, 2022, 5:15:22 AMApr 27
to django-d...@googlegroups.com
@Mariusz

Thx for the update on the db support roadmap, helps alot to get things
into perspective.


@expression support
After a few more tests it turns out, that expression support prolly can
be saved exactly as it is by a combination of the current CASE chaining
with the VALUES table. At least sqlite and postgres support the
following format (mysql variants not tested yet):

UPDATE table SET f1 = CASE pk=1 THEN f2 ELSE foo.column2 END FROM
(VALUES (1, NULL), (2, 'Hello')) AS foo where table.pk = foo.column1;

This would be the result from something like
bulk_update([
Model(pk=1, f1=F('f2')),
Model(pk=..., f1=<literal value>),
...], ['f1'])

This again introduces the bad runtime of the CASE chains into the
statement, but only for objects using expression fields, so the user has
limited control over the runtime needs (dont use expressions if you want
very fast value pumping...)

Advantage of this - it should do exactly the same updates as the
bulk_update implementation (no weird ordering side effects), thus it
gets alot easier to prove, that the fast and fallback code path exhibit
the same update behavior.


Cheers,
jerch

Jörg Breitbart

unread,
Apr 28, 2022, 7:22:33 AMApr 28
to django-d...@googlegroups.com
Some findings/ideas on possible expression field support:

The SQL standard requires, that within a single UPDATE command the
column ordering does not matter. This effectively means, that the RHS of
the SET clause operates on old record values, while LHS gets the new
values, which dont manifest before the whole command returned.*

This allows a more aggressive transformation of updating fields with
expression values - those field updates could be pulled upfront and
executed separately, before applying changes from literal values in the
VALUES table. But since the expression values might be randomly
scattered across all to-be-updated fields, slicing into
expression/literal value fields at individual objects would lead to tons
of VALUES tables with different dimensions + extra field ref updates
(again creating high query load).

Suggestion:
A more straight forward way excludes objects with expression values from
the VALUES table handling, and updates them upfront with the fallback
path. This should still be an equivalent transformation in set theory,
given that expressions cannot directly ref a different row. Is that
always the case? (Otherwise we have a bigger problem with row order
dependency during an update...)
Compared to the individual object/field splitting, this way has several
advantages:
- easier to comprehend (at least for me)
- avoids building CASE chains in the VALUES table update command (keeps
the fast path fast)
- much easier to test (as branching happens in two main blocks, and not
deep at single object-update level creating a deep control-flow
branching hell)
- automatically deals with the MySQL update issue the same way as the
current bulk_update implementation


While I think that this should produce the same update results as the
current bulk_update implementation, I may have overlooked crucial
details, that need further to be addressed. Esp. around multi-table
inheritance the correct field update ordering is not yet clear to me -
Is this always done as second update after the local fields? How does
the ascent into parent fields work here? Is this python-mro ordered?


Cheers,
jerch


[*] MySQL/MariaDB have a non-standard UPDATE implementation in this
regard, as can be tested in this fiddle:
https://dbfiddle.uk/?rdbms=mariadb_10.3&fiddle=5e596bc7d256ad85a5fd4718acd46496
(works for all other db engines...)
Should this be noted in the .update docs?

Jörg Breitbart

unread,
Apr 29, 2022, 3:35:41 AMApr 29
to django-d...@googlegroups.com
Have found workarounds for older db engines, which makes the more
demanding version requirements from above obsolete. Db support with
these workaround would be:

- SQLite 3.15+ (should work with Python 3.7+ installer, Ubuntu 18 LTS)
- MySQL 5.7+ (older versions should work too, not tested)

The workarounds construct the literal values tables from multiple
SELECTs + UNION ALL, which is perfwise slightly worse for sqlite (~40%
slower), but on par for mysql (well, mysql runs much earlier into stack
issues than with TVC, but this can be configured by the user).

Downside - this creates 2 more code paths for 2 db engine versions, that
would need to be tested with the test battery. The nuisance can be
removed by a later release, once db version support is dropped for other
reasons.

I also found possible fast update pattern for:
- oracle 19c (prolly older as well, UNION ALL + correlated update)
- oracle 21c (UNION ALL + join update)
- SQL Server 2014+ (FROM VALUES pattern)

but this needs anyone else to test and integrate, since I have no
development environments for those. So whether they can gain significant
performance remains uncertain until actually adopted.

Jörg Breitbart

unread,
Apr 30, 2022, 3:17:01 PMApr 30
to django-d...@googlegroups.com
Released the second version of fast_update
(https://pypi.org/project/django-fast-update/), based on some findings
above, e.g. it now should work with all recent db engine versions
supported by django (despite oracle).

Would be happy to get some tests/feedback, before moving things closer
to django itself.

Cheers,
Jerch


Am 29.04.22 um 09:34 schrieb Jörg Breitbart:
Reply all
Reply to author
Forward
0 new messages