I've been working on a bulk_update method for a project to allow saving many instances of a Model in a single call/query, an analog to the existing bulk_create method.
There seems to be some interest in this as evidenced by a library
django-bulk-update (which isn't working for me in tests with python3 and SQLite) and a recent
ticket.
The ticket was closed as wontfix with an invitation for discussion, although the author doesn't seem to have pursued it. However, Shai Berger did comment positively on the ticket and proposed a different API, similar to the one in django-bulk-update and my own prototype:
Book.objects.update_many(books, 'price')
My own attempt was on Django 1.7 using a subclass of ExpressionNode passed as a value into UpdateQuery.add_update_values which allows Django to generate most of the UPDATE SQL, while the expression subclass was only responsible for generating an SQL CASE expression. This quickly generalized into a Case expression class which could be passed as a value into QuerySet.update.
But now that Query Expressions have been merged, I've done a quick port and the Case expression now works with updates, annotations and aggregates, and is used to power my bulk_update method.
A simple performance test shows that this works quite well. I tested the following methods of updating many database rows, using an in memory SQLite database :
# loop_save
for o in objects:
o.save(update_fields=['field'])
# bulk_update
MyModel.objects.bulk_update(objects, update_fields=['field'])
# many_updates
q = Q(condition__lt=test_value)
MyModel.objects.filter(q).update(field=value1)
MyModel.objects.filter(~q).update(field=value2)
# case_update
q = Q(condition__lt=test_value)
MyModel.objects.update(field=Case([(q, value1)], default=value2,
output_field=MyModel._meta.get_field('field')))
With 10000 objects in the database, and updating all of them I got these run times using clock():
loop_save: 3.022845626653664
bulk_update: 0.1785595393377175
many_updates: 0.009768530320993563
case_update: 0.009343744353718986
We can see that bulk_update outperforms looping by an order of magnitude, despite working by generating a CASE expression with a WHEN clause for every object conditioned on its pk.
Using Case in an update has no significant performance improvement, over running multiple queries. It may arguably allow for more readable code when using many conditions since the semantics of case are that conditions are evaluated in order, like a Python if ... elif ... else. Compare this:
MyModel.objects.filter(Q(condition__lt=test_value1)).update(field=value1)
MyModel.objects.filter(Q(condition__gte=test_value1, condition__lt=test_value2)).update(field=value2)
MyModel.objects.filter(Q(condition__gte=test_value2, condition__lt=test_value3)).update(field=value3)
MyModel.objects.filter(Q(condition__gte=test_value3, condition__lt=test_value4)).update(field=value4)
MyModel.objects.filter(Q(condition__gte=test_value4)).update(field=value5)
as opposed to:
q = Q(condition__lt=test_value)
MyModel.objects.update(field=Case([(Q(condition__lt=test_value1), value1),
(Q(condition__lt=test_value2), value2),
(Q(condition__lt=test_value3), value3),
(Q(condition__lt=test_value4), value4)], default=value5, output_field=MyModel._meta.get_field('field')))
With only 10 objects in the database, and updating all of them the times are:
loop_save: 0.004131221428270882
bulk_update: 0.0006249582329893796
many_updates: 0.0006562061446388481
case_update: 0.00036259952923938313
So the bulk_update performance gains are still there, and update using a Case expression if faster because the SQL generation time dominates in this example, and only one query needs to be generated and executed.
Case expressions can also be used in other ways (although these examples use a subclass SimpleCase, which is described in a little more detail below, for simpler syntax):
# annotation
MyModel.objects.annotate(status_text=SimpleCase('status', [('S', 'Started'), ('R', 'Running'), ('F', 'Finished')],
default='Unknown', output_field=CharField()))
# aggregation
MyModel.objects.aggregate(started=Sum(SimpleCase('status', [('S', 1)], default=0, output_field=IntegerField())),
running=Sum(SimpleCase('status', [('R', 1)], default=0, output_field=IntegerField())),
finished=Sum(SimpleCase('status', [('F', 1)], default=0, output_field=IntegerField())))
In these examples SimpleCase is a subclass of Case that generates the SQL simple CASE expression which is an equality test against a field.
So is this something that would be a worthwhile addition to core?
It could work now as a separate module, but the update example requires setting output_field. My original Django 1.7 code abused the prepare_database_save method that SQLUpdateCompiler called on an ExpressionNode to get the field that the expression would be assigned to automatically. After the merge of Query Expression SQLUpdateCompiler calls either resolve_expression or prepare_database_save, so I can no longer use this. A change in SQLUpdateCompiler would be required with a new API for settings output_field automatically on ExpressionNodes used as update values.
If this were considered for core, I would appreciate input on the API. I currently have:
Case(list_of_case_tuples, default, output_field) - A general CASE expression called a searched case in the SQL spec. list_of_case_tuples is an iterable of tuples of the form (Q_object, value), and default is the value for the ELSE clause. This generates SQL like:
CASE WHEN n > 0 THEN 'positive' WHEN n < 0 THEN 'negative' ELSE 'zero' END
SimpleCase(fieldname, list_of_simple_case_tuples, default, output_field) - A simple case expression in the SQL spec. fieldname is a field identifier like in an F expression, list_of_simple_case_tuples is an iterable of tuples of the form (condition_value, result_value). This generates SQL like:
CASE n WHEN 1 THEN 'one' WHEN 2 THEN 'two' ELSE 'other' END
Model.objects.bulk_update(list_of_instances, update_fields) - An analog to the bulk_create method that save changes in many model instances. list of instances is an iterable of model instances, update_fields is the same as the argument to Model.save of the same name.
Of course I will put up a branch on github if there is interest in this proposal. Thanks for your time and sorry for the wall of text.
- Michael