A case for CASE expressions and bulk_update

866 views
Skip to first unread message

michal.mo...@gmail.com

unread,
Nov 17, 2014, 7:54:26 AM11/17/14
to django-d...@googlegroups.com
I've been working on a bulk_update method for a project to allow saving many instances of a Model in a single call/query, an analog to the existing bulk_create method.
There seems to be some interest in this as evidenced by a library django-bulk-update (which isn't working for me in tests with python3 and SQLite) and a recent ticket.

The ticket was closed as wontfix with an invitation for discussion, although the author doesn't seem to have pursued it. However, Shai Berger did comment positively on the ticket and proposed a different API, similar to the one in django-bulk-update and my own prototype:

Book.objects.update_many(books, 'price')

My own attempt was on Django 1.7 using a subclass of ExpressionNode passed as a value into UpdateQuery.add_update_values which allows Django to generate most of the UPDATE SQL, while the expression subclass was only responsible for generating an SQL CASE expression. This quickly generalized into a Case expression class which could be passed as a value into QuerySet.update.

But now that Query Expressions have been merged, I've done a quick port and the Case expression now works with updates, annotations and aggregates, and is used to power my bulk_update method.

A simple performance test shows that this works quite well. I tested the following methods of updating many database rows, using an in memory SQLite database :

# loop_save
for o in objects:
    o.save(update_fields=['field'])

# bulk_update
MyModel.objects.bulk_update(objects, update_fields=['field'])

# many_updates
q = Q(condition__lt=test_value)
MyModel.objects.filter(q).update(field=value1)
MyModel.objects.filter(~q).update(field=value2)

# case_update
q = Q(condition__lt=test_value)
MyModel.objects.update(field=Case([(q, value1)], default=value2,
                                  output_field=MyModel._meta.get_field('field')))

With 10000 objects in the database, and updating all of them I got these run times using clock():

loop_save:    3.022845626653664
bulk_update:  0.1785595393377175
many_updates: 0.009768530320993563
case_update:  0.009343744353718986

We can see that bulk_update outperforms looping by an order of magnitude, despite  working by generating a CASE expression with a WHEN clause for every object conditioned on its pk.
Using Case in an update has no significant performance improvement, over running multiple queries. It may arguably allow for more readable code when using many conditions since the semantics of case are that conditions are evaluated in order, like a Python if ... elif ... else. Compare this:

MyModel.objects.filter(Q(condition__lt=test_value1)).update(field=value1)
MyModel.objects.filter(Q(condition__gte=test_value1, condition__lt=test_value2)).update(field=value2)
MyModel.objects.filter(Q(condition__gte=test_value2, condition__lt=test_value3)).update(field=value3)
MyModel.objects.filter(Q(condition__gte=test_value3, condition__lt=test_value4)).update(field=value4)
MyModel.objects.filter(Q(condition__gte=test_value4)).update(field=value5)

as opposed to:

q = Q(condition__lt=test_value)
MyModel.objects.update(field=Case([(Q(condition__lt=test_value1), value1),
                                   (Q(condition__lt=test_value2), value2),
                                   (Q(condition__lt=test_value3), value3),
                                   (Q(condition__lt=test_value4), value4)], default=value5,
                                  output_field=MyModel._meta.get_field('field')))

With only 10 objects in the database, and updating all of them the times are:

loop_save:    0.004131221428270882
bulk_update:  0.0006249582329893796
many_updates: 0.0006562061446388481
case_update:  0.00036259952923938313

So the bulk_update performance gains are still there, and update using a Case expression if faster because the SQL generation time dominates in this example, and only one query needs to be generated and executed.

Case expressions can also be used in other ways (although these examples use a subclass SimpleCase, which is described in a little more detail below, for simpler syntax):

# annotation
MyModel.objects.annotate(status_text=SimpleCase('status', [('S', 'Started'), ('R', 'Running'), ('F', 'Finished')],
                                                default='Unknown', output_field=CharField()))

# aggregation
MyModel.objects.aggregate(started=Sum(SimpleCase('status', [('S', 1)], default=0, output_field=IntegerField())),
                          running=Sum(SimpleCase('status', [('R', 1)], default=0, output_field=IntegerField())),
                          finished=Sum(SimpleCase('status', [('F', 1)], default=0, output_field=IntegerField())))

In these examples SimpleCase is a subclass of Case that generates the SQL simple CASE expression which is an equality test against a field.

So is this something that would be a worthwhile addition to core?
It could work now as a separate module, but the update example requires setting output_field. My original Django 1.7 code abused the prepare_database_save method that SQLUpdateCompiler called on an ExpressionNode to get the field that the expression would be assigned to automatically. After the merge of Query Expression SQLUpdateCompiler calls either resolve_expression or prepare_database_save, so I can no longer use this. A change in SQLUpdateCompiler would be required with a new API for settings output_field automatically on ExpressionNodes used as update values.

If this were considered for core, I would appreciate input on the API. I currently have:

Case(list_of_case_tuples, default, output_field) - A general CASE expression called a searched case in the SQL spec. list_of_case_tuples is an iterable of tuples of the form (Q_object, value), and default is the value for the ELSE clause. This generates SQL like:

CASE WHEN n > 0 THEN 'positive' WHEN n < 0 THEN 'negative' ELSE 'zero' END

SimpleCase(fieldname, list_of_simple_case_tuples, default, output_field) - A simple case expression in the SQL spec. fieldname is a field identifier like in an F expression, list_of_simple_case_tuples is an iterable of tuples of the form (condition_value, result_value). This generates SQL like:

CASE n WHEN 1 THEN 'one' WHEN 2 THEN 'two' ELSE 'other' END

Model.objects.bulk_update(list_of_instances, update_fields) - An analog to the bulk_create method that save changes in many model instances. list of instances is an iterable of model instances, update_fields is the same as the argument to Model.save of the same name.

Of course I will put up a branch on github if there is interest in this proposal. Thanks for your time and sorry for the wall of text.

- Michael

Josh Smeaton

unread,
Nov 17, 2014, 6:02:08 PM11/17/14
to django-d...@googlegroups.com
Hi Michael,

Great to see new expressions being used already! Speaking specifically about the Case structure, I think it looks nice. I was planning on writing a few functions (https://code.djangoproject.com/ticket/23753) which included Case/When of some description.

My idea of a Case API was:

Model.objects.annotate(field=If( Q(), TrueExpression(), FalseExpression(), output_field=Field ))

Where TrueExpression and FalseExpression could be any expression, like Value(1) or another If(). I think I prefer your list of 2-tuples with a default though, as it looks cleaner, and should be easier for people to write.

You could maybe avoid asking the user to set an output_field by doing it yourself in the `resolve_expression`. There's also probably an issue where we should validate that all the parts of the case expression will resolve to the same type. For an update, it's easier - since you know the type that should be returned. Using it in an annotate clause gives no such information. I'm not sure that we can distinguish between an update or annotate call to be able to do the extra work on behalf of the user though.

Case expressions, in my opinion, are wanted in core. It solves the conditional aggregates problem, conditional select, and now you've shown conditional updates. You also get around introducing another queryset method (bulk_update) which I think would be appreciated.

Nice work!

Anssi Kääriäinen

unread,
Nov 18, 2014, 1:43:35 AM11/18/14
to django-d...@googlegroups.com
I'd like to go for bulk_merge(objects, force_update, force_insert,
update_fields) method. The basic implementation would call obj.save() in
loop. Optimized implementations could then be added later on.

Bulk update would then be an optimized implementation for force_update
or update_fields cases.

I'm +1 for adding some form of Case statement to core. I'm not sure what
the API should look like.

- Anssi

Michał Modzelewski

unread,
Nov 18, 2014, 4:34:10 PM11/18/14
to django-d...@googlegroups.com
I've put my code up on github at https://github.com/michalmo/django-case-expressions. I've decided not to create a Django branch yet as I'd like to decide on some API issues first.

Josh noted that I could avoid asking the user to set an output_field by doing it automatically in resolve_expression, but I'm not sure how to go about this. In the bulk_update case I can get the field from the model. When a Case is used in an update and SQLUpdateCompiler.as_sql calls resolve_expression I could look for self in query.values, but what if a single instance has been used more than once? I think it would be best if resolve_expression took an extra field parameter which could be used by SQLUpdateCompiler to pass the update field in to ExpressionNodes, and by ExpressionNodes to pass their output_field down to child nodes. Otherwise an extra method for passing in the output_field. Unless, of course, there's something I've overlooked.

bulk_merge could also just split objects into groups with and without a set pk, and call bulk_create and bulk_update with those groups. I think bulk_save would be more inline with existing method names, though, or bulk_update_or_create.

As far as API goes SQLAlchemy uses case(whens, value, else_) where whens is a list of tuples, value is the predicate expression and turns the case into a simple case expressions if set, else_ is the value for the ELSE clause. PeeWee uses case(predicate, expression_tuples, default) where predicate is the predicate expression for a simple case expression, and creates a searched case expression if it is set to None.

I favoured two expression classes so each could be responsible for validating arguments, eq. Case requires conditions to be Q objects.

I also tried to use a dictionary mapping conditions to values, which seemed more readable, but that requires using OrderedDict if the user cared about the order of WHEN clauses in the generated SQL. Since constructing an OrderedDict requires passing in a list of tuples anyway I decided against using a dictionary. I assume this is what motivated the API in SQLAlchemy and PeeWee.

- Michael

Josh Smeaton

unread,
Dec 15, 2014, 11:54:05 PM12/15/14
to django-d...@googlegroups.com
Hi Michal,

I'm about ready to implement Case/When expressions, do you mind if I use your implementation? Also, do you think it's necessary or desirable to have a SimpleCase and a Case?

Josh

Michał Modzelewski

unread,
Dec 16, 2014, 2:23:29 AM12/16/14
to django-d...@googlegroups.com
Sure. I was planning to work on it myself, but wanted to discuss API, passing in output_fields, and casting results first.
I started with 2 classes because I only needed SimpleCase first to implement the bulk_update. Later I added Case and it seemed clearer to me than the SQLAlchemy and PeeWee approach of using an optional predicate argument.

- Michael

Josh Smeaton

unread,
Dec 16, 2014, 8:01:23 PM12/16/14
to django-d...@googlegroups.com
On Tuesday, 16 December 2014 18:23:29 UTC+11, Michał Modzelewski wrote:
Sure. I was planning to work on it myself, but wanted to discuss API, passing in output_fields, and casting results first.
I started with 2 classes because I only needed SimpleCase first to implement the bulk_update. Later I added Case and it seemed clearer to me than the SQLAlchemy and PeeWee approach of using an optional predicate argument.

- Michael

If you're still planning on working on it, then I'm happy to let you do so. The 1.8 alpha freeze is coming up shortly (Jan 12th), and I'm not sure whether new constructs will be allowed in after that point or not. Can someone else clarify? It would be great to have something to merge around this time though, or at least before the Beta.

My work on other functions is being tracked at https://code.djangoproject.com/ticket/23753 and developed here https://github.com/jarshwah/django/tree/ticket_23753. I'll try to have these finished and merged soon (hopefully within a week, pending reviews), which should give a decent base to fully complete the Case stuff.

Happy to discuss API and help you out where I can. I'm always in IRC as jarshwah if you'd like more immediate discussion too.

Cheers

Tim Graham

unread,
Dec 16, 2014, 8:28:53 PM12/16/14
to django-d...@googlegroups.com
Unlike in previous release cycles, we are going to try a complete feature freeze at alpha. The details are outlined here:

https://code.djangoproject.com/wiki/Version1.8Roadmap

Josh Smeaton

unread,
Dec 16, 2014, 8:48:31 PM12/16/14
to django-d...@googlegroups.com
Thanks Tim, exactly what I was after. I linked to the wiki but obviously didn't read it all :/

Michal, do you think you'll be able to pull Case expressions together by the first or early second week of January to allow for review and changes?

Cheers

Michał Modzelewski

unread,
Dec 21, 2014, 3:58:19 PM12/21/14
to django-d...@googlegroups.com
I think I can. I'll try to catch you on IRC to discuss API.

- Michael

Michał Modzelewski

unread,
Jan 2, 2015, 7:14:03 AM1/2/15
to django-d...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages