Proposal to re-open #27017 (updating only dirty fields in save())

198 views
Skip to first unread message

Daniel Tao

unread,
Jan 28, 2019, 9:00:21 AM1/28/19
to Django developers (Contributions to Django itself)
Hi!

This is my first post on this list. I recently left a comment on #27017 requesting to re-open the topic of only saving dirty fields in save(). Tim Graham helpfully directed to #4102 and advised that I make the proposal on the dev mailing list, so that's what I'm doing :)

I've gone through the history of #4102 and taken notes on the challenges that arose when this was first attempted 12 years ago (!). The TL;DR is that, while there were indeed quite a few complications, I don't see anything that came up that should be considered a flat-out showstopper. Rather, what happened was that at about the 69th comment, back in 2012, the conversation shifted as Matt Long made this observation:

what started as a simple opt-in feature request of adding a field white-list to the Model's save function morphed into a complicated dirty flag approach that obviously has many edge cases and performance implications given that this ticket has been open for 5 years now

From here it seems things progressed towards the current solution of supporting the update_fields argument, and that's where things ended. I would like to point out that Matt did not advocate for completely abandoning all efforts to support dirty field tracking; to the contrary, in the same comment he said this (emphasis mine):

Clearly some people feel differently and favor the dirty flag approach for a more hands-off approach. As such, I propose adding support for both methods

With that in mind, I believe it's worth re-opening this discussion. For a fairly lengthy justification, see my aforementioned comment on #27017. I'll copy the effective TL;DR of the proposal here for convenience:

In my opinion Django could make most code bases inherently more resilient against latent race conditions by implementing some form of dirty field tracking and effectively providing the functionality of update_fields automatically. I would like to propose a new setting, something like SAVE_UPDATE_DIRTY_FIELDS_ONLY, to change the ORM's default behavior so that calls to Model.save() only update the fields that have been set on the model instance. Naturally for backwards compatibility this setting would be False by default.

As for the concerns that were raised when this was first attempted, I will now attempt to summarize what I found along with, in most cases, a bit of editorializing from me.

Performance

The performance angle was first explored in a comment that said it "doesn't look good" and provided some benchmarks showing a performance hit from 0.17s to 2.64s for setting an attribute using the timeit package. I didn't see anyone point out that the timeit method defaults to executing code a million times; so the throughput of the operation went from about 6 million to closer to 400 thousand times per second. (The percentage change is indeed significant, but this doesn't smell like a potential bottleneck.)

It was noted in a couple places that it seems potentially shortsighted to focus so much on the performance of getting and setting attributes without taking into account the potential performance benefits of executing smaller UPDATE statements that write fewer columns. As far as I can tell, no one in the thread on #4102 actively investigated the latter.

Based on the unlikelihood of attribute setting representing a performance bottleneck, and the lack of data for the performance impact of executing smaller updates, I would consider the performance discussion largely a distraction. Though I do think it's worth measuring the latter.

Compatibility

It was observed that the two approaches considered on the ticket (overriding __setattr__ or defining custom property setters) would not work with obj.__dict__.update, which is apparently an optimization you can find prescribed on some blogs out int he wild. This supports the premise that this behavior should be opt in, so devs who are using incompatible techniques can stay away (unless/until they've removed those optimizations from their code).

I had already suggested putting this functionality behind a setting, so I think we're good here.

Correctness

There were some bugs related to multi-table inheritance. It appears these were both followed by patches with proposed fixes.

Another bug was that fields updated in pre_save such as DateTimeField(auto_now=True) were not being updated. There was a patch to fix this as well.

The point was raised that this approach poses a problem for mutable field types (e.g., ArrayField). Presumably one approach for addressing that would be the one employed by django-save-the-change here.

One gets the sense from following the thread that, surely, there would be more bugs discovered were someone to resurrect this work. However, again, I don't see any showstoppers here. We're talking about an advanced framework with a large set of features, and so naturally it's going to be difficult to get this right while taking all supported use cases into account.

Conclusion

As I said up top, I think it's worth re-opening #27017, or creating a new ticket for the behavior it was requesting (updating only dirty fields automatically in save()) where future contributors might discuss possible approaches and, with any luck, submit pull requests. I believe it would be both possible and helpful, and I also suspect that the process would go a bit more smoothly today, as the project has a huge suite of tests to catch possible edge cases and provide greater confidence about the correctness of any patches submitted.

P.S. I'm a little worried that someone will now direct me to another long thread I should read filled with even more reasons this idea caused problems in the past. If so, no worries! I'll do it. I've already come this far :)

Patryk Zawadzki

unread,
Jan 29, 2019, 4:43:26 AM1/29/19
to Django developers (Contributions to Django itself)
I feel conflicted here as I've spent numerous hours under different contracts fixing bugs caused by missing update_fields and leading to data loss or inconsistent database states ;)

On a more serious note, please make this happen.

With concurrent updates calling save() without update_fields is asking for trouble, one of our customers has save() methods specifically patched to raise exceptions under these circumstances because of how easy it is to mess things up and how hard it is to track them down.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/f13f860e-8dc9-4a5d-80c5-f1201e741a9d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
Patryk Zawadzki
I solve problems.

Ivan Anishchuk

unread,
Jan 29, 2019, 11:15:46 AM1/29/19
to django-d...@googlegroups.com
While I'm not sure I have any particular suggestion for this, the problem is important enough for a few of the projects I'm maintaining. I have implemented more than a few custom save() methods, post_init/pre_save/post_save signals and complicated ways of setting update_fields and acting on their values and I would really like if there was a simple standard way of tracking dirty fields and saving only them in a reliable way. I'll be grateful if someone finally makes it happen.

Ivan.

Josh Smeaton

unread,
Jan 30, 2019, 8:51:10 AM1/30/19
to Django developers (Contributions to Django itself)
That's a +1 from me. I've certainly hit the bugs you've mentioned before, and I can't think of a good reason not to do dirty field tracking, especially if it were to be opt in.

Bikeshedding a little bit, rather than having a global setting, I'd rather see an option on Options/Meta so that opt in is per-model. On a related note, I'd like to see some real world testing of any solution with packages that implement polymorphic models.

Adam Johnson

unread,
Mar 14, 2019, 5:52:54 AM3/14/19
to django-d...@googlegroups.com
I believe there's the opportunity for a subtle bug in applications if Django moves from its current default behaviour of saving all fields to saving only dirty fields. Values in a model instance can become "torn" between two versions by two concurrent processes even though neither wanted to do this.

Take this model:

class Photo(models.Model):
    likes = models.IntegerField()
    multiplier = models.IntegerField()
    total = models.IntegerField()

Let's say total should always equal likes * multiplier .

Process 1 (which could be a view, management command, celery task, etc.) increases a Photo's likes:

photo = Photo.objects.get(id=id)
photo.likes += 1
photo.total = photo.likes * photo.multiplier
photo.save()

Process 2 updates a Photo's multiplier:

photo = Photo.objects.get(id=id)
photo.multiplier = new_multiplier
photo.total = photo.likes * photo.multiplier
photo.save()

To avoid problems if these two processes run at once on the same Photo instance, transactions or row-level locking with select_for_update() should be used, so they don't run actually in parallel. These aren't the default in Django though (I wish ATOMIC_REQUESTS was on by default).

Assuming the application is not using either a transaction or select_for_update(), current Django does still guarantee that either the complete state after Process 1 or 2 is in the database, by always writing back all the fields. That is, no Photo instance can have its total set to anything but likes * multiplier , since Django's save() sends all fields in a single UPDATE statement, which is always atomic. Sure there's a race condition and the results of either process 1 or 2 can be lost, but still there's a guarantee on the relationship between the values, which downstream logic might rely on.

Moving to dirty field tracking would break this guarantee. This would probably cause untold low-level breakage in applications where atomicity has not been considered. Therefore it seems like it would be a breaking change which is hard to communicate to users, a complicated situation.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.

For more options, visit https://groups.google.com/d/optout.


--
Adam

Daniel Tao

unread,
Mar 14, 2019, 8:39:05 AM3/14/19
to Django developers (Contributions to Django itself)
I agree with this:

> Therefore it seems like it would be a breaking change which is hard to communicate to users, a complicated situation.

This is why I'm recommending that this behavior be opt-in via a Django setting such as SAVE_UPDATE_DIRTY_FIELDS_ONLY. Application developers would then need to make a determination for themselves whether or not to enable it.

Adam is right to point out an edge case where this behavior could lead to subtle bugs. Not surprisingly, it's related to concurrency (which is where subtle bugs tend to live)! In response, I would point out that the whole point of this proposal is to address a different edge case, also related to concurrency, which I personally believe is more common, though it's admittedly hard to know without a lot of empirical data. (Actually it's also present in the scenario Adam describes, and he addresses this: "Sure there's a race condition and the results of either process 1 or 2 can be lost.")

From the perspective of an application developer, if this functionality were available to me, I'd want to look at the code and ask myself this question: which is more work?
  • Are there lots of places where the code is setting a small number of attributes and calling save() without update_fields, where concurrent requests (or background jobs) could override each other's writes and user data could be lost? Should I leave SAVE_UPDATE_DIRTY_FIELDS_ONLY disabled, enumerate these code paths and add update_fields to all of them?
  • Are there lots of places where the code is setting model attributes whose values are derived from a combination of other attributes, not all of which are set? Should I enable SAVE_UPDATE_DIRTY_FIELDS_ONLY, enumerate these code paths and wrap them in select_for_update to ensure the data doesn't get into an inconsistent state?

Tim Graham

unread,
Mar 14, 2019, 9:49:59 AM3/14/19
to Django developers (Contributions to Django itself)
I'm not sure if a global setting is a good idea. Code in third-party apps would need to work with both configurations?

charettes

unread,
Mar 14, 2019, 11:20:10 AM3/14/19
to Django developers (Contributions to Django itself)
+1 to what Adam said. Turning this feature on by default would be certainly cause a lot of bugs in application
expecting a "coherent state" to be persisted on .save().

That would effectively make most of the Form.clean(), Model.clean() and DRF serializer logic in charge of
validating a field value based on another field value useless as another process could race to update the
row to another state unless you've defined database level constraints as well.

> ... performance benefits of executing smaller UPDATE statements that write fewer columns.

FWIW the current .save() logic ignores deferred fields so it's currently possible to work around the
rare cases where updating a large column causes a slow UPDATE query.

> I'm not sure if a global setting is a good idea. Code in third-party apps would need to work with both configurations?

Agreed, it feels like a Meta or .save() option would be more appropriate, we already have a select_on_save
option for example which is even more specific than this use case.

At this point I'm not convinced that it's something Django should implement internally as it is possible to implement
an abstract model in charge of tracking "dirty" fields and overriding `.save` to make `update_fields` default to it
as demonstrated by the few existing third-party solutions.

Simon

Tobias McNulty

unread,
Mar 14, 2019, 3:14:06 PM3/14/19
to django-developers
I want to qualify this by saying that personally I've always thought of (and likely will continue to think of) model objects as a snapshot of what's in the database, and use the more explicit queryset.update() when that snapshot is NOT what I want to save back to the database. So I may be somewhat biased against this idea to start with. :)

That said, as Simon mentioned, don't we already effectively support this?

>>> from django.db import connection
>>> from food.models import Food
>>> f = Food.objects.only('pk').first()
>>> f.name = 'foo'
>>> f.save()
>>> connection.queries
[{'sql': 'SELECT "food_food"."id" FROM "food_food" ORDER BY "food_food"."food_code" ASC LIMIT 1', 'time': '0.003'}, {'sql': 'UPDATE "food_food" SET "name" = \'foo\' WHERE "food_food"."id" = 5542', 'time': '0.010'}]

(For the record, this model has many more fields.)

Of course, if you need other fields to calculate whatever needs to be saved back to the DB, you should add them to .only(). But in any case it protects you from silently losing data by forgetting to include them in update_fields (at the risk of generating extra queries if you forget to include them in .only(), but at least you can see those and optimize them away).

Consider me relatively ambivalent about the idea...perhaps +0 on a Meta- or save()-level implementation and -0 on a settings-based approach (given the red flags already raised).

Cheers,

Tobias McNulty
Chief Executive Officer

tob...@caktusgroup.com
www.caktusgroup.com



--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.

Tobias Wiese

unread,
Mar 16, 2019, 2:05:18 PM3/16/19
to django-d...@googlegroups.com
On 3/14/19 10:52 AM, Adam Johnson wrote:
> Assuming the application is not using either a transaction or
> select_for_update(), current Django does still guarantee that either the
> complete state after Process 1 or 2 is in the database, by always writing
> back all the fields. That is, no Photo instance can have its total set to
> anything but likes * multiplier , since Django's save() sends all fields in
> a single UPDATE statement, which is always atomic. Sure there's a race
> condition and the results of either process 1 or 2 can be lost, but still
> there's a guarantee on the relationship between the values, which
> downstream logic might rely on.
>
> Moving to dirty field tracking would break this guarantee. This would
> probably cause untold low-level breakage in applications where atomicity
> has not been considered. Therefore it seems like it would be a breaking
> change which is hard to communicate to users, a complicated situation.

This case already has the bug, that data gets lost. (Beside that one
shouldn't have fields that rely on other fields that much anyway). This
change would only move the bug somewere else, were it is also easier to
detect (the existance of the bug, not exactly where it happens)

Beside that I don't see a vaild use case where something like this would
actually happen.

--
Tobias Wiese

signature.asc
Reply all
Reply to author
Forward
0 new messages