[Django] #16649: Models.save() refactoring: check updated rows to determine action

Django

unread,

Aug 17, 2011, 6:09:25 PM8/17/11

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action
----------------------------+----------------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Component: Database layer (models, ORM)
Milestone: | Severity: Normal
Version: 1.3 | Triage Stage: Unreviewed
Keywords: | Easy pickings: 0
Has patch: 1 |
UI/UX: 0 |
----------------------------+----------------------------------------------
The attached patch implements a new logic for models.save(). Instead of
the old
{{{
SELECT
if found:
UPDATE
else:
INSERT

The patch uses
UPDATE
if rows modified:
DONE
else:
INSERT
}}}

The logic is naturally more complex when tacking in account force_update,
force_insert and all the corner cases.

As a teaser: load 1000 objects from db, update field, save. Almost 50%
faster now. (Remember to run in one transaction).

The gain is that when the save() results in an update, there is no need
for an extra select. In my somewhat limited tests, with PostgreSQL and
10000 saves of a simple model in single transaction the result was almost
50% speed up. The speedup is similar when using SQLite3.

For the insert case, when there is no PK value for the model, there is no
effect. The result is an insert as always.

For the insert case where there is a PK value defined, and there is no
matching pk value in the DB, the idea results in a speed loss. The reason
is that when doing the UPDATE which doesn't update anything, the data must
be transferred to the db. The speed loss is very small for models which
have not too many / wide fields, and according to my quick testing, it is
somewhere from 10 to 20% with very wide (100000 chars) models. This is
localhost setup, so if there is a slow (as in throughput) network
involved, I would imagine this would result in pretty much exactly 50%
speed loss. There is patch attached to this ticket mitigating this ticket.
In the end of this post there is an explanation how to mitigate this
problem.

There is one big catch: It is currently very well documented that Django
does a select and then based on that an update or an insert. I can't see
changing that could break applications (famous last words) so this is just
a documentation issue. The other downside is that this makes save_base
more complicated. Especially the patch no2, we are sometimes selecting and
sometimes updating as first action, and it is not easy to guess the path
taken.

The speedtests I did are basically different ways of saving models with pk
set and not set. I will post some benchmarks if this is not thrown out on
backwards compatibility / complexity claims. I don't have any benchmarks
worth posting at the moment.

All tests passed using SQLite3, running PostgreSQL tests currently. No new
tests added. Needs documentation, but I might not be the best person to do
that. MultiDB testing would be welcome. I hope to get some feedback.
Negative or positive...

----
To work around the unnecessary data transfer problem, I used the
model._state.adding variable. The idea is simple. If we are adding (the
model did not come from the db) and there is a pk value, we don't expect
there to be any data matching for our primary key. So we use the old
select check in this case to avoid the potential network traffic. Based on
the select, set force_insert or force_update to get the desired action.
This mimics exactly what was done before, expect for error messages in
obscure cases. Those can be fixed, too.

When the _state.adding is False (the model did come from the db) we expect
the save() to result in an update. So we use the above described method.

There is still one situation where that guess can fail. Multidb. Save the
model first in master db, then in slave dbs. The pk is set and
adding=False, but we are still adding in reality. For this there is a
check for self._state.db <> using. I would except that in these situations
people would use force_insert=True, but I am no multidb person so not sure
about that.

I did a third small optimization. The idea is again simple. When saving a
inheriting model, the parents are saved first (there is no change here).
If the parent was inserted, we know this model must also be inserted. It
is not possible for a model to exist without it's parent. This way we
don't need to do any checking, we can force_insert. For update the same
doesn't work: it is possible that the parent exists already, and we are
"extending" it at the moment, saving a new model. This optimization can
result also in almost 50% speedup in the best case, although the cases are
more rare. I don't know any downsides to this optimization.

There are three patches. The first is the base idea, the second is the
_state.adding trick and the third is the parent inheritance thingy. These
must be applied in order.

Sorry for the long post. I was going to do just the update and check rows
returned idea. But why not do just one small thing more... :)

--
Ticket URL: <https://code.djangoproject.com/ticket/16649>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

Aug 20, 2011, 5:15:38 PM8/20/11

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------

Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Component: Database layer

* needs_better_patch: => 0
* needs_docs: => 0
* needs_tests: => 0
* stage: Unreviewed => Design decision needed

Comment:

The idea makes sense and I think it's worth studying.

However, Trac isn't the best place to reach an audience for comments, I
suggest writing to django-developers if you haven't done so yet.

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:1>

Django

unread,

Sep 20, 2011, 9:55:41 AM9/20/11

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------

Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Component: Database layer

Comment (by jacob):

akaariai, can you back up a few steps and explain *why* you think this is
a good idea? As far as I can tell, the speed benefits are just a trade-
off: you trade create performance for update performance. Either call we
make is going to adversely affect some users, so that's exactly why the
`force_update`/`_insert` arguments exist. As it stands, I'm fairly well
against the idea: I can't see the benefit in making a fundamental change
like this, and I *can* see a lot of downside to changing the behavior of
`save()`. More information might change my mind, though.

Thanks!

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:2>

Django

unread,

Sep 20, 2011, 10:44:03 AM9/20/11

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------

Reporter: akaariai | Owner: nobody
Type: | Status: new
Cleanup/optimization | Component: Database layer

Comment (by akaariai):

In summary, update case now requires just one query while previously it
required 2. Insert case still requires 2 queries. Unfortunately the update
-> insert way can be a little bit more expensive than doing select ->
insert. This is due to the need to transfer the data to the DB two times.
There are still reasons why I think this is an optimization:

- I think we will be gaining more from the update case speedup than we
are going to lose in the insert case. Needs benchmarks.
- In the common case of an autofield pk we are even more likely to gain.
Autofield gets its value on first save and should not change after that.
If the field is set to a value it is very likely we are going to do an
update. If it is not set, we are going to do an insert directly.
- When loading fixtures, or when saving the same model to another DB we
are likely going to do an insert. In this case we can use a heuristic: if
we are adding the object to the DB (instance._state.adding=True) or if we
are saving to a different DB (instance._state.db <> using) we use the old
method of SELECT -> insert / update.

All three above combined, I don't think there are many cases where losses
are expected. The only case I can think of is reusing an object in a loop
while changing its primary key, something like "obj = Obj(); for i in
range(0, 100): obj.pk=i; obj.save()" - but I don't think that is a case we
should optimize for. In fact, I think that should either result in error
or update of the PK value, but that is another ticket's problem.

Of course, the best optimization would be to use database specific methods
(MERGE or INSERT ON DUPLICATE KEY UPDATE). This would allow for single
query save in every case. Unfortunately there is no such method for
PostgreSQL, at least not any method I know of. It might make sense to
investigate this route first.

Sorry for making the original post so long, there are a lot of ideas mixed
in one ticket. I will try to be a better trac-citizen in the future :)

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:3>

Django

unread,

Sep 5, 2012, 3:03:38 PM9/5/12

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------

Reporter: akaariai | Owner: nobody
Type: | Status: new

Comment (by akaariai):

A bit more information about what the approach suggested is about. Assume
you load a model from DB, change some fields and save it. We currently do:
{{{
SELECT - load from DB
[change fields]; save()
In save:
SELECT - assert that the PK exists in the DB
if yes:
UPDATE
else:
INSERT
}}}

The second select above is usually redundant. If we know the object is
loaded from DB (self._state.adding == False), and we are saving it to the
same DB it was loaded from (self._state.db == the db we are saving to),
then it is very likely the object exists in the DB. So, instead we should
do:
{{{
In save:
if same db, not adding:
exists = UPDATE
else:
exists = SELECT
if exists:
UPDATE
if not exists:
INSERT
}}}
So, in the case the model is already in the DB, we do just a single update
instead of select + update. If it so happens it doesn't exist (which
should be very unlikely), we do update with zero rows modified and still
know to do an insert.

This does lead to behaviour change on MySQL when there is a concurrent
insert to the same table. Currently the SELECT sees nothing, and the
insert will lead to integrity error. After this, the update will block
until the inserting transaction is committed, and then update the matching
row. However, on PostgreSQL the update sees nothing, so there is no
behaviour change there. For MySQL users this will be better than what we
have now. For PostgreSQL users there is no change from behaviour
standpoint.

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:4>

Django

unread,

Nov 29, 2012, 2:10:20 PM11/29/12

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------

Reporter: akaariai | Owner: nobody
Type: | Status: new

Comment (by akaariai):

Sorry, the explanations above are a little confusing. Still another try:
{{{
s = SomeModel.objects.get(pk=someval)
s.somecol = someval
s.save()
}}}
Here save is implemented as SELECT - if not found INSERT - else UPDATE.
The SELECT here is redundant, we have the information that SELECT was just
done in model._state. So, we can do directly an UPDATE. If it happens so
that nothing is updated (likely because of concurrent delete) then we will
still do the INSERT and things work as expected.

I have done a
[https://github.com/akaariai/django/compare/model_save_refactor full
refactor] of model.save_base(), this includes 4 parts:
- What this ticket deals with - trying directly to UPDATE if the model
was fetched from the same DB we are saving to. Also assorted cleanup to
make this possible.
- Cleanup to proxy parents handling (the logic is somewhat ugly
currently)
- A bug fix for #17341 (do not commit transaction after every parent
model save)
- Splitting save_base into parts so that the saving logic is easier to
follow

Above, the bug fix is of course needed, and the proxy parents handling
cleanup is IMO also needed.

I can't see any downsides to trying UPDATE directly as done in the patch.
This should result in clear performance improvement in most cases. There
is a problem that we have documented very explicitly the sequence of SQL
commands .save() does but I don't think that documentation should be
considered part of the public API. So, docs update is needed.

The refactoring of .save_base() into parts is a stylistic question. I
surprisingly prefer the refactored way. Some examples of things that are
hard to see from the current writing of the code:
- Which signals are sent for which models
- The actual hard work is done in the if not meta.proxy: branch.
However, it is impossible that meta.proxy is True here. And it isn't
exactly easy to see this.
- The origin parameter is a bit weird - it is None except for the
outermost table save.

Still, I'd like to get a confirmation that others prefer the new writing,
too.

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:5>

Django

unread,

Dec 30, 2012, 2:51:10 PM12/30/12

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------

Reporter: akaariai | Owner: nobody
Type: | Status: new

Cleanup/optimization | Version: 1.3
Component: Database layer | Resolution:

(models, ORM) | Triage Stage: Ready for
Severity: Normal | checkin

Keywords: | Needs documentation: 0
Has patch: 1 | Patch needs improvement: 0
Needs tests: 0 | UI/UX: 0
Easy pickings: 0 |
-------------------------------------+-------------------------------------

Changes (by akaariai):

* stage: Design decision needed => Ready for checkin

Comment:

Squash-rebase of the above branch available from:
https://github.com/akaariai/django/compare/django:master...model_save_refactor_squash

I am moving this out of DDN - I have written to django-developers and I
haven't seen any -1 (or +1 for that matter) when I suggested this change.
I take this as nobody opposing the change. I think the only reason to
oppose this feature is that something might break. As far as I know
nothing should break, but the only way to find out for sure is moving this
in and getting real world usage of the new save() way.

A final review (if only looking for typos & stale comments) is welcome.
Another thing I'd still like to get reviewed is the structuring of the new
code - do the method names make sense, are the splitpoints sensible?

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:6>

Django

unread,

Jan 8, 2013, 12:51:49 PM1/8/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------

Reporter: akaariai | Owner: nobody
Type: | Status: new

Comment (by akaariai):

I updated the branch with some last minute polish. Will commit soon.

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:7>

Django

unread,

Feb 16, 2013, 7:19:39 PM2/16/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------

Reporter: akaariai | Owner: nobody
Type: | Status: new

Comment (by timo):

I hope you won't find my grammar nitpicks too exacting. :-)

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:8>

Django

unread,

Feb 22, 2013, 1:55:45 PM2/22/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------

Reporter: akaariai | Owner: nobody
Type: | Status: new

Cleanup/optimization | Version: 1.3
Component: Database layer | Resolution:

(models, ORM) | Triage Stage: Accepted
Severity: Normal | Needs documentation: 0
Keywords: | Patch needs improvement: 1

Has patch: 1 | UI/UX: 0

Needs tests: 0 |
Easy pickings: 0 |
-------------------------------------+-------------------------------------
Changes (by akaariai):

* needs_better_patch: 0 => 1
* stage: Ready for checkin => Accepted

Comment:

Thanks for the comments! The Finnish dialect of English has special
grammar...

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:9>

Django

unread,

Mar 13, 2013, 5:58:54 PM3/13/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------

Reporter: akaariai | Owner: nobody
Type: | Status: new

Cleanup/optimization | Version: 1.3
Component: Database layer | Resolution:

(models, ORM) | Triage Stage: Ready for
Severity: Normal | checkin
Keywords: | Needs documentation: 0
Has patch: 1 | Patch needs improvement: 0

Needs tests: 0 | UI/UX: 0

Easy pickings: 0 |
-------------------------------------+-------------------------------------
Changes (by akaariai):

* needs_better_patch: 1 => 0
* stage: Accepted => Ready for checkin

Comment:

Patch updated,
https://github.com/akaariai/django/compare/django:master...model_save_refactor_squash.
I removed the "if from same DB" check. Now the patch just always tries
UPDATE first if the model is from the same DB. It will be easy enough to
change the behaviour later on if it seems that is needed.

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:10>

Django

unread,

Mar 14, 2013, 5:04:56 AM3/14/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody
Type: | Status: closed
Cleanup/optimization | Version: 1.3
Component: Database layer | Resolution: fixed

Changes (by Anssi Kääriäinen <akaariai@…>):

* status: new => closed
* resolution: => fixed

Comment:

In [changeset:"6b4834952dcce0db5cbc1534635c00ff8573a6d8"]:
{{{
#!CommitTicketReference repository=""
revision="6b4834952dcce0db5cbc1534635c00ff8573a6d8"
Fixed #16649 -- Refactored save_base logic

Model.save() will use UPDATE - if not updated - INSERT instead of
SELECT - if found UPDATE else INSERT. This should save a query when
updating, but will cost a little when inserting model with PK set.

Also fixed #17341 -- made sure .save() commits transactions only after
the whole model has been saved. This wasn't the case in model
inheritance situations.

The save_base implementation was refactored into multiple methods.
A typical chain for inherited save is:
save_base()
_save_parents(self)
for each parent:
_save_parents(parent)
_save_table(parent)
_save_table(self)
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:11>

Django

unread,

Apr 15, 2013, 9:49:57 AM4/15/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody

Comment (by akaariai):

While using update_fields on one of my projects I noticed a possible
reason to revert the way update count is used in this ticket's committed
patch. On PostgreSQL, if there is a trigger for the updated table, and
this trigger returns NULL in some conditions, then the UPDATE statement is
seen as affecting zero rows. This will result in Django seeing zero rows
updated, and this again will result in error when using update_fields as
the update didn't seem to affect any rows.

For plain .save() in 1.6 the result will be IntegrityError - the update
doesn't seem to affect any rows, but there is a row matching the update
condition in the DB, so Django will try to UPDATE, see no rows matched,
and then try to INSERT -> boom.

The end result is that code that worked in 1.5 will not work in 1.6 if
there are ON UPDATE triggers which return NULL. I am not sure about how
common this condition will be. It will be easy to change the detection in
save_base() from UPDATE - if not updated - INSERT to SELECT - if found
UPDATE - else INSERT. But I would like to wait and see if this problem is
common enough to worry about...

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:12>

Django

unread,

Aug 16, 2013, 5:57:44 AM8/16/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody

Changes (by intgr):

* cc: marti@… (added)

Comment:

We hit the problem mentioned in [comment:12 comment 12] while trying out
Django 1.6 beta. We are using the PostgreSQL built-in
[http://www.postgresql.org/docs/current/static/functions-trigger.html
suppress_redundant_updates_trigger] trigger on some tables to save I/O
when saving unchanged rows. With Django 1.6 this now causes
IntegrityErrors.

No big deal for us, we can drop the trigger when we migrate to 1.6, but I
can imagine there are situations where trigger logic is a necessary part
of the application. Maybe there should be a workaround for those poor
users, like a per-model option to fall back to the old save behavior?

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:13>

Django

unread,

Aug 16, 2013, 6:03:16 AM8/16/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody

Comment (by akaariai):

Maybe something like select_on_save or somesuch _meta option. I'd like to
keep the default False as trying direct update is the correct thing to do
most of the time. I will check how hard this is to do.

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:14>

Django

unread,

Aug 16, 2013, 8:00:03 AM8/16/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody

Comment (by akaariai):

I've added a crude patch for "select_on_save" option. Docs missing, maybe
more tests needed, too. See
https://github.com/akaariai/django/commit/1ca398b0352b11868ec236f92cf17b6ce82ff88c

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:15>

Django

unread,

Aug 28, 2013, 4:06:57 AM8/28/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody

Comment (by akaariai):

I added a new ticket for the incompatibility issue, see #20988.

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:16>

Django

unread,

Aug 30, 2013, 2:44:27 AM8/30/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody

Comment (by Anssi Kääriäinen <akaariai@…>):

In [changeset:"e973ee6a9879969b8ae05bb7ff681172cc5386a5"]:
{{{
#!CommitTicketReference repository=""
revision="e973ee6a9879969b8ae05bb7ff681172cc5386a5"
Fixed #20988 -- Added model meta option select_on_save

The option can be used to force pre 1.6 style SELECT on save behaviour.
This is needed in case the database returns zero updated rows even if
there is a matching row in the DB. One such case is PostgreSQL update
trigger that returns NULL.

Reviewed by Tim Graham.

Refs #16649
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:17>

Django

unread,

Aug 30, 2013, 2:49:16 AM8/30/13

to django-...@googlegroups.com

#16649: Models.save() refactoring: check updated rows to determine action

-------------------------------------+-------------------------------------
Reporter: akaariai | Owner: nobody

Comment (by Anssi Kääriäinen <akaariai@…>):

In [changeset:"76e38a21777243fec58c640d617bb71a251c5ba1"]:
{{{
#!CommitTicketReference repository=""
revision="76e38a21777243fec58c640d617bb71a251c5ba1"
[1.6.x] Fixed #20988 -- Added model meta option select_on_save

The option can be used to force pre 1.6 style SELECT on save behaviour.
This is needed in case the database returns zero updated rows even if
there is a matching row in the DB. One such case is PostgreSQL update
trigger that returns NULL.

Reviewed by Tim Graham.

Refs #16649

Backport of e973ee6a9879969b8ae05bb7ff681172cc5386a5 from master

Conflicts:
django/db/models/options.py
tests/basic/tests.py
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:18>

Reply all

Reply to author

Forward