--
Ticket URL: <https://code.djangoproject.com/ticket/16649>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* needs_better_patch: => 0
* needs_docs: => 0
* needs_tests: => 0
* stage: Unreviewed => Design decision needed
Comment:
The idea makes sense and I think it's worth studying.
However, Trac isn't the best place to reach an audience for comments, I
suggest writing to django-developers if you haven't done so yet.
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:1>
Comment (by jacob):
akaariai, can you back up a few steps and explain *why* you think this is
a good idea? As far as I can tell, the speed benefits are just a trade-
off: you trade create performance for update performance. Either call we
make is going to adversely affect some users, so that's exactly why the
`force_update`/`_insert` arguments exist. As it stands, I'm fairly well
against the idea: I can't see the benefit in making a fundamental change
like this, and I *can* see a lot of downside to changing the behavior of
`save()`. More information might change my mind, though.
Thanks!
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:2>
Comment (by akaariai):
In summary, update case now requires just one query while previously it
required 2. Insert case still requires 2 queries. Unfortunately the update
-> insert way can be a little bit more expensive than doing select ->
insert. This is due to the need to transfer the data to the DB two times.
There are still reasons why I think this is an optimization:
- I think we will be gaining more from the update case speedup than we
are going to lose in the insert case. Needs benchmarks.
- In the common case of an autofield pk we are even more likely to gain.
Autofield gets its value on first save and should not change after that.
If the field is set to a value it is very likely we are going to do an
update. If it is not set, we are going to do an insert directly.
- When loading fixtures, or when saving the same model to another DB we
are likely going to do an insert. In this case we can use a heuristic: if
we are adding the object to the DB (instance._state.adding=True) or if we
are saving to a different DB (instance._state.db <> using) we use the old
method of SELECT -> insert / update.
All three above combined, I don't think there are many cases where losses
are expected. The only case I can think of is reusing an object in a loop
while changing its primary key, something like "obj = Obj(); for i in
range(0, 100): obj.pk=i; obj.save()" - but I don't think that is a case we
should optimize for. In fact, I think that should either result in error
or update of the PK value, but that is another ticket's problem.
Of course, the best optimization would be to use database specific methods
(MERGE or INSERT ON DUPLICATE KEY UPDATE). This would allow for single
query save in every case. Unfortunately there is no such method for
PostgreSQL, at least not any method I know of. It might make sense to
investigate this route first.
Sorry for making the original post so long, there are a lot of ideas mixed
in one ticket. I will try to be a better trac-citizen in the future :)
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:3>
Comment (by akaariai):
A bit more information about what the approach suggested is about. Assume
you load a model from DB, change some fields and save it. We currently do:
{{{
SELECT - load from DB
[change fields]; save()
In save:
SELECT - assert that the PK exists in the DB
if yes:
UPDATE
else:
INSERT
}}}
The second select above is usually redundant. If we know the object is
loaded from DB (self._state.adding == False), and we are saving it to the
same DB it was loaded from (self._state.db == the db we are saving to),
then it is very likely the object exists in the DB. So, instead we should
do:
{{{
In save:
if same db, not adding:
exists = UPDATE
else:
exists = SELECT
if exists:
UPDATE
if not exists:
INSERT
}}}
So, in the case the model is already in the DB, we do just a single update
instead of select + update. If it so happens it doesn't exist (which
should be very unlikely), we do update with zero rows modified and still
know to do an insert.
This does lead to behaviour change on MySQL when there is a concurrent
insert to the same table. Currently the SELECT sees nothing, and the
insert will lead to integrity error. After this, the update will block
until the inserting transaction is committed, and then update the matching
row. However, on PostgreSQL the update sees nothing, so there is no
behaviour change there. For MySQL users this will be better than what we
have now. For PostgreSQL users there is no change from behaviour
standpoint.
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:4>
Comment (by akaariai):
Sorry, the explanations above are a little confusing. Still another try:
{{{
s = SomeModel.objects.get(pk=someval)
s.somecol = someval
s.save()
}}}
Here save is implemented as SELECT - if not found INSERT - else UPDATE.
The SELECT here is redundant, we have the information that SELECT was just
done in model._state. So, we can do directly an UPDATE. If it happens so
that nothing is updated (likely because of concurrent delete) then we will
still do the INSERT and things work as expected.
I have done a
[https://github.com/akaariai/django/compare/model_save_refactor full
refactor] of model.save_base(), this includes 4 parts:
- What this ticket deals with - trying directly to UPDATE if the model
was fetched from the same DB we are saving to. Also assorted cleanup to
make this possible.
- Cleanup to proxy parents handling (the logic is somewhat ugly
currently)
- A bug fix for #17341 (do not commit transaction after every parent
model save)
- Splitting save_base into parts so that the saving logic is easier to
follow
Above, the bug fix is of course needed, and the proxy parents handling
cleanup is IMO also needed.
I can't see any downsides to trying UPDATE directly as done in the patch.
This should result in clear performance improvement in most cases. There
is a problem that we have documented very explicitly the sequence of SQL
commands .save() does but I don't think that documentation should be
considered part of the public API. So, docs update is needed.
The refactoring of .save_base() into parts is a stylistic question. I
surprisingly prefer the refactored way. Some examples of things that are
hard to see from the current writing of the code:
- Which signals are sent for which models
- The actual hard work is done in the if not meta.proxy: branch.
However, it is impossible that meta.proxy is True here. And it isn't
exactly easy to see this.
- The origin parameter is a bit weird - it is None except for the
outermost table save.
Still, I'd like to get a confirmation that others prefer the new writing,
too.
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:5>
* stage: Design decision needed => Ready for checkin
Comment:
Squash-rebase of the above branch available from:
https://github.com/akaariai/django/compare/django:master...model_save_refactor_squash
I am moving this out of DDN - I have written to django-developers and I
haven't seen any -1 (or +1 for that matter) when I suggested this change.
I take this as nobody opposing the change. I think the only reason to
oppose this feature is that something might break. As far as I know
nothing should break, but the only way to find out for sure is moving this
in and getting real world usage of the new save() way.
A final review (if only looking for typos & stale comments) is welcome.
Another thing I'd still like to get reviewed is the structuring of the new
code - do the method names make sense, are the splitpoints sensible?
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:6>
Comment (by akaariai):
I updated the branch with some last minute polish. Will commit soon.
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:7>
Comment (by timo):
I hope you won't find my grammar nitpicks too exacting. :-)
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:8>
* needs_better_patch: 0 => 1
* stage: Ready for checkin => Accepted
Comment:
Thanks for the comments! The Finnish dialect of English has special
grammar...
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:9>
* needs_better_patch: 1 => 0
* stage: Accepted => Ready for checkin
Comment:
Patch updated,
https://github.com/akaariai/django/compare/django:master...model_save_refactor_squash.
I removed the "if from same DB" check. Now the patch just always tries
UPDATE first if the model is from the same DB. It will be easy enough to
change the behaviour later on if it seems that is needed.
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:10>
* status: new => closed
* resolution: => fixed
Comment:
In [changeset:"6b4834952dcce0db5cbc1534635c00ff8573a6d8"]:
{{{
#!CommitTicketReference repository=""
revision="6b4834952dcce0db5cbc1534635c00ff8573a6d8"
Fixed #16649 -- Refactored save_base logic
Model.save() will use UPDATE - if not updated - INSERT instead of
SELECT - if found UPDATE else INSERT. This should save a query when
updating, but will cost a little when inserting model with PK set.
Also fixed #17341 -- made sure .save() commits transactions only after
the whole model has been saved. This wasn't the case in model
inheritance situations.
The save_base implementation was refactored into multiple methods.
A typical chain for inherited save is:
save_base()
_save_parents(self)
for each parent:
_save_parents(parent)
_save_table(parent)
_save_table(self)
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:11>
Comment (by akaariai):
While using update_fields on one of my projects I noticed a possible
reason to revert the way update count is used in this ticket's committed
patch. On PostgreSQL, if there is a trigger for the updated table, and
this trigger returns NULL in some conditions, then the UPDATE statement is
seen as affecting zero rows. This will result in Django seeing zero rows
updated, and this again will result in error when using update_fields as
the update didn't seem to affect any rows.
For plain .save() in 1.6 the result will be IntegrityError - the update
doesn't seem to affect any rows, but there is a row matching the update
condition in the DB, so Django will try to UPDATE, see no rows matched,
and then try to INSERT -> boom.
The end result is that code that worked in 1.5 will not work in 1.6 if
there are ON UPDATE triggers which return NULL. I am not sure about how
common this condition will be. It will be easy to change the detection in
save_base() from UPDATE - if not updated - INSERT to SELECT - if found
UPDATE - else INSERT. But I would like to wait and see if this problem is
common enough to worry about...
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:12>
* cc: marti@… (added)
Comment:
We hit the problem mentioned in [comment:12 comment 12] while trying out
Django 1.6 beta. We are using the PostgreSQL built-in
[http://www.postgresql.org/docs/current/static/functions-trigger.html
suppress_redundant_updates_trigger] trigger on some tables to save I/O
when saving unchanged rows. With Django 1.6 this now causes
IntegrityErrors.
No big deal for us, we can drop the trigger when we migrate to 1.6, but I
can imagine there are situations where trigger logic is a necessary part
of the application. Maybe there should be a workaround for those poor
users, like a per-model option to fall back to the old save behavior?
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:13>
Comment (by akaariai):
Maybe something like select_on_save or somesuch _meta option. I'd like to
keep the default False as trying direct update is the correct thing to do
most of the time. I will check how hard this is to do.
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:14>
Comment (by akaariai):
I've added a crude patch for "select_on_save" option. Docs missing, maybe
more tests needed, too. See
https://github.com/akaariai/django/commit/1ca398b0352b11868ec236f92cf17b6ce82ff88c
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:15>
Comment (by akaariai):
I added a new ticket for the incompatibility issue, see #20988.
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:16>
Comment (by Anssi Kääriäinen <akaariai@…>):
In [changeset:"e973ee6a9879969b8ae05bb7ff681172cc5386a5"]:
{{{
#!CommitTicketReference repository=""
revision="e973ee6a9879969b8ae05bb7ff681172cc5386a5"
Fixed #20988 -- Added model meta option select_on_save
The option can be used to force pre 1.6 style SELECT on save behaviour.
This is needed in case the database returns zero updated rows even if
there is a matching row in the DB. One such case is PostgreSQL update
trigger that returns NULL.
Reviewed by Tim Graham.
Refs #16649
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:17>
Comment (by Anssi Kääriäinen <akaariai@…>):
In [changeset:"76e38a21777243fec58c640d617bb71a251c5ba1"]:
{{{
#!CommitTicketReference repository=""
revision="76e38a21777243fec58c640d617bb71a251c5ba1"
[1.6.x] Fixed #20988 -- Added model meta option select_on_save
The option can be used to force pre 1.6 style SELECT on save behaviour.
This is needed in case the database returns zero updated rows even if
there is a matching row in the DB. One such case is PostgreSQL update
trigger that returns NULL.
Reviewed by Tim Graham.
Refs #16649
Backport of e973ee6a9879969b8ae05bb7ff681172cc5386a5 from master
Conflicts:
django/db/models/options.py
tests/basic/tests.py
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/16649#comment:18>