I think we could use UPSERT for our .save() logic. Django's save()
method is defined as:
- Insert a row in to the database if there isn't already a matching
row for the saved model's primary key
- Otherwise update the row
This is currently implemented as "try to update, if nothing was updated,
then insert". Naturally, that approach has some race conditions.
There are a couple of problems with the suggested INSERT ... ON
DUPLICATE KEY UPDATE feature for this use case:
1. We need to use WITHIN primary_key_idx_name, but we don't
necessarily know the primary key index name of the table. It would be
extremely useful to have support for WITHIN PRIMARY KEY syntax.
2. We need to know if the row was updated or if it was created
(post_save signal needs this information). Looking at the spec, it seems
this is possible to do by issuing a query:
INSERT ... ON DUPLICATE KEY UPDATE RETURNING primary_key_col;
and then, if nothing is returned, we know it was an update. However, I
see this way as problematic. if PostgreSQL ever wants to allow returning
the updated values on conflict, then using RETURNING primary_key_col;
wouldn't work any more. It seems somewhat likely that somebody will want
to add support for RETURNING for the ON CONFLICT UPDATE case later on.
Using the proposed feature for create_or_update() method isn't that easy
- the problem is that the user is free to specify any filtering for the
model to be upserted. We would need to check if the filtering matches
some unique index exactly, and if it does, then do an INSERT ON
DUPLICATE KEY UPDATE WITHING uq_index_name, but we don't know the index
name.
Still possible use case is some sort of "bulk merge" operation. The
operation would be defined as:
- It takes in a list of model instances
- For those models which have primary key set:
- If the database has a row with same pk, that row is updated
- Otherwise a new row is inserted
- Those models that do not have primary key set are bulk inserted to
the database.
A good use case is for example updating employee table from an external
resource, or just loading test data from a file (AKA fixture loading).
If we could use WITHIN PRIMARY KEY, and have better knowledge of which
rows were inserted and which updated then it seems the proposed feature
would match the bulk merge's primary key set use case perfectly.
Even if MySQL has the ON DUPLICATE KEY UPDATE feature we haven't yet
used it. Quickly checking, it seems we can't use it, because we can't
define to use only the primary key index of the table for conflict
checking.
Some non-Django review comments:
- I am not sure what exactly the "ON CONFLICT UPDATE also optionally
accepts a WITHIN clause..." section of the docs means.
- I am not sure what exactly is supposed to happen if you do:
INSERT INTO author(name, age) VALUES ('Anssi', 33) ON CONFLICT UPDATE
SET name = CONFLICTING(name), age = CONFLICTING(age);
and author has separate unique indexes on name and age, and finally
two rows with values ('Tom', 33), and ('Anssi', 30) exists. If I am
reading the section mentioned in the first item correctly, then one of
the conflicting indexes is chosen as the source of the conflict, and
that row's value is then updated. Is this effectively random behavior a
good API?
- Wild suggestion: Maybe it would be better to default to the PRIMARY
KEY index of the table. If no PK index exists, the user must specify
which unique index to use. Maybe there shouldn't be a possibility to
specify more than one unique index?
- I assume there is a good reason to use CONFLICTING(id) instead of
CONFLICTING.id in the syntax?
- I didn't see tests for expression or partial indexes in the patches.
Are partial unique indexes supported?
- Anssi