RFC: "UPSERT" in PostgreSQL

454 views
Skip to first unread message

Peter Geoghegan

unread,
Sep 27, 2014, 8:44:03 PM9/27/14
to django-d...@googlegroups.com, Jacob Kaplan-Moss
Hello,

I am a PostgreSQL major contributor, currently undertaking development
of a feature sometimes called "UPSERT" for PostgreSQL.

Jacob Kaplan-Moss is an acquaintance and co-worker, and I know that
it's certainly something that interests him personally - presumably he
has some ideas on how an UPSERT-like feature could be used by Django.
Beyond that, I don't have a good sense of what the level of interest
is in the Django community. I would like to reach out and get feedback
on the syntax that I've proposed on the pgsql-hackers development
list. Certainly, I've put a lot of work into this, and expect to put a
lot more in, so obviously it's important that we not make any missteps
with user-visible semantics. I've used Django a bit in the past, and
my general impression is that the Django development community has
made sound technical decisions, so I'm happy to take your concerns as
representative of the concerns of ORM authors and web frameworks
generally.

This is a non-standard syntax that is somewhat comparable to MySQL's
INSERT ... ON DUPLICATE KEY UPDATE, while being more flexible and
safe. The SQL standard's MERGE statement is kind of weird, and has a
lot of baggage from trying to serve multiple use-cases, so I avoided
trying to implement that, preferring to focus on something that has
simple atomic insert-or-update semantics - what people seem to
actually want.

I would like to get a sense of:

* Would you consider the syntax that I've proposed a good one?

* If it was available, would you use it in future versions of Django?
Do you think the plugins ecosystem could take advantage of it sooner?
Is the demand there?

* What could be better about the feature?

* How are you using the MySQL feature (INSERT ... ON DUPLICATE KEY
UPDATE), if at all? Do you think it would be useful to have some
degree of compatibility with it? Why or why not?

I've built html user-facing documentation, which is publicly
available. This page is probably of most interest -- there is
extensive discussion of the new "ON CONFLICT UPDATE/IGNORE" clause:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/sql-insert.html

These may be of some interest (search for "on conflict" to see the new
references to the feature):

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/transaction-iso.html
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/ddl-inherit.html
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/sql-createrule.html
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/trigger-definition.html
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/sql-createtrigger.html
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/index-unique-checks.html
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/sql-createview.html
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/on-conflict-docs/postgres-fdw.html

If you feel like actually trying out my patch, you can download the
latest version of the patchset from:

http://www.postgresql.org/message-id/CAM3SWZRvkCKc=1Y6_Wn8mk97_Vi8+j-aX-RY-=msrJVU...@mail.gmail.com

You'll need to build PostgreSQL from the git master branch (which
includes dependencies on things like Flex and Bison), with the patch
set applied. There are guides to doing this on the internet, including
this one: http://www.postgresql.org/docs/devel/static/sourcerepo.html

Thanks!
--
Peter Geoghegan

Alex Gaynor

unread,
Sep 27, 2014, 8:49:03 PM9/27/14
to django-d...@googlegroups.com, Jacob Kaplan-Moss
Hi Peter,

Thanks for reaching out about this!

From the docs, the syntax looks pretty reasonable, it would be nice to see what some examples look like though.
 
* If it was available, would you use it in future versions of Django?
Do you think the plugins ecosystem could take advantage of it sooner?
Is the demand there?

Yes, I think so: https://docs.djangoproject.com/en/dev/ref/models/querysets/#update-or-create exists, but right now it's manually implemented using savepoints (IIRC) -- I think strictly speaking it isn't safe. 

I think even under MySQL it doesn't use the native support for this feature, having it in PostgreSQL as well would be good motivation.
 
--
Peter Geoghegan

--
You received this message because you are subscribed to the Google Groups "Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/CAM3SWZSAAK-nSpBuFDpz5KDGPJxNRAE7OGe2sf30Zu0UYCkwNA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Cheers,
Alex

--
"I disapprove of what you say, but I will defend to the death your right to say it." -- Evelyn Beatrice Hall (summarizing Voltaire)
"The people's good is the highest law." -- Cicero
GPG Key fingerprint: 125F 5C67 DFE9 4084

Shai Berger

unread,
Sep 27, 2014, 9:22:51 PM9/27/14
to django-d...@googlegroups.com
Hi Peter,

Thanks for asking!

On Sunday 28 September 2014 02:01:59 Peter Geoghegan wrote:
>
> * Would you consider the syntax that I've proposed a good one?
>

Looks pretty reasonable.

> * If it was available, would you use it in future versions of Django?
> Do you think the plugins ecosystem could take advantage of it sooner?
> Is the demand there?
>

I agree with Alex here. Probably will get used in Django. However, Django
tries to be compatible with all currently supported versions of PG; so I will
be a little surprised if it gets used before PG 9.4's EOL.

Plugins (or, as we call them, apps) in the ecosystem will be at liberty to use
it much sooner, of course, but for them there are strong incentives to stay
with the cross-database API. The next release, Django 1.8, will probably
include a library to support PG-only development, so such incentives will be a
little weaker there.

> * What could be better about the feature?
>

Upon reading the docs, I was a little surprised to see that in terms of
triggers etc, the operation is always considered an INSERT. I would expect it
to be considered an insert for BEFORE INSERT or INSTEAD OF INSERT triggers,
but if conflict resolution turns it into an UPDATE, I'd expect to see it
handled as an UPDATE from that point on (definitely INSTEAD OF UPDATE and AFTER
UPDATE triggers, maybe even BEFORE UPDATE). That is the semantics we have now
(this is a general remark, not particulary Django-oriented; Django does not
use triggers on PG as far as I know, and only uses them elsewhere to implement
serial keys).

Could you explain the reasoning behind the described behavior?

Thanks,
Shai.

Peter Geoghegan

unread,
Sep 27, 2014, 9:28:11 PM9/27/14
to django-d...@googlegroups.com, Jacob Kaplan-Moss
Hi Alex,

On Sat, Sep 27, 2014 at 5:48 PM, Alex Gaynor <alex....@gmail.com> wrote:
> Thanks for reaching out about this!

No problem. I'm not sure if you recall, but we actually met before, in
San Francisco during Waza 2013.

> From the docs, the syntax looks pretty reasonable, it would be nice to see
> what some examples look like though.

There are a couple of examples at the end of the sql-insert.html
INSERT reference page. There could probably stand to be more. There
isn't an example featuring the use of a WHERE clause, though.

The WHERE clause is a bit odd because It is only evaluated on a
conclusively visible, locked row version (and if the predicate isn't
satisfied, the row is still locked, unlike a regular UPDATE). So,
potentially, you could have the predicate satisfied, where the
command's MVCC snapshot would not see a row version that you'd expect
to be satisfied by it. It might just so happen that a later row
version - the latest one - did happen to be satisfied by the predicate
at the same time, because some other session updated the row in a way
that made that the case. This is kind of esoteric, but is a new thing.
Although there have always been similar odd things for regular
UPDATEs. It's also kind of a new thing that you can update something
that is committed, and exists, even though *no* version is visible to
the command's MVCC snapshot. You have to be using READ COMMITTED for
this to happen, though, and if you are, the next command will have a
new snapshot anyway, and so it's probably a pretty academic concern
for Django people in practice.

UPSERT is more or less about reconciling two different ideas of
reality: The snapshot isolation idea, and the physical, objective
reality of whether or not an index tuple exists in a unique index at
this time. Spookiness like this is inevitable, unless you throw an
error instead, which I strongly consider to be a cop-out. If you
wanted that, you'd have used a higher isolation level.

> Yes, I think so:
> https://docs.djangoproject.com/en/dev/ref/models/querysets/#update-or-create
> exists, but right now it's manually implemented using savepoints (IIRC) -- I
> think strictly speaking it isn't safe.

I recall talking to a client about a race condition within
get_or_create() back in my consulting days.

The rule of thumb with the subtransaction looping pattern is that you
have to have a unique index defined to determine when a duplicate is
inserted according to your particular idea of a duplicate (obviously),
and you must be prepared to retry. So typically you update, then you
insert. Loop until one of those two works. If you just try both once,
there are still race conditions.

What I think would be particularly interesting to hear is your level
of interest in the optional "WITHIN unique_index_name" clause. The
idea here is that you can name a unique index directly, and have that
explicitly be the place where the violation is expected, if anywhere
(there ought to be an expectation of having such a violation be in one
place, or else surprising things can happen).

Now, I'm sure the first thing you'll say, as others have is: "Can't it
be the name of some columns instead, so that Postgres figures out
which unique index I mean"? It turns out that that's quite ticklish in
certain edge cases (e.g. partial unique indexes with BEFORE triggers).
We might come up with a better way that's fully general, but I'm not
holding my breath.

> I think even under MySQL it doesn't use the native support for this feature,
> having it in PostgreSQL as well would be good motivation.

Cool.

--
Peter Geoghegan

Peter Geoghegan

unread,
Sep 27, 2014, 9:33:17 PM9/27/14
to django-d...@googlegroups.com
On Sat, Sep 27, 2014 at 6:22 PM, Shai Berger <sh...@platonix.com> wrote:
> Upon reading the docs, I was a little surprised to see that in terms of
> triggers etc, the operation is always considered an INSERT. I would expect it
> to be considered an insert for BEFORE INSERT or INSTEAD OF INSERT triggers,
> but if conflict resolution turns it into an UPDATE, I'd expect to see it
> handled as an UPDATE from that point on (definitely INSTEAD OF UPDATE and AFTER
> UPDATE triggers, maybe even BEFORE UPDATE). That is the semantics we have now
> (this is a general remark, not particulary Django-oriented; Django does not
> use triggers on PG as far as I know, and only uses them elsewhere to implement
> serial keys).
>
> Could you explain the reasoning behind the described behavior?

per-row BEFORE triggers from insert will fire even when we go to
update (but not AFTER triggers from INSERT). We need to know what
we're actually inserting before inserting/deciding to update. So
BEFORE triggers must fire, because in principle (and, indeed,
typically), they do modify the tuple being inserted. At the same time,
that's user-defined code often entirely outside our ability to
nullify.

The statement-level trigger stuff (i.e. the idea that ON CONFLICT
UPDATE never fires an UPDATE statement level trigger) is for
consistency with user-defined rules, where we're really compelled to
have INSERT...ON CONFLICT UPDATE just be seen as an UPDATE. But I
admit that that isn't a very satisfactory state of affairs for people
who are using per-statement triggers for auditing and things like
that, and I wouldn't be surprised if that was revised.
--
Peter Geoghegan

Peter Geoghegan

unread,
Sep 27, 2014, 9:41:22 PM9/27/14
to django-d...@googlegroups.com
On Sat, Sep 27, 2014 at 6:33 PM, Peter Geoghegan <p...@heroku.com> wrote:
> The statement-level trigger stuff (i.e. the idea that ON CONFLICT
> UPDATE never fires an UPDATE statement level trigger) is for
> consistency with user-defined rules, where we're really compelled to
> have INSERT...ON CONFLICT UPDATE just be seen as an UPDATE


Excuse me. I mean where we're really compelled to have INSERT...ON
CONFLICT UPDATE just be seen as an **INSERT**.

--
Peter Geoghegan

Simon Riggs

unread,
Sep 28, 2014, 2:33:48 AM9/28/14
to django-developers
On 28 September 2014 02:22, Shai Berger <sh...@platonix.com> wrote:

> Upon reading the docs, I was a little surprised to see that in terms of
> triggers etc, the operation is always considered an INSERT. I would expect it
> to be considered an insert for BEFORE INSERT or INSTEAD OF INSERT triggers,
> but if conflict resolution turns it into an UPDATE, I'd expect to see it
> handled as an UPDATE from that point on (definitely INSTEAD OF UPDATE and AFTER
> UPDATE triggers, maybe even BEFORE UPDATE). That is the semantics we have now
> (this is a general remark, not particulary Django-oriented; Django does not
> use triggers on PG as far as I know, and only uses them elsewhere to implement
> serial keys).

Hmm, good thinking.

So it should act like this?

BEFORE INSERT triggers fire
if CONFLICT then
{
BEFORE UPDATE triggers fire
perform UPDATE
AFTER UPDATE triggers fire
}
else
{
perform INSERT
AFTER INSERT triggers fire
}

INSTEAD OF triggers would fire on views only, so would be shown in the
above instead of the before triggers

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Simon Riggs

unread,
Sep 28, 2014, 2:38:24 AM9/28/14
to django-developers
On 28 September 2014 00:01, Peter Geoghegan <p...@heroku.com> wrote:

> I am a PostgreSQL major contributor, currently undertaking development
> of a feature sometimes called "UPSERT" for PostgreSQL.

Thanks for posting this Peter.

The feature is under review in the PostgreSQL community and we would
like to solicit open feedback about how such a feature might look and
what aspects make it more/less likely to be adopted by Django.

Detailed feedback on ORM related issues is welcome.

Marc Tamlyn

unread,
Sep 28, 2014, 4:11:44 AM9/28/14
to django-d...@googlegroups.com
There will be some postgresql specific features in Django 1.8 (at least array fields, probably also hstore, ranges and jsonb). There are also a couple of new features which have already landed which leverage postgresql specific features when present and fall back otherwise (uuid, interval).

I am in the process of working on some possible changes for how indexes are handled in Django[1] which would make the ability to refer to an index directly possible.

From a brief look at the code for create_or_update it seems reasonable we could inspect the connection and change the code path depending on whether it supports a more reliable approach. As we can also inspect the version (not just which database) I don't see any problem with the postgresql_psycopg2 backend only reporting that it supports_atomic_create_or_update if the version is sufficiently recent.

So in my opinion, yes we would support it in a future version of Django, potentially not that far in the future. If we were to do so, we would also introduce support for the mysql version as well most likely.



--
You received this message because you are subscribed to the Google Groups "Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.

Petite Abeille

unread,
Sep 28, 2014, 4:51:39 AM9/28/14
to django-d...@googlegroups.com

On Sep 28, 2014, at 1:01 AM, Peter Geoghegan <p...@heroku.com> wrote:

> The SQL standard's MERGE statement is kind of weird...

For diversity’s sake, and for the record, I, for one, would definitively rather have the standard MERGE statement instead of yet another ‘weird’ UPSERT concoction peculiar to Postgres.

Just saying :)

Peter Geoghegan

unread,
Sep 28, 2014, 5:21:52 AM9/28/14
to django-d...@googlegroups.com
On Sun, Sep 28, 2014 at 1:51 AM, Petite Abeille
<petite....@gmail.com> wrote:
> For diversity's sake, and for the record, I, for one, would definitively rather have the standard MERGE statement instead of yet another 'weird' UPSERT concoction peculiar to Postgres.

A few people have said that, but then when you look at SQL MERGE in
detail it becomes apparent that it is really intended to serve the
bulk loading use-case. I took the time to delineate the differences
between SQL MERGE and what I call UPSERT in detail [1].

Both Oracle and SQL server have SQL MERGE implementations that promise
nothing about atomicity or concurrency, and are known to have race
conditions when used to implement even the simplest upsert-like
operation. Certainly, the SQL standard has nothing to say about MERGE
and concurrency. Teradata, SAP HANA and MySQL have UPSERT-ish features
that make some guarantees, and they're both non-standard.

[1] http://www.postgresql.org/message-id/flat/CAM3SWZRP0c3g6+aJ=YYDGYAcTZg0xA8-1...@mail.gmail.com#CAM3SWZRP0c3g6+aJ=YYDGYAcTZg0xA8-1...@mail.gmail.com

--
Peter Geoghegan

Petite Abeille

unread,
Sep 28, 2014, 7:44:58 AM9/28/14
to django-d...@googlegroups.com

On Sep 28, 2014, at 11:21 AM, Peter Geoghegan <p...@heroku.com> wrote:

> On Sun, Sep 28, 2014 at 1:51 AM, Petite Abeille
> <petite....@gmail.com> wrote:
>> For diversity's sake, and for the record, I, for one, would definitively rather have the standard MERGE statement instead of yet another 'weird' UPSERT concoction peculiar to Postgres.
>
> A few people have said that, but then when you look at SQL MERGE in
> detail it becomes apparent that it is really intended to serve the
> bulk loading use-case.

A bad case of confirmation bias :D

> I took the time to delineate the differences
> between SQL MERGE and what I call UPSERT in detail [1].

Again, your house, your choice. But it seems a bit self-indulgent to concoct your very own take on MERGE, with baroque syntax, peculiar semantic, and all, just because some abstract aspects of the MERGE specification is not to you liking… rather self-defeating altogether.

Anyway, just my 2¢.




Aymeric Augustin

unread,
Sep 28, 2014, 9:32:20 AM9/28/14
to django-d...@googlegroups.com
On 28 sept. 2014, at 13:44, Petite Abeille <petite....@gmail.com> wrote:

> Again, your house, your choice. But it seems a bit self-indulgent to concoct your very own take on MERGE, with baroque syntax, peculiar semantic, and all, just because some abstract aspects of the MERGE specification is not to you liking... rather self-defeating altogether.

Hello Petite Abeille,

I don’t know how much research you have done in this area, but I hope you’re at least as knowledgeable as Peter if you’re going to talk to him like that.

You’re certainly aware that there have been years of discussions on this very question in the PostgreSQL community. It wasn’t easy to come to a decision.

You’re allowed to disagree with that decision, however:

1) The django-developers mailing list isn’t the place to rehash the debate,
2) The disparaging vocabulary you’re using isn’t acceptable in this forum,
3) Personal attacks based on prejudice will not be tolerated.

Thanks.

--
Aymeric.

Shai Berger

unread,
Sep 28, 2014, 11:20:08 AM9/28/14
to django-d...@googlegroups.com, Simon Riggs
On Sunday 28 September 2014 09:33:06 Simon Riggs wrote:
> On 28 September 2014 02:22, Shai Berger <sh...@platonix.com> wrote:
> > Upon reading the docs, I was a little surprised to see that in terms of
> > triggers etc, the operation is always considered an INSERT. I would
> > expect it to be considered an insert for BEFORE INSERT or INSTEAD OF
> > INSERT triggers, but if conflict resolution turns it into an UPDATE, I'd
> > expect to see it handled as an UPDATE from that point on (definitely
> > INSTEAD OF UPDATE and AFTER UPDATE triggers, maybe even BEFORE UPDATE).
> > That is the semantics we have now (this is a general remark, not
> > particulary Django-oriented; Django does not use triggers on PG as far
> > as I know, and only uses them elsewhere to implement serial keys).
>
> So it should act like this?
>
> BEFORE INSERT triggers fire
> if CONFLICT then
> {
> BEFORE UPDATE triggers fire
> perform UPDATE
> AFTER UPDATE triggers fire
> }
> else
> {
> perform INSERT
> AFTER INSERT triggers fire
> }
>

Yes, that's pretty much what I was expecting.

Petite Abeille

unread,
Sep 28, 2014, 12:49:17 PM9/28/14
to django-d...@googlegroups.com

On Sep 28, 2014, at 3:32 PM, Aymeric Augustin <aymeric....@polytechnique.org> wrote:

> You're allowed to disagree with that decision, however:

You are right. Apologies about that. Moving along.

Simon Riggs

unread,
Sep 28, 2014, 1:57:44 PM9/28/14
to django-developers
On 28 September 2014 14:32, Aymeric Augustin
<aymeric....@polytechnique.org> wrote:
> On 28 sept. 2014, at 13:44, Petite Abeille <petite....@gmail.com> wrote:
>
>> Again, your house, your choice. But it seems a bit self-indulgent to concoct your very own take on MERGE, with baroque syntax, peculiar semantic, and all, just because some abstract aspects of the MERGE specification is not to you liking... rather self-defeating altogether.
>
> Hello Petite Abeille,
>
> I don't know how much research you have done in this area, but I hope you're at least as knowledgeable as Peter if you're going to talk to him like that.

The reason to post here was to receive opinions from the Django
community, so those comments are welcome.

> You're certainly aware that there have been years of discussions on this very question in the PostgreSQL community. It wasn't easy to come to a decision.

And the Django community's opinions might differ, for valid reasons.

> You're allowed to disagree with that decision, however:
>
> 1) The django-developers mailing list isn't the place to rehash the debate,
> 2) The disparaging vocabulary you're using isn't acceptable in this forum,
> 3) Personal attacks based on prejudice will not be tolerated.

Well, for 1) any comments are most welcome. I didn't see any evidence
of the other two.

All opinions welcome.

Aymeric Augustin

unread,
Sep 28, 2014, 2:31:11 PM9/28/14
to django-d...@googlegroups.com
On 28 sept. 2014, at 19:57, Simon Riggs <si...@2ndQuadrant.com> wrote:

>> 1) The django-developers mailing list isn't the place to rehash the debate,
>> 2) The disparaging vocabulary you're using isn't acceptable in this forum,
>> 3) Personal attacks based on prejudice will not be tolerated.
>
> Well, for 1) any comments are most welcome.

provided they relate to Django and are based on facts rather than rhetoric :)

As explained above, Django could take advantage of ON DUPLICATE to
implement create_or_update with good database level guarantees.

It isn’t clear what Django could do with MERGE. Maybe a suggestion will
come later in the thread.

At this point, as far as Django is concerned, Peter’s proposal aligns well
with what our users ask for.

> I didn't see any evidence of the other two.

Perhaps comparing the email I reacted to with the following proposal will help:

“Obviously the PostgreSQL project can make its own choices. But I’m surprised
that you’re proposing non-standard syntax and semantics, considering
PostgreSQL's tradition to comply with SQL standards. Even if you disagree with
some aspects of the specification of MERGE, wouldn’t it be valuable to
implement it properly? Since I'm not the most knowledgeable person on this
topic, where can I learn more about the reasons for this choice?”

(I’ve had to add one idea to base the argumentation on at least one fact.
Otherwise the ideas are the same in the same order.)

--
Aymeric.




Peter Geoghegan

unread,
Sep 28, 2014, 3:16:01 PM9/28/14
to django-d...@googlegroups.com
On Sun, Sep 28, 2014 at 4:44 AM, Petite Abeille
<petite....@gmail.com> wrote:
> Again, your house, your choice. But it seems a bit self-indulgent to concoct your very own take on MERGE, with baroque syntax, peculiar semantic, and all, just because some abstract aspects of the MERGE specification is not to you liking... rather self-defeating altogether.

Self-indulgent? I thought that I did it for the reasons that I
described in enormous detail in the mail I linked to.

Here is how you make Oracle do an upsert reliably:

https://stackoverflow.com/questions/237327/oracle-how-to-upsert-update-or-insert-into-a-table/22777749#22777749

Would you be happier with that?

--
Peter Geoghegan

Simon Riggs

unread,
Sep 28, 2014, 3:38:57 PM9/28/14
to django-developers
On 28 September 2014 19:30, Aymeric Augustin
<aymeric....@polytechnique.org> wrote:

> "Obviously the PostgreSQL project can make its own choices. But I'm surprised
> that you're proposing non-standard syntax and semantics, considering
> PostgreSQL's tradition to comply with SQL standards. Even if you disagree with
> some aspects of the specification of MERGE, wouldn't it be valuable to
> implement it properly? Since I'm not the most knowledgeable person on this
> topic, where can I learn more about the reasons for this choice?"

A fair point.

Petite Abeille

unread,
Sep 28, 2014, 3:45:02 PM9/28/14
to django-d...@googlegroups.com

On Sep 28, 2014, at 9:15 PM, Peter Geoghegan <p...@heroku.com> wrote:

> Would you be happier with that?

I would be happier with a clean MERGE statement implementation in Postgres, yes. And yes, I have followed Postgres' agonizing debate over the years about the subject. But I personally think it has been framed in the wrong light since the very beginning and that Postgres has convince itself that it somehow cannot support MERGE. Therefore it will not.

To paraphrase Ian Grigg [1], albeit in a different context: we fixed what we could, not what we should, because it was easier.

Anyway, these where my 2¢. Remember, you asked for feedback :)

[1] http://www.mail-archive.com/cryptography%40metzdowd.com/msg01276.html

Christophe Pettus

unread,
Sep 28, 2014, 3:59:18 PM9/28/14
to django-d...@googlegroups.com

On Sep 28, 2014, at 12:44 PM, Petite Abeille <petite....@gmail.com> wrote:

> Postgres has convince itself that it somehow cannot support MERGE. Therefore it will not.


There's no question that PostgreSQL could support SQL MERGE. But SQL MERGE is not what people are asking for when they ask for UPSERT. PostgreSQL could implement UPSERT and *call it* MERGE (with somewhat different syntax, most likely)... but how that would be better than implementing UPSERT and calling it something that doesn't conflict with existing specification language escapes me.

In short: "Clean MERGE," where I assume "clean" means "with reasonable behavior in the presence of concurrent activity" and MERGE means "the MERGE statement defined in the standard" is a contradiction in terms, and expecting PostgreSQL to square that circle isn't a reasonable request.

--
-- Christophe Pettus
x...@thebuild.com

Peter Geoghegan

unread,
Sep 28, 2014, 4:04:50 PM9/28/14
to django-d...@googlegroups.com
On Sun, Sep 28, 2014 at 12:44 PM, Petite Abeille
<petite....@gmail.com> wrote:
> I would be happier with a clean MERGE statement implementation in Postgres, yes. And yes, I have followed Postgres' agonizing debate over the years about the subject. But I personally think it has been framed in the wrong light since the very beginning and that Postgres has convince itself that it somehow cannot support MERGE. Therefore it will not.

I don't know where you've been getting your information from.

I think Postgres can and should support MERGE too. MERGE isn't harder
-- it's easier, because there is no need to prevent the kind of race
conditions that you get described in the stack overflow thread I
linked to. Oracle's docs describe MERGE as:

"This statement is a convenient way to combine multiple operations. It
lets you avoid multiple INSERT, UPDATE, and DELETE DML statements."

And that's all it is! It's useful for bulk loading/data warehousing,
mostly. I don't quite understand why many people think that MERGE is
useful for implementing UPSERT. It isn't documented to do anything
special in the face of concurrency. I think that there is an
incredible amount of misinformation about this topic floating around.
The various vendors that have a MERGE feature should have clearly
indicated that MERGE isn't useful for implementing an UPSERT, but they
didn't, and so the problem persists.

--
Peter Geoghegan

Anssi Kääriäinen

unread,
Sep 29, 2014, 3:47:59 AM9/29/14
to django-d...@googlegroups.com, Jacob Kaplan-Moss, Peter Geoghegan
I think we could use UPSERT for our .save() logic. Django's save()
method is defined as:
- Insert a row in to the database if there isn't already a matching
row for the saved model's primary key
- Otherwise update the row

This is currently implemented as "try to update, if nothing was updated,
then insert". Naturally, that approach has some race conditions.

There are a couple of problems with the suggested INSERT ... ON
DUPLICATE KEY UPDATE feature for this use case:
1. We need to use WITHIN primary_key_idx_name, but we don't
necessarily know the primary key index name of the table. It would be
extremely useful to have support for WITHIN PRIMARY KEY syntax.
2. We need to know if the row was updated or if it was created
(post_save signal needs this information). Looking at the spec, it seems
this is possible to do by issuing a query:
INSERT ... ON DUPLICATE KEY UPDATE RETURNING primary_key_col;
and then, if nothing is returned, we know it was an update. However, I
see this way as problematic. if PostgreSQL ever wants to allow returning
the updated values on conflict, then using RETURNING primary_key_col;
wouldn't work any more. It seems somewhat likely that somebody will want
to add support for RETURNING for the ON CONFLICT UPDATE case later on.

Using the proposed feature for create_or_update() method isn't that easy
- the problem is that the user is free to specify any filtering for the
model to be upserted. We would need to check if the filtering matches
some unique index exactly, and if it does, then do an INSERT ON
DUPLICATE KEY UPDATE WITHING uq_index_name, but we don't know the index
name.

Still possible use case is some sort of "bulk merge" operation. The
operation would be defined as:
- It takes in a list of model instances
- For those models which have primary key set:
- If the database has a row with same pk, that row is updated
- Otherwise a new row is inserted
- Those models that do not have primary key set are bulk inserted to
the database.

A good use case is for example updating employee table from an external
resource, or just loading test data from a file (AKA fixture loading).

If we could use WITHIN PRIMARY KEY, and have better knowledge of which
rows were inserted and which updated then it seems the proposed feature
would match the bulk merge's primary key set use case perfectly.

Even if MySQL has the ON DUPLICATE KEY UPDATE feature we haven't yet
used it. Quickly checking, it seems we can't use it, because we can't
define to use only the primary key index of the table for conflict
checking.

Some non-Django review comments:
- I am not sure what exactly the "ON CONFLICT UPDATE also optionally
accepts a WITHIN clause..." section of the docs means.
- I am not sure what exactly is supposed to happen if you do:
INSERT INTO author(name, age) VALUES ('Anssi', 33) ON CONFLICT UPDATE
SET name = CONFLICTING(name), age = CONFLICTING(age);
and author has separate unique indexes on name and age, and finally
two rows with values ('Tom', 33), and ('Anssi', 30) exists. If I am
reading the section mentioned in the first item correctly, then one of
the conflicting indexes is chosen as the source of the conflict, and
that row's value is then updated. Is this effectively random behavior a
good API?
- Wild suggestion: Maybe it would be better to default to the PRIMARY
KEY index of the table. If no PK index exists, the user must specify
which unique index to use. Maybe there shouldn't be a possibility to
specify more than one unique index?
- I assume there is a good reason to use CONFLICTING(id) instead of
CONFLICTING.id in the syntax?
- I didn't see tests for expression or partial indexes in the patches.
Are partial unique indexes supported?

- Anssi

shmengie

unread,
Sep 29, 2014, 8:46:14 AM9/29/14
to django-d...@googlegroups.com, ja...@heroku.com
To upsert or to merge.  Sparked a lot more emotion than I would have anticipated.

Both ideas, have similiar functionality, they solve marginally different problems.

Both have the objective of reducing client server traffic involved with pk.

Merge -- bulk loading foreign data -no guaranty

upsert -- insert or update row w/confidence.

Peter, please add the upsert as intended.  After you finishing this objective, if you would kindly consider working on merge, you'd please everyone ;)

Upsert adds a feature, I think more of us can identify and would enable interactive performance gains.

Merge would save some fewer ppl a lot of development effort, for when its time to perform merge, more development effort is required.  I've worked around (the lack of) both features, so I understand the passions that arise.  The features are similar, I would request YOUR additional effort to add Merge after Upsert.

Kudos either way!

-Joe

Carl Meyer

unread,
Sep 29, 2014, 1:11:45 PM9/29/14
to django-d...@googlegroups.com
Of course, it would be better for Django to reliably know index names.
Even though we don't currently, I'm hopeful that
https://github.com/django/deps/pull/6 may get us there by the time we
would be considering these changes anyway, which would address both your
#1 above, and the issues with create_or_update().

Carl

Petite Abeille

unread,
Sep 29, 2014, 2:14:18 PM9/29/14
to django-d...@googlegroups.com

On Sep 29, 2014, at 2:46 PM, shmengie <1st...@gmail.com> wrote:

> Merge -- bulk loading foreign data -no guaranty
>
> upsert -- insert or update row w/confidence.

Sounds like a difference without a distinction.

Let spin it a different way: the MERGE syntax, and broad semantic, is fine as it is, for all use cases, in all its ISO glory. There is no compelling reasons to invent another one. The only point of contention is what guaranty, if any, the MERGE statement provides, which is formally none. The UPSERT statement as suggested identify a narrow use case where it can provide some sort of guaranties about atomicity and race condition [1]. Beautiful. Let this be an implementation detail of MERGE as understood by Postgres: under the right circumstances, Postgres’ MERGE implementation guaranties such and such. This is a compelling implementation advantage. But not one that warrant a different syntax and semantic altogether. My 2¢.

[1] http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/

Anssi Kääriäinen

unread,
Sep 30, 2014, 3:28:30 AM9/30/14
to django-d...@googlegroups.com
On Mon, 2014-09-29 at 11:11 -0600, Carl Meyer wrote:
> Of course, it would be better for Django to reliably know index names.
> Even though we don't currently, I'm hopeful that
> https://github.com/django/deps/pull/6 may get us there by the time we
> would be considering these changes anyway, which would address both your
> #1 above, and the issues with create_or_update().

It will be a long time before we can be sure we know index names for all
models. Currently Django supports usage of models with hand-edited or
legacy database schemas. So, we have no knowledge of primary key index
names for current Django projects. I am not sure how we could get into a
point where we know index names of models without forcing users to
always specify index names.

Some alternative approaches:
- We could introspect models for index names.
- We could have "upsert_on_save" model ._meta flag, if set, then we
assume the primary key index has the default name

The main point is that having WITHIN PRIMARY KEY syntax would make usage
of this feature a lot easier for us.

- Anssi



Peter Geoghegan

unread,
Sep 30, 2014, 3:37:37 AM9/30/14
to django-d...@googlegroups.com
On Tue, Sep 30, 2014 at 12:37 AM, Anssi Kääriäinen
<anssi.ka...@thl.fi> wrote:
> The main point is that having WITHIN PRIMARY KEY syntax would make usage
> of this feature a lot easier for us.

I was thinking about doing that anyway.

--
Peter Geoghegan

Anssi Kääriäinen

unread,
Sep 30, 2014, 3:40:43 AM9/30/14
to django-d...@googlegroups.com, Jacob Kaplan-Moss, Peter Geoghegan
On Mon, 2014-09-29 at 10:56 +0300, I wrote:
> - Wild suggestion: Maybe it would be better to default to the PRIMARY
> KEY index of the table. If no PK index exists, the user must specify
> which unique index to use. Maybe there shouldn't be a possibility to
> specify more than one unique index?

The more I think about this, the more certain I am that defaulting to
use all unique indexes of the table isn't a good default for ON CONFLICT
UPDATE case. Both PostgreSQL's proposed documentation and MySQL's
existing documentation warn users that it is a bad idea to use this
feature when multiple unique indexes exists for the table. So, why make
that behavior default?

OTOH for ignore case defaulting to all unique indexes can be a good
default.

- Anssi

Peter Geoghegan

unread,
Sep 30, 2014, 3:43:44 AM9/30/14
to Anssi Kääriäinen, django-d...@googlegroups.com, Jacob Kaplan-Moss
On Tue, Sep 30, 2014 at 12:49 AM, Anssi Kääriäinen
<anssi.ka...@thl.fi> wrote:
> Both PostgreSQL's proposed documentation and MySQL's
> existing documentation warn users that it is a bad idea to use this
> feature when multiple unique indexes exists for the table. So, why make
> that behavior default?

The MySQL documentation is mostly due to INSERT...ON DUPLICATE UPDATE
completely breaking their statement-based replication. I accept that
there are other hazards, but it's difficult to have a fully general
syntax that indicates user intent WRT the unique index to merge on.

--
Peter Geoghegan
Reply all
Reply to author
Forward
0 new messages