non-concurrent QuerySet.get_or_create() on Postgresql

142 views
Skip to first unread message

Дилян Палаузов

unread,
Oct 1, 2017, 2:41:09 PM10/1/17
to django-developers
Hello,

currently get_or_create(params) is implemented (imprecisely) this way:

try:
self.get(params)
except DoesNotExist:
self._create_object_from_params(params)

This creates concurrency problem, as the object might get created by
another thread, after get(params) threw an exception and before
_create_object_from_params has started.

Luckily Postgresql lets combine the above statements into a single
structured query. This should be a performance gain as for the DB is
faster to execute one than a two queries.

Consider this:

CREATE TABLE t (
id SERIAL PRIMARY KEY,
name VARCHAR(10) NOT NULL UNIQUE,
comment VARCHAR(10));

And here comes the magic:

WITH
to_be_inserted AS (SELECT 'nameD' as "name", 'comment13' as "comment"),
just_inserted AS (
INSERT INTO t (name, comment) SELECT * FROM to_be_inserted
WHERE NOT EXISTS(
SELECT * FROM t WHERE t.name='nameD')
RETURNING *)
SELECT *, FALSE as "created" FROM t WHERE t.name='nameD' UNION ALL
SELECT *, TRUE AS "created" FROM just_inserted;

where "to_be_inserted contains" the values for the new object ('default'
parameter of get_or_create) and 'nameB' is the criterion passed to get().

Regards
Дилян

Ran Benita

unread,
Oct 12, 2017, 1:04:54 PM10/12/17
to django-d...@googlegroups.com
Have you drilled down to `self._create_object_from_params(params)`? It does handle this case, as follows:

try:
    with transaction.atomic(using=self.db):
        params = {k: v() if callable(v) else v for k, v in params.items()}
        obj = self.create(**params)
    return obj, True
except IntegrityError:
    exc_info = sys.exc_info()
    try:
        return self.get(**lookup), False
    except self.model.DoesNotExist:
        pass
    raise exc_info[0](exc_info[1]).with_traceback(exc_info[2])

That said, I do think that doing it in one single atomic query has advantages (for one, the above is still not atomic, though the failure case in the end is much less likely). But a complicated postgres-specific query is perhaps not worth it. If you can think of a generic method, that works with READ COMMITTED, that would be interesting. Some conditions I would personally impose are:

1. Should not increment the primary key sequence in the `get` case. This is not critical for correctness, I just like to avoid these holes :)
2. The `get` case should remain fast. In the current code, `get_or_create` is the same as `get` when the row already exists. In many (most?) scenarios this is the 99% case.

Дилян Палаузов

unread,
Oct 16, 2017, 12:46:48 PM10/16/17
to django-d...@googlegroups.com, Ran Benita
Hello

The problem was, that get_or_create() was called within a transaction and at the end of the transaction the INSERT caused IntegrityError, as in the meantime another transaction has finished, that inserted the same row but with a different id. I have wrongly concluded, that repeating get_or_create() would help. By coincidence, after inserting a second get_or_create() I have never seen the problem again.

> That said, I do think that doing it in one single atomic query has advantages (for one, the above is still not atomic, though the failure case in the end is much less likely). But a complicated postgres-specific query is perhaps not worth it. If you can think of a generic method, that works with READ COMMITTED, that would be interesting. Some conditions I would personally impose are:
>
> 1. Should not increment the primary key sequence in the `get` case. This is not critical for correctness, I just like to avoid these holes :)
What I proposed does not increment the sequence.

> 2. The `get` case should remain fast. In the current code, `get_or_create` is the same as `get` when the row already exists. In many (most?) scenarios this is the 99% case.

I think my proposal is also quite fast, assuming that WHERE NOT EXISTS(SELECT * FROM -already fetched tabled-) is fast, and UNION ALL with with two tables, having a total of one row, is also fast.

Regards
Dilian

Ran Benita

unread,
Oct 16, 2017, 1:28:16 PM10/16/17
to Дилян Палаузов, django-d...@googlegroups.com
The code tries to handle a scenario like the following (of course, the statements can be relatively ordered in different ways). Can you describe how the scenario which fails for you looks like? I am assuming you are using READ COMMITTED, and that the lookup fields are unique together.

THREAD1                                       THREAD2
                                              SELECT => not found
SELECT => not found                          
                                              BEGIN
                                              INSERT => success
                                              COMMIT
BEGIN
INSERT => integrity error
ROLLBACK
(enter except IntegrityError)
SELECT => ??? (should be success)
Reply all
Reply to author
Forward
0 new messages