Problems with concurrent DB access and get_or_create()

1,719 views
Skip to first unread message

Travis Terry

unread,
Dec 4, 2007, 1:08:51 PM12/4/07
to django-d...@googlegroups.com
I've run into a problem with get_or_create() with respect to concurrent
access of the DB, and I have looked at the list archives for advice. I
found some discussions a while back regarding other's problems but no
acceptable solution was ever implemented. I have another proposed
solution that I thought I should throw out there to see if anyone like
it better. First, let me restate the problem:

Two threads/processes/servers (let's call them P1 and P2) need to
concurrently create a unique object
1. P1 calls get_or_create(), which tries to get the item (it doesn't exist)
2. P2 calls get_or_create(), which tries to get the same item (it
doesn't exist)
3. P1's get_or_create() tries to create the item (this works and returns
the item)
4. P2's get_or_create() tries to create the item. One of two things
happens:
a. a second item is created with the same parameters if this doesn't
violate a UNIQUE constraint
b. the second create fails (because of a UNIQUE constraint) and
raises an exception

In the case of 4a, a future get() or get_or_create() call will assert
because multiple values have been returned. In the case of 4b, the
caller will need to catch the exception and (since the exception
probably means there was a concurrent create) most likely try to get the
object again.

Previous proposals to address this issue involved adding either a thread
lock or a DB table lock around the get_or_create() call. Both of these
are unacceptable. The thread lock does nothing to prevent the problem
when using multiple front-end servers, and the DB lock is just plain bad
for performance.

It seems reasonable to require that the model be designed with
unique_together=(...) on the fields that are used the get_or_create().
This will allow the DB to prevent duplicates from being created. Thus
the only code change needed to make get_or_create() always return the
correct object is to call get() again in the event of an exception from
create().

Pseudo-code
-------------
def get_or_create(**kwargs):
try:
obj = get(**kwargs)
except:
try:
obj = create(**kwargs)
except:
obj = get(**kwargs)
return obj

This solution is based on the following assumptions:

1. We always want get_or_create() to return the object we're looking for.
2. MOST of the time the object will exist, so calling get() first is the
highest performance.
3. Occasionally the object will not exist and may be created
concurrently by multiple threads/processes/servers. In this case the
second get() is no more expensive than the get() the caller should have
to make anyway when handling the exception.

This solution has not performance penalty in the "normal" case and takes
full advantage of the DB's data integrity enforcement.

If this solution is favorable, I'll create a ticket with the patch and
tests.

Travis

James Bennett

unread,
Dec 4, 2007, 1:27:44 PM12/4/07
to django-d...@googlegroups.com
On 12/4/07, Travis Terry <te...@topdog-tech.com> wrote:
> Previous proposals to address this issue involved adding either a thread
> lock or a DB table lock around the get_or_create() call. Both of these
> are unacceptable. The thread lock does nothing to prevent the problem
> when using multiple front-end servers, and the DB lock is just plain bad
> for performance.

Ultimately, the database is the only location from which you can solve
this problem, because only the database can reliably know when these
situations are occurring. Solutions implemented outside the DB are
essentially doomed from the start.

Similarly, an application which experiences high levels of concurrent
writes, such that this sort of situation is likely to occur, is really
not a common enough case to warrant rewriting an otherwise useful
shortcut.

> It seems reasonable to require that the model be designed with
> unique_together=(...) on the fields that are used the get_or_create().

No, it's completely unreasonable. Other DB API methods -- filter(),
get(), values(), etc. -- don't require the model to have been designed
in a certain way, and get_or_create() should not break with them by
doing so. Furthermore, you will *still* receive an exception from the
DB on an attempt to create two similar objects at once.

> This will allow the DB to prevent duplicates from being created.

This is one and only one way of allowing the DB to detect the problem situation.

> Thus
> the only code change needed to make get_or_create() always return the
> correct object is to call get() again in the event of an exception from
> create().

Which will work until a third process comes along and tries to
get_or_create() at the same time, at which point you'll be back
proposing that we catch again and get() again.

Which will work until a fourth process comes along and tries to
get_or_create() at the same time, at which point you'll be back
proposing that we catch again and get() again.

Which will work until a fifth process comes along and tries to
get_or_create() at the same time, at which point you'll be back
proposing that we catch again and get() again.

And at some point we'll just overflow the call stack.


--
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."

Leo Soto M.

unread,
Dec 4, 2007, 2:09:57 PM12/4/07
to django-d...@googlegroups.com
On Dec 4, 2007 3:08 PM, Travis Terry <te...@topdog-tech.com> wrote:
>
> I've run into a problem with get_or_create() with respect to concurrent
> access of the DB, and I have looked at the list archives for advice. I
> found some discussions a while back regarding other's problems but no
> acceptable solution was ever implemented.

Isn't the serializable transaction isolation level[1] what you are looking for?

[1] http://www.postgresql.org/docs/7.4/interactive/transaction-iso.html#XACT-SERIALIZABLE
--
Leo Soto M.
http://blog.leosoto.com

Travis Terry

unread,
Dec 4, 2007, 2:11:08 PM12/4/07
to django-d...@googlegroups.com


James Bennett wrote:
Ultimately, the database is the only location from which you can solve
this problem, because only the database can reliably know when these
situations are occurring. Solutions implemented outside the DB are
essentially doomed from the start.

Similarly, an application which experiences high levels of concurrent
writes, such that this sort of situation is likely to occur, is really
not a common enough case to warrant rewriting an otherwise useful
shortcut.

  
I agree that for most instances (especially on smaller sites with a single server), a concurrent write is not going to happen.  However since it *can* happen it *should* (must?) be caught to prevent errors going back to the user, the solution I proposed only adds overhead in the case when the event happens.  It does not "rewrite an otherwise useful shortcut" or in fact even change anything in the "normal" case.  We use 50 front-end servers to services a very high volume site, and we will have concurrency issues. 

The reason I proposed this solution is that EVERY place I use get_or_create(), I will have to handle a potential exception on the concurrent write and get the object again anyway.  If it's something I have to do on ever call to get_or_create() and it's likely that anyone else with a sufficiently busy site will have to do something similar, adding the functionality to the core get_or_create() is the best place to handle it.


  
It seems reasonable to require that the model be designed with
unique_together=(...) on the fields that are used the get_or_create().
    
No, it's completely unreasonable. Other DB API methods -- filter(),
get(), values(), etc. -- don't require the model to have been designed
in a certain way, and get_or_create() should not break with them by
doing so. Furthermore, you will *still* receive an exception from the
DB on an attempt to create two similar objects at once.

  
Well, I'm not saying that you "couldn't" use get_or_create() without the unique_together() attibute, just that if you want it to behave sanely in the case of a concurrent create you "should."  If you "intend" to have duplicates on those fields (by not using unique_together), you CAN'T use the get() or get_or_create() syntax since they ASSERT when more than one item is returned.


  
This will allow the DB to prevent duplicates from being created.
    
This is one and only one way of allowing the DB to detect the problem situation.

  
My point.



  
 Thus
the only code change needed to make get_or_create() always return the
correct object is to call get() again in the event of an exception from
create().
    
Which will work until a third process comes along and tries to
get_or_create() at the same time, at which point you'll be back
proposing that we catch again and get() again.

Which will work until a fourth process comes along and tries to
get_or_create() at the same time, at which point you'll be back
proposing that we catch again and get() again.

Which will work until a fifth process comes along and tries to
get_or_create() at the same time, at which point you'll be back
proposing that we catch again and get() again.

And at some point we'll just overflow the call stack.


  
Absolutely *not* true.  No matter how many processes there are, if the create() fails, that means that someone else already created the object since your thread tried to get() it.  Thus a second get() with the same parameters (which is how get_or_create workds) is sufficient to get the object.  Yes, many threads (in my case possibly thousands) might have to do a second get when there is a mad rush on the view that first creates the object, but there is no possibly of any one of them overflowing their individual stacks.

Travis

Travis Terry

unread,
Dec 4, 2007, 2:21:09 PM12/4/07
to django-d...@googlegroups.com
Unfortunately, no.  I understand transaction isolation levels.  By the way, I'm using MySQL InnoDB which has similar isolation level settings to PostgreSQL.  Regardless of the isolation level, the one of the two concurrent creates will fails.  They are atomic and add data that will voliate a UNIQUE constraint.  My proposal is to fold in a common handler into get_or_create() so that it will always return the object that you tried to get/create in keeping with the DRY priciple.  Otherwise, if it lets the exception escape, then everywhere I call get_or_create() will have to implement a very non-DRY piece of identical code to handle the situation.

Travis

Leo Soto M.

unread,
Dec 4, 2007, 2:41:12 PM12/4/07
to django-d...@googlegroups.com
On Dec 4, 2007 4:21 PM, Travis Terry <te...@topdog-tech.com> wrote:
>
> Unfortunately, no. I understand transaction isolation levels. By the way,
> I'm using MySQL InnoDB which has similar isolation level settings to
> PostgreSQL. Regardless of the isolation level, the one of the two
> concurrent creates will fails. They are atomic and add data that will
> voliate a UNIQUE constraint. My proposal is to fold in a common handler
> into get_or_create() so that it will always return the object that you tried
> to get/create in keeping with the DRY priciple. Otherwise, if it lets the
> exception escape, then everywhere I call get_or_create() will have to
> implement a very non-DRY piece of identical code to handle the situation.

No. You can use a simple external wrapper. Something along the lines of:

def get_or_create_object(klass, *args, **kwargs):
while True:
try:
return klass.get_or_create(*args, **kwargs):
except TheExceptionThatShouldNotEscape:
pass

Note that I tend to agree that Django should provide some way to deal
with this situations. My point is that if core developers don't agree
with us, we aren't forced to be non-DRY.

Jordan Levy

unread,
Dec 4, 2007, 2:45:44 PM12/4/07
to django-d...@googlegroups.com
On Dec 4, 2007 11:21 AM, Travis Terry <te...@topdog-tech.com> wrote:
Otherwise, if it lets the exception escape, then everywhere I call get_or_create() will have to implement a very non-DRY piece of identical code to handle the situation.

Travis
 

Couldn't you implement your own DRY solution?  Write a decorator for the get_or_create() method that catches exception thrown during the call to create() and calls get() again.

Jordan

Marty Alchin

unread,
Dec 4, 2007, 2:42:09 PM12/4/07
to django-d...@googlegroups.com
On Dec 4, 2007 2:21 PM, Travis Terry <te...@topdog-tech.com> wrote:
> exception escape, then everywhere I call get_or_create() will have to
> implement a very non-DRY piece of identical code to handle the situation.

I won't get into the issue of whether this should or shouldn't be in
core, but you most certainly don't need any non-DRY code anywhere you
use get_or_create(). Remember, it's a manager method, and you're
already able to override the default manager. Sure, it's still more
overhead in your code than if Django did it for you, but there's no
need to be melodramatic.

from django.db import models

class ConcurrentManager(models.Manager):
def get_or_create(self, **kwargs):
try:
return super(ConcurrentManager, self).get_or_create(**kwargs)
except:
return self.get(**kwargs)

class MyModel(models.Model):
# fields go here
objects = ConcurrentManager()

You can set up the manager code once, then simply import it and slap
it on whichever models you expect to have concurrency problems. Then
all the rest of your code behaves as if it was included in trunk. And
if you're already using custom managers, just add the above method to
them and enjoy the ride.

That is, unless you're using get_or_create() on a QuerySet instead of
the model's manager, at which point you might have to wait for the
queryset-refactor, since that will make custom QuerySet classes easier
to work with.

Hope this helps!

-Gul

Ivan Sagalaev

unread,
Dec 4, 2007, 3:47:49 PM12/4/07
to django-d...@googlegroups.com
Travis Terry wrote:
> 4. P2's get_or_create() tries to create the item. One of two things
> happens:
> a. a second item is created with the same parameters if this doesn't
> violate a UNIQUE constraint
> b. the second create fails (because of a UNIQUE constraint) and
> raises an exception

Actually there is a common solution to this problem that doesn't create
duplicates and doesn't fail on second transaction. And as James
correctly has noted it works on database level. The solution is a form
of SELECT called SELECT FOR UPDATE. When one transaction selects
something explicitly for update any other transaction trying to do the
same thing will wait until the first one ends. I.e. it works like a lock
for threads but since it works in database it works for multiple Python
processes that otherwise don't know anything about each other.

The good part is that SELECT FOR UPDATE is implemented in MySQL,
PostgreSQL and Oracle. I recall Malcolm has ones said that Adrian
expressed desire to have this in Django and it might happen after
queryset refactoring. Malcolm, Adrian, please confirm is this correct or
I'm just hallucinating :-)

Travis Terry

unread,
Dec 4, 2007, 4:07:33 PM12/4/07
to django-d...@googlegroups.com
Ivan Sagalaev wrote:
>
> Actually there is a common solution to this problem that doesn't create
> duplicates and doesn't fail on second transaction. And as James
> correctly has noted it works on database level. The solution is a form
> of SELECT called SELECT FOR UPDATE. When one transaction selects
> something explicitly for update any other transaction trying to do the
> same thing will wait until the first one ends. I.e. it works like a lock
> for threads but since it works in database it works for multiple Python
> processes that otherwise don't know anything about each other.
>
> The good part is that SELECT FOR UPDATE is implemented in MySQL,
> PostgreSQL and Oracle. I recall Malcolm has ones said that Adrian
> expressed desire to have this in Django and it might happen after
> queryset refactoring. Malcolm, Adrian, please confirm is this correct or
> I'm just hallucinating :-)
>
>
It's my understanding that SELECT ... FOR UPDATE only locks the rows
that it reads (so you can be sure they can be updated or referenced
later in the same trasaction). However, in the case of get_or_create(),
there is no existing row, so there's nothing to lock and thus the
problem still exists. If it has a fallback to locking the whole table
when no row is found (which I don't see in the docs) then it would have
worse performance in the create case since I'm only interested in
preventing duplicates, not serializing the processes.

Travis

Ivan Sagalaev

unread,
Dec 4, 2007, 4:31:10 PM12/4/07
to django-d...@googlegroups.com
Travis Terry wrote:
> It's my understanding that SELECT ... FOR UPDATE only locks the rows
> that it reads (so you can be sure they can be updated or referenced
> later in the same trasaction). However, in the case of get_or_create(),
> there is no existing row

Oh... Indeed. Then this won't help here, agreed.

Reply all
Reply to author
Forward
0 new messages