Two threads/processes/servers (let's call them P1 and P2) need to
concurrently create a unique object
1. P1 calls get_or_create(), which tries to get the item (it doesn't exist)
2. P2 calls get_or_create(), which tries to get the same item (it
doesn't exist)
3. P1's get_or_create() tries to create the item (this works and returns
the item)
4. P2's get_or_create() tries to create the item. One of two things
happens:
a. a second item is created with the same parameters if this doesn't
violate a UNIQUE constraint
b. the second create fails (because of a UNIQUE constraint) and
raises an exception
In the case of 4a, a future get() or get_or_create() call will assert
because multiple values have been returned. In the case of 4b, the
caller will need to catch the exception and (since the exception
probably means there was a concurrent create) most likely try to get the
object again.
Previous proposals to address this issue involved adding either a thread
lock or a DB table lock around the get_or_create() call. Both of these
are unacceptable. The thread lock does nothing to prevent the problem
when using multiple front-end servers, and the DB lock is just plain bad
for performance.
It seems reasonable to require that the model be designed with
unique_together=(...) on the fields that are used the get_or_create().
This will allow the DB to prevent duplicates from being created. Thus
the only code change needed to make get_or_create() always return the
correct object is to call get() again in the event of an exception from
create().
Pseudo-code
-------------
def get_or_create(**kwargs):
try:
obj = get(**kwargs)
except:
try:
obj = create(**kwargs)
except:
obj = get(**kwargs)
return obj
This solution is based on the following assumptions:
1. We always want get_or_create() to return the object we're looking for.
2. MOST of the time the object will exist, so calling get() first is the
highest performance.
3. Occasionally the object will not exist and may be created
concurrently by multiple threads/processes/servers. In this case the
second get() is no more expensive than the get() the caller should have
to make anyway when handling the exception.
This solution has not performance penalty in the "normal" case and takes
full advantage of the DB's data integrity enforcement.
If this solution is favorable, I'll create a ticket with the patch and
tests.
Travis
Ultimately, the database is the only location from which you can solve
this problem, because only the database can reliably know when these
situations are occurring. Solutions implemented outside the DB are
essentially doomed from the start.
Similarly, an application which experiences high levels of concurrent
writes, such that this sort of situation is likely to occur, is really
not a common enough case to warrant rewriting an otherwise useful
shortcut.
> It seems reasonable to require that the model be designed with
> unique_together=(...) on the fields that are used the get_or_create().
No, it's completely unreasonable. Other DB API methods -- filter(),
get(), values(), etc. -- don't require the model to have been designed
in a certain way, and get_or_create() should not break with them by
doing so. Furthermore, you will *still* receive an exception from the
DB on an attempt to create two similar objects at once.
> This will allow the DB to prevent duplicates from being created.
This is one and only one way of allowing the DB to detect the problem situation.
> Thus
> the only code change needed to make get_or_create() always return the
> correct object is to call get() again in the event of an exception from
> create().
Which will work until a third process comes along and tries to
get_or_create() at the same time, at which point you'll be back
proposing that we catch again and get() again.
Which will work until a fourth process comes along and tries to
get_or_create() at the same time, at which point you'll be back
proposing that we catch again and get() again.
Which will work until a fifth process comes along and tries to
get_or_create() at the same time, at which point you'll be back
proposing that we catch again and get() again.
And at some point we'll just overflow the call stack.
--
"Bureaucrat Conrad, you are technically correct -- the best kind of correct."
Isn't the serializable transaction isolation level[1] what you are looking for?
[1] http://www.postgresql.org/docs/7.4/interactive/transaction-iso.html#XACT-SERIALIZABLE
--
Leo Soto M.
http://blog.leosoto.com
Ultimately, the database is the only location from which you can solve this problem, because only the database can reliably know when these situations are occurring. Solutions implemented outside the DB are essentially doomed from the start. Similarly, an application which experiences high levels of concurrent writes, such that this sort of situation is likely to occur, is really not a common enough case to warrant rewriting an otherwise useful shortcut.
It seems reasonable to require that the model be designed with unique_together=(...) on the fields that are used the get_or_create().No, it's completely unreasonable. Other DB API methods -- filter(), get(), values(), etc. -- don't require the model to have been designed in a certain way, and get_or_create() should not break with them by doing so. Furthermore, you will *still* receive an exception from the DB on an attempt to create two similar objects at once.
This will allow the DB to prevent duplicates from being created.This is one and only one way of allowing the DB to detect the problem situation.
Thus the only code change needed to make get_or_create() always return the correct object is to call get() again in the event of an exception from create().Which will work until a third process comes along and tries to get_or_create() at the same time, at which point you'll be back proposing that we catch again and get() again. Which will work until a fourth process comes along and tries to get_or_create() at the same time, at which point you'll be back proposing that we catch again and get() again. Which will work until a fifth process comes along and tries to get_or_create() at the same time, at which point you'll be back proposing that we catch again and get() again. And at some point we'll just overflow the call stack.
No. You can use a simple external wrapper. Something along the lines of:
def get_or_create_object(klass, *args, **kwargs):
while True:
try:
return klass.get_or_create(*args, **kwargs):
except TheExceptionThatShouldNotEscape:
pass
Note that I tend to agree that Django should provide some way to deal
with this situations. My point is that if core developers don't agree
with us, we aren't forced to be non-DRY.
Otherwise, if it lets the exception escape, then everywhere I call get_or_create() will have to implement a very non-DRY piece of identical code to handle the situation.
Travis
I won't get into the issue of whether this should or shouldn't be in
core, but you most certainly don't need any non-DRY code anywhere you
use get_or_create(). Remember, it's a manager method, and you're
already able to override the default manager. Sure, it's still more
overhead in your code than if Django did it for you, but there's no
need to be melodramatic.
from django.db import models
class ConcurrentManager(models.Manager):
def get_or_create(self, **kwargs):
try:
return super(ConcurrentManager, self).get_or_create(**kwargs)
except:
return self.get(**kwargs)
class MyModel(models.Model):
# fields go here
objects = ConcurrentManager()
You can set up the manager code once, then simply import it and slap
it on whichever models you expect to have concurrency problems. Then
all the rest of your code behaves as if it was included in trunk. And
if you're already using custom managers, just add the above method to
them and enjoy the ride.
That is, unless you're using get_or_create() on a QuerySet instead of
the model's manager, at which point you might have to wait for the
queryset-refactor, since that will make custom QuerySet classes easier
to work with.
Hope this helps!
-Gul
Actually there is a common solution to this problem that doesn't create
duplicates and doesn't fail on second transaction. And as James
correctly has noted it works on database level. The solution is a form
of SELECT called SELECT FOR UPDATE. When one transaction selects
something explicitly for update any other transaction trying to do the
same thing will wait until the first one ends. I.e. it works like a lock
for threads but since it works in database it works for multiple Python
processes that otherwise don't know anything about each other.
The good part is that SELECT FOR UPDATE is implemented in MySQL,
PostgreSQL and Oracle. I recall Malcolm has ones said that Adrian
expressed desire to have this in Django and it might happen after
queryset refactoring. Malcolm, Adrian, please confirm is this correct or
I'm just hallucinating :-)
Travis
Oh... Indeed. Then this won't help here, agreed.