I have the following problem.
I have a function that do some get_or_create on model X giving as
parameters the field Y and Z. The same function is running on
different threads, so it can happen that more get_or_create on model X
and fields Y/Z are called at the same time. It can also happen that
the values of Y/Z are the same in the different threads and here it
comes the problem: it happens that get_or_create says it returned more
than one row for X-Y/Z.
Now Y/Z are not unique in my model and I do not have an overloading of
the save function, but I think I can have the same problem if Y/Z were
unique too.
Looking at the DB I have more than one row with same values...how
comes this? Here the get_or_create...
def get_or_create(self, **kwargs):
"""
Looks up an object with the given kwargs, creating one if
necessary.
Returns a tuple of (object, created), where created is a
boolean
specifying whether an object was created.
"""
assert len(kwargs), 'get_or_create() must be passed at least
one keyword argument'
defaults = kwargs.pop('defaults', {})
try:
return self.get(**kwargs), False
except self.model.DoesNotExist:
params = dict([(k, v) for k, v in kwargs.items() if '__'
not in k])
params.update(defaults)
obj = self.model(**params)
obj.save()
return obj, True
Well...looking at the function I can say it happens in this way (and I
did by debug too) :
time 1 : T1 (thread 1) call get_or_create and does the self.get and it
goes in exception for DoesNotExists
time 2 : T2 do the same and goes in exception too for the same reason
time 3 : T1 goes on with the exception and creates the object and
gives it back
time 4 : T2 the same (creating another one!)
time 5 : any T (T1 or T2 or T3) who calls get_or_create again with
same X-Y/Z...the self.get gets crazy.
If the Y/Z were unique I just think the whole thing would fail at time
4...a little bit different, but still a problem.
I could solve all the thing in this way, but I hope there is a better
solution that I'm missing :
from threading import Lock
lock = Lock()
def get_or_create(self, **kwargs):
"""
Looks up an object with the given kwargs, creating one if
necessary.
Returns a tuple of (object, created), where created is a
boolean
specifying whether an object was created.
"""
assert len(kwargs), 'get_or_create() must be passed at least
one keyword argument'
defaults = kwargs.pop('defaults', {})
try:
return self.get(**kwargs), False
except self.model.DoesNotExist:
lock.acquire()
try :
try :
res = self.get(**kwargs), False
except self.model.DoesNotExist:
params = dict([(k, v) for k, v in kwargs.items()
if '__' not in k])
params.update(defaults)
obj = self.model(**params)
obj.save()
res = obj, True
except Exception, e:
lock.release()
raise e
lock.release()
return res
Both threads conceptually have their own "picture" of what the
database looked like when the transaction was started. That isolation
exists until the transaction is ultimately committed.
At the risk of over-explaining and over-simplifying, if the record did
not exist at the time each thread started its transaction, then it
doesn't matter that T1 hit the save operation first, T2 will not see
it.
If you're not using transactions then ignore my explanation, but it
might help to know which database engine you are using.
Django is 0% threadsafe (as in nada, null or zilch)
it is not supposed to be run that way, but if you must keep locking
around every operation.
i.
Mike,
There are two issues here. Thread safe and concurrent operation, and
they are very different issues (though there is overlap).
Django DOES supports concurrent operation (separate processes on the
same or multiple servers).
Django DOES NOT support threaded operation (and from what I've
gathered in past discussions on this list, is not likely to).
This is why Apache must be configured to use the prefork model instead
of the worker model.
In practice this doesn't tend to pose a problem for web deployments.
Both FCGI and Apache are designed so that they can work with non
thread-safe applications.
Where you may run into some difficulty is if you want to make a
multi-threaded backend application.
If you'd like to discuss this issue further, please bring it up on django-users.
> I think more research should be done into this sort of operation
Database locking has been discussed previously on this list and in a
number of related tickets in Trac. This is currently the recommended
approach for ensuring data integrity in a Django app. It is, however,
true that there are still some race conditions that require special
attention (get_or_create is one of them).
I agree that there should probably be more information about how to
handle massively parallel web applications, but that's not a
Django-specific concern. Many web developers don't understand how to
handle these issues, and I haven't found a good resource to point
people toward (suggestions welcome).
I hope this helps to clarify things a bit.
- Ben
Can you find the discussions on Google groups and post references to
them.
> This is why Apache must be configured to use the prefork model instead
> of the worker model.
In which case there should also be a warning that Django cannot be
used with Apache/mod_python on Windows as the Apache winnt MPM is also
multithreaded. Also, why are there instructions posted for running a
FASTCGI process in multithreaded mode and that wouldn't be safe
either.
I have pointed out the Apache inconsistency before. At the same time,
there seems to be various people who have no problem running Django on
winnt and worker MPM. That FASTCGI example shows a threaded example
must also mean that is okay as well.
The most recent response I got was that any threading problems were
related to database backends and were fixed a long time ago and that
besides those issues, there weren't any specific things known of that
would be a problem in a multithreaded web server. There were also some
multithreading issues in mod_python <3.2.7 as well which may have been
making people think there were problems where there weren't.
http://groups.google.com/group/django-developers/browse_frm/thread/bfaad3e93611b2e6/d8b9f845fe31c0e1
http://groups.google.com/group/django-developers/browse_frm/thread/c72a6f0a56321ac7/381a580be1ef0751
... plus other posts I can't find right now.
Thus any issues with multithreading are perhaps more to do with how
people implement an application on top of Django. It would be nice
though to get some sort of official statement from the Django
developers on this one way or the other and document on the Django web
site what the issues are and what parts of Django if any do have
multithreading issues.
That you have made this statement that 'prefork' must be used, do you
do that as one of the developers?
> In practice this doesn't tend to pose a problem for web deployments.
> Both FCGI and Apache are designed so that they can work with non
> thread-safe applications.
Although Apache/mod_python can be setup for prefork MPM, it is not
ideal for Python web applications due to the generally large memory
requirements of the web frameworks. It is much more preferable that
worker MPM be used as it cuts down on the number of Apache child
processes. If you ever want Django to be taken up and offered as an
option by commodity web hosters then you must be able to support a
multithreaded server as they cannot afford the memory requirements of
mod_python, mod_wsgi or fastcgi solutions used in a multiprocess/
single threaded mode.
Can we please somehow settle this issue once and for all. I have tried
to get discussions going on this issue in the past but have got
minimal feedback. I thought that too a degree it had been determined
that multithreaded servers were okay, although users should though
ensure there own code is multithread safe, but now again someone is
saying that Django itself is not multithread safe. :-(
Graham
I talked with Jacob about this quite a while ago and he told me that
Django was not originally written to be threadsafe. The only threading
problems I remember hearing about were with the database connections,
and those issues were fixed in #1442 [1]. To my knowledge, there has
never been any review of the code to check for other possible sticky
spots. I used to deploy Django on Windows and never had any threading
problems, but the sites were mostly low traffic, internal, and
probably not good candidates for exposing problems.
In short, Django was not *designed* to be threadsafe, but any obvious
problems that I'm aware of have been fixed. YMMV.
Joseph
that's scary.
but then again, python itself isn't multi-threaded. (all threading is
faked - google "global interpreter lock". lazy s.o.b. python devs) so
all your really hairy "c=c+1" type issues are already nixed.
so not so scary.
derek
Right. What *is* is scary is how much people cling to the horrible hack
that preemptive multithreading is.
--
Nicola Larosa - http://www.tekNico.net/
Love is hate
War is peace
No is yes
And we're all free
-- Tracy Chapman, Why?, Tracy Chapman, 1988
if so, heck yeah. dear lord in heaven yeah.
> but then again, python itself isn't multi-threaded. (all threading is
> faked - google "global interpreter lock". lazy s.o.b. python devs)
given that a stock CPython interpreter releases the lock in a few
hundred places, primarily around potentially long-running or blocking C
operations, claiming that "all threading is faked" is a bit misleading.
maybe you should do a bit more research before you start calling
people names?
</F>
but it is faked. two python threads can't run concurrently on a two
processor machine. unless, as you pointed out, you're calling a few
hand-picked C operations (mostly i/o). they're like green threads from
java 1.0. "faked" is about as good of an adjective as you get.
still love python tho.
derek
Derek Anderson wrote:
> you mean to say cooperative multithreading, right?
>
> if so, heck yeah. dear lord in heaven yeah.
Erm... what?!? Cooperative MT *is* a hack, and preemptive MT is *not*?!?
That sounds backwards to me. :-)
I'd like you to elaborate, but we're off-topic here, so feel free to take
this to private email, or some more relevant public place. Whatever you
please, but do elaborate! :-)
When running Python embedded within Apache that matters a great deal
less than some seem to make it out to be. This is because all the
threads originate from Apache and all the Apache request handling and
static file handling, being done in C code before Python code is even
invoked, can quite happily run in parallel to each other and to
threads executing within Python code. So, in the context of Apache
there is still ample opportunity to make use of multiple processors
and/or cores and the GIL isn't as big a problem as where someone runs
a single Python only web process. Apache being a multi process web
server as well makes it even less relevant. So although bashing up on
the Python GIL seems to be the flavour of the month, use the right
technology and configurations and for web applications at least it
isn't a big deal. :-)
Graham
> ensure there own code is multithread safe, but now again someone is
> saying that Django itself is not multithread safe. :-(
Well, I was just repeating the what I heard from the developers. Not
so long ago a proposal was made that involved a minimal change - but
it would have made the builtin webserver multithreaded ( of course
your know about his as you were involved as well).
This was rejected on the grounds that multithreaded operations might
not work correctly. See this:
http://code.djangoproject.com/ticket/3357
Istvan.
I think given the questions in this thread I should try to clarify the
state of Django and threading:
So as far as I know, Django should be threadsafe. However, I'm also an
idiot, so "as far as I know" isn't very far at all, actually.
See, if I've learned one thing from all the threaded code I've had to
write, it's that threading is always MUCH harder than you think it is.
I've never gotten threaded code write on the first try, and with a
codebase as large as Django I just don't feel comfortable guaranteeing
thread-safety.
I don't know of any specific bugs that make Django unsafe, so there's
a possibility that you won't have any issues. Certainly I've seen a
number of live sites running threaded workers (either Apache or FCGI)
without problems. Me, I'm always going to use the prefork worker.
Jacob
More exact information about original post:
* The DB is postgresql 8.2, but it could be anyone else (I mean I must
be multi cross DB-platform).
* I'm using Django on trunk.
* Actually I described something that is a multi-threaded backend
application...but I think something like this could also be something
for a web user request.
* (To Joe Holloway) I'm NOT using middleware transaction since it's
all backend.
I perfectly know the "c = c + 1" problem for multi threaded
application, but excepting something about django low-level that I
don't know, I don't have any shared objects/variables among threads,
at least nothing that isn't "mutexed".
I'm pretty new in django and the way I'm working now...I've always
loved working in this manner :
1) explicit transaction (begin/commit/rollback trans)
2) when in a open-transaction and reading DB-data for a next quite
sure change, doing that locking them (the oracle "select for update"
for example)
3) I love doing someting like INSERT and if DUPKEY then UPDATE. No
useless preemptive select if not necessary on data i'm trying to
insert.
This to say that whatever more processes operating on a DB or just one
multi-threaded process operating on the DB...the DB is the one who
does the "mutex" of data. He guarantees no one reads "old" data.
Now about the first point for some reason I must work in autocommit
mode (and it's not a problem itself).
About the 2nd one...unfortunately not all DB support lock data so this
can't be done (and it's a no-sense in autocommit mode).
What I expect by django when using the DB-API is that something like
"get_or_create", "save", "add", "remove" or whatever is an "atomic"
operation. All these functions do usually a select before an insert/
update/delete and it's easy falling in the problem I described at the
top. There's no any particular crash on the thread itself (error
memory or whatever)...just a "get" says I got more rows...it's a
problem of data-consistence.
Just to tell another problem tied to this mecchanism : another time I
got a dupkey in a table that works on a manyToMany relation, the "add"
function does a select before inserting data. Two threads were doing
the "add" on the same object-data (I can imagine both did the select
THEN did the insert, and the second one crashed).
This is how I see things...like saying that the DB-API should be the
"mutex" DB I was speaking about before. Like saying using a "mutex"
when performing those operations that should be considered as being
"atomic" (the get_or_create). All this studied in a way that the
performances are still good.
I hope to have been clear about my point-of-view. Also hoping to have
been useful in some way.
And sorry for my bad english :)
i've modified some files of django.db.models putting a decorator
function on all those functions that are critical (get_or_create,
save, create, delete, add, remove, etc.). The decorator function just
does :
* RLock.acquire
* execute the requested function
* RLock release (even if exception raised of course)
* give the result or exeception back
Everything seems to be ok. I am serializing all those operations that
can have problems in a multithreaded scenario. Of course concurrent
operations on DB will be slower 'cause of the serialization, but they
will be safe and I think that's much more important.
And obviously this is just a quick and not better solution.
What do you think about?
Threads and thread saftey has nothing to do with your problem, What you
have is a problem with two different processes (or threads, doesn't
matter) which are issuing database commands that get in the way of each
other.
What happens is something like this (P1, P2 and P3 are two differnt
processes, or threads):
P1: Does a row with the values Y and Z exist? No
P2: Does a row with the values Y and Z exist? No
P1: Add a row with Y and Z
P2: Add a row with Y and Z
P3: Does a row with the values Y and Z exist: AssertionError
This can happen if you are using transactions or not - doesn't really
matter (though transactions will make the problem worse).
Solution: Do not use get_or_create with fields that are not unique in
the database. Do NOT rely on your application to enforce uniqueness.
This still has the problem that get_or_create can return an error (the
second insert in the above sequence will fail) but at least your data
stays correct.
Nis