I have a function that do some get_or_create on model X giving as parameters the field Y and Z. The same function is running on different threads, so it can happen that more get_or_create on model X and fields Y/Z are called at the same time. It can also happen that the values of Y/Z are the same in the different threads and here it comes the problem: it happens that get_or_create says it returned more than one row for X-Y/Z.
Now Y/Z are not unique in my model and I do not have an overloading of the save function, but I think I can have the same problem if Y/Z were unique too.
Looking at the DB I have more than one row with same values...how comes this? Here the get_or_create...
def get_or_create(self, **kwargs): """ Looks up an object with the given kwargs, creating one if necessary. Returns a tuple of (object, created), where created is a boolean specifying whether an object was created. """ assert len(kwargs), 'get_or_create() must be passed at least one keyword argument' defaults = kwargs.pop('defaults', {}) try: return self.get(**kwargs), False except self.model.DoesNotExist: params = dict([(k, v) for k, v in kwargs.items() if '__' not in k]) params.update(defaults) obj = self.model(**params) obj.save() return obj, True
Well...looking at the function I can say it happens in this way (and I did by debug too) :
time 1 : T1 (thread 1) call get_or_create and does the self.get and it goes in exception for DoesNotExists time 2 : T2 do the same and goes in exception too for the same reason time 3 : T1 goes on with the exception and creates the object and gives it back time 4 : T2 the same (creating another one!) time 5 : any T (T1 or T2 or T3) who calls get_or_create again with same X-Y/Z...the self.get gets crazy.
If the Y/Z were unique I just think the whole thing would fail at time 4...a little bit different, but still a problem.
I could solve all the thing in this way, but I hope there is a better solution that I'm missing :
from threading import Lock lock = Lock()
def get_or_create(self, **kwargs): """ Looks up an object with the given kwargs, creating one if necessary. Returns a tuple of (object, created), where created is a boolean specifying whether an object was created. """ assert len(kwargs), 'get_or_create() must be passed at least one keyword argument' defaults = kwargs.pop('defaults', {}) try: return self.get(**kwargs), False except self.model.DoesNotExist:
lock.acquire() try :
try : res = self.get(**kwargs), False except self.model.DoesNotExist: params = dict([(k, v) for k, v in kwargs.items() if '__' not in k]) params.update(defaults) obj = self.model(**params) obj.save() res = obj, True
Do you have the transaction middleware enabled? Given that you do not have unique constraints on field Y/Z, I believe this would be the expected behavior with transactions enabled.
Both threads conceptually have their own "picture" of what the database looked like when the transaction was started. That isolation exists until the transaction is ultimately committed.
At the risk of over-explaining and over-simplifying, if the record did not exist at the time each thread started its transaction, then it doesn't matter that T1 hit the save operation first, T2 will not see it.
If you're not using transactions then ignore my explanation, but it might help to know which database engine you are using.
> time 1 : T1 (thread 1) call get_or_create and does the self.get and it > goes in exception for DoesNotExists > time 2 : T2 do the same and goes in exception too for the same reason > time 3 : T1 goes on with the exception and creates the object and > gives it back > time 4 : T2 the same (creating another one!) > time 5 : any T (T1 or T2 or T3) who calls get_or_create again with > same X-Y/Z...the self.get gets crazy.
It should be threadsafe - the way web applications and web loads work mean that lots of simultaneous connection will mean that it pretty much becomes a threaded application, and for that reason I think more research should be done into this sort of operation?
On 9/26/07, Istvan Albert <istvan.alb...@gmail.com> wrote:
> It should be threadsafe - [... ] web applications [...] pretty much > [become] a threaded application
Mike,
There are two issues here. Thread safe and concurrent operation, and they are very different issues (though there is overlap).
Django DOES supports concurrent operation (separate processes on the same or multiple servers).
Django DOES NOT support threaded operation (and from what I've gathered in past discussions on this list, is not likely to).
This is why Apache must be configured to use the prefork model instead of the worker model.
In practice this doesn't tend to pose a problem for web deployments. Both FCGI and Apache are designed so that they can work with non thread-safe applications.
Where you may run into some difficulty is if you want to make a multi-threaded backend application.
If you'd like to discuss this issue further, please bring it up on django-users.
> I think more research should be done into this sort of operation
Database locking has been discussed previously on this list and in a number of related tickets in Trac. This is currently the recommended approach for ensuring data integrity in a Django app. It is, however, true that there are still some race conditions that require special attention (get_or_create is one of them).
I agree that there should probably be more information about how to handle massively parallel web applications, but that's not a Django-specific concern. Many web developers don't understand how to handle these issues, and I haven't found a good resource to point people toward (suggestions welcome).
On Sep 26, 10:27 am, "Benjamin Slavin" <benjamin.sla...@gmail.com> wrote:
> On 9/25/07, Mike Scott <mic...@gmail.com> wrote:
> > It should be threadsafe - [... ] web applications [...] pretty much > > [become] a threaded application
> Mike,
> There are two issues here. Thread safe and concurrent operation, and > they are very different issues (though there is overlap).
> Django DOES supports concurrent operation (separate processes on the > same or multiple servers).
> Django DOES NOT support threaded operation (and from what I've > gathered in past discussions on this list, is not likely to).
Can you find the discussions on Google groups and post references to them.
> This is why Apache must be configured to use the prefork model instead > of the worker model.
In which case there should also be a warning that Django cannot be used with Apache/mod_python on Windows as the Apache winnt MPM is also multithreaded. Also, why are there instructions posted for running a FASTCGI process in multithreaded mode and that wouldn't be safe either.
I have pointed out the Apache inconsistency before. At the same time, there seems to be various people who have no problem running Django on winnt and worker MPM. That FASTCGI example shows a threaded example must also mean that is okay as well.
The most recent response I got was that any threading problems were related to database backends and were fixed a long time ago and that besides those issues, there weren't any specific things known of that would be a problem in a multithreaded web server. There were also some multithreading issues in mod_python <3.2.7 as well which may have been making people think there were problems where there weren't.
Thus any issues with multithreading are perhaps more to do with how people implement an application on top of Django. It would be nice though to get some sort of official statement from the Django developers on this one way or the other and document on the Django web site what the issues are and what parts of Django if any do have multithreading issues.
That you have made this statement that 'prefork' must be used, do you do that as one of the developers?
> In practice this doesn't tend to pose a problem for web deployments. > Both FCGI and Apache are designed so that they can work with non > thread-safe applications.
Although Apache/mod_python can be setup for prefork MPM, it is not ideal for Python web applications due to the generally large memory requirements of the web frameworks. It is much more preferable that worker MPM be used as it cuts down on the number of Apache child processes. If you ever want Django to be taken up and offered as an option by commodity web hosters then you must be able to support a multithreaded server as they cannot afford the memory requirements of mod_python, mod_wsgi or fastcgi solutions used in a multiprocess/ single threaded mode.
Can we please somehow settle this issue once and for all. I have tried to get discussions going on this issue in the past but have got minimal feedback. I thought that too a degree it had been determined that multithreaded servers were okay, although users should though ensure there own code is multithread safe, but now again someone is saying that Django itself is not multithread safe. :-(
On 9/25/07, Graham Dumpleton <Graham.Dumple...@gmail.com> wrote:
> Can we please somehow settle this issue once and for all. I have tried > to get discussions going on this issue in the past but have got > minimal feedback. I thought that too a degree it had been determined > that multithreaded servers were okay, although users should though > ensure there own code is multithread safe, but now again someone is > saying that Django itself is not multithread safe. :-(
I talked with Jacob about this quite a while ago and he told me that Django was not originally written to be threadsafe. The only threading problems I remember hearing about were with the database connections, and those issues were fixed in #1442 [1]. To my knowledge, there has never been any review of the code to check for other possible sticky spots. I used to deploy Django on Windows and never had any threading problems, but the sites were mostly low traffic, internal, and probably not good candidates for exposing problems.
In short, Django was not *designed* to be threadsafe, but any obvious problems that I'm aware of have been fixed. YMMV.
> In short, Django was not *designed* to be threadsafe, but any obvious > problems that I'm aware of have been fixed. YMMV.
that's scary.
but then again, python itself isn't multi-threaded. (all threading is faked - google "global interpreter lock". lazy s.o.b. python devs) so all your really hairy "c=c+1" type issues are already nixed.
Joseph Kocherhans wrote: > On 9/25/07, Graham Dumpleton <Graham.Dumple...@gmail.com> wrote: >> Can we please somehow settle this issue once and for all. I have tried >> to get discussions going on this issue in the past but have got >> minimal feedback. I thought that too a degree it had been determined >> that multithreaded servers were okay, although users should though >> ensure there own code is multithread safe, but now again someone is >> saying that Django itself is not multithread safe. :-(
> I talked with Jacob about this quite a while ago and he told me that > Django was not originally written to be threadsafe. The only threading > problems I remember hearing about were with the database connections, > and those issues were fixed in #1442 [1]. To my knowledge, there has > never been any review of the code to check for other possible sticky > spots. I used to deploy Django on Windows and never had any threading > problems, but the sites were mostly low traffic, internal, and > probably not good candidates for exposing problems.
> In short, Django was not *designed* to be threadsafe, but any obvious > problems that I'm aware of have been fixed. YMMV.
Derek Anderson wrote: > but then again, python itself isn't multi-threaded. (all threading is > faked - google "global interpreter lock". lazy s.o.b. python devs) so > all your really hairy "c=c+1" type issues are already nixed.
> so not so scary.
Right. What *is* is scary is how much people cling to the horrible hack that preemptive multithreading is.
Nicola Larosa wrote: > Derek Anderson wrote: >> but then again, python itself isn't multi-threaded. (all threading is >> faked - google "global interpreter lock". lazy s.o.b. python devs) so >> all your really hairy "c=c+1" type issues are already nixed.
>> so not so scary.
> Right. What *is* is scary is how much people cling to the horrible hack > that preemptive multithreading is.
Derek Anderson wrote: > but then again, python itself isn't multi-threaded. (all threading is > faked - google "global interpreter lock". lazy s.o.b. python devs)
given that a stock CPython interpreter releases the lock in a few hundred places, primarily around potentially long-running or blocking C operations, claiming that "all threading is faked" is a bit misleading. maybe you should do a bit more research before you start calling people names?
well, that was sarcasm with the "lazy" comment...come on, i love python. why else would i be here? :)
but it is faked. two python threads can't run concurrently on a two processor machine. unless, as you pointed out, you're calling a few hand-picked C operations (mostly i/o). they're like green threads from java 1.0. "faked" is about as good of an adjective as you get.
>> but then again, python itself isn't multi-threaded. (all threading is >> faked - google "global interpreter lock". lazy s.o.b. python devs)
> given that a stock CPython interpreter releases the lock in a few > hundred places, primarily around potentially long-running or blocking C > operations, claiming that "all threading is faked" is a bit misleading. > maybe you should do a bit more research before you start calling > people names?
> Nicola Larosa wrote: >> Right. What *is* is scary is how much people cling to the horrible hack >> that preemptive multithreading is. Derek Anderson wrote: > you mean to say cooperative multithreading, right?
> if so, heck yeah. dear lord in heaven yeah.
Erm... what?!? Cooperative MT *is* a hack, and preemptive MT is *not*?!? That sounds backwards to me. :-)
I'd like you to elaborate, but we're off-topic here, so feel free to take this to private email, or some more relevant public place. Whatever you please, but do elaborate! :-)
On Sep 26, 7:01 pm, Derek Anderson <pub...@kered.org> wrote:
> well, that was sarcasm with the "lazy" comment...come on, i love python. > why else would i be here? :)
> but it is faked. two python threads can't run concurrently on a two > processor machine. unless, as you pointed out, you're calling a few > hand-picked C operations (mostly i/o). they're like green threads from > java 1.0. "faked" is about as good of an adjective as you get.
When running Python embedded within Apache that matters a great deal less than some seem to make it out to be. This is because all the threads originate from Apache and all the Apache request handling and static file handling, being done in C code before Python code is even invoked, can quite happily run in parallel to each other and to threads executing within Python code. So, in the context of Apache there is still ample opportunity to make use of multiple processors and/or cores and the GIL isn't as big a problem as where someone runs a single Python only web process. Apache being a multi process web server as well makes it even less relevant. So although bashing up on the Python GIL seems to be the flavour of the month, use the right technology and configurations and for web applications at least it isn't a big deal. :-)
> >> but then again, python itself isn't multi-threaded. (all threading is > >> faked - google "global interpreter lock". lazy s.o.b. python devs)
> > given that a stock CPython interpreter releases the lock in a few > > hundred places, primarily around potentially long-running or blocking C > > operations, claiming that "all threading is faked" is a bit misleading. > > maybe you should do a bit more research before you start calling > > people names?
On Sep 25, 10:58 pm, Graham Dumpleton <Graham.Dumple...@gmail.com> wrote:
> ensure there own code is multithread safe, but now again someone is > saying that Django itself is not multithread safe. :-(
Well, I was just repeating the what I heard from the developers. Not so long ago a proposal was made that involved a minimal change - but it would have made the builtin webserver multithreaded ( of course your know about his as you were involved as well).
This was rejected on the grounds that multithreaded operations might not work correctly. See this:
I think given the questions in this thread I should try to clarify the state of Django and threading:
So as far as I know, Django should be threadsafe. However, I'm also an idiot, so "as far as I know" isn't very far at all, actually.
See, if I've learned one thing from all the threaded code I've had to write, it's that threading is always MUCH harder than you think it is. I've never gotten threaded code write on the first try, and with a codebase as large as Django I just don't feel comfortable guaranteeing thread-safety.
I don't know of any specific bugs that make Django unsafe, so there's a possibility that you won't have any issues. Certainly I've seen a number of live sites running threaded workers (either Apache or FCGI) without problems. Me, I'm always going to use the prefork worker.
More exact information about original post: * The DB is postgresql 8.2, but it could be anyone else (I mean I must be multi cross DB-platform). * I'm using Django on trunk. * Actually I described something that is a multi-threaded backend application...but I think something like this could also be something for a web user request. * (To Joe Holloway) I'm NOT using middleware transaction since it's all backend.
I perfectly know the "c = c + 1" problem for multi threaded application, but excepting something about django low-level that I don't know, I don't have any shared objects/variables among threads, at least nothing that isn't "mutexed".
I'm pretty new in django and the way I'm working now...I've always loved working in this manner : 1) explicit transaction (begin/commit/rollback trans) 2) when in a open-transaction and reading DB-data for a next quite sure change, doing that locking them (the oracle "select for update" for example) 3) I love doing someting like INSERT and if DUPKEY then UPDATE. No useless preemptive select if not necessary on data i'm trying to insert.
This to say that whatever more processes operating on a DB or just one multi-threaded process operating on the DB...the DB is the one who does the "mutex" of data. He guarantees no one reads "old" data.
Now about the first point for some reason I must work in autocommit mode (and it's not a problem itself). About the 2nd one...unfortunately not all DB support lock data so this can't be done (and it's a no-sense in autocommit mode).
What I expect by django when using the DB-API is that something like "get_or_create", "save", "add", "remove" or whatever is an "atomic" operation. All these functions do usually a select before an insert/ update/delete and it's easy falling in the problem I described at the top. There's no any particular crash on the thread itself (error memory or whatever)...just a "get" says I got more rows...it's a problem of data-consistence.
Just to tell another problem tied to this mecchanism : another time I got a dupkey in a table that works on a manyToMany relation, the "add" function does a select before inserting data. Two threads were doing the "add" on the same object-data (I can imagine both did the select THEN did the insert, and the second one crashed).
This is how I see things...like saying that the DB-API should be the "mutex" DB I was speaking about before. Like saying using a "mutex" when performing those operations that should be considered as being "atomic" (the get_or_create). All this studied in a way that the performances are still good.
I hope to have been clear about my point-of-view. Also hoping to have been useful in some way.
i've modified some files of django.db.models putting a decorator function on all those functions that are critical (get_or_create, save, create, delete, add, remove, etc.). The decorator function just does :
* RLock.acquire * execute the requested function * RLock release (even if exception raised of course) * give the result or exeception back
Everything seems to be ok. I am serializing all those operations that can have problems in a multithreaded scenario. Of course concurrent operations on DB will be slower 'cause of the serialization, but they will be safe and I think that's much more important.
And obviously this is just a quick and not better solution.
> i've modified some files of django.db.models putting a decorator > function on all those functions that are critical (get_or_create, > save, create, delete, add, remove, etc.). The decorator function just > does :
> * RLock.acquire > * execute the requested function > * RLock release (even if exception raised of course) > * give the result or exeception back
> Everything seems to be ok. I am serializing all those operations that > can have problems in a multithreaded scenario. Of course concurrent > operations on DB will be slower 'cause of the serialization, but they > will be safe and I think that's much more important.
> And obviously this is just a quick and not better solution.
> I have a function that do some get_or_create on model X giving as > parameters the field Y and Z. The same function is running on > different threads, so it can happen that more get_or_create on model X > and fields Y/Z are called at the same time. It can also happen that > the values of Y/Z are the same in the different threads and here it > comes the problem: it happens that get_or_create says it returned more > than one row for X-Y/Z.
> Now Y/Z are not unique in my model and I do not have an overloading of > the save function, but I think I can have the same problem if Y/Z were > unique too.
> Looking at the DB I have more than one row with same values...how > comes this? Here the get_or_create...
Pitching in, since it seems no-one was giving the answer I would give - though Benjamin Slavin comes close.
Threads and thread saftey has nothing to do with your problem, What you have is a problem with two different processes (or threads, doesn't matter) which are issuing database commands that get in the way of each other.
What happens is something like this (P1, P2 and P3 are two differnt processes, or threads):
P1: Does a row with the values Y and Z exist? No P2: Does a row with the values Y and Z exist? No P1: Add a row with Y and Z P2: Add a row with Y and Z
P3: Does a row with the values Y and Z exist: AssertionError
This can happen if you are using transactions or not - doesn't really matter (though transactions will make the problem worse).
Solution: Do not use get_or_create with fields that are not unique in the database. Do NOT rely on your application to enforce uniqueness.
This still has the problem that get_or_create can return an error (the second insert in the above sequence will fail) but at least your data stays correct.