Hey, thanks for the comments guys.
On Saturday, November 10, 2012 11:26:25 AM UTC+11, vitaly wrote:
Hi Mark, although this workaround might work in your app, I would venture to say that it's **not** a legit solution
I agree Vitaly - good points.
On Friday, November 2, 2012 9:23:22 PM UTC+11, Denis Bilenko wrote:
Still would like to have a way to reproduce your problem.
I got some time to come back to this today - I'm now pretty confident that this is what's happening in the case of the Popen lockup:
- process makes socket connections to SQS / graylog (seems that perhaps it's important here that there are multiple threads created through the use of these libraries)
- fork / execv in subprocess.Popen
- child writes to stdout/stderr, (e.g.:
"WARNING: Mixing fork() and threads detected; memory leaked."
or
"Exception AttributeError: AttributeError("'_DummyThread' object has no attribute '_Thread__block'",) in <module 'threading' from '/usr/lib/python2.7/threading.pyc'> ignored")
- celery redirects these outputs to the logger
- graypy attempts to use a socket connection to send out a log, child process hangs
- parent waits forever on child process
At this point, ps shows:
pid: 23310; parent-pid: 20888; cmd: [celeryd@mark-VirtualBox:MainProcess] -active- (worker --pool=gevent --loglevel=DEBUG)
pid: 23352; parent-pid: 23310; cmd: [celeryd@mark-VirtualBox:MainProcess] -active- (worker --pool=gevent --loglevel=DEBUG)
And if I kill process 23352, then celery frees up, finishes processing the task in question and goes on to process another task.
Further, if I suppress the two error messages mentioned above from being written to stdout/err in the child then no lockups occur.
So by this point I'm almost certain that the underlying issue I'm seeing is Issue 154. However the fix suggested in the comments for that issue is not really appropriate for me I don't think. I know I commented earlier that I had still seen a lockup with that fix applied, but it's such a hacky fix (killing all the threads that celery is actually using) that I didn't really want to investigate further what was causing a lockup in this case. So I'll be working around this issue by turing off celery's stdout/stderr redirection.
FWIW, I've attached a tarball that can be used to reproduce the problem. It requires you to run a celery worker with:
celery worker --pool=gevent --loglevel=DEBUG
You will also need an Amazon SQS login, with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set as environment variables.
You also need to run a graylog server on the localhost.
Running this on linux 12.04.1 saw the Popen lockup pretty frequently (when it happened it was always within the first 5-10 task invocations, and often on the first).
Thanks for the help!
Mark