Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
gevent + celery + subprocess + graylog = hang
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  12 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Mark Hingston  
View profile  
 More options Oct 29 2012, 3:01 am
From: Mark Hingston <mrkhings...@gmail.com>
Date: Mon, 29 Oct 2012 00:01:45 -0700 (PDT)
Local: Mon, Oct 29 2012 3:01 am
Subject: gevent + celery + subprocess + graylog = hang

Hey guys,

It seems that the combination of libraries that I'm using in my project has
led me to experience a hang in gevent-based code. I'm using celery worker
(http://celeryproject.org/) to handle video encoding tasks. The worker is
using gevent as the concurrency backend (it uses a gevent threadpool) and
it's connecting to Amazon SQS for it's underlying queue. I've also got
graylog enabled. However with this setup I often get a lockup when I call
something like:

p = subprocess.Popen(['ffmpeg', ...], cwd='/my/dir/',
stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
p.wait()

Some important details:
- celery's logging helper creates a logging proxy which they use to replace
stdout and stderr
- graylog adds a second logging handler to the base logger. The graylog
handler calls socket.getfqdn() which calls into gevent's event loop to do
the resolve
- I'm using the gevent version of subprocess
- I'm using the --pool=gevent argument to celery worker which makes it
patch gevent asap
- I'm running on Ubuntu 12.04.1

I was able to get the traces of what's happening in each thread / greenlet
by using an adaptation of the pyrasite-injected code mentioned
here: http://blog.ziade.org/2012/05/25/zmq-and-gevent-debugging-nightmares/.
The trace that I got is attached.

I seem to be able to work around this issue by adding the following code
change to the POSIX _execute_child function in gevent/subprocess.py:

--- a/gevent/subprocess.py
+++ b/gevent/subprocess.py
@@ -597,6 +597,14 @@ class Popen(object):
                            errread, errwrite):
             """Execute program (POSIX version)"""

+            saved_stdin = sys.stdin
+            saved_stdout = sys.stdout
+            saved_stderr = sys.stderr
+
+            sys.stdin = sys.__stdin__
+            sys.stdout = sys.__stdout__
+            sys.stderr = sys.__stderr__
+            
             if isinstance(args, types.StringTypes):
                 args = [args]
             else:
@@ -725,6 +733,10 @@ class Popen(object):
                 else:
                     os.close(errpipe_read)

+                sys.stdin = saved_stdin
+                sys.stdout = saved_stdout
+                sys.stderr = saved_stderr
+
             if data != "":
                 self.wait()
                 child_exception = pickle.loads(data

And I can also work around the issue by doing either of:

   - turning off graypy
   - turning off celery's stdout/stderr redirection

So I'm wondering the patch above is a legit solution to the issue or not. I
know that subprocess is doing some manipulation of stdout, stdin, stderr
files and duping and closing them, etc in the child but I'm not sure if the
fact that celery is swapping out the stdout / stderr files for it's own
proxy should be expected to cause an issue.

I'll probably end up just turning off celery's output redirection but I'd
appreciate any advice that you guys can give on this issue.

Also I apologise - I did try to distill this issue down to the smallest
possible code sample that can cause the problem (minus all the libraries,
etc) however I couldn't really - the problem seems fleeting and disappears
when I make a simpler version.

Thanks!
Mark

  trace.txt
7K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Denis Bilenko  
View profile  
 More options Oct 30 2012, 10:01 am
From: Denis Bilenko <denis.bile...@gmail.com>
Date: Tue, 30 Oct 2012 15:01:30 +0100
Local: Tues, Oct 30 2012 10:01 am
Subject: Re: [gevent] gevent + celery + subprocess + graylog = hang

On Mon, Oct 29, 2012 at 8:01 AM, Mark Hingston <mrkhings...@gmail.com> wrote:
> Hey guys,

> It seems that the combination of libraries that I'm using in my project has
> led me to experience a hang in gevent-based code. I'm using celery worker
> (http://celeryproject.org/) to handle video encoding tasks. The worker is
> using gevent as the concurrency backend (it uses a gevent threadpool) and
> it's connecting to Amazon SQS for it's underlying queue. I've also got
> graylog enabled. However with this setup I often get a lockup when I call
> something like:

> p = subprocess.Popen(['ffmpeg', ...], cwd='/my/dir/', stdin=subprocess.PIPE,
> stdout=subprocess.PIPE, stderr=subprocess.PIPE)
> p.wait()

Could it be that something else in your code messes with SIGCHLD or
waitpid()? That could causes a hang here.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Hingston  
View profile  
 More options Nov 1 2012, 5:29 am
From: Mark Hingston <mrkhings...@gmail.com>
Date: Thu, 1 Nov 2012 02:29:05 -0700 (PDT)
Local: Thurs, Nov 1 2012 5:29 am
Subject: Re: [gevent] gevent + celery + subprocess + graylog = hang

On Wednesday, October 31, 2012 1:01:53 AM UTC+11, Denis Bilenko wrote:

> On Mon, Oct 29, 2012 at 8:01 AM, Mark Hingston <mrkhi...@gmail.com<javascript:>>
> wrote:

> > p = subprocess.Popen(['ffmpeg', ...], cwd='/my/dir/',
> stdin=subprocess.PIPE,
> > stdout=subprocess.PIPE, stderr=subprocess.PIPE)
> > p.wait()

> Could it be that something else in your code messes with SIGCHLD or
> waitpid()? That could causes a hang here.

Sorry - my original description was poor. The lockup actually occurs at the
point when I call Popen. The call to Popen just never returns. It seems to
stay waiting on the socket read operation on the errpipe_read.read() call:

errpipe_read = FileObject(errpipe_read, 'rb')
data = errpipe_read.read()

It seems as though it's never receiving a read event even though the exec
completes correctly and the process that I'm calling runs and then even
terminates.

Also FWIW I don't think that anything is messing with SIGCHLD or waitpid,
not from what I could tell.

I had a bit of luck with applying the fix for issue 154: "Hang when calling
`fork()` after `socket.connect()`" - I added the relevant code to kill()
then join() on the thread pool around where subprocess does a fork and as
far as I could tell this prevented the code from locking up, however I
don't think that killing the threads is really a good solution for me here.
Surprisingly celery still seems to work and be able to process many
subsequent tasks that call subprocess. I've not really looked into how but
I've assumed it must be recreating threads for the pool as it needs them or
something like that.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Hingston  
View profile  
 More options Nov 1 2012, 5:36 am
From: Mark Hingston <mrkhings...@gmail.com>
Date: Thu, 1 Nov 2012 02:36:37 -0700 (PDT)
Local: Thurs, Nov 1 2012 5:36 am
Subject: Re: [gevent] gevent + celery + subprocess + graylog = hang

On Thursday, November 1, 2012 8:29:05 PM UTC+11, Mark Hingston wrote:

> I had a bit of luck with applying the fix for issue 154: "Hang when
> calling `fork()` after `socket.connect()`" - I added the relevant code to
> kill() then join() on the thread pool around where subprocess does a fork
> and as far as I could tell this prevented the code from locking up, however
> I don't think that killing the threads is really a good solution for me
> here. Surprisingly celery still seems to work and be able to process many
> subsequent tasks that call subprocess. I've not really looked into how but
> I've assumed it must be recreating threads for the pool as it needs them or
> something like that.

Scratch that - I have now been able to get it to lock up even with this
change. Maybe it just changed timing to make the lockup less frequent. Not
sure.

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Denis Bilenko  
View profile  
 More options Nov 1 2012, 6:30 am
From: Denis Bilenko <denis.bile...@gmail.com>
Date: Thu, 1 Nov 2012 11:30:02 +0100
Local: Thurs, Nov 1 2012 6:30 am
Subject: Re: [gevent] gevent + celery + subprocess + graylog = hang

weird, does wrapping this block with "with gevent.Timeout(5):" help?

Can you make a complete script that reproduces the look up?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Hingston  
View profile  
 More options Nov 2 2012, 1:30 am
From: Mark Hingston <mrkhings...@gmail.com>
Date: Thu, 1 Nov 2012 22:30:09 -0700 (PDT)
Local: Fri, Nov 2 2012 1:30 am
Subject: Re: [gevent] gevent + celery + subprocess + graylog = hang

On Thursday, November 1, 2012 9:30:26 PM UTC+11, Denis Bilenko wrote:
> On Thu, Nov 1, 2012 at 10:29 AM, Mark Hingston <mrkhi...@gmail.com<javascript:>>
> wrote:
> > The call to Popen just never returns. It seems to
> > stay waiting on the socket read operation on the errpipe_read.read()
> call:

> > errpipe_read = FileObject(errpipe_read, 'rb')
> > data = errpipe_read.read()

> weird, does wrapping this block with "with gevent.Timeout(5):" help?

Well, using a gevent timeout here definitely throws an exception when the
timer expires, yes. Which is interesting but I don't think that will be
very helpful to me - I'd like to check if the process started correctly, or
at the very least it's output and return code. If Popen throws an error I
won't have the process object to wait / poll

> Can you make a complete script that reproduces the look up?

Tried all day but couldn't bring this down to a single script that locks
up. Will try further when I return from holidays in a week.

It's worth pointing out that I'm definitely seeing the 'Mixing fork() and
threads detected; memory leaked.' print whenever I call subprocess.Popen -
sorry I didn't think to mention this before. Some of my other tasks are
importing libraries that make socket connections to web services well
before I attempt to call subprocess. When I remove the instantiation of
objects from these libraries I don't seem to get that warning print. That's
why I was looking into the socket related lockup i mentioned earlier.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Denis Bilenko  
View profile  
 More options Nov 2 2012, 6:23 am
From: Denis Bilenko <denis.bile...@gmail.com>
Date: Fri, 2 Nov 2012 11:22:58 +0100
Local: Fri, Nov 2 2012 6:22 am
Subject: Re: [gevent] gevent + celery + subprocess + graylog = hang

On Fri, Nov 2, 2012 at 6:30 AM, Mark Hingston <mrkhings...@gmail.com> wrote:
> It's worth pointing out that I'm definitely seeing the 'Mixing fork() and
> threads detected; memory leaked.' print whenever I call subprocess.Popen -
> sorry I didn't think to mention this before.

Actually, this warning is only relevant if you fork() Python process
and continue using it. In case of Popen, which does execv(), there's
nothing actually leaked that I know of.

I'll remove the warning. Still would like to have a way to reproduce
your problem.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
vitaly  
View profile  
 More options Nov 9 2012, 7:26 pm
From: vitaly <vitaly.krugl.nume...@gmail.com>
Date: Fri, 9 Nov 2012 16:26:25 -0800 (PST)
Local: Fri, Nov 9 2012 7:26 pm
Subject: Re: gevent + celery + subprocess + graylog = hang

Hi Mark, although this workaround might work in your app, I would venture
to say that it's **not** a legit solution to include in
gevent.subprocess.Popen because a different user of gevent may have
purposefully changed one or more of the sys.stdout/stderr instances to go
somewhere else and would like gevent.subprocess.Popen to honor those
choices (just like the builtin subprocess.Popen would) instead of
overriding them with the default sys.__stdout__, etc.  And there may also
be other posix threads in the user's app writing to sys.stdout/sys.stderr
that would be impacted if gevent.subprocess.Popen temporarily patched
sys.stdin/stdout/stderr.

Best,
Vitaly


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Hingston  
View profile  
 More options Nov 20 2012, 9:54 pm
From: Mark Hingston <mrkhings...@gmail.com>
Date: Tue, 20 Nov 2012 18:54:41 -0800 (PST)
Local: Tues, Nov 20 2012 9:54 pm
Subject: Re: gevent + celery + subprocess + graylog = hang

Hey, thanks for the comments guys.

On Saturday, November 10, 2012 11:26:25 AM UTC+11, vitaly wrote:

> Hi Mark, although this workaround might work in your app, I would venture
> to say that it's **not** a legit solution

I agree Vitaly - good points.

On Friday, November 2, 2012 9:23:22 PM UTC+11, Denis Bilenko wrote:

> Still would like to have a way to reproduce your problem.

I got some time to come back to this today - I'm now pretty confident that
this is what's happening in the case of the Popen lockup:

 - process makes socket connections to SQS / graylog (seems that perhaps
it's important here that there are multiple threads created through the use
of these libraries)
 - fork / execv in subprocess.Popen
 - child writes to stdout/stderr, (e.g.:
    "WARNING: Mixing fork() and threads detected; memory leaked."
    or
    "Exception AttributeError: AttributeError("'_DummyThread' object has no
attribute '_Thread__block'",) in <module 'threading' from
'/usr/lib/python2.7/threading.pyc'> ignored")
 - celery redirects these outputs to the logger
 - graypy attempts to use a socket connection to send out a log, child
process hangs
 - parent waits forever on child process

At this point, ps shows:

pid: 23310; parent-pid: 20888; cmd: [celeryd@mark-VirtualBox:MainProcess]
-active- (worker --pool=gevent --loglevel=DEBUG)
pid: 23352; parent-pid: 23310; cmd: [celeryd@mark-VirtualBox:MainProcess]
-active- (worker --pool=gevent --loglevel=DEBUG)

And if I kill process 23352, then celery frees up, finishes processing the
task in question and goes on to process another task.

Further, if I suppress the two error messages mentioned above from being
written to stdout/err in the child then no lockups occur.

So by this point I'm almost certain that the underlying issue I'm seeing is
Issue 154. However the fix suggested in the comments for that issue is not
really appropriate for me I don't think. I know I commented earlier that I
had still seen a lockup with that fix applied, but it's such a hacky fix
(killing all the threads that celery is actually using) that I didn't
really want to investigate further what was causing a lockup in this case.
So I'll be working around this issue by turing off celery's stdout/stderr
redirection.

FWIW, I've attached a tarball that can be used to reproduce the problem. It
requires you to run a celery worker with:

celery worker --pool=gevent --loglevel=DEBUG

You will also need an Amazon SQS login, with AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY set as environment variables.

You also need to run a graylog server on the localhost.

Running this on linux 12.04.1 saw the Popen lockup pretty frequently (when
it happened it was always within the first 5-10 task invocations, and often
on the first).

Thanks for the help!
Mark


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Hingston  
View profile  
 More options Nov 20 2012, 10:01 pm
From: Mark Hingston <mrkhings...@gmail.com>
Date: Tue, 20 Nov 2012 19:01:10 -0800 (PST)
Local: Tues, Nov 20 2012 10:01 pm
Subject: Re: gevent + celery + subprocess + graylog = hang

Google groups was barfing when trying to add attachment with tar.gz
extension, trying with .zip

  mytest.zip
1K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Denis Bilenko  
View profile  
 More options Nov 23 2012, 3:23 am
From: Denis Bilenko <denis.bile...@gmail.com>
Date: Fri, 23 Nov 2012 09:22:57 +0100
Local: Fri, Nov 23 2012 3:22 am
Subject: Re: [gevent] Re: gevent + celery + subprocess + graylog = hang

So, execvp/execvpe in subprocess.Popen do not succeed, right? (If they
did, errpipe_read would be closed because it has CLOEXEC flag.)

Could you avoid using thread pool and try again? (set
GEVENT_RESOLVER=ares to make sure dns resolver does not use the thread
pool).

> FWIW, I've attached a tarball that can be used to reproduce the problem. It
> requires you to run a celery worker with:

> celery worker --pool=gevent --loglevel=DEBUG

> You will also need an Amazon SQS login, with AWS_ACCESS_KEY_ID and
> AWS_SECRET_ACCESS_KEY set as environment variables.
> You also need to run a graylog server on the localhost.

Can you make a self-contained tarball that does not depend on external
services and includes all the dependencies and has a single script
that launches everything and reproduces the problem? Otherwise it's
not very useful.

BTW, what is your gevent version? 1.0rc1?


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Mark Hingston  
View profile  
 More options Nov 26 2012, 3:13 am
From: Mark Hingston <mrkhings...@gmail.com>
Date: Mon, 26 Nov 2012 00:13:36 -0800 (PST)
Local: Mon, Nov 26 2012 3:13 am
Subject: Re: [gevent] Re: gevent + celery + subprocess + graylog = hang

On Friday, November 23, 2012 7:23:19 PM UTC+11, Denis Bilenko wrote:
> So, execvp/execvpe in subprocess.Popen do not succeed, right? (If they
> did, errpipe_read would be closed because it has CLOEXEC flag.)

Well from what I can tell the process gets stuck before it even tries to
call exec, because my prints directly to sys.__stderr__ that were just
before the exec were never printing and also the child process is always
labelled as a 'celery' process when stuck. If the exec had occurred then I
would have expected the stuck command to be 'touch' (or ffmpeg in my actual
real world code).

> Could you avoid using thread pool and try again? (set
> GEVENT_RESOLVER=ares to make sure dns resolver does not use the thread
> pool).

Thanks for the tip about this - that's an important factor here. We
actually were already changing the resolver with:

gevent.hub.Hub.resolver_class = 'gevent.resolver_ares.Resolver'

but celery patches gevent and uses the resolver before we get a chance to
change it with that line, so we end up using the threaded resolver. We'll
change to use the environment variable to ensure that the right resolver is
used from now on. FWIW, using this env var makes the lockup go away -
without needing any of the other changes I mentioned earlier in my posts.

Previously I hadn't considered that our code should not have been using the
threadpool in the first place. There's no other code actually using a
threadpool for us.

> Can you make a self-contained tarball that does not depend on external
> services and includes all the dependencies and has a single script
> that launches everything and reproduces the problem?

Ok - done and attached to this post. The attached script spawns a child
process that should lock up when it attempts to log in the preexec
function. I wasn't using a preexec function in my celery code so this isn't
the way that the lockup was occurring for me (as I described above, mine
happened because the 'WARNING: Mixing fork ...' print was redirected to
logging which then did a socket call) but in the end it's the same lockup -
on Ubuntu the parent process hangs waiting for the errpipe_read.read() call
to return, and it never returns.

Sorry it took so much time to get there - I actually didn't realise that
Popen.subprocess has a pre-exec function.

BTW, what is your gevent version? 1.0rc1?


Yep, gevent 1.0rc1

Also sorry if this is off topic but in working on trying to get a simple
repro of this issue I tried some code that does something like:

def run():
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.connect(("google.com", 80))
    s.close()

pool = ThreadPool(2)
pool.spawn(run)
pool.join()

But this code locks up immediately on ubuntu and on Mac OSX. I couldn't
find much documentation on ThreadPool so I'm wondering if maybe I'm just
using it in a way it was not intended for??

  test.py
< 1K Download

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »