Usage of python's subprocess.Popen.communicate()

1,982 views
Skip to first unread message

Caleb Spare

unread,
Jan 14, 2014, 9:53:35 PM1/14/14
to ansibl...@googlegroups.com
Hi, I wasn't sure whether to report this as a bug or not.

The issue was this: I have an rsync task that I use for efficiently copying directory hierarchies to remote machines. The symptom is that it would sometimes hang during the rsync task. ps would reveal that the rsync process was done, but being held onto by the parent process (defunct). If I killed ansible and re-ran the playbook, it would usually work.

After a bunch of troubleshooting, I finally realized what the issue was. I have this in my ssh configuration (among other stuff):

ControlPath ~/.ssh-control/%r@%h:%p
ControlMaster auto

When there isn't already a control connection open, ssh (invoked by rsync) will create a new one. This new process outlives the rsync/ssh that spawned it, and holds onto stderr. Ansible is using Popen.communicate() from python's subprocess module, which not only blocks until the process finishes (that is, until wait() returns) but also blocks until stdout and stderr from the command are closed.

I think this is bad behavior -- instead, ansible should wait() for the process to finish only. (I mean the wait syscall -- the Popen.wait() documentation lists some caveats about deadlocks, so the changes required probably involve reading reading the buffers itself or something. I'm not sure of the details because I'm not really a Python user, but in other languages you certainly can call another process and wait for it to finish without waiting on its stdout/stderr to be closed, so I'm sure it's possible with Python as well.)

I also think ssh's behavior itself is wonky, and I sent the openssh dev ML (openssh-...@mindrot.org) a note as well. (You can read the January archive here and search for my name to see the email).

Thoughts about this?

Thanks!
-Caleb

Michael DeHaan

unread,
Jan 14, 2014, 10:09:59 PM1/14/14
to Caleb Spare, ansibl...@googlegroups.com
First, please let us know what version of Ansible you are using.

I assume you may be referring to this, but it would have been helpful if you referenced line numbers.   (Please do?)

This is from ssh_alt.py actually, on the devel branch:

    # only break out if we've emptied the pipes, or there is nothing to
            # read from and the process has finished.
            if (not rpipes or not rfd) and p.poll() is not None:
                break


That's actually pretty vital that it stays in there, though I'd welcome experimentation that keeps equivalent behavior.

There can still be input left after execution, which we've observed empirically.




--
You received this message because you are subscribed to the Google Groups "Ansible Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-deve...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.



--
Michael DeHaan <mic...@ansibleworks.com>
CTO, AnsibleWorks, Inc.
http://www.ansibleworks.com/

Caleb Spare

unread,
Jan 14, 2014, 10:24:38 PM1/14/14
to Michael DeHaan, ansibl...@googlegroups.com
Apologies; I actually forgot several pieces of information I meant to give.

ansible --version says 'ansible 1.5' (we're using a particular pinned commit that has some post-1.4 fixes).

The code I'm looking at is in the ~/.ansible/tmp/[RANDOM ID THING]/command when I run an ansible command. It appears to be copied from lib/ansible/module_utils/basic.py:

https://github.com/ansible/ansible/blob/devel/lib/ansible/module_utils/basic.py#L1007

That's the line where it calls Popen.communicate().

I can actually reproduce this completely locally if I add localhost to my $ANSIBLE_HOSTS file (and have passwordless ssh allowed to localhost):

# First kill any open ssh control connection to localhost
$ ps -ef | grep "ss[h]: .*localhost" | awk '{print $2}' | xargs kill 2>/dev/null
$ ansible localhost -a 'ssh localhost ls' # hangs -- eventually I ctrl-c
$ ansible localhost -a 'ssh localhost ls' # works the second time because there's already a control connection running


As I noted in my original email, I have ControlMaster auto in my ~/.ssh/config. If I turn it to off, then the problem does not exist.

Michael DeHaan

unread,
Jan 14, 2014, 10:54:31 PM1/14/14
to Caleb Spare, ansibl...@googlegroups.com
Ok, so that's not asking about the connection.   Invoking ssh localhost is pretty bizarre ansible usage and the module_utils function is a pretty standard communicate call pattern.

I suspect what's happening is your command really isn't exiting, but I'd be curious how you could make it better.

The correct Ansible patterns here would be do this:

- local_action: shell ls

Or if remote:

- shell: ls

And I know lots of people did "local_action: rsync" fine before the synchronize module existed, and pretty much everyone on Ubuntu is using ControlMaster auto by default.

See the defaults in ansible.cfg.

#ssh_args = -o ControlMaster=auto -o ControlPersist=60s








Caleb Spare

unread,
Jan 14, 2014, 11:18:37 PM1/14/14
to Michael DeHaan, ansibl...@googlegroups.com
On Tue, Jan 14, 2014 at 7:54 PM, Michael DeHaan <mic...@ansibleworks.com> wrote:
Ok, so that's not asking about the connection.   Invoking ssh localhost is pretty bizarre ansible usage and the module_utils function is a pretty standard communicate call pattern.

I think my original email was pretty clear. The problem is when calling rsync, but just calling ssh directly reproduces the issue and skips the middleman. Obviously I'm not calling ssh in my actual deployment code.
 

I suspect what's happening is your command really isn't exiting, but I'd be curious how you could make it better.

My command absolutely is exiting.

# First kill any open ssh control connection to localhost
$ ps -ef | grep "ss[h]: .*localhost" | awk '{print $2}' | xargs kill 2>/dev/null
$ ansible localhost -a 'ssh localhost ls && echo done >> /tmp/foobar'

Now it hangs. You can look in /tmp/foobar and see that it's done. Furthermore, you can ps -ef | grep ssh and see the process:

caleb    15918 15917  0 20:01 pts/1    00:00:00 [ssh] <defunct>

It's defunct because it has exited and the parent process (the ansible program in ~/.ansible/tmp/...) is still holding onto it.

I can prove that it's stuck on the communicate() line waiting for stderr to close:

# First kill any open ssh control connection to localhost
$ ps -ef | grep "ss[h]: .*localhost" | awk '{print $2}' | xargs kill 2>/dev/null
$ ansible localhost -a 'bash -c "exec 2>&- ssh localhost ls"' # does NOT hang

(2>&- is a bashism for closing stderr)


The correct Ansible patterns here would be do this:

- local_action: shell ls

Yes, my real code where I discovered this issue does

- local_action rsync ...
 

Or if remote:

- shell: ls

And I know lots of people did "local_action: rsync" fine before the synchronize module existed, and pretty much everyone on Ubuntu is using ControlMaster auto by default.

Ubuntu does not use ControlMaster by default afaik. ControlMaster is 'no' by default (see man ssh_config) and /etc/ssh/ssh_config on my Ubuntu 13.10 machine does not set ControlMaster. If you are referring to Ansible's own ssh invocation, see below...


See the defaults in ansible.cfg.


#ssh_args = -o ControlMaster=auto -o ControlPersist=60s

Ansible's ssh settings have nothing to do with this. What matters is the system's ssh configuration, because we're invoking ssh (through rsync) by shelling out.

Caleb Spare

unread,
Jan 15, 2014, 5:07:58 AM1/15/14
to Michael DeHaan, ansibl...@googlegroups.com
By the way, thanks for pointing out the synchronize module. Last I checked (when I wrote my own rsync task which spawned this thread) it wasn't merged. I'll be glad to use that instead of our custom code.

That module is doing essentially the same thing as my rsync task, and I can reproduce the hang with it as well (in the same manner as I've been describing, with ControlMaster auto in my ssh configuration and killing the ssh connection beforehand.)

Michael DeHaan

unread,
Jan 15, 2014, 8:01:49 AM1/15/14
to Caleb Spare, ansibl...@googlegroups.com
"I think my original email was pretty clear."

Alas, but I didn't.

So in your case, you're initiating rsync from the remote, and not using local_action.

Can you look into whether synchronize works for you, assuming you are rsyncing files from a given host to another, and one is not local host, please see this example:

# Synchronization of src on delegate host to dest on the current inventory host
synchronize: >
    src=some/relative/path dest=/some/absolute/path
    delegate_to: delegate.host
Which is linked here.
http://docs.ansible.com/synchronize_module.html

Caleb Spare

unread,
Jan 15, 2014, 11:05:31 AM1/15/14
to Michael DeHaan, ansibl...@googlegroups.com
On Wed, Jan 15, 2014 at 5:01 AM, Michael DeHaan <mic...@ansibleworks.com> wrote:
"I think my original email was pretty clear."

Alas, but I didn't.

So in your case, you're initiating rsync from the remote, and not using local_action.

Oh I see the problem. Yes, my repro is running ssh as a remote action. This incidentally does reproduce the behavior, but I see why that's confusing.

Let me try again to explain what's going on.

Here is my rsync task that hangs: https://gist.github.com/cespare/8438518

Here is a minimal playbook that reproduces the issue using ssh directly instead of using rsync: https://gist.github.com/cespare/8438767

We'll use localhost as the remote host as well, but it can be anything. First, make sure localhost appears somewhere in the ANSIBLE_HOSTS file. Next, kill any existing open ssh connections to localhost:

$ ps -ef | grep "ss[h]: .*localhost" | awk '{print $2}' | xargs kill 2>/dev/null

Run the first test:

$ ansible-playbook hang_playbook.yml -e 'hosts=localhost task_to_run=1' 

Kill the connections again, and then run the second test:

$ ansible-playbook hang_playbook.yml -e 'hosts=localhost task_to_run=2' 

The first test hangs, but the second does not. The reason is the same as I've been explaining: Popen.communicate() is waiting for stderr to close.
 

Can you look into whether synchronize works for you, assuming you are rsyncing files from a given host to another, and one is not local host, please see this example:

I think you missed my follow-up email above. I have tried the synchronize module; and it's doing the same thing as my own rsync task (running rsync as a local_task), and it also has the hanging issue that I've been describing.

Michael DeHaan

unread,
Jan 15, 2014, 11:09:25 AM1/15/14
to Caleb Spare, ansibl...@googlegroups.com
yeah to be clear I don't think it's illegal to run ssh as a remote action, but the likelihood of SSH going into some kind of prompt mode when you are doing that is quite high  (host key not added yet,  needs password, credentials not forwarded, etc).

So I wanted to be sure to rule that out.



Caleb Spare

unread,
Jan 15, 2014, 8:30:58 PM1/15/14
to ansibl...@googlegroups.com, Caleb Spare
So do we agree that this is an issue? Should I file an issue and/or send a pull request? My fix would involve using subprocess in a different way (not using communicate) so that we wait for the process to finish but not for it to close stdout/stderr.

Michael DeHaan

unread,
Jan 15, 2014, 8:53:35 PM1/15/14
to Caleb Spare, ansibl...@googlegroups.com
I don't agree it's an issue yet as I haven't had time to reproduce it, though I'm not saying it's not -- but it hasn't been something that's been reported before (in two years).

(It sounds like there's a high chance ssh is going interactive and asking you a question, if I had to bet -- but I could be wrong)

If you can find a way to speak the run code that solves your problem, I'd be interested in seeing it, so would you like to poke around?.

Caleb Spare

unread,
Jan 15, 2014, 9:03:16 PM1/15/14
to Michael DeHaan, ansibl...@googlegroups.com
On Wed, Jan 15, 2014 at 5:53 PM, Michael DeHaan <mic...@ansibleworks.com> wrote:
I don't agree it's an issue yet as I haven't had time to reproduce it, though I'm not saying it's not -- but it hasn't been something that's been reported before (in two years).

Most folks don't use ControlMaster auto in their ~/.ssh/config.
 

(It sounds like there's a high chance ssh is going interactive and asking you a question, if I had to bet -- but I could be wrong)

I have provided a lot of evidence that this is not the case, and it's the communicate() thing I described.
  • The ssh process (or sh process, if run through a shell) is defunct -- i.e. not running
  • Things done after ssh in the command execute (see the '&& echo done >> /tmp/foobar example)
  • Under the same conditions, redirecting stderr to /dev/null fixes the issue

If you can find a way to speak the run code that solves your problem, I'd be interested in seeing it, so would you like to poke around?.

Sure, I'll send over a PR when I get a chance to look into it further.

Brian Coca

unread,
Jan 15, 2014, 11:21:53 PM1/15/14
to ansibl...@googlegroups.com
I've been using control master for a long time, I was using it with ansible even before it was incorporated as a default option (most used paramiko because it was faster w/o it) and I've never encountered these kind of issues w/o it being caused by a command or script keeping a tty resource open (stdin/out/err).

Caleb Spare

unread,
Jan 15, 2014, 11:25:36 PM1/15/14
to Brian Coca, ansibl...@googlegroups.com
On Wed, Jan 15, 2014 at 8:21 PM, Brian Coca <bria...@gmail.com> wrote:
I've been using control master for a long time, I was using it with ansible even before it was incorporated as a default option (most used paramiko because it was faster w/o it) and I've never encountered these kind of issues w/o it being caused by a command or script keeping a tty resource open (stdin/out/err).

That's what's happening. ssh is keeping stderr open. 

--
You received this message because you are subscribed to a topic in the Google Groups "Ansible Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ansible-devel/yrQ_UaGOv0g/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ansible-deve...@googlegroups.com.

Johannes Truschnigg

unread,
Jan 16, 2014, 3:29:47 AM1/16/14
to ansibl...@googlegroups.com
On 01/15/2014 05:09 PM, Michael DeHaan wrote:
> yeah to be clear I don't think it's illegal to run ssh as a remote action,
> but the likelihood of SSH going into some kind of prompt mode when you are
> doing that is quite high (host key not added yet, needs password,
> credentials not forwarded, etc).

That use case is one of the reasons why OpenSSH's BatchMode
("-oBatchMode=yes") exists.

--
Mit freundlichen Gr��en
Johannes Truschnigg
Senior System Administrator
--
mailto:johannes....@geizhals.at (in dringenden F�llen bitte an
in...@geizhals.at)

Geizhals(R) - Preisvergleich Internet Services AG
Obere Donaustrasse 63/2
A-1020 Wien
Tel: +43 1 5811609/87
Fax: +43 1 5811609/55
http://geizhals.at => Preisvergleich f�r �sterreich
http://geizhals.de => Preisvergleich f�r Deutschland
http://geizhals.eu => Preisvergleich EU-weit
Handelsgericht Wien | FN 197241K | Firmensitz Wien
Reply all
Reply to author
Forward
0 new messages