Wacky SSH timeout bug with Ansible

918 views
Skip to first unread message

Brandon Heller

unread,
Aug 8, 2016, 10:14:01 PM8/8/16
to Ansible Project

tl;dr; Seeing an Ansible bug where commands take lots of time to start, and then complete quickly; seems to affect Ubuntu VMs.


Full description: I’m using Ansible to orchestrate code deployments.  These deployments frequently take _much_ longer than expected, because individual steps that should take only a few seconds to run (like copying a  small file or getting facts) take up to 4 minutes, where the time is in SSH waits.


The Ansible profiler output shows that commands almost always take some exact multiple of 60 seconds, plus a small, reasonable, expected delta.  Here’s an example:


Monday 08 August 2016  11:26:18 -0700 (0:00:00.650)       0:13:51.907 *********

===============================================================================

TASK: appserver : Copy web jar file from local ------------------------ 242.42s

TASK: setup  ---------------------------------------------------------- 242.21s

TASK: localhost_keystore : Create the forward directory --------------- 181.18s

TASK: appserver : Copy web dependencies jar file from local ------------ 66.84s

TASK: localhost_keystore : Copy the keystore file ---------------------- 60.84s

TASK: Wait for app server to accept incoming HTTPS connections ---------- 9.74s


This problem occurs across playbooks, servers, and tasks - pretty much everywhere.  All are Ubuntu 14.04 server instances.  


[I’ve seen a possibly related problem locally on my Mac laptop, when using a script to create a Docker image.  When Ansible would connect right after the docker image had started up, there was a race condition where the connection could occur as the SSH server was starting up, and I would often see a similar added 60 second delay.  This problem went away by delaying the initial connection attempt by 5 seconds to give enough time for the SSH server to fully start up. ]


There’s one 60 second default delay in the source, which is related to these delays:


ansible/constants.py:ANSIBLE_SSH_ARGS = get_config(p, 'ssh_connection', 'ssh_args', 'ANSIBLE_SSH_ARGS', '-o ControlMaster=auto -o ControlPersist=60s’)


The man page for this param:

    ControlPersist

            When used in conjunction with ControlMaster, specifies that the master connection should remain open in the background (waiting for future client connections) after the initial client connection has been closed.  If set to ``no'', then the master connection will not be placed into the background, and will close as soon as the initial client connection is closed.  If set to ``yes'', then the master connection will remain in the background indefinitely (until killed or closed via a mechanism such as the ssh(1) ``-O exit'' option).  If set to a time in seconds, or a time in any of the formats documented in sshd_config(5), then the backgrounded master connection will automatically terminate after it has remained idle (with no client connections) for the specified time.


Looking in the SSH logs for a machine where this was occurring (with 60s value), I see examples like the below where the session is immediately closed, but then later opened, potentially multiple times, even for simple things like setup (getting facts for a playbook).


Aug  8 13:35:19 lohika-linux-2 sudo: pam_unix(sudo:session): session opened for user root by forward(uid=0)

Aug  8 13:35:19 lohika-linux-2 sudo: pam_unix(sudo:session): session closed for user root

Aug  8 13:36:20 lohika-linux-2 sshd[2265]: Received disconnect from 10.128.0.116: 11: disconnected by user

Aug  8 13:36:20 lohika-linux-2 sshd[2216]: pam_unix(sshd:session): session closed for user forward

Aug  8 13:36:21 lohika-linux-2 sshd[2361]: Accepted publickey for forward from 10.128.0.116 port 40208 ssh2: RS


Other times, you just see a ‘session opened’ message but no immediate 'session closed’.

Here’s an example of Ansible’s output with full verbosity when the delay occurred (set to 25s); 3 out of 4 targets saw the delay but I’m showing one fast (lohika-linux-4) example and one slow example (lohika-linux-3) in the hopes that something pops out.


TASK [setup] *******************************************************************

Monday 08 August 2016  15:05:59 -0700 (0:00:00.190)       0:01:46.080 *********

<lohika-linux-4> ESTABLISH SSH CONNECTION FOR USER: forward

<lohika-linux-4> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt lohika-linux-4 ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507 )" )

<lohika-linux-3> ESTABLISH SSH CONNECTION FOR USER: forward

<lohika-linux-3> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt lohika-linux-3 ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723 )" )

<lohika-linux-4> PUT /tmp/tmpCVuqVg TO /home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/setup

<lohika-linux-4> SSH: EXEC scp -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r /tmp/tmpCVuqVg [lohika-linux-4]:/home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/setup

<lohika-linux-4> ESTABLISH SSH CONNECTION FOR USER: forward

<lohika-linux-4> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt lohika-linux-4 LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/setup; rm -rf "/home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/" > /dev/null 2>&1

ok: [lohika-linux-4]

<lohika-linux-3> PUT /tmp/tmpEbpIWo TO /home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/setup

<lohika-linux-3> SSH: EXEC scp -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r /tmp/tmpEbpIWo [lohika-linux-3]:/home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/setup

<lohika-linux-3> ESTABLISH SSH CONNECTION FOR USER: forward

<lohika-linux-3> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt lohika-linux-3 LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/setup; rm -rf "/home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/" > /dev/null 2>&1

ok: [lohika-linux-3]


When changing this parameter via ansible.cfg, I definitely see an effect, in that the delay lines up with the ControlMaster change.  For example, dropping to 25s, I see 26s and 51s delays, and the rest effectively zero.


However when I drop this param to 10s, I find that lots of commands that previously had sub-second delays then take at least the ControlPersist time:


Monday 08 August 2016  14:32:02 -0700 (0:00:00.053)       0:03:50.756 *********

===============================================================================

TASK: Wait for VM to boot ---------------------------------------------- 42.17s

TASK: setup  ----------------------------------------------------------- 31.22s

TASK: stat path=/etc/ec2_version --------------------------------------- 30.92s

TASK: Verify that parser_dev_web can auth to the Forward server -------- 21.50s

TASK: Ensure server is accepting incoming web connections -------------- 21.01s

TASK: Ensure that the server port is accessible ------------------------ 20.94s

TASK: setup  ----------------------------------------------------------- 10.98s

TASK: setup  ----------------------------------------------------------- 10.97s

TASK: Wait for app server to accept incoming HTTPS connections --------- 10.89s

TASK: Check the status of the libvirt-bin service ---------------------- 10.69s

TASK: Wait for 5 seconds before checking libvirt-bin status ------------- 5.03s

TASK: Delay before checking for the boot status of the VM ——————— 5.02s


Or, for 20s:


TASK: setup  ----------------------------------------------------------- 61.18s

TASK: stat path=/etc/ec2_version --------------------------------------- 60.94s

TASK: Check the status of the libvirt-bin service ---------------------- 60.94s

TASK: setup  ----------------------------------------------------------- 41.22s

TASK: Wait for VM to boot ---------------------------------------------- 27.75s

TASK: Wait for app server to accept incoming HTTPS connections --------- 20.94s

TASK: Ensure server is accepting incoming web connections -------------- 20.92s

TASK: Ensure that the server port is accessible ------------------------ 20.71s


So at this point I have a semi-workaround (use 25-30s), which helps the worst case quite a bit, while only occasionally hurting the typical case.


This certainly feels like either a bug in Ansible’s SSH handling for some condition that I’m seeing when running on Ubuntu, or some issue in my SSH client config that triggers errors.  It does seem like the problem occurs more frequently when the underlying VM host for the target VMs is more loaded.  I can’t seem to replicate the problem locally on a Mac.


I’ve confirmed that the issue is not Reverse DNS lookups; the target servers all have UseDNS no set in /etc/ssh/sshd_config, plus -u0 set to further prevent Reverse DNS.


The client version used on Ubuntu is the latest one for 14.04 LTS, which is somewhat old: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7, OpenSSL 1.0.1f 6 Jan 2014


For most of these tests, I was using Ansible v2.0.0-0.7.rc2; switching to v2.1.1.0-1 produced identical effects.


Anyone have ideas, or seen anything similar?  


Much appreciated!


Brandon Heller

unread,
Aug 10, 2016, 4:39:39 PM8/10/16
to Ansible Project
On a whim, I tried setting pipelining to true, and found that it dramatically reduced the incidence of these issues, but still did not resolve them entirely.  

I still see examples of simple commands that take 60+1 seconds to complete.  

If anyone is seeing the same issues, definitely speak up.  Thanks!
Reply all
Reply to author
Forward
0 new messages