tl;dr; Seeing an Ansible bug where commands take lots of time to start, and then complete quickly; seems to affect Ubuntu VMs.
Full description: I’m using Ansible to orchestrate code deployments. These deployments frequently take _much_ longer than expected, because individual steps that should take only a few seconds to run (like copying a small file or getting facts) take up to 4 minutes, where the time is in SSH waits.
The Ansible profiler output shows that commands almost always take some exact multiple of 60 seconds, plus a small, reasonable, expected delta. Here’s an example:
Monday 08 August 2016 11:26:18 -0700 (0:00:00.650) 0:13:51.907 *********
===============================================================================
TASK: appserver : Copy web jar file from local ------------------------ 242.42s
TASK: setup ---------------------------------------------------------- 242.21s
TASK: localhost_keystore : Create the forward directory --------------- 181.18s
TASK: appserver : Copy web dependencies jar file from local ------------ 66.84s
TASK: localhost_keystore : Copy the keystore file ---------------------- 60.84s
TASK: Wait for app server to accept incoming HTTPS connections ---------- 9.74s
This problem occurs across playbooks, servers, and tasks - pretty much everywhere. All are Ubuntu 14.04 server instances.
[I’ve seen a possibly related problem locally on my Mac laptop, when using a script to create a Docker image. When Ansible would connect right after the docker image had started up, there was a race condition where the connection could occur as the SSH server was starting up, and I would often see a similar added 60 second delay. This problem went away by delaying the initial connection attempt by 5 seconds to give enough time for the SSH server to fully start up. ]
There’s one 60 second default delay in the source, which is related to these delays:
ansible/constants.py:ANSIBLE_SSH_ARGS = get_config(p, 'ssh_connection', 'ssh_args', 'ANSIBLE_SSH_ARGS', '-o ControlMaster=auto -o ControlPersist=60s’)
The man page for this param:
ControlPersist
When used in conjunction with ControlMaster, specifies that the master connection should remain open in the background (waiting for future client connections) after the initial client connection has been closed. If set to ``no'', then the master connection will not be placed into the background, and will close as soon as the initial client connection is closed. If set to ``yes'', then the master connection will remain in the background indefinitely (until killed or closed via a mechanism such as the ssh(1) ``-O exit'' option). If set to a time in seconds, or a time in any of the formats documented in sshd_config(5), then the backgrounded master connection will automatically terminate after it has remained idle (with no client connections) for the specified time.
Looking in the SSH logs for a machine where this was occurring (with 60s value), I see examples like the below where the session is immediately closed, but then later opened, potentially multiple times, even for simple things like setup (getting facts for a playbook).
Aug 8 13:35:19 lohika-linux-2 sudo: pam_unix(sudo:session): session opened for user root by forward(uid=0)
Aug 8 13:35:19 lohika-linux-2 sudo: pam_unix(sudo:session): session closed for user root
Aug 8 13:36:20 lohika-linux-2 sshd[2265]: Received disconnect from 10.128.0.116: 11: disconnected by user
Aug 8 13:36:20 lohika-linux-2 sshd[2216]: pam_unix(sshd:session): session closed for user forward
Aug 8 13:36:21 lohika-linux-2 sshd[2361]: Accepted publickey for forward from 10.128.0.116 port 40208 ssh2: RS
Other times, you just see a ‘session opened’ message but no immediate 'session closed’.
Here’s an example of Ansible’s output with full verbosity when the delay occurred (set to 25s); 3 out of 4 targets saw the delay but I’m showing one fast (lohika-linux-4) example and one slow example (lohika-linux-3) in the hopes that something pops out.
TASK [setup] *******************************************************************
Monday 08 August 2016 15:05:59 -0700 (0:00:00.190) 0:01:46.080 *********
<lohika-linux-4> ESTABLISH SSH CONNECTION FOR USER: forward
<lohika-linux-4> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt lohika-linux-4 ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507 )" )
<lohika-linux-3> ESTABLISH SSH CONNECTION FOR USER: forward
<lohika-linux-3> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt lohika-linux-3 ( umask 22 && mkdir -p "$( echo $HOME/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723 )" && echo "$( echo $HOME/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723 )" )
<lohika-linux-4> PUT /tmp/tmpCVuqVg TO /home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/setup
<lohika-linux-4> SSH: EXEC scp -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r /tmp/tmpCVuqVg [lohika-linux-4]:/home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/setup
<lohika-linux-4> ESTABLISH SSH CONNECTION FOR USER: forward
<lohika-linux-4> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt lohika-linux-4 LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/setup; rm -rf "/home/forward/.ansible/tmp/ansible-tmp-1470693959.32-112350956649507/" > /dev/null 2>&1
ok: [lohika-linux-4]
<lohika-linux-3> PUT /tmp/tmpEbpIWo TO /home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/setup
<lohika-linux-3> SSH: EXEC scp -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r /tmp/tmpEbpIWo [lohika-linux-3]:/home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/setup
<lohika-linux-3> ESTABLISH SSH CONNECTION FOR USER: forward
<lohika-linux-3> SSH: EXEC ssh -C -vvv -o ControlMaster=auto -o ControlPersist=25s -o StrictHostKeyChecking=no -o 'IdentityFile="/home/jenkins/.ssh/id_rsa_fwd_ansible_dev.pem"' -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=forward -o ConnectTimeout=600 -o ControlPath=/home/jenkins/.ansible/cp/ansible-ssh-%h-%p-%r -tt lohika-linux-3 LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 /usr/bin/python /home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/setup; rm -rf "/home/forward/.ansible/tmp/ansible-tmp-1470693959.34-233827923418723/" > /dev/null 2>&1
ok: [lohika-linux-3]
When changing this parameter via ansible.cfg, I definitely see an effect, in that the delay lines up with the ControlMaster change. For example, dropping to 25s, I see 26s and 51s delays, and the rest effectively zero.
However when I drop this param to 10s, I find that lots of commands that previously had sub-second delays then take at least the ControlPersist time:
Monday 08 August 2016 14:32:02 -0700 (0:00:00.053) 0:03:50.756 *********
===============================================================================
TASK: Wait for VM to boot ---------------------------------------------- 42.17s
TASK: setup ----------------------------------------------------------- 31.22s
TASK: stat path=/etc/ec2_version --------------------------------------- 30.92s
TASK: Verify that parser_dev_web can auth to the Forward server -------- 21.50s
TASK: Ensure server is accepting incoming web connections -------------- 21.01s
TASK: Ensure that the server port is accessible ------------------------ 20.94s
TASK: setup ----------------------------------------------------------- 10.98s
TASK: setup ----------------------------------------------------------- 10.97s
TASK: Wait for app server to accept incoming HTTPS connections --------- 10.89s
TASK: Check the status of the libvirt-bin service ---------------------- 10.69s
TASK: Wait for 5 seconds before checking libvirt-bin status ------------- 5.03s
TASK: Delay before checking for the boot status of the VM ——————— 5.02s
Or, for 20s:
TASK: setup ----------------------------------------------------------- 61.18s
TASK: stat path=/etc/ec2_version --------------------------------------- 60.94s
TASK: Check the status of the libvirt-bin service ---------------------- 60.94s
TASK: setup ----------------------------------------------------------- 41.22s
TASK: Wait for VM to boot ---------------------------------------------- 27.75s
TASK: Wait for app server to accept incoming HTTPS connections --------- 20.94s
TASK: Ensure server is accepting incoming web connections -------------- 20.92s
TASK: Ensure that the server port is accessible ------------------------ 20.71s
So at this point I have a semi-workaround (use 25-30s), which helps the worst case quite a bit, while only occasionally hurting the typical case.
This certainly feels like either a bug in Ansible’s SSH handling for some condition that I’m seeing when running on Ubuntu, or some issue in my SSH client config that triggers errors. It does seem like the problem occurs more frequently when the underlying VM host for the target VMs is more loaded. I can’t seem to replicate the problem locally on a Mac.
I’ve confirmed that the issue is not Reverse DNS lookups; the target servers all have UseDNS no set in /etc/ssh/sshd_config, plus -u0 set to further prevent Reverse DNS.
The client version used on Ubuntu is the latest one for 14.04 LTS, which is somewhat old: OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7, OpenSSL 1.0.1f 6 Jan 2014
For most of these tests, I was using Ansible v2.0.0-0.7.rc2; switching to v2.1.1.0-1 produced identical effects.
Anyone have ideas, or seen anything similar?
Much appreciated!