mysterious failures with truncated output, after upgrading managed machine to 4.2.0 kernel

77 views
Skip to first unread message

mspr...@us.ibm.com

unread,
Feb 8, 2016, 9:51:06 AM2/8/16
to Ansible Project
I am getting some really hard to understand failures.

I have three managed machines, running Ubuntu 14.04.03 --- which normally has Linux kernel 3.13.  I run Ansible 1.9.4 on my Mac (OS 10.10.5) to manage them.  The managed machines are VMs in a cloud service, on a datacenter subnet, and I use an OpenVPN tunnel between them and my Mac.  I have a playbook, and it runs.  All is well.  BTW: I have no ansible.cfg anywhere (I set the SL_INVENTORY envar to point at the appropriate dynamic inventory program), and none of my SSH config files say anything about TCPKeepAlive or ServerAlive<whatever>.

Then on the managed machines I run (as root) `apt-get install linux-generic-lts-wily`, which upgrades the kernel to 4.2.  After that my Ansible is flaky.  When I run my playbook, usually one or two of the three managed machines fails the gather-facts; the remainder fail in the first few tasks.  Which fails where is variable, so I think this is due to some intermittent problem.

Thinking it might be a connectivity flakiness, I created `ansible.cfg` that specifies a timeout of 100, and set `TCPKeepAlive no` and `ServerAliveCountMax 100` in my SSH client config (leaving ServerAliveInterval unspecified everywhere, which I think disables that function).  These did not solve the problem.  The playbook runs to full failure in under 30 seconds.

The failure message shows a truncated reply from the remote machine and then some SSH debugging that ends with a complaint about a broken pipe.  Here is an example from the gather-facts step (I passed `-vvvv` to `ansible-playbook`):

GATHERING FACTS ***************************************************************
<10.140.155.166> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.166> REMOTE_MODULE setup
<10.140.155.161> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> REMOTE_MODULE setup
<10.140.155.161> REMOTE_MODULE setup
<10.140.155.166> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.166 /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573 && echo $HOME/.ansible/tmp/ansible-tmp-1454784058.99-138766034918573'
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274 && echo $HOME/.ansible/tmp/ansible-tmp-1454784058.99-66302778668274'
<10.140.155.161> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.161 /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077 && echo $HOME/.ansible/tmp/ansible-tmp-1454784058.99-221770072514077'
<10.140.155.166> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpFwQwuI TO setup
<10.140.155.161> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpcY2it_ TO setup
<10.140.155.162> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpuXklyy TO setup
<10.140.155.166> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.166 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, key=spxlnvdjayufeyagwywqhtdtptisrnej] password: " -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-spxlnvdjayufeyagwywqhtdtptisrnej; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"''
<10.140.155.161> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.161 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, key=dprrxybimhltrqziynaikzahzsgdjmlc] password: " -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-dprrxybimhltrqziynaikzahzsgdjmlc; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"''
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, key=bnojgeocfbeoxbgakzuluppvtxaviuyr] password: " -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-bnojgeocfbeoxbgakzuluppvtxaviuyr; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python setup'"'"''
failed: [10.140.155.166] => {"failed": true, "parsed": false}
BECOME-SUCCESS-spxlnvdjayufeyagwywqhtdtptisrnej
{"verbose_override": true, "changed": false, "ansible_facts": {<lots of gather-facts output here, but it is not complete/>OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011
debug1: Reading configuration data /Users/mspreitz/.ssh/config
debug1: /Users/mspreitz/.ssh/config line 8: Applying options for 10.*
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 102: Applying options for *
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 9654
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 0
Shared connection to 10.140.155.166 closed.

From the same run, here is a failure at a task for the one machine that succeeded to gather-facts in this case:

TASK: [bootstrap | add any dependent repository keys from url] ****************
<10.140.155.162> ESTABLISH CONNECTION FOR USER: mspreitzer
<10.140.155.162> REMOTE_MODULE apt_key url=http://10.140.132.215:8082/Release.pgp
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812 && echo $HOME/.ansible/tmp/ansible-tmp-1454784077.96-58394575390812'
<10.140.155.162> PUT /var/folders/cd/d050bcgj7xd6lkbfhnb9bns80000gp/T/tmpxHKer5 TO apt_key
<10.140.155.162> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/Users/mspreitz/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=mspreitzer -o ConnectTimeout=100 10.140.155.162 /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, key=jsmmpfgpouhdlwmcvgmwvbhidafftccn] password: " -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-jsmmpfgpouhdlwmcvgmwvbhidafftccn; LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python apt_key'"'"''
failed: [10.140.155.162] => (item={'repo': u'deb http://10.140.132.215:8082/ docker main', 'key_url': u'http://10.140.132.215:8082/Release.pgp'}) => {"failed": true, "item": {"key_url": "http://10.140.132.215:8082/Release.pgp", "repo": "deb http://10.140.132.215:8082/ docker main"}, "parsed": false}
BECOME-SUCCESS-jsmmpfgpouhdlwmcvgmwvbhidafftccn
OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011
debug1: Reading configuration data /Users/mspreitz/.ssh/config
debug1: /Users/mspreitz/.ssh/config line 8: Applying options for 10.*
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 102: Applying options for *
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 9658
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 0
Shared connection to 10.140.155.162 closed.


In desperation I captured all packets on the VPN network interface on my Mac during that run, then looked at it with Wireshark.  I see one long-lived connection for each of the three managed machines, plus one short-lived connection for 10.140.155.162.

I see no end to the long-lived connections.  Just to make sure it was not due to a missing buffer flush in the packet capture, I did a `ping` of one managed machine after `ansible-playbook` finished; I see the `ping` in the captured packets.  So how can Ansible suffer from truncated replies when the connections stay open?  How can the connections stay open after `ansible-playbook` exits?

The short-lived connection is even stranger.  It comes late in the trace.  The client closes it soon after the completion of the three-packet handshake that opens the connection.  The client sends no TCP payload at all, not even the "client hello" of TLS.  The server sends a "server hello", which prompts the client to respond with a RST.  What the heck is going on with that?

Thanks for any clues,
Mike

Brian Coca

unread,
Feb 8, 2016, 12:13:53 PM2/8/16
to Ansible Project
I'm not sure but it seems it might be related to this issue,
https://github.com/ansible/ansible/issues/13876

it seems that an interaction between openssh and the newer kernels is
closing the ssh connection sooner than it should.




--
Brian Coca
Reply all
Reply to author
Forward
0 new messages