Gathering Facts Hangs after Reboot

163 views
Skip to first unread message

Tony Kinsley

unread,
Nov 1, 2014, 3:45:58 AM11/1/14
to ansible...@googlegroups.com
Today I stumbled upon what I think is an extremely rare bug with the paramiko ssh connection plugin. It seems that something my team added recently to our startup scripts has caused the ssh connection to hang when attempting to connect close to immediately after a reboot. I setup a seperate playbook that would reboot a VM and then run a debug task after the reboot ( plus gathering facts ). The output including all of my debug print statements. I could not use pdb at all with this problem as adding the slightest delay would make the problem go away.

PLAY [vms] ******************************************************************** 

TASK: [power on vsphere] ****************************************************** 
<127.0.0.1> REMOTE_MODULE vsphere login=root password=VALUE_HIDDEN host=192.168.140.200
### Executing Module ###
### LATE NEEDS TMP PATH ###
### NEED TMP PATH ###
### prior to conn.shell.mkdtemp ###
### post conn.shell.mkdtemp cmd = mkdir -p $HOME/.ansible/tmp/ansible-tmp-1414825524.97-171497198370849 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1414825524.97-171497198370849 && echo $HOME/.ansible/tmp/ansible-tmp-1414825524.97-171497198370849 ###
<127.0.0.1> EXEC ['/bin/sh', '-c', 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1414825524.97-171497198370849 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1414825524.97-171497198370849 && echo $HOME/.ansible/tmp/ansible-tmp-1414825524.97-171497198370849']
### post low_level_exec_command result = {'stdout': '/home/akinsley/.ansible/tmp/ansible-tmp-1414825524.97-171497198370849\n', 'stderr': '', 'rc': 0} ###
### tmp path = /home/akinsley/.ansible/tmp/ansible-tmp-1414825524.97-171497198370849/  ###
### remote module path /home/akinsley/.ansible/tmp/ansible-tmp-1414825524.97-171497198370849/vsphere ###
### TRANSFERING STRING ###
<127.0.0.1> PUT /tmp/tmpYLi4mm TO /home/akinsley/.ansible/tmp/ansible-tmp-1414825524.97-171497198370849/vsphere
<127.0.0.1> EXEC ['/bin/sh', '-c', u'LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/env python /home/akinsley/.ansible/tmp/ansible-tmp-1414825524.97-171497198370849/vsphere']
changed: [192.168.140.119 -> 127.0.0.1] => {"changed": true, "msg": "\"'vim.Task:task-113516'\" completed successfully."}

TASK: [power on nova] ********************************************************* 
skipping: [192.168.140.119]

TASK: [wait for ssh] ********************************************************** 
<127.0.0.1> REMOTE_MODULE wait_for port=22 host=192.168.140.119
### Executing Module ###
### LATE NEEDS TMP PATH ###
### NEED TMP PATH ###
### prior to conn.shell.mkdtemp ###
### post conn.shell.mkdtemp cmd = mkdir -p $HOME/.ansible/tmp/ansible-tmp-1414825535.28-82995727032606 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1414825535.28-82995727032606 && echo $HOME/.ansible/tmp/ansible-tmp-1414825535.28-82995727032606 ###
<127.0.0.1> EXEC ['/bin/sh', '-c', 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1414825535.28-82995727032606 && chmod a+rx $HOME/.ansible/tmp/ansible-tmp-1414825535.28-82995727032606 && echo $HOME/.ansible/tmp/ansible-tmp-1414825535.28-82995727032606']
### post low_level_exec_command result = {'stdout': '/home/akinsley/.ansible/tmp/ansible-tmp-1414825535.28-82995727032606\n', 'stderr': '', 'rc': 0} ###
### tmp path = /home/akinsley/.ansible/tmp/ansible-tmp-1414825535.28-82995727032606/  ###
### remote module path /home/akinsley/.ansible/tmp/ansible-tmp-1414825535.28-82995727032606/wait_for ###
### TRANSFERING STRING ###
<127.0.0.1> PUT /tmp/tmptiaR4f TO /home/akinsley/.ansible/tmp/ansible-tmp-1414825535.28-82995727032606/wait_for
<127.0.0.1> EXEC ['/bin/sh', '-c', u'LANG=en_US.UTF-8 LC_CTYPE=en_US.UTF-8 /usr/bin/python /home/akinsley/.ansible/tmp/ansible-tmp-1414825535.28-82995727032606/wait_for']
ok: [192.168.140.119 -> 127.0.0.1] => {"changed": false, "elapsed": 21, "path": null, "port": 22, "search_regex": null, "state": "started"}

PLAY [all] ******************************************************************** 

GATHERING FACTS *************************************************************** 
<192.168.140.119> ESTABLISH CONNECTION FOR USER: root on PORT 22 TO 192.168.140.119
<192.168.140.119> REMOTE_MODULE setup
### Executing Module ###
### LATE NEEDS TMP PATH ###
### NEED TMP PATH ###
### prior to conn.shell.mkdtemp ###
### post conn.shell.mkdtemp cmd = mkdir -p $HOME/.ansible/tmp/ansible-tmp-1414825558.04-215745388264747 && echo $HOME/.ansible/tmp/ansible-tmp-1414825558.04-215745388264747 ###
### Getting channel ###


hangs here indefinitely

So the output would just hang there indefinitely. I traced this back to these two lines in the source code.

            print '\t\t### Getting channel ###'
            chan = self.ssh.get_transport().open_session()
            print '\t\t### Got channel %s ###' % chan
            self.ssh.get_transport().set_keepalive(5)
            print '\t\t### Got transport ###'

I happen to work with the engineer who submitted the original pull request adding that self.ssh.get_transport().set_keepalive(5) and realized what our problem must be. We saw this exact same issue when installing new iptables rules and applying them. In this case we have an upstart script applying our rules on boot. Recently we have been tweaking other startup scripts so I think we must have changed our startup order enough that the iptables rules are getting applied at the perfect time. I don't know I really cannot explain the issue any other way, but the window of time that could occur seems to be so tiny and I could reproduce this issue over and over again.

Regardless the fix is pretty harmless, just set the keepalive before trying to open the session. I am going to submit a pull request and really wanted to explain why the request was needed. 

Thanks,
Tony



Michael DeHaan

unread,
Nov 1, 2014, 10:52:33 AM11/1/14
to ansible...@googlegroups.com
Sounds like you may want to use the pause module to introduce a "sleep" in there?

The wait_for module is frequently not enough to wait for a reboot because SSH isn't ready (though this isn't a hang scenario, but a connection failure), for instance.



--
You received this message because you are subscribed to the Google Groups "Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-proje...@googlegroups.com.
To post to this group, send email to ansible...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/cbb9b499-7264-4750-891e-c1860c012a64%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tony Kinsley

unread,
Nov 1, 2014, 4:23:55 PM11/1/14
to ansible...@googlegroups.com
The pause module would definitely have gotten us around the issue, however I did not think of that. I submitted pull request https://github.com/ansible/ansible/pull/9462 which just sets the keep alive before trying to open the session. That fixed the issue for me.

By the way for anybody trying to reproduce the issue, we are running on ubuntu 10.04 both as the ansible host and the remote host that saw the issue. We have setup a iptables upstart job to load our iptables rules on starting network-interface or starting network-manager or starting networking. The rules set has default policies to drop packets rather than reject them. I believe the issue is if you have an ssh connection then the iptables rules are applied that the ssh connection is now half open. We have seen the same issue with normal ssh. We had to set the ServerAliveInterval=5 option to get around it. Iptables must reset the established related rule or not keep track of established connections if that rule does not exist prior to applying new rule set.
Reply all
Reply to author
Forward
0 new messages