I've been having issues running Ansible against AIX; specifically with the copy/template modules.
Periodically, copy/template plays will hang; either for a long time (read hours, as in leave it overnight and it might be completed the next day) or indefinitely. After reviewing debug output for a number of these instances, it appears to be an issue that occurs in the sh.py code under runner. The problem is in the 'checksum' function. Below is an example debug output of where the copy/template module will hang:
<aix14.mgmt.loc> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/home/ansible/.ansible/cp/ansible-ssh-%h-%p-%r" -o Port=22 -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 aix14.mgmt.loc /bin/sh -c 'sudo -k && sudo -H -S -p "[sudo via ansible, key=sfknwylttinwgjiawaunhugtrjbqdymg] password: " -u root /bin/sh -c '"'"'echo SUDO-SUCCESS-sfknwylttinwgjiawaunhugtrjbqdymg; rc=flag; [ -r "/etc/ntp.conf" ] || rc=2; [ -f "/etc/ntp.conf" ] || rc=1; [ -d "/etc/ntp.conf" ] && rc=3; python -V 2>/dev/null || rc=4; [ x"$rc" != "xflag" ] && echo "${rc} /etc/ntp.conf" && exit 0; (python -c '"'"'"'"'"'"'"'"'import hashlib; print(hashlib.sha1(open("/etc/ntp.conf", "rb").read()).hexdigest())'"'"'"'"'"'"'"'"' 2>/dev/null) || (python -c '"'"'"'"'"'"'"'"'import sha; print(sha.sha(open("/etc/ntp.conf", "rb").read()).hexdigest())'"'"'"'"'"'"'"'"' 2>/dev/null) || (echo "0 /etc/ntp.conf")'"'"''
This will happen during random copy/template plays, not necessarily for the same file as in the example above. The issue is reproducible, but not consistently; 1 in 5 runs or more may have the issue. It appears that the file actually copies over successfully, and then the session hangs. If I run a "who -u" on the AIX host, and "kill <pid>" the pid of the SSH session, the playbook will continue on. I can confirm this happens using SFTP, and with "scp_if_ssh = True". It also happens with "pipelining = True" configured.
After digging about on the interwebs, I have found a handful references to issues with the version of python included by IBM as part of the Linux-for-AIX toolbox. The version we're using is from
http://www.perzl.org/aix/, which doesn't suffer the same issues (see
https://github.com/ansible/ansible-modules-core/issues/80). I tried substituting 'hashlib.sha1' with 'hashlib._md5', and was able to reproduce the same hanging issue. As part of some references online to other folks using Ansible to manage AIX, I've symlink'd /bin/md5sum to /bin/csum; this also did not fix our issues. I can also periodically reproduce the issue when running a single ad-hoc ansible command using the copy module.
Below is a truss output from an AIX box where this issue occurs; this is a truss against the ssh process of the user connected in from Ansible. I'm by no means an expert at debugging truss output, however, it appears that the /bin/sh is called, then it forks off a subprocess, which right away sends a SIGCHLD, and then the process hangs with "close(8) (sleeping...)". This is where it will hang for a looooonnnnggg time. The PID that gets forked off (24379542 in the example below), ends up in a '<defunct>' state.
kwrite(4, "\0\00304 / b i n / s h ".., 776) = 776
kfcntl(7, F_DUPFD, 0x00000000) = 9
kfcntl(7, F_DUPFD, 0x00000000) = 10
sigprocmask(0, 0xF02B4970, 0xF02B4978) = 0
kfork() = 24379542
thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0xD052A400, 0x00000000, 0x11
960029, 0x00000000) = 0x00000000
Received signal #20, SIGCHLD [caught]
sigprocmask(2, 0xF02B4970, 0x2FF21E80) = 0
_sigaction(20, 0x00000000, 0x2FF21F30) = 0
thread_setmymask_fast(0x00080000, 0x00000000, 0x00000000, 0x11960029, 0x00000003, 0x00
000000, 0x00000000) = 0x00000000
kwrite(6, "\0", 1) = 1
ksetcontext_sigreturn(0x2FF21FE0, 0x2FF22FF8, 0x2002D0D0, 0x0000D032, 0x00000003, 0x00
000000, 0x00000000)
close(8) (sleeping...)
In the interest of disclosing all information, I also notice weird behavior with the 'w' command when trying to determine if Ansible has an SSH session open on a host where a playbook is hanging. The 'w' command will hang for a few seconds when it hits the user logged in and running the Ansible playbook. When I run a truss against the 'w' command, I get the output below. the command is getting the status of the user's pts, then it gets a SIGALRM, which apparently means the system call is taking too long to respond:
kopen("/dev/pts/4", O_RDONLY|O_NONBLOCK) (sleeping...)
kopen("/dev/pts/4", O_RDONLY|O_NONBLOCK) Err#4 EINTR
Received signal #14, SIGALRM [caught]
_sigaction(14, 0x0FFFFFFFFFFFEEB0, 0x0FFFFFFFFFFFEEE0) = 0
ksetcontext_sigreturn(0x0FFFFFFFFFFFF000, 0x0000000000000000, 0x0FFFFFFFFFFFFFE8, 0x800000000000D032, 0x3FFC000000000003, 0x00000000000000E8, 0x0000000000000000, 0x0000000000000000)
statx("/dev/pts/4", 0x0FFFFFFFFFFFF618, 176, 0) = 0
incinterval(0, 0x0FFFFFFFFFFFF4F8, 0x0FFFFFFFFFFFF518) = 0
statx("/dev/pts", 0x0FFFFFFFFFFFF618, 176, 0) = 0
statx("/dev/pts/4", 0x0FFFFFFFFFFFF638, 176, 0) = 0
ansible pts/4 03:15PM 36 0 0 -
kwrite(1, " a n s i b l e p t s".., 62) = 62
kread(3, "\0\0\0\0\0\0\0\0\0\0\0\0".., 4096) = 1136
_sigaction(14, 0x0FFFFFFFFFFFF4F0, 0x0FFFFFFFFFFFF520) = 0
incinterval(0, 0x0FFFFFFFFFFFF4F8, 0x0FFFFFFFFFFFF518) = 0
My environment is as follows:
Ubuntu 12.04
Ansible 1.8.2 (installed from the Ansible PPA)
AIX 7.1 (have reproduced for sure on TL2SP4, and TL1SP0)
python 2.7.5