Hey all,
I am facing a timeout issue while trying to run a job template. This is our current setup:
AWX Version - 22.5.0 (AWX is running on OKD and is deployed using AWX Operator)
OKD Version - 4.11.0-0.okd-2022-12-02-145640 (Update Channel: Stable-4)
OpenSSH version on bastion host:
openssh-server-7.4p1-23.el7_9.x86_64
openssh-7.4p1-23.el7_9.x86_64
openssh-clients-7.4p1-23.el7_9.x86_64
OpenSSH version on remote server:
openssh-8.7p1-30.el9_2.x86_64
openssh-clients-8.7p1-30.el9_2.x86_64
openssh-server-8.7p1-30.el9_2.x86_64
The traffic flow is as follows:
AWX on OKD -> Bastion Host/Jumpbox -> Remote Server
Problem Statement:
When I try to run a template, the first few tasks run successfully. But after running a few tasks, I see that the server becomes unreachable and I see "Timeout Before Authentication" in the SSH logs on the remote server. Here's an example:
--------------------------------------------------------------------------------------------------------------------------------------------
Identity added: /runner/artifacts/25/ssh_key_data (AWX)
Certificate added: /runner/artifacts/25/ssh_key_data-cert.pub (CA:sshca_2020_2 USER:awx VALID:1696849513-1696936093)
SSH password:
[WARNING]: Invalid characters were found in group names but not replaced, use
-vvvv to see details
PLAY [Setting up hosts] ********************************************************
TASK [Gathering Facts] *********************************************************
ok: [SERVER1]
TASK [hosts : create hosts] ****************************************************
ok: [SERVER1]
PLAY [Setting up resolv.conf] **************************************************
TASK [resolv : Configure resolv.conf] ******************************************
ok: [SERVER1]
PLAY [Setting up chronyd/ntp & timezone] ***************************************
TASK [chrony : Ensure that the chrony package is installed] ********************
ok: [SERVER1]
TASK [chrony : Attempting to overlay chrony configurations] ********************
ok: [SERVER1] => (item=chrony.conf)
failed: [SERVER1] (item=chronyd) => {"ansible_loop_var": "item", "item": {"dst": "/etc/sysconfig/chronyd", "mode": 420, "src": "chronyd.sysconfig.j2"}, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\\r\\nConnection closed by UNKNOWN port 65535", "unreachable": true}
fatal: [SERVER1]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_loop_var": "item", "changed": false, "checksum": "6f9d06e122ab7a370d9baa26c923ecc850718b49", "dest": "/etc/chrony.conf", "diff": {"after": {"path": "/etc/chrony.conf"}, "before": {"path": "/etc/chrony.conf"}}, "failed": false, "gid": 0, "group": "root", "invocation": {"module_args": {"_diff_peek": null, "_original_basename": "chrony.conf.j2", "access_time": null, "access_time_format": "%Y%m%d%H%M.%S", "attributes": null, "dest": "/etc/chrony.conf", "follow": true, "force": false, "group": "root", "mode": "420", "modification_time": null, "modification_time_format": "%Y%m%d%H%M.%S", "owner": "root", "path": "/etc/chrony.conf", "recurse": false, "selevel": null, "serole": null, "setype": null, "seuser": null, "src": null, "state": "file", "unsafe_writes": false}}, "item": {"dst": "/etc/chrony.conf", "mode": 420, "src": "chrony.conf.j2"}, "mode": "0420", "owner": "root", "path": "/etc/chrony.conf", "size": 186, "state": "file", "uid": 0}, {"ansible_loop_var": "item", "item": {"dst": "/etc/sysconfig/chronyd", "mode": 420, "src": "chronyd.sysconfig.j2"}, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\\r\\nConnection closed by UNKNOWN port 65535", "unreachable": true}]}
PLAY RECAP *********************************************************************
SERVER1 : ok=4 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0
----------------------------------------------------------------------------------------------------------------------------------------------
As you can see in the above output, the first few tasks ran successfully, but the task after that starts to fail. I have tried different playbooks as well, the same problem persists.
Output of the /var/log/secure:
----------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------
What I have tried so far:
- Added the following ansible variables:
- ansible_ssh_args: '-o ControlMaster=auto -o ControlPersist=600s -o ConnectTimeout=600s -o ProxyCommand="ssh -o ConnectTimeout=600s -o StrictHostKeyChecking=no -W %h:%p -l awx BASTION_HOST_NAME"'
- ansible_ssh_timeout: 120
- ansible_command_timeout: 120
- ansible_timeout: 120
- Added AWX_TASK_ENV['ANSIBLE_TIMEOUT'] = '120' in /etc/tower/setting.py
- The playbook runs absolutely fine when I run it using ansible-playbook command on the bastion host
- I have played with various combinations of the above variables but am still getting the same issue. I even set the values to as high as 1200!
- I have attached the output of the FAILED template in high verbosity (failed_job_high_verbosity.txt)
- The IPs are whitelisted on all firewalls
Any help would be highly appreciated. Please let me know if anything else is needed from my side.
Thanks,
Shrihari