Timeout Before Authentication - SSH

347 views
Skip to first unread message

Shrihari M

unread,
Oct 10, 2023, 6:04:33 AM10/10/23
to AWX Project
Hey all,

I am facing a timeout issue while trying to run a job template. This is our current setup:

AWX Version - 22.5.0 (AWX is running on OKD and is deployed using AWX Operator)

OKD Version - 4.11.0-0.okd-2022-12-02-145640 (Update Channel: Stable-4)

OpenSSH version on bastion host:
openssh-server-7.4p1-23.el7_9.x86_64
openssh-7.4p1-23.el7_9.x86_64
openssh-clients-7.4p1-23.el7_9.x86_64

OpenSSH version on remote server:
openssh-8.7p1-30.el9_2.x86_64
openssh-clients-8.7p1-30.el9_2.x86_64
openssh-server-8.7p1-30.el9_2.x86_64

The traffic flow is as follows:
AWX on OKD -> Bastion Host/Jumpbox -> Remote Server

Problem Statement:

When I try to run a template, the first few tasks run successfully. But after running a few tasks, I see that the server becomes unreachable and I see "Timeout Before Authentication" in the SSH logs on the remote server. Here's an example:
--------------------------------------------------------------------------------------------------------------------------------------------
Identity added: /runner/artifacts/25/ssh_key_data (AWX)
Certificate added: /runner/artifacts/25/ssh_key_data-cert.pub (CA:sshca_2020_2 USER:awx VALID:1696849513-1696936093)
SSH password:
[WARNING]: Invalid characters were found in group names but not replaced, use
-vvvv to see details

PLAY [Setting up hosts] ********************************************************

TASK [Gathering Facts] *********************************************************
ok: [SERVER1]

TASK [hosts : create hosts] ****************************************************
ok: [SERVER1]

PLAY [Setting up resolv.conf] **************************************************

TASK [resolv : Configure resolv.conf] ******************************************
ok: [SERVER1]

PLAY [Setting up chronyd/ntp & timezone] ***************************************

TASK [chrony : Ensure that the chrony package is installed] ********************
ok: [SERVER1]

TASK [chrony : Attempting to overlay chrony configurations] ********************
ok: [SERVER1] => (item=chrony.conf)
failed: [SERVER1] (item=chronyd) => {"ansible_loop_var": "item", "item": {"dst": "/etc/sysconfig/chronyd", "mode": 420, "src": "chronyd.sysconfig.j2"}, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\\r\\nConnection closed by UNKNOWN port 65535", "unreachable": true}
fatal: [SERVER1]: UNREACHABLE! => {"changed": false, "msg": "All items completed", "results": [{"ansible_loop_var": "item", "changed": false, "checksum": "6f9d06e122ab7a370d9baa26c923ecc850718b49", "dest": "/etc/chrony.conf", "diff": {"after": {"path": "/etc/chrony.conf"}, "before": {"path": "/etc/chrony.conf"}}, "failed": false, "gid": 0, "group": "root", "invocation": {"module_args": {"_diff_peek": null, "_original_basename": "chrony.conf.j2", "access_time": null, "access_time_format": "%Y%m%d%H%M.%S", "attributes": null, "dest": "/etc/chrony.conf", "follow": true, "force": false, "group": "root", "mode": "420", "modification_time": null, "modification_time_format": "%Y%m%d%H%M.%S", "owner": "root", "path": "/etc/chrony.conf", "recurse": false, "selevel": null, "serole": null, "setype": null, "seuser": null, "src": null, "state": "file", "unsafe_writes": false}}, "item": {"dst": "/etc/chrony.conf", "mode": 420, "src": "chrony.conf.j2"}, "mode": "0420", "owner": "root", "path": "/etc/chrony.conf", "size": 186, "state": "file", "uid": 0}, {"ansible_loop_var": "item", "item": {"dst": "/etc/sysconfig/chronyd", "mode": 420, "src": "chronyd.sysconfig.j2"}, "msg": "Failed to connect to the host via ssh: kex_exchange_identification: Connection closed by remote host\\r\\nConnection closed by UNKNOWN port 65535", "unreachable": true}]}

PLAY RECAP *********************************************************************
SERVER1 : ok=4    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0   

----------------------------------------------------------------------------------------------------------------------------------------------
As you can see in the above output, the first few tasks ran successfully, but the task after that starts to fail. I have tried different playbooks as well, the same problem persists.

Output of the /var/log/secure:
----------------------------------------------------------------------------------------------------------------------------------------------
Screenshot 2023-10-10 at 3.15.17 PM.png
----------------------------------------------------------------------------------------------------------------------------------------------

What I have tried so far:
  1. Added the following ansible variables:
    • ansible_ssh_args: '-o ControlMaster=auto -o ControlPersist=600s -o ConnectTimeout=600s -o ProxyCommand="ssh -o ConnectTimeout=600s -o StrictHostKeyChecking=no -W %h:%p -l awx BASTION_HOST_NAME"'
    • ansible_ssh_timeout: 120
    • ansible_command_timeout: 120
    • ansible_timeout: 120
    • Added AWX_TASK_ENV['ANSIBLE_TIMEOUT'] = '120' in /etc/tower/setting.py
  2. The playbook runs absolutely fine when I run it using ansible-playbook command on the bastion host
  3. I have played with various combinations of the above variables but am still getting the same issue. I even set the values to as high as 1200!
  4. I have attached the output of the FAILED template in high verbosity (failed_job_high_verbosity.txt)
  5. The IPs are whitelisted on all firewalls
Any help would be highly appreciated. Please let me know if anything else is needed from my side.

Thanks,
Shrihari
failed_job_high_verbosity.txt
Reply all
Reply to author
Forward
0 new messages