Random timeout errors (Salt request timed out. The master is not responding)

2,596 views
Skip to first unread message

Guillaume carpentier

unread,
Sep 11, 2017, 10:11:40 AM9/11/17
to Salt-users
Hello salt users,

I've been experiencing strange timeouts when contacting minions from our master. Our setup consists of a single master and approximatively 1800 minions over wan. Everything worked flawlessly until a few weeks ago, minions were around 1500 at that time. Minions are contacted 'on demand' and are mainly used for remote execution from a custom web page.

For a working minion, a test.ping command will randomly work, show the minion as 'not connected' or return the following error : 'Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.'

Adding async option does not change the behaviour.

The debug log indicates a SaltReqTimeoutError, appearing just after the IPCClient initialization, or sometimes after the AsyncZeroMQReqChannel init:

debug.log
2017-09-11 09:06:24,571 [salt.utils.event ][DEBUG   ][9361] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
2017-09-11 09:06:24,571 [salt.utils.event ][DEBUG   ][9361] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
2017-09-11 09:06:24,665 [salt.transport.zeromq][DEBUG   ][9361] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'saltmaster_master', 'tcp://127.0.0.1:4506', 'clear')
2017-09-11 09:06:24,666 [salt.transport.ipc][DEBUG   ][9361] Initializing new IPCClient for path: /var/run/salt/master/master_event_pub.ipc
2017-09-11 09:06:27,913 [salt.utils.event ][DEBUG   ][6193] Sending event: tag = salt/auth; data = {'id': 'minion', '_stamp': '2017-09-11T07:06:27.913266', 'result': True, 'pub': '', 'act': 'pend'}
2017-09-11 09:06:29,671 [salt.transport.zeromq][DEBUG   ][9361] SaltReqTimeoutError, retrying. (1/3)
2017-09-11 09:06:34,676 [salt.transport.zeromq][DEBUG   ][9361] SaltReqTimeoutError, retrying. (2/3)
2017-09-11 09:06:39,680 [salt.transport.zeromq][DEBUG   ][9361] SaltReqTimeoutError, retrying. (3/3)
2017-09-11 09:06:44,685 [salt.utils.lazy  ][DEBUG   ][9361] LazyLoaded nested.output

another debug from the console
[DEBUG   ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG   ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG   ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'saltmaster_master', 'tcp://127.0.0.1:4506', 'clear')
[DEBUG   ] SaltReqTimeoutError, retrying. (1/3)
[DEBUG   ] SaltReqTimeoutError, retrying. (2/3)
[DEBUG   ] SaltReqTimeoutError, retrying. (3/3)
[ERROR   ] An un-handled exception was caught by salt's global exception handler:
SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
Traceback (most recent call last):
  File "/usr/bin/salt", line 10, in <module>
    salt_main()
  File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 476, in salt_main
    client.run()
  File "/usr/lib/python2.7/dist-packages/salt/cli/salt.py", line 129, in run
    jid = self.local_client.cmd_async(**kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 452, in cmd_async
    **kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 344, in run_job
    raise SaltClientError(general_exception)
SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
Traceback (most recent call last):
  File "/usr/bin/salt", line 10, in <module>
    salt_main()
  File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 476, in salt_main
    client.run()
  File "/usr/lib/python2.7/dist-packages/salt/cli/salt.py", line 129, in run
    jid = self.local_client.cmd_async(**kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 452, in cmd_async
    **kwargs)
  File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 344, in run_job
    raise SaltClientError(general_exception)
salt.exceptions.SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

I first thought the master was under heavy load, but not at all. Still, I changed the following settings according on the master, without success :

worker_threads: 20
max_open_files: 65536
zmq_backlog: 1000
pub_hwm: 1000
salt_event_pub_hwm: 32000
event_publisher_pub_hwm: 16000

and even
zmq_filtering: True

I also upgraded from Debian jessie to stretch as some zmq packages were more recent on the stretch repository.

The master runs with 16G ram, 4 vcpus. wan bandwith is 100Mbits symmetric

$> salt --versions-report

Salt Version:
           Salt: 2017.7.1
 
Dependency Versions:
           cffi: 1.9.1
       cherrypy: unknown
       dateutil: 2.5.3
      docker-py: Not Installed
          gitdb: 2.0.0
      gitpython: 2.1.1
          ioflo: Not Installed
         Jinja2: 2.9.4
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: 1.3.7
      pycparser: 2.17
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 2.7.13 (default, Jan 19 2017, 14:48:08)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: 2.0.1
        timelib: Not Installed
        Tornado: 4.4.3
            ZMQ: 4.2.1
 
System Versions:
           dist: debian 9.1
         locale: UTF-8
        machine: x86_64
        release: 4.9.0-3-amd64
         system: Linux
        version: debian 9.1


Has anyone experienced this kind of behaviour ? The error message is telling the event bus is congested, is there a way to monitor that ?
I'm quite stucked right now so any help would be greatly appreciated.

Best regards,

Guillaume

rahav jv

unread,
Feb 26, 2018, 8:01:50 PM2/26/18
to Salt-users
Hi,

I hit the same error, where you able to resolve this issue.

Thanks,
Rahav

T.J. Yang

unread,
Dec 13, 2019, 8:52:59 AM12/13/19
to Salt-users
Thanks for Guillaume carpentier's performance tuning on zeromq.
Reply all
Reply to author
Forward
0 new messages