Hello salt users,
I've been experiencing strange timeouts when contacting minions from our master. Our setup consists of a single master and approximatively 1800 minions over wan. Everything worked flawlessly until a few weeks ago, minions were around 1500 at that time. Minions are contacted 'on demand' and are mainly used for remote execution from a custom web page.
For a working minion, a test.ping command will randomly work, show the minion as 'not connected' or return the following error : 'Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.'
Adding async option does not change the behaviour.
The debug log indicates a SaltReqTimeoutError, appearing just after the IPCClient initialization, or sometimes after the AsyncZeroMQReqChannel init:
debug.log
2017-09-11 09:06:24,571 [salt.utils.event ][DEBUG ][9361] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
2017-09-11 09:06:24,571 [salt.utils.event ][DEBUG ][9361] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
2017-09-11 09:06:24,665 [salt.transport.zeromq][DEBUG ][9361] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'saltmaster_master', 'tcp://
127.0.0.1:4506', 'clear')
2017-09-11 09:06:24,666 [salt.transport.ipc][DEBUG ][9361] Initializing new IPCClient for path: /var/run/salt/master/master_event_pub.ipc
2017-09-11 09:06:27,913 [salt.utils.event ][DEBUG ][6193] Sending event: tag = salt/auth; data = {'id': 'minion', '_stamp': '2017-09-11T07:06:27.913266', 'result': True, 'pub': '', 'act': 'pend'}
2017-09-11 09:06:29,671 [salt.transport.zeromq][DEBUG ][9361] SaltReqTimeoutError, retrying. (1/3)
2017-09-11 09:06:34,676 [salt.transport.zeromq][DEBUG ][9361] SaltReqTimeoutError, retrying. (2/3)
2017-09-11 09:06:39,680 [salt.transport.zeromq][DEBUG ][9361] SaltReqTimeoutError, retrying. (3/3)
2017-09-11 09:06:44,685 [salt.utils.lazy ][DEBUG ][9361] LazyLoaded nested.output
another debug from the console
[DEBUG ] MasterEvent PUB socket URI: /var/run/salt/master/master_event_pub.ipc
[DEBUG ] MasterEvent PULL socket URI: /var/run/salt/master/master_event_pull.ipc
[DEBUG ] Initializing new AsyncZeroMQReqChannel for ('/etc/salt/pki/master', 'saltmaster_master', 'tcp://
127.0.0.1:4506', 'clear')
[DEBUG ] SaltReqTimeoutError, retrying. (1/3)
[DEBUG ] SaltReqTimeoutError, retrying. (2/3)
[DEBUG ] SaltReqTimeoutError, retrying. (3/3)
[ERROR ] An un-handled exception was caught by salt's global exception handler:
SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
Traceback (most recent call last):
File "/usr/bin/salt", line 10, in <module>
salt_main()
File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 476, in salt_main
client.run()
File "/usr/lib/python2.7/dist-packages/salt/cli/salt.py", line 129, in run
jid = self.local_client.cmd_async(**kwargs)
File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 452, in cmd_async
**kwargs)
File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 344, in run_job
raise SaltClientError(general_exception)
SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
Traceback (most recent call last):
File "/usr/bin/salt", line 10, in <module>
salt_main()
File "/usr/lib/python2.7/dist-packages/salt/scripts.py", line 476, in salt_main
client.run()
File "/usr/lib/python2.7/dist-packages/salt/cli/salt.py", line 129, in run
jid = self.local_client.cmd_async(**kwargs)
File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 452, in cmd_async
**kwargs)
File "/usr/lib/python2.7/dist-packages/salt/client/__init__.py", line 344, in run_job
raise SaltClientError(general_exception)
salt.exceptions.SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
I first thought the master was under heavy load, but not at all. Still, I changed the following settings according on the master, without success :
worker_threads: 20
max_open_files: 65536
zmq_backlog: 1000
pub_hwm: 1000
salt_event_pub_hwm: 32000
event_publisher_pub_hwm: 16000
and even
zmq_filtering: True
I also upgraded from Debian jessie to stretch as some zmq packages were more recent on the stretch repository.
The master runs with 16G ram, 4 vcpus. wan bandwith is 100Mbits symmetric
$> salt --versions-report
Salt Version:
Salt: 2017.7.1
Dependency Versions:
cffi: 1.9.1
cherrypy: unknown
dateutil: 2.5.3
docker-py: Not Installed
gitdb: 2.0.0
gitpython: 2.1.1
ioflo: Not Installed
Jinja2: 2.9.4
libgit2: Not Installed
libnacl: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack-pure: Not Installed
msgpack-python: 0.4.8
mysql-python: 1.3.7
pycparser: 2.17
pycrypto: 2.6.1
pycryptodome: Not Installed
pygit2: Not Installed
Python: 2.7.13 (default, Jan 19 2017, 14:48:08)
python-gnupg: Not Installed
PyYAML: 3.12
PyZMQ: 16.0.2
RAET: Not Installed
smmap: 2.0.1
timelib: Not Installed
Tornado: 4.4.3
ZMQ: 4.2.1
System Versions:
dist: debian 9.1
locale: UTF-8
machine: x86_64
release: 4.9.0-3-amd64
system: Linux
version: debian 9.1
Has anyone experienced this kind of behaviour ? The error message is telling the event bus is congested, is there a way to monitor that ?
I'm quite stucked right now so any help would be greatly appreciated.
Best regards,
Guillaume