salt.master.FileserverUpdate'> died with exit status 0 when working on 1000 VMs

Aprameya NDS

unread,

Dec 1, 2022, 7:29:16 PM12/1/22

to Salt-users

Hi Team,

We are trying to execute scripts by copying on to the minion and a custom module.

We have close to 1000 VMs and when the below said commands are executed it exits out with the following backtrace:

...: salt_handle.cmd_async(f"{source}", 'cp.get_file', [src, trg])
...: salt_handle.cmd_async(f"{source}", 'ping_minion.clean_ping')

First command would copy the file and 2nd command is calling the custom built module present on the minions to execute the copied script.

Master side i see the following messages:

===================================

Dec 1 16:20:07 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec 1 16:20:07 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#41973: Invalid argument
Dec 1 16:20:07 bb11t1-mgmt named[1689]: client @0x7f9cb402dc80 172.16.0.5#41973 (bb12-5gfi-d-c14-b02.bb11t1.local): error sending response: invalid file
Dec 1 16:20:10 bb11t1-mgmt salt-master[509142]: [DEBUG ] Process <class 'salt.master.FileserverUpdate'> (1230778) died with exit status 0, restarting...

SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

During handling of the above exception, another exception occurred:

SaltClientError Traceback (most recent call last)

salt --versions-report
Salt Version:
Salt: 3005.1

Dependency Versions:
cffi: 1.14.5
cherrypy: Not Installed
dateutil: 2.8.1
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 2.11.3
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.4
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.20
pycrypto: 3.9.8
pycryptodome: 3.16.0
pygit2: Not Installed
Python: 3.7.7 (default, Feb 16 2021, 12:37:08)
python-gnupg: Not Installed
PyYAML: 5.3.1
PyZMQ: 20.0.0
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.3

System Versions:
dist: centos 8 Core
locale: UTF-8
machine: x86_64
release: 4.18.0-193.el8.x86_64
system: Linux
version: CentOS Linux 8 Core

Can anyone please let me know how do i proceed further. Its a scale setup that we have here.

Regards
Sai

Twangboy

unread,

Dec 2, 2022, 11:46:30 AM12/2/22

to Salt-users

Salt has a system in place for using custom modules. Please see the documentation here.
Basically, you put your custom modules in the file_roots on the master. These can be modules, states, grains, etc. They are prepended by an underscore. So, for example, the default file roots location on a master is `/srv/salt`. You would create the following directories for custom modules you want to sync:

- /srv/salt/_modules

- /srv/salt/_states

- /srv/salt/_grains

Place execution modules in the `_modules` directory. Place state modules in the `_states` directory, and so forth. Then run sync.

```

# sync everything

salt * saltutils.sync_all

# sync only execution modules

salt * saltutils.sync_modules

```

Docs for the sync utils are found here.

After they are synced, they should behave like any Salt module.

Aprameya NDS

unread,

Dec 2, 2022, 1:22:35 PM12/2/22

to Salt-users

Hi,

I dont think you read or understood the problem here.

We have tested for 500 VMs and it used to work. Our custom module that is built is working fine.

Please check the traces that we get when we run for 1000 VMs:

://scratchpad/clean_BB12-5GFI-C-C10-B04.sh', '/root/clean_BB12-5GFI-C-C10-B04.sh'], '_stamp': '2022-12-02T00:36:42.131010'}
Dec 1 16:36:42 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec 1 16:36:42 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#65406: Invalid argument
Dec 1 16:36:42 bb11t1-mgmt named[1689]: client @0x7f9ce0033070 172.16.0.5#65406 (bb12-5gfi-d-c15-b06.bb11t1.local): error sending response: invalid file
Dec 1 16:36:42 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec 1 16:36:42 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#25046: Invalid argument
Dec 1 16:36:42 bb11t1-mgmt named[1689]: client @0x7f9ce0033070 172.16.0.5#25046 (bb12-5gfi-d-c15-b08.bb11t1.local): error sending response: invalid file
Dec 1 16:36:42 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec 1 16:36:42 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#12796: Invalid argument
Dec 1 16:36:42 bb11t1-mgmt named[1689]: client @0x7f9ce0033070 172.16.0.5#12796 (bb12-5gfi-d-c15-b01.bb11t1.local): error sending response: invalid file
Dec 1 16:36:42 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec 1 16:36:42 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#33238: Invalid argument
Dec 1 16:36:42 bb11t1-mgmt named[1689]: client @0x7f9cc00323e0 172.16.0.5#33238 (bb12-5gfi-d-c15-b02.bb11t1.local): error sending response: invalid file
Dec 1 16:36:42 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec 1 16:36:42 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#63891: Invalid argument
Dec 1 16:36:42 bb11t1-mgmt named[1689]: client @0x7f9ce802c6e0 172.16.0.5#63891 (bb12-5gfi-d-c14-b05

When we run individually it works fine. So how is that possible if the modules are not placed correctly then it should not work any time rite???

Regards

Sai

Aprameya NDS

unread,

Dec 5, 2022, 3:23:30 PM12/5/22

to Salt-users

Hi,

Here are some observations:

I tried with 800 hosts and it works perfectly fine and the time taken seems to be impressive:

(virtualenv3.7.7) [root@bb11t1-mgmt traffic_tool]# python clean.py

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 947/947 [00:19<00:00, 49.46it/s]

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 947/947 [00:18<00:00, 52.51it/s]

(virtualenv3.7.7) [root@bb11t1-mgmt traffic_tool]#

But when the host size is increased to another ~ 50 hosts totaling to 995 hosts then we start getting problems and there is backtrace dump:

(virtualenv3.7.7) [root@bb11t1-mgmt traffic_tool]# python clean.py

32%|████████████████████████████████████▏ | 319/995 [00:45<01:35, 7.04it/s]
Traceback (most recent call last):
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 1901, in pub
payload = channel.send(payload_kwargs, timeout=timeout)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py", line 125, in wrap
raise exc_info[1].with_traceback(exc_info[2])
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py", line 131, in _target
result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
return future_cell[0].result()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/channel/client.py", line 292, in send
ret = yield self._uncrypted_transfer(load, timeout=timeout)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/channel/client.py", line 269, in _uncrypted_transfer
timeout=timeout,
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/transport/zeromq.py", line 914, in send
ret = yield self.message_client.send(load, timeout=timeout)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
yielded = self.gen.throw(*exc_info)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/transport/zeromq.py", line 624, in send
recv = yield future
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
value = future.result()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 396, in run_job
**kwargs
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 1905, in pub

"Salt request timed out. The master is not responding. You "

salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "clean.py", line 59, in <module>
run()
File "clean.py", line 14, in run
salt_handle = salt.client.LocalClient()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 495, in cmd_async
tgt, fun, arg, tgt_type, ret, jid=jid, kwarg=kwarg, listen=False, **kwargs
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 409, in run_job
raise SaltClientError(general_exception)
salt.exceptions.SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

If you suspect this is an IPython 7.20.0 bug, please report it at:
https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipyth...@python.org

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
%config Application.verbose_crash=True

[ERROR ] Message timed out
(virtualenv3.7.7) [root@bb11t1-mgmt traffic_tool]#

Can you let me know what could be the problem here?

Regards

Sai

Dmitry Golubenko

unread,

Dec 5, 2022, 9:05:57 PM12/5/22

to salt-...@googlegroups.com

> Hi,
>
> Here are some observations:
>
> I tried with 800 hosts and it works perfectly fine and the time taken
> seems to be impressive:

check
https://docs.saltproject.io/en/latest/topics/tutorials/intro_scale.html

and search this maillist, some years ago there were some discussions
related to the topic, one of participants was 'Volker' if I remember
correctly.

Aprameya NDS

unread,

Dec 12, 2022, 10:06:14 PM12/12/22

to Salt-users

Hi,

I have gone through the doc for the scale thing and had configured most of the things in the master and minion configs as mentioned but still not helping us in any way.

The batch thing mentioned in the doc does not hold good here as we are using cmd_async

So what we have observed is once the host size is increased say to 987 thats the number we have currently we see it starts failing at the start itself after say 3-4 iterations.

ipdb> c
1%|▊ | 6/987 [00:22<1:00:25, 3.70s/it]

Traceback (most recent call last):
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 1901, in pub
payload = channel.send(payload_kwargs, timeout=timeout)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py", line 125, in wrap
raise exc_info[1].with_traceback(exc_info[2])
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py",

We have a for loop for each host and trigger cmd_async one host at a time in a loop and it crashes after 5-6 iterations when the host size is increased. With 900 hosts it works perfectly.

So can you please give some explanation as to what could be the problem here.

We are kind of blocked and do not know how to proceed from here.

Regards

Sai

Aprameya NDS

unread,

Dec 12, 2022, 10:08:10 PM12/12/22

to Salt-users

Adding the complete traceback:

ipdb> c
1%|▊ | 6/987 [00:22<1:00:25, 3.70s/it]
Traceback (most recent call last):
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 1901, in pub
payload = channel.send(payload_kwargs, timeout=timeout)
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py", line 125, in wrap
raise exc_info[1].with_traceback(exc_info[2])

Traceback (most recent call last):

File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 396, in run_job
**kwargs
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 1905, in pub
"Salt request timed out. The master is not responding. You "
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "clean.py", line 69, in <module>
run()
File "clean.py", line 15, in run

salt_handle = salt.client.LocalClient()
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 495, in cmd_async
tgt, fun, arg, tgt_type, ret, jid=jid, kwarg=kwarg, listen=False, **kwargs
File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 409, in run_job
raise SaltClientError(general_exception)
salt.exceptions.SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

If you suspect this is an IPython 7.20.0 bug, please report it at:
https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipyth...@python.org

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
%config Application.verbose_crash=True

[ERROR ] Message timed out

Nerigal

unread,

Dec 13, 2022, 10:46:57 AM12/13/22

to salt-...@googlegroups.com

We have a for loop for each host and trigger cmd_async one host at a time in a loop and it crashes after 5-6 iterations when the host size is increased. With 900 hosts it works perfectly.

would it be a case where you end in a infinite loop if you have to many minions cos the
loop start over the last one that is not completed yet ?
do you have any kind of validation before starting the loop ?

salt.exceptions.SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

i understand you using cmd_async but definitely look like something over flooding the event bus

tus said, There is a lot of things that can be done to help your salt master to handle more returning from minions at once but need to make sure your state or module executed is clean

You may have this done already but it still worth mentioning i believe

first, i would think about delaying the minions Re-authentication at time on that scale

recon_default: 1000
recon_max: 59000
recon_randomize: True

Then you should think about putting the cachedir into a ram drive
***NOTE***: if you do that, you really need to have an external fast system that you will used as ext_job_cache

But this is a good way to avoid being slowdown by disk IO
Each job return for every Minion is saved in a single file. Over time this directory can grow quite large, depending on the number of published jobs. The amount of files and directories will scale with the number of jobs published

default cache dir is

cachedir: /var/cache/salt

but you can change it for a ram drive and you want to use tmpsf
mount -t tmpfs -o size=20g tmpfs /mnt/tmp

20 Go is totally arbitrary value and you should base the size on the actual size of the current cachedir

Salt master will love you if you have as much core as worker
then you also need to make sure that you have plenty of RAM
enough to cover than ram drive and to NOT have to use the swap at all

for that you can also set

sysctl vm.swappiness=10

by default, on the wrong OS used as server this is set at 60 like Ubuntu
you want it as low as possible to prevent any unnecessary disk activity

Then you can also think about augmenting the number of
worker_threads
on the master: by default should be 5
dont increase this value high than the number of core -1

This is one of many setting under Large-Scale tuning Settings in the master config that can be analysed in your case

--
You received this message because you are subscribed to the Google Groups "Salt-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to salt-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/salt-users/8cf6e959-7168-4661-a8fc-43840250b4dbn%40googlegroups.com.

Aprameya NDS

unread,

Dec 14, 2022, 2:04:21 PM12/14/22

to Salt-users

Hi,

Thanks for the reply.

Will try the ram part which you have suggested and will get back since we have done most of the other configs mentioned.

One more observation is that we had these 2 statements in a single for loop:

...: salt_handle.cmd_async(f"{source}", 'cp.get_file', [src, trg])
...: salt_handle.cmd_async(f"{source}", 'ping_minion.clean_ping')

This causes problem when the host size is 980+.

But when we split the 2 statements in to different for loop then it works.

Can you comment on why would it not work when we have those 2 statements in the same loop as its a async call and should not be blocking.

Is it something to do with the ZMQ since the same source is being used twice for 2 different commands?

Regards

Sai

Aprameya NDS

unread,

Jan 10, 2023, 7:05:59 PM1/10/23

to Salt-users

Hi Team,

Any update on this query that i had asked in my previous mail:

"""

But when we split the 2 statements in to different for loops then it works.

Can you comment on why would it not work when we have those 2 statements in the same loop as its a async call and should not be blocking.

Is it something to do with the ZMQ since the same source is being used twice for 2 different commands?

"""

Regards

Sai

Reply all

Reply to author

Forward