salt.master.FileserverUpdate'> died with exit status 0 when working on 1000 VMs

97 views
Skip to first unread message

Aprameya NDS

unread,
Dec 1, 2022, 7:29:16 PM12/1/22
to Salt-users
Hi Team,

We are trying to execute scripts by copying on to the minion and a custom module.
We have close to 1000 VMs and when the below said commands are executed it exits out with the following backtrace:

    ...:     salt_handle.cmd_async(f"{source}", 'cp.get_file', [src, trg])
    ...:     salt_handle.cmd_async(f"{source}", 'ping_minion.clean_ping')

First command would copy the file and 2nd command is calling the custom built module present on the minions to execute the copied script.


Master side i see the following messages:
===================================
Dec  1 16:20:07 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec  1 16:20:07 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#41973: Invalid argument
Dec  1 16:20:07 bb11t1-mgmt named[1689]: client @0x7f9cb402dc80 172.16.0.5#41973 (bb12-5gfi-d-c14-b02.bb11t1.local): error sending response: invalid file
Dec  1 16:20:10 bb11t1-mgmt salt-master[509142]: [DEBUG   ] Process <class 'salt.master.FileserverUpdate'> (1230778) died with exit status 0, restarting...


SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

During handling of the above exception, another exception occurred:

SaltClientError                           Traceback (most recent call last)



salt --versions-report
Salt Version:
          Salt: 3005.1
 
Dependency Versions:
          cffi: 1.14.5
      cherrypy: Not Installed
      dateutil: 2.8.1
     docker-py: Not Installed
         gitdb: Not Installed
     gitpython: Not Installed
        Jinja2: 2.11.3
       libgit2: Not Installed
      M2Crypto: Not Installed
          Mako: Not Installed
       msgpack: 1.0.4
  msgpack-pure: Not Installed
  mysql-python: Not Installed
     pycparser: 2.20
      pycrypto: 3.9.8
  pycryptodome: 3.16.0
        pygit2: Not Installed
        Python: 3.7.7 (default, Feb 16 2021, 12:37:08)
  python-gnupg: Not Installed
        PyYAML: 5.3.1
         PyZMQ: 20.0.0
         smmap: Not Installed
       timelib: Not Installed
       Tornado: 4.5.3
           ZMQ: 4.3.3
 
System Versions:
          dist: centos 8 Core
        locale: UTF-8
       machine: x86_64
       release: 4.18.0-193.el8.x86_64
        system: Linux
       version: CentOS Linux 8 Core


Can anyone please let me know how do i proceed further. Its a scale setup that we have here.

Regards
Sai

Twangboy

unread,
Dec 2, 2022, 11:46:30 AM12/2/22
to Salt-users
Salt has a system in place for using custom modules. Please see the documentation here.
Basically, you put your custom modules in the file_roots on the master. These can be modules, states, grains, etc. They are prepended by an underscore. So, for example, the default file roots location on a master is `/srv/salt`. You would create the following directories for custom modules you want to sync:

- /srv/salt/_modules
- /srv/salt/_states
- /srv/salt/_grains

Place execution modules in the `_modules` directory. Place state modules in the `_states` directory, and so forth. Then run sync.

```
# sync everything
salt * saltutils.sync_all

# sync only execution modules
salt * saltutils.sync_modules
```

Docs for the sync utils are found here.

After they are synced, they should behave like any Salt module.

Aprameya NDS

unread,
Dec 2, 2022, 1:22:35 PM12/2/22
to Salt-users
Hi,
I dont think you read or understood the problem here.

We have tested for 500 VMs and it used to work. Our custom module that is built is working fine.

Please check the traces that we get when we run for 1000 VMs:



://scratchpad/clean_BB12-5GFI-C-C10-B04.sh', '/root/clean_BB12-5GFI-C-C10-B04.sh'], '_stamp': '2022-12-02T00:36:42.131010'}
Dec  1 16:36:42 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec  1 16:36:42 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#65406: Invalid argument
Dec  1 16:36:42 bb11t1-mgmt named[1689]: client @0x7f9ce0033070 172.16.0.5#65406 (bb12-5gfi-d-c15-b06.bb11t1.local): error sending response: invalid file
Dec  1 16:36:42 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec  1 16:36:42 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#25046: Invalid argument
Dec  1 16:36:42 bb11t1-mgmt named[1689]: client @0x7f9ce0033070 172.16.0.5#25046 (bb12-5gfi-d-c15-b08.bb11t1.local): error sending response: invalid file
Dec  1 16:36:42 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec  1 16:36:42 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#12796: Invalid argument
Dec  1 16:36:42 bb11t1-mgmt named[1689]: client @0x7f9ce0033070 172.16.0.5#12796 (bb12-5gfi-d-c15-b01.bb11t1.local): error sending response: invalid file
Dec  1 16:36:42 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec  1 16:36:42 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#33238: Invalid argument
Dec  1 16:36:42 bb11t1-mgmt named[1689]: client @0x7f9cc00323e0 172.16.0.5#33238 (bb12-5gfi-d-c15-b02.bb11t1.local): error sending response: invalid file
Dec  1 16:36:42 bb11t1-mgmt named[1689]: ../../../../lib/isc/unix/socket.c:2187: unexpected error:
Dec  1 16:36:42 bb11t1-mgmt named[1689]: internal_send: 172.16.0.5#63891: Invalid argument
Dec  1 16:36:42 bb11t1-mgmt named[1689]: client @0x7f9ce802c6e0 172.16.0.5#63891 (bb12-5gfi-d-c14-b05


When we run individually it works fine. So how is that possible if the modules are not placed correctly then it should not work any time rite???

Regards
Sai

Aprameya NDS

unread,
Dec 5, 2022, 3:23:30 PM12/5/22
to Salt-users
Hi,

Here are some observations:

I tried with 800 hosts and it works perfectly fine and the time taken seems to be impressive:

(virtualenv3.7.7) [root@bb11t1-mgmt traffic_tool]# python clean.py


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 947/947 [00:19<00:00, 49.46it/s]

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 947/947 [00:18<00:00, 52.51it/s]

(virtualenv3.7.7) [root@bb11t1-mgmt traffic_tool]#



But when the host size is increased to another ~ 50 hosts totaling to 995 hosts  then we start getting problems and there is backtrace dump:



(virtualenv3.7.7) [root@bb11t1-mgmt traffic_tool]# python clean.py

 32%|████████████████████████████████████▏                                                                            | 319/995 [00:45<01:35,  7.04it/s]
Traceback (most recent call last):
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 1901, in pub
    payload = channel.send(payload_kwargs, timeout=timeout)
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py", line 131, in _target
    result = io_loop.run_sync(lambda: getattr(self.obj, key)(*args, **kwargs))
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/ioloop.py", line 459, in run_sync
    return future_cell[0].result()
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/channel/client.py", line 292, in send
    ret = yield self._uncrypted_transfer(load, timeout=timeout)
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/channel/client.py", line 269, in _uncrypted_transfer
    timeout=timeout,
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/transport/zeromq.py", line 914, in send
    ret = yield self.message_client.send(load, timeout=timeout)
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1064, in run
    yielded = self.gen.throw(*exc_info)
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/transport/zeromq.py", line 624, in send
    recv = yield future
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/gen.py", line 1056, in run
    value = future.result()
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/ext/tornado/concurrent.py", line 249, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
salt.exceptions.SaltReqTimeoutError: Message timed out


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 396, in run_job
    **kwargs
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 1905, in pub

    "Salt request timed out. The master is not responding. You "
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "clean.py", line 59, in <module>
    run()
  File "clean.py", line 14, in run
    salt_handle = salt.client.LocalClient()
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 495, in cmd_async
    tgt, fun, arg, tgt_type, ret, jid=jid, kwarg=kwarg, listen=False, **kwargs
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 409, in run_job
    raise SaltClientError(general_exception)
salt.exceptions.SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

If you suspect this is an IPython 7.20.0 bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipyth...@python.org

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

[ERROR   ] Message timed out
(virtualenv3.7.7) [root@bb11t1-mgmt traffic_tool]#



Can you let me know what could be the problem here?

Regards
Sai

Dmitry Golubenko

unread,
Dec 5, 2022, 9:05:57 PM12/5/22
to salt-...@googlegroups.com

> Hi,
>
> Here are some observations:
>
> I tried with 800 hosts and it works perfectly fine and the time taken
> seems to be impressive:

check
https://docs.saltproject.io/en/latest/topics/tutorials/intro_scale.html

and search this maillist, some years ago there were some discussions
related to the topic, one of participants was 'Volker' if I remember
correctly.




Aprameya NDS

unread,
Dec 12, 2022, 10:06:14 PM12/12/22
to Salt-users
Hi,

I have gone through the doc for the scale thing and had configured most of the things in the master and minion configs as mentioned but still not helping us in any way.


The batch thing mentioned in the doc does not hold good here as we are using cmd_async

So what we have observed is once the host size is increased say to 987 thats the number we have currently we see it starts failing at the start itself after say 3-4 iterations.

ipdb> c
  1%|▊                                                                                                                                    | 6/987 [00:22<1:00:25,  3.70s/it]

Traceback (most recent call last):
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 1901, in pub
    payload = channel.send(payload_kwargs, timeout=timeout)
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py",


We have a for loop for each host and trigger cmd_async one host at a time in a loop and it crashes after 5-6 iterations when the host size is increased. With 900 hosts it works perfectly.

So can you please give some explanation as to what could be the problem here.
We are kind of blocked and do not know how to proceed from here.

Regards
Sai

Aprameya NDS

unread,
Dec 12, 2022, 10:08:10 PM12/12/22
to Salt-users
Adding the complete traceback:

ipdb> c
  1%|▊                                                                                                                                    | 6/987 [00:22<1:00:25,  3.70s/it]
Traceback (most recent call last):
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 1901, in pub
    payload = channel.send(payload_kwargs, timeout=timeout)
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/utils/asynchronous.py", line 125, in wrap
    raise exc_info[1].with_traceback(exc_info[2])
Traceback (most recent call last):
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 396, in run_job
    **kwargs
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 1905, in pub
    "Salt request timed out. The master is not responding. You "
salt.exceptions.SaltReqTimeoutError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "clean.py", line 69, in <module>
    run()
  File "clean.py", line 15, in run

    salt_handle = salt.client.LocalClient()
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 495, in cmd_async
    tgt, fun, arg, tgt_type, ret, jid=jid, kwarg=kwarg, listen=False, **kwargs
  File "/root/virtualenv3.7.7/lib/python3.7/site-packages/salt/client/__init__.py", line 409, in run_job
    raise SaltClientError(general_exception)
salt.exceptions.SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

If you suspect this is an IPython 7.20.0 bug, please report it at:
    https://github.com/ipython/ipython/issues
or send an email to the mailing list at ipyth...@python.org

You can print a more detailed traceback right now with "%tb", or use "%debug"
to interactively debug it.

Extra-detailed tracebacks for bug-reporting purposes can be enabled via:
    %config Application.verbose_crash=True

[ERROR   ] Message timed out

Nerigal

unread,
Dec 13, 2022, 10:46:57 AM12/13/22
to salt-...@googlegroups.com


We have a for loop for each host and trigger cmd_async one host at a time in a loop and it crashes after 5-6 iterations when the host size is increased. With 900 hosts it works perfectly.

would it be a case where you end in a infinite loop if you have to many minions cos the 
loop start over the last one that is not completed yet ?
do you have any kind of validation before starting the loop ?


salt.exceptions.SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.

i understand you using cmd_async but definitely look like something over flooding the event bus


tus said, There is a lot of things that can be done to help your salt master to handle more returning from minions at once but need to make sure your state or module executed is clean

You may have this done already but it still worth mentioning i believe

first, i would think about delaying the minions Re-authentication at time on that scale

recon_default: 1000
recon_max: 59000
recon_randomize: True


Then you should think about putting the cachedir into a ram drive
***NOTE***: if you do that, you really need to have an external fast system that you will used as ext_job_cache

But this is a good way to avoid being slowdown by disk IO
Each job return for every Minion is saved in a single file. Over time this directory can grow quite large, depending on the number of published jobs. The amount of files and directories will scale with the number of jobs published

default cache dir is

cachedir: /var/cache/salt


but you can change it for a ram drive and you want to use tmpsf
mount -t tmpfs -o size=20g tmpfs /mnt/tmp

20 Go is totally arbitrary value and you should base the size on the actual size of the current cachedir


Salt master will love you if you have as much core as worker
then you also need to make sure that you have plenty of RAM
enough to cover than ram drive and to NOT have to use the swap at all

for that you can also set 

sysctl vm.swappiness=10


by default, on the wrong OS used as server this is set at 60 like Ubuntu
you want it as low as possible to prevent any unnecessary disk activity


Then you can also think about augmenting the number of
worker_threads
on the master: by default should be 5
dont increase this value high than the number of core -1

This is one of many setting under Large-Scale tuning Settings in the master config that can be analysed in your case

--
You received this message because you are subscribed to the Google Groups "Salt-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to salt-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/salt-users/8cf6e959-7168-4661-a8fc-43840250b4dbn%40googlegroups.com.

Aprameya NDS

unread,
Dec 14, 2022, 2:04:21 PM12/14/22
to Salt-users
Hi,
Thanks for the reply.

Will try the ram part which you have suggested and will get back since we have done most of the other configs mentioned.

One more observation is that we had these 2 statements in a single for loop:

   ...:     salt_handle.cmd_async(f"{source}", 'cp.get_file', [src, trg])
    ...:     salt_handle.cmd_async(f"{source}", 'ping_minion.clean_ping')

This causes problem when the host size is 980+.

But when we split the 2 statements in to different for loop then it works.
Can you comment on why would it not work when we have those 2 statements in the same loop as its a async call and should not be blocking.

Is it something to do with the ZMQ since the same source is being used twice for 2 different commands?

Regards
Sai

Aprameya NDS

unread,
Jan 10, 2023, 7:05:59 PM1/10/23
to Salt-users
Hi Team,

Any update on this query that i had asked in my previous mail:

"""
But when we split the 2 statements in to different for loops then it works.
Can you comment on why would it not work when we have those 2 statements in the same loop as its a async call and should not be blocking.

Is it something to do with the ZMQ since the same source is being used twice for 2 different commands?
"""

Regards
Sai
Reply all
Reply to author
Forward
0 new messages