(virtualenv3.7.7) [root@bb11t1-mgmt traffic_tool]# python clean.py
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 947/947 [00:19<00:00, 49.46it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 947/947 [00:18<00:00, 52.51it/s]
(virtualenv3.7.7) [root@bb11t1-mgmt traffic_tool]#
We have a for loop for each host and trigger cmd_async one host at a time in a loop and it crashes after 5-6 iterations when the host size is increased. With 900 hosts it works perfectly.
would it be a case where you end in a infinite loop if you have to many minions cos the
loop start over the last one that is not completed yet ?
do you have any kind of validation before starting the loop ?
salt.exceptions.SaltClientError: Salt request timed out. The master is not responding. You may need to run your command with `--async` in order to bypass the congested event bus. With `--async`, the CLI tool will print the job id (jid) and exit immediately without listening for responses. You can then use `salt-run jobs.lookup_jid` to look up the results of the job in the job cache later.
i understand you using cmd_async but definitely look like something over flooding the event bus
tus said, There is a lot of things that can be done to help your salt master to handle more returning from minions at once but need to make sure your state or module executed is clean
You may have this done already but it still worth mentioning i believe
first, i would think about delaying the minions Re-authentication at time on that scale
recon_default: 1000 recon_max: 59000 recon_randomize: True
Then you should think about putting the cachedir into a ram drive
***NOTE***: if you do that, you really need to have an external fast system that you will used as ext_job_cache
But this is a good way to avoid being slowdown by disk IO
Each job return for every Minion is saved in a single file. Over time this directory can grow quite large, depending on the number of published jobs. The amount of files and directories will scale with the number of jobs published
default cache dir is
cachedir: /var/cache/salt
but you can change it for a ram drive and you want to use tmpsf
mount -t tmpfs -o size=20g tmpfs /mnt/tmp
20 Go is totally arbitrary value and you should base the size on the actual size of the current cachedir
Salt master will love you if you have as much core as worker
then you also need to make sure that you have plenty of RAM
enough to cover than ram drive and to NOT have to use the swap at all
for that you can also set
sysctl vm.swappiness=10
by default, on the wrong OS used as server this is set at 60 like Ubuntu
you want it as low as possible to prevent any unnecessary disk activity
Then you can also think about augmenting the number of
worker_threads
on the master: by default should be 5
dont increase this value high than the number of core -1
This is one of many setting under Large-Scale tuning Settings in the master config that can be analysed in your case
--
You received this message because you are subscribed to the Google Groups "Salt-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to salt-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/salt-users/8cf6e959-7168-4661-a8fc-43840250b4dbn%40googlegroups.com.