Hello all,
I am starting this discussion to explore possible approaches to make sure a salt job execution is more reliable and necessary actions are taken upon failures. For this, I would like to brief everyone about the problem that I am trying to solve. PLEASE redirect me to any other post which has covered my use-case.
I have a salt-master and around 10-15 salt-minions(which might scale up going forward). I have a python script that calls salt-states on these salt-minions based on some meta-information (VMTYPE which is hardcoded in a file on each salt-minion). All these VMs (including salt-master) are spawned on Openstack.
I basically find following issues for which I need to figure out a reliable approach to all salt jobs:
1) Some salt-states take more than 4-5 minutes (based on the overall load on compute node disk). At this time I have no clue if the job has finished or not.
2) Even if job finishes and returns to salt-master, I am NOT sure if all the sub-states got executed perfectly well or not.
3) I have few 3rd-party daemons (nginx, rsyslog, etc.) running on these minions as well. I want to restart them in case they are not running on minion. For this there should be a notification from salt-minion toward master which will tell master to restart these services.
Till now I have explored this approach:
1) I can get currently running jobs on a minion. Based on the job_id, I can decide to wait or kill the job and restart salt-state (BUT how to be sure that the salt job is progressing/not progressing?)
2) I am aware that there is a salt event module that can fire events to salt-master. BUT, can anyone explain or point me to any such example that provides a solution to my use-case?
I am also open to use any better approach that is asynchronous in nature so that salt-master isn't stuck waiting to get a response from one salt-minion and hence delaying salt-states execution on other minions.
Looking forward to hearing back from the experts here!
TIA