Discussion about making salt-state execution MORE reliable

Amit Bhardwaj

unread,

Mar 26, 2018, 7:27:20 AM3/26/18

to Salt-users

Hello all,

I am starting this discussion to explore possible approaches to make sure a salt job execution is more reliable and necessary actions are taken upon failures. For this, I would like to brief everyone about the problem that I am trying to solve. PLEASE redirect me to any other post which has covered my use-case.

I have a salt-master and around 10-15 salt-minions(which might scale up going forward). I have a python script that calls salt-states on these salt-minions based on some meta-information (VMTYPE which is hardcoded in a file on each salt-minion). All these VMs (including salt-master) are spawned on Openstack.

I basically find following issues for which I need to figure out a reliable approach to all salt jobs:

1) Some salt-states take more than 4-5 minutes (based on the overall load on compute node disk). At this time I have no clue if the job has finished or not.

2) Even if job finishes and returns to salt-master, I am NOT sure if all the sub-states got executed perfectly well or not.

3) I have few 3rd-party daemons (nginx, rsyslog, etc.) running on these minions as well. I want to restart them in case they are not running on minion. For this there should be a notification from salt-minion toward master which will tell master to restart these services.

Till now I have explored this approach:

1) I can get currently running jobs on a minion. Based on the job_id, I can decide to wait or kill the job and restart salt-state (BUT how to be sure that the salt job is progressing/not progressing?)

2) I am aware that there is a salt event module that can fire events to salt-master. BUT, can anyone explain or point me to any such example that provides a solution to my use-case?

I am also open to use any better approach that is asynchronous in nature so that salt-master isn't stuck waiting to get a response from one salt-minion and hence delaying salt-states execution on other minions.

Looking forward to hearing back from the experts here!

TIA

Jeremy McMillan

unread,

Mar 27, 2018, 9:58:31 AM3/27/18

to Salt-users

Listen to the events.

Increase the timeout on the salt command to give it more tolerance (of 4-5 minute state.apply jobs) while it listens to events for you.
Run your state.apply in an orchestration job that can do the right thing in case of failures (notifications via salt events?).

Write your states to use requisites for ordering.

Salt will tell you if and how things go wrong if you make it sensitive to these things.
Consider using some "onfail" states to help with robustness of your state.apply jobs.

Configure beacons and reactors to restart services if necessary.

Salt jobs from a single state.apply with multiple targets, on each minion, run in parallel. They do not wait for each other. Individual states get compiled on each minion to a list of serially executed "low chunks" which execute each in turn, unless you have told them to run in parallel (multiprocessing "threads") on the minion (usually not).

Shane Gibson

unread,

Mar 27, 2018, 10:24:51 AM3/27/18

to Salt-users

On Monday, March 26, 2018 at 4:27:20 AM UTC-7, Amit Bhardwaj wrote:

1) Some salt-states take more than 4-5 minutes (based on the overall load on compute node disk). At this time I have no clue if the job has finished or not.
2) Even if job finishes and returns to salt-master, I am NOT sure if all the sub-states got executed perfectly well or not.

Amit - since you mention "jobs" - I'm assuming you know about the Jobs subsystem:

https://docs.saltstack.com/en/latest/topics/jobs/index.html

If the job is still running in the background, and you're return from the salt command has timed out - that job is still running. You can access the state and return information from the jobs subsystem. There are some limitations in the job subsystem and job cache. See the "Managing the Job Cache" docs:

https://docs.saltstack.com/en/latest/topics/jobs/job_cache.html

If those limitations become an issue - you might need to explore the External Jobs Results capabilities to capture job run information in a longer lived solution (or if you need to manage Job run audit and compliance):

https://docs.saltstack.com/en/latest/topics/jobs/external_cache.html

3) I have few 3rd-party daemons (nginx, rsyslog, etc.) running on these minions as well. I want to restart them in case they are not running on minion. For this there should be a notification from salt-minion toward master which will tell master to restart these services.

Depending on what Linux distro and version you are running - you might explore using native tools like SystemD to enforce that a given Daemin is always running by properly restarting it via SystemD if it should die. Then Salt would be best served by ensuring that the configuration of the SystemD unit file is correct to catch these cases. This would be a standard pattern to consider following; 1) Use the native tools to do the important things like Service Restarts, 2) use Salt to enforce that the configuration of these tools is correct for your use case.

The lighter weight you make Salt, the less work it's trying to do, the better overall experience you'll have with it. I'm not saying that Salt isn't capable of enforcing that services are running or not - but apply each tool carefully with the idea of reducing burden on all of the various pieces and parts appropriately makes for a much lighter weight and cleaner system.

~~shane

Jeremy McMillan

unread,

Mar 27, 2018, 3:47:03 PM3/27/18

to Salt-users

+1 Shane's suggestion to use systemd.

Consider this: if systemd fails to restart a daemon multiple times, it will stop trying. Restarting it manually will also (predictably) fail. This probably means the daemon is unreliable, or it has been subjected to configuration or operational load that makes it unreliable. The goal of the salt states that set up the daemon and configure it should be to make it robust so that it does not require active management. Do you have memory contention? Maybe the service that dies is a victim and not the culprit, and you need to set limits on something else?

Here you can trigger something like a

journalctl -u service-name.service -b

This will allow you to collect some failure details, and possibly send it somewhere useful (email returner?)

Amit Bhardwaj

unread,

Apr 3, 2018, 8:18:27 AM4/3/18

to Salt-users

Thanks, Jeremy and Shane for valuable inputs.

I was just wondering if salt can send different retcodes depending on the failure type?
For example, in case of a pkg.install failure due to dependencies, error code A is sent whereas, if pkg.install fails due to say repository response failure, error code B is being sent.

Based on that, I can handle further actions to be taken with much more control.