Monitoring Saltstack

951 views
Skip to first unread message

Charles Baker

unread,
Oct 8, 2014, 12:20:25 PM10/8/14
to salt-...@googlegroups.com
How do you guys monitor saltstack itself?

I know I can use `salt-run manage.down' to get an idea of which minions aren't responding, but it's possible to get different results with runs very close together. I imagine this is due to the `test.ping' to individual minions timing out. This could lead to a lot of false positives and extraneous alerts when the results are fed to a monitoring system.

What methods do you all use to monitor minion health? How do keep track of failed state runs? Is there a set of best practices for this that I've just not come across? Thanks.

--
Charles H. Baker
864.990.1297
Knowing is not enough; we must apply. Willing is not enough; we must do. Bruce Lee

Stephen Spencer

unread,
Oct 8, 2014, 12:28:26 PM10/8/14
to salt-...@googlegroups.com
I'm not in production with my salt work yet, but my plan is to create a nagios plugin that will periodically issue a salt-call to fetch a default grain or some such.  If it fails, the counter starts.  It is a policy decision that hasn't been made yet as to whether it will attempt restart the minion on its own or just make nagios squawk.from the first reasonable certainty of failure.

-S

--
You received this message because you are subscribed to the Google Groups "Salt-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to salt-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
You know, I used to think it was awful that life was so unfair. Then I thought, wouldn't it be much worse if life were fair, and all the terrible things that happen to us come because we actually deserve them? So, now I take great comfort in the general hostility and unfairness of the universe.

Dan Sheridan

unread,
Oct 13, 2014, 7:01:10 AM10/13/14
to salt-...@googlegroups.com
Hi,

I'm doing a test.ping from Nagios to see if a minion is responding, and a state.highstate test=true to see if it is up-to-date. I've uploaded them as Gists, but should probably get round to making a proper repo for them:

https://gist.github.com/djs52/705cd764ac782c1ab461
https://gist.github.com/djs52/e0f753e0d54f8e1890c9

The only caveat is that the Nagios user needs to be able to execute Salt commands -- the client_acl needs to be set and permissions on /var/run/salt need to be appropriate.

    Dan.

Charles Baker

unread,
Oct 13, 2014, 12:22:14 PM10/13/14
to salt-...@googlegroups.com
Thanks for sharing, Dan. That looks to be a good approach.

Valentin Bud

unread,
Oct 14, 2014, 3:02:10 AM10/14/14
to salt-...@googlegroups.com
Hello Charles,

I use M/MONIT [1], to monitor the local salt-minion daemon and also the salt-master
one on the Salt Master node.

I deploy the M/MONIT configurations using Salt of course :). It worked well for the past
10 months. 


Best,
Valentin

Denis Witt

unread,
Oct 17, 2014, 7:33:15 AM10/17/14
to salt-...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Mon, 13 Oct 2014 04:01:10 -0700 (PDT)
Dan Sheridan <dan.sh...@postman.org.uk> wrote:

> https://gist.github.com/djs52/705cd764ac782c1ab461
> https://gist.github.com/djs52/e0f753e0d54f8e1890c9

Thanks for sharing. I noticed one thing. If you check for the highstate
output and have states with functions which are always
executed (apt update via pkg.refresh_db for example, or cmd.* without
any dependencies) you will always end up with a CRITICAL state for
nagios.

An option would be nice to exclude those states.

btw, what value do you use for timeouts when using nrpe (if you do so),
300 as in the script waiting for highstate returns?

Regards.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBCAAGBQJUQP5sAAoJELB26T3daKAJi/YQAKojvR85GRN+jMblupNDIFqj
0YoEzLhRFmrgoeUD0W3LPkL/BT6ibXOO3AOGNmPD87buGASb6mh/f8K6arQXzpPf
UAALTj7bA6c4euFwGNYsgggtGL3kWafoj6Jf5o4uTZIlkKuv1614/73KSwZZTa9j
Df69XhPjp21iQUYraUyVGOd+yYCjqwvHv2i2hE44yUemA0CpaYoNnGDoShxxOVDv
GhQDV2rKjIbiJ50NP/1fIlEt6Wcut3RFf+lDFnc2mVFzyyiPZsQlNl/9VF57qMnF
f2c+5GcynI1d1FvIQLW0tnqSqAzkWt3m6Iw4kNiugcSZQUzj2Rdlej0SGzoVtWwr
UrfsznDPvDIivkHGLvNYB7tJYR4lwToqF/khwc5OVsUXVxH+GUihCi4FWkNAi4G1
7cn0/kzERd4lQveZ9eV/FVtEkH7w501Kdm//wlwGYa8bZuFnGNgkDDOIm46RzBQa
kdGbi26SqwI8fHb+BPC74hD558xt2ErqgCHneQIICm7lKAkE7uhGPWRBFmGBjBUd
J9r1O5bOPW43VwHtK2QTLHwvY0Uzq9FFyX5YkPc0lgpguoypjK4GFIA2c82bIFLY
GSxB4vOtH+CC9972Ew7nqN/MjX7/cJxE5jh4TY3B9E+QREkW24Y684xW1vGlOiRl
ryDdvT2ugCDjW6P4M9G/
=aNda
-----END PGP SIGNATURE-----

Dan Sheridan

unread,
Oct 17, 2014, 11:48:29 AM10/17/14
to salt-...@googlegroups.com
Yes, this is something that's been bothering me too, so I've added something very simple to match against states and ignore them (https://gist.github.com/djs52/e0f753e0d54f8e1890c9). It's hardcoded, but would probably be better as a configuration file or something.

My Nagios server is my Salt master so all the checks run locally without NRPE. I have the check interval set to 30 minutes with a retry of 5 minutes. A service_check_timeout of 3 minutes seems to be working for me at the moment but YMMV.

    Dan.

Denis Witt

unread,
Oct 20, 2014, 11:51:34 AM10/20/14
to salt-...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Fri, 17 Oct 2014 08:48:29 -0700 (PDT)
Dan Sheridan <dan.sh...@postman.org.uk> wrote:

> Yes, this is something that's been bothering me too, so I've added
> something very simple to match against states and ignore them
> (https://gist.github.com/djs52/e0f753e0d54f8e1890c9). It's hardcoded,
> but would probably be better as a configuration file or something.

Hi Dan,

thanks a lot, I'll have a look on it more briefly later.

3 minutes sounds reasonable, too. I think my check interval will be
once every 12 hours.

Regards.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQIcBAEBCAAGBQJURS98AAoJELB26T3daKAJ5nkQAJqwPFGXouyF9q6ZiZkBzt2b
wW3zFHxL41WoE6q9RFj0zJIAtgJsXGO3QXwaVRWe3bzMw86GSrS7Bhzb9FqpRL0O
5uzpsFsM4zeArL4oizqvGsTDCZ+i9sbAJ4l6fGr2JvhSFZElESY6Bb4y81Vd3VGP
wNnXdWW9lrQ+NhcYt8057SE2vV0tRmlb+tvw9PkKHPeqSB9/VLXQc7jopUG7aFWP
M9WV9hWTz3vcD7wlQshdW8vJvAsHPb2W214iEtQiy4dqgbzs5Fep97AQEhUotsDI
565wVb0fvp3FbB092DxbOizjiHngx/v7BP/FavQ+ZUhkMcC3kKm/gD9VeBmygUXK
Et/FaAHQYENJ1lq+/ju3muSWQ1EnbI2Nw8mMnMo4+4VrA9BkPU2Pe98IS9AndESE
yxU532B0pmvryA523/9YMWSg/xZFXZj36p9FiUbQZYfsuGnqUEgA50nkvfuuXjH8
FCK3z+v7vsteYlzABzVI2Wbw+rlLAmfyD78yFTrJbjrlcq23S1VErS8r2d4LaP9x
czbrpvO/Wwam4s0J0NCD5C1hNn0GJHykRXDTRCNgJuFTgKJ+ACqVTc1O6b+iJqAW
B3NxQWnHctH+tt7353mDTv1YF6KFlnMxvEx161IqkGGTQngkYuAk8OvHlJV9bdwV
fGF3YVuOo9/+Y0N3TZGc
=8g/W
-----END PGP SIGNATURE-----
Reply all
Reply to author
Forward
0 new messages