How to monitor Consul health with Nagios?

1,942 views
Skip to first unread message

Cameron Davison

unread,
Oct 30, 2014, 11:17:59 AM10/30/14
to consu...@googlegroups.com
Does anyone know if there are any nagios plugins to monitor a consul cluster with nagios? Is there a recommended API to use to check the health of a consul cluster? I know that there are the serf health status's. I am not even sure what I would think is healthy/unhealthy. For a 3 server, many client cluster, I think that it would be nice to be able to think of 1 server node being down as minor, >= 2 being outage. For clients, maybe we should consider if more than half are unreachable then the cluster is unhealthy?

Cameron

Andrew Watson

unread,
Oct 30, 2014, 11:45:26 AM10/30/14
to Cameron Davison, consu...@googlegroups.com
I'd think that if you ask multiple nodes for the member list from their perspective and they differ in their opinions then you should trigger warnings.

On Thu, Oct 30, 2014, 11:17 AM Cameron Davison <cameron...@gmail.com> wrote:
Does anyone know if there are any nagios plugins to monitor a consul cluster with nagios? Is there a recommended API to use to check the health of a consul cluster? I know that there are the serf health status's. I am not even sure what I would think is healthy/unhealthy. For a 3 server, many client cluster, I think that it would be nice to be able to think of 1 server node being down as minor, >= 2 being outage. For clients, maybe we should consider if more than half are unreachable then the cluster is unhealthy?

Cameron

--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Armon Dadgar

unread,
Oct 30, 2014, 2:30:37 PM10/30/14
to Cameron Davison, Andrew Watson, consu...@googlegroups.com
Hey Cameron,

Monitoring the health of a cluster is rather tricky. I think that a great first stab is to check
the /v1/status/leader endpoint is returning a node. If not, then your cluster is in an outage
situation and that should be critical.

After that, you can just check for failed servers by using /v1/health/service/consul endpoint
to query all the known Consul servers. Then you can use another request to /v1/health/service/consul?passing=1
to get the ones passing health checks.

This will let you determine what number of nodes are failed, and warn / fail as appropriate.

Hope that helps!

Best Regards,
Armon Dadgar
Reply all
Reply to author
Forward
0 new messages