Consul Fails to Query Service Health - consul_up is down ~40% of time

85 views
Skip to first unread message

Nikash Narula

unread,
Dec 22, 2022, 5:38:35 AM12/22/22
to Prometheus Users
We have an alert set to trigger when consul_up is failing to be above 90% in past 5 min: (avg_over_time(consul_up{job="consul-exporter"}[5m]) * 100) < 90.
We see continuous fluctuations between consul_up being 1 (up) and 0 (down). Thus, our alert is getting triggered often even when all Consul health checks are spotless (we had a Consul support engineer verify this).
Environment (Prod):
  • System information: Linux 5.8.0-1041-aws x86_64

  • consul_exporter version: 0.7.1

  • Consul version: Consul v1.8.0

  • Prometheus version: prometheus, version 2.28.1 (branch: HEAD, revision: b0944590a1c9a6b35dc5a696869f75f422b107a1)

consul_uptime_graph.png
consul_uptime_value.png
prometheus_config.txt

Ben Kochie

unread,
Dec 22, 2022, 5:44:46 AM12/22/22
to Nikash Narula, Prometheus Users
This means your consul_exporter is having problems talking to your consul server. Only you can debug this communication problem.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/26d5df3a-0635-4c69-ab47-303ee7327dd2n%40googlegroups.com.

Brian Candler

unread,
Dec 22, 2022, 6:57:22 AM12/22/22
to Prometheus Users
To expand on that slightly:
- if the "up" metric were fluctuating between 0 and 1, then that would be a problem with with prometheus talking to consul_exporter
- but what you're seeing is the "consul_up" metric fluctuating between 0 and 1.  "consul_up" is a metric returned by console_exporter.  This means that consul_exporter is working just fine, and prometheus is also able to talk to it just fine; but the testing done by console_exporter is giving 0 and 1 results.

So you need to focus your attention on consul_exporter itself, and how it communicates with the target consul cluster.

Looking at logs from the consul_exporter container would be a good start.  Looking at tcpdump of traffic between the consul_exporter container and the consul cluster it's talking to would be another thing to try.

We don't know how consul_exporter is locating the consul cluster, but if it's using DNS, then that's something else to check too.
Reply all
Reply to author
Forward
0 new messages