consul_exporter - Expose health statuses as values?

206 views
Skip to first unread message

Matt Russi

unread,
Aug 16, 2021, 4:31:15 PM8/16/21
to Prometheus Developers
Currently, the consul_exporter exposes 4 series per health_node and health_service status check. Each with a label indicating the status (maintenance, warning, critical, or passing). In larger environments, this creates quite a few extra series. 

As somewhat of a precedent, the status is already being mapped to a value for the consul_serf_lan_member_status metric (as Consul's API provides this mapping).
# HELP consul_serf_lan_member_status Status of member in the cluster. 1=Alive, 2=Leaving, 3=Left, 4=Failed.

I wanted to get some thoughts around this before pursuing a PR.

In my example, I used -2=maintenance, -1=warning, 0=critical, and 1=passing to fall in line with the Prometheus paradigm of up=0 (down) and up=1 (up). Since we have two additional values, the negative numbers play more nicely when trying to do a value mapping in Grafana. Not married to the values themselves though. :) 

Present Example:
consul_health_node_status{check="serfHealth",node="example_node",status="critical"} 0
consul_health_node_status{check="serfHealth",node="example_node",status="maintenance"} 0
consul_health_node_status{check="serfHealth",node="example_node",status="passing"} 1
consul_health_node_status{check="serfHealth",node="example_node",status="warning"} 0

consul_health_service_status{check="service:10.0.0.1_443",node="example_node",service_id="10.0.0.1_443",service_name="auth_service",status="critical"} 0
consul_health_service_status{check="service:10.0.0.1_443",node="example_node",service_id="10.0.0.1_443",service_name="auth_service",status="maintenance"} 0
consul_health_service_status{check="service:10.0.0.1_443",node="example_node",service_id="10.0.0.1_443",service_name="auth_service",status="passing"} 1
consul_health_service_status{check="service:10.0.0.1_443",node="example_node",service_id="10.0.0.1_443",service_name="auth_service",status="warning"} 0

Proposed Example:
# HELP consul_health_node_status Status of health checks associated with a node. -2=maintenance, -1=warning, 0=critical, 1=passing
consul_health_node_status{check="serfHealth",node="example_node"} 1

# HELP consul_health_service_status Status of health checks associated with a service. -2=maintenance, -1=warning, 0=critical, 1=passing 
consul_health_service_status{check="service:10.0.0.1_443",node="example_node",service_id="10.0.0.1_443",service_name="auth_service"} 1

Matthias Rampke

unread,
Aug 17, 2021, 5:01:27 AM8/17/21
to Matt Russi, Prometheus Developers
What would some common queries be that this affects, and how would they look in the future? For example, "what fraction of nodes is down" "which nodes have multiple services down?"

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/9bb6b446-728d-47d9-8a08-355dec88d572n%40googlegroups.com.

Matt Russi

unread,
Sep 2, 2021, 8:42:38 PM9/2/21
to Prometheus Developers
The query syntax would not be drastically impacted (though I understand there is still a change). The significance would be a 75% reduction in the number of series generated by these metrics. Less to store and compute.

> "what fraction of nodes is down"

Current:
    count (consul_health_node_status{status!="passing"} == 1)
    /
    count (consul_health_node_status)

Proposed:
    count (consul_health_node_status == 0)
    /
    count (consul_health_node_status)

> "which nodes have multiple services down?"

Current:
    count by (node) (consul_health_service_status{status!="passing"} == 1) > 1

Proposed:
    count by (node) (consul_health_service_status == 0) > 1

> What service checks are critical?

Current:
    consul_health_service_status{status="critical"} == 1

Proposed:
    consul_health_service_status == 0

Matt Russi

unread,
Oct 4, 2021, 5:26:05 PM10/4/21
to Prometheus Developers
Bump. :) 

Reply all
Reply to author
Forward
0 new messages