Calculating Cluster uptime % for two node cluster

126 views

Skip to first unread message

Shubham Shrivastav

unread,

Jul 29, 2022, 11:21:09 PM7/29/22

to Prometheus Users

Hey guys,

I have custom metrics enabled for individual nodes of cluster.

# HELP platform_uptime_state Overall platform status is 1 when up, 0 otherwise
# TYPE platform_uptime_state gauge
platform_uptime_state{platform_version="6.4", node_id="101", cluster_id="1" } 1
platform_uptime_state{platform_version="6.4", node_id="102", cluster_id="1" } 0

Each cluster has two nodes.

Using ranged vector I could derive something like this to calculate uptime for an individual node.

sum_over_time((platform_uptime_state{node_id ="101"})[1h:15s]) / count_over_time((platform_uptime_state{node_id ="101"})[1h:15s])

But here's the formula, I'm trying to implement:

count of time series when the cluster has at least 1 node up over 1d
(eg. sum by (cluster_id) (platform_uptime_state) == 0)
/
count of total cluster ts over 1d

But this doesn't work,
Is there a better way to do this?

TIA,
Shubham

Brian Candler

unread,

Jul 30, 2022, 3:37:21 AM7/30/22

to Prometheus Users

On Saturday, 30 July 2022 at 04:21:09 UTC+1 shrivasta...@gmail.com wrote:

Using ranged vector I could derive something like this to calculate uptime for an individual node.

sum_over_time((platform_uptime_state{node_id ="101"})[1h:15s]) / count_over_time((platform_uptime_state{node_id ="101"})[1h:15s])

Aside: avg_over_time is simpler.

Also in this particular case, there's no need for a subquery. A native range vector will work:

avg_over_time(platform_uptime_state{node_id ="101"}[1h])

There is a subtle difference: this doesn't resample the metric at 15 second intervals, but just takes all the existing data points in the timeseries over that range, with whatever timestamps they were recorded at.

But you would need a subquery if the expression is any more complex than just a plain metric, as you'll see shortly.

But here's the formula, I'm trying to implement:

count of time series when the cluster has at least 1 node up over 1d
(eg. sum by (cluster_id) (platform_uptime_state) == 0)
/
count of total cluster ts over 1d

But this doesn't work,

In what way "doesn't it work"? What output do you get? What happens if you try graphing the numerator and denominator separately? Or is the problem you don't know what to put for "count of total cluster ts over 1d" ?

I suggest you build the query up in stages, testing the query so far in the Prometheus web interface at each stage. If you're trying to detect when the cluster has *at least* one node up, then I'd start with a query like this:

max(platform_uptime_state)

That gives a single value across all nodes. If you have multiple clusters, then it would be:

max by (cluster_id) (platform_uptime_state)

Graph that. Check that it it gives one value per cluster, and the value is 0 when all nodes in that cluster are down and 1 when at least 1 node is up.

Once you're happy with that, try a subquery to evaluate this expression multiple times over the previous hour:

max by (cluster_id) (platform_uptime_state))[1h:15s])

Does that work? (Note: the result is a range vector and the "graph" view in the web interface can't show this, but the "table" view will show the data points)

Now try adding up those points over the hour:

sum_over_time((max by (cluster_id) (platform_uptime_state))[1h:15s])

Is that correct? The result is an instant vector so you should be able to graph this one. At each point in the graph, it shows the result for the time from T-1h to T.

Now you know that 3600/15 = 240, so you could divide by 240, but it's simpler to change to

avg_over_time((max by (cluster_id) (platform_uptime_state))[1h:15s])

If that doesn't produce what you're looking for, then you can still follow the same logical process to end up with an expression which does.

Reply all

Reply to author

Forward

0 new messages