PromQL: multiple queries with dependent values

marc koser

unread,

Oct 13, 2022, 7:03:24 AM10/13/22

to Prometheus Users

Is it possible to have one side of a query limit the results of another part of the same query?

For example; I want an alert query for when the count of all instances in a clustered service are not equal to what the instance is reporting based on a label's value being equal in both sides of the query (in this case; there are many nodes that run the same service but are part of different clusters:)

redis_cluster_known_nodes != scalar(count(up{service=~"redis-exporter"}))

The shared label value would be something like, group="cluster-a" and should not evaluate metrics where group="cluster-b"

Thanks in advance!

Brian Candler

unread,

Oct 13, 2022, 9:13:42 AM10/13/22

to Prometheus Users

> Is it possible to have one side of a query limit the results of another part of the same query?

Yes, but it depends on exactly what you mean. The details are here:

https://prometheus.io/docs/prometheus/latest/querying/operators/

It depends on whether you can construct vectors for the LHS and RHS which have corresponding labels.

If you can give some specific examples of the metrics themselves - including all their labels - then we can see whether it's possible to do what you want in PromQL. Right now the requirements are unclear.

> redis_cluster_known_nodes != scalar(count(up{service=~"redis-exporter"}))

>

> The shared label value would be something like, group="cluster-a" and should not evaluate metrics where group="cluster-b"

You need to arrange both LHS and RHS to have some corresponding labels before you can combine them with any operator such as !=. The RHS has no "group" label at the moment, in fact it's not even a vector, but you could do:

count by (group) (up{service="redis-exporter"})

Then, assuming that redis_cluster_known_nodes also has a "group" label, you can do:

redis_cluster_known_nodes != on (group) count by (group) (up{service="redis-exporter"})

This will work as long as the LHS and RHS both have exactly *one* metric for a given value of the "group" label.

If the LHS has N values of "group" for 1 on the RHS, or vice versa, then you can use N:1 matching as described in the documentation ("group left" or "group right").

If there are multiple matches on both LHS and RHS for the same value of group, then the query will fail. You will have to include some more labels in the on(...) list to get a unique match.

Brian Candler

unread,

Oct 13, 2022, 9:17:55 AM10/13/22

to Prometheus Users

Sorry, second to last sentence was unclear. What I meant was:

If the LHS vector contains N metrics with a particular value of the "group" label, which correspond to exactly 1 metric on the RHS with the matching label value, or vice versa, then you can use N:1 matching.

marc koser

unread,

Oct 17, 2022, 4:12:49 PM10/17/22

to Prometheus Users

Thanks for the pointer Brian.

From what you suggested; I updated my query to include `service` rather than `job` to cover the different values (representing either redis service on each `instance`), however I'm still not getting the results I expect:

query:

redis_cluster_known_nodes != on (instance, service, group) count by (instance, service, group) (up{service=~"exporter-redis-.*"})

result:

{group="group-a", instance="node-1", service="exporter-redis-6379"} 10
{group="group-a", instance="node-1", service="exporter-redis-6380"} 10
{group="group-a", instance="node-2", service="exporter-redis-6379"} 11
{group="group-a", instance="node-2", service="exporter-redis-6380"} 16
{group="group-a", instance="node-3", service="exporter-redis-6379"} 16
{group="group-a", instance="node-3", service="exporter-redis-6380"} 16
{group="group-a", instance="node-4", service="exporter-redis-6379"} 16
{group="group-a", instance="node-4", service="exporter-redis-6380"} 16
{group="group-a", instance="node-5", service="exporter-redis-6379"} 16
{group="group-a", instance="node-5", service="exporter-redis-6380"} 16

I would expect only those who's count is != 10 be included in the result.

Here's a metric sample of those used in the query:

```

up{group="group-a", instance="node-1", job="redis-cluster", service="exporter-redis-6379", team="sre"} 1
up{group="group-a", instance="node-1", job="redis-cluster", service="exporter-redis-6380", team="sre"} 1
up{group="group-a", instance="node-2", job="redis-cluster", service="exporter-redis-6379"} 1
up{group="group-a", instance="node-2", job="redis-cluster", service="exporter-redis-6380"} 1
up{group="group-a", instance="node-3", job="redis-cluster", service="exporter-redis-6379"} 1
up{group="group-a", instance="node-3", job="redis-cluster", service="exporter-redis-6380"} 1
up{group="group-a", instance="node-4", job="redis-cluster", service="exporter-redis-6379"} 1
up{group="group-a", instance="node-4", job="redis-cluster", service="exporter-redis-6380"} 1
up{group="group-a", instance="node-5", job="redis-cluster", service="exporter-redis-6379"} 1
up{group="group-a", instance="node-5", job="redis-cluster", service="exporter-redis-6380"} 1

redis_cluster_known_nodes{group="group-a", instance="node-1", job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
redis_cluster_known_nodes{group="group-a", instance="node-1", job="redis-cluster", service="exporter-redis-6380", team="sre"} 10
redis_cluster_known_nodes{group="group-a", instance="node-2", job="redis-cluster", service="exporter-redis-6379"} 11
redis_cluster_known_nodes{group="group-a", instance="node-2", job="redis-cluster", service="exporter-redis-6380"} 16
redis_cluster_known_nodes{group="group-a", instance="node-3", job="redis-cluster", service="exporter-redis-6379"} 16
redis_cluster_known_nodes{group="group-a", instance="node-3", job="redis-cluster", service="exporter-redis-6380"} 16
redis_cluster_known_nodes{group="group-a", instance="node-4", job="redis-cluster", service="exporter-redis-6379"} 16
redis_cluster_known_nodes{group="group-a", instance="node-4", job="redis-cluster", service="exporter-redis-6380"} 16
redis_cluster_known_nodes{group="group-a", instance="node-5", job="redis-cluster", service="exporter-redis-6379"} 16
redis_cluster_known_nodes{group="group-a", instance="node-5", job="redis-cluster", service="exporter-redis-6380"} 16

```

Brian Candler

unread,

Oct 18, 2022, 3:34:52 AM10/18/22

to Prometheus Users

If you run the two halves of the query separately:

redis_cluster_known_nodes

and

count by (instance, service, group) (up{service=~"exporter-redis-.*"})

then I think the reason will become clear.

If that set of "up" metrics is complete, then I'd expect the "count by" results for node-1 to be to be

{group="group-a",instance="node-1",service="exporter-redis-6379"} 1

{group="group-a",instance="node-1",service="exporter-redis-6380"} 1

and these values (of 1) are clearly different to

redis_cluster_known_nodes{group="group-a", instance="node-1", job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
redis_cluster_known_nodes{group="group-a", instance="node-1", job="redis-cluster", service="exporter-redis-6380", team="sre"} 10

Aside: the "count by" seems superfluous here, since every "up" metric has a distinct combination of (instance,service,group). I guess it ensures that up values of 0 are turned into 1.

Without knowing more about what you're trying to do and what these metrics represent, I can't really help. A value of redis_cluster_known_nodes of 10 suggests there are 10 "nodes" of some sort, whatever they are. But the "up" metric will only be 1 or 0 (success or fail on scrape). If you had a separate scrape target for each node then you could count or sum these to get the number of nodes, but the list of "up" metrics you showed suggests there's only one scrape job for each instance+service combination.

So really it boils down to, what's a "node" and how do you count them? Is a single "node" a whole cluster, or is a cluster a collection of nodes?

In particular, what do these metrics mean?

redis_cluster_known_nodes{group="group-a", instance="node-1", job="redis-cluster", service="exporter-redis-6379", team="sre"} 10

redis_cluster_known_nodes{group="group-a", instance="node-2", job="redis-cluster", service="exporter-redis-6379"} 11

redis_cluster_known_nodes{group="group-a", instance="node-3", job="redis-cluster", service="exporter-redis-6379"} 16

redis_cluster_known_nodes{group="group-a", instance="node-4", job="redis-cluster", service="exporter-redis-6379"} 16

redis_cluster_known_nodes{group="group-a", instance="node-5", job="redis-cluster", service="exporter-redis-6379"} 16

They are all the same "service", but how come instance "node-1" contains or sees 10 "nodes", but instance "node-2" contains or sees 11 "nodes", and the other instances contain or see 16 "nodes"? Perhaps this inconsistency is the error you're trying to detect - in which case, what do you think is the correct number of nodes?

Let's say 16 is the correct answer for group="group-a" and service="exporter-redis-6379". Perhaps you didn't show the full set of "up" metrics. In which case, I'd first try to build an "up" query which gives the expected answer 16 on the right-hand side. Maybe something like this:

count by (service, group) (up{service=~"exporter-redis-.*"})

What does that expression show?

When you have that part working, then we can work on matching the LHS. Since each *instance* seems to have its own distinct idea of the total number of nodes, then I expect this requires an N:1 match on (group,service). That is, there is 1 "should be" value for a given (service,group) on the RHS, and multiple nodes each with their own count of (service,group) on the LHS.

If that's the case, it might end up something like this:

redis_cluster_known_nodes != on (service, group) group left() count by (service, group) (up{service=~"exporter-redis-.*"})

but at this point I'm just speculating.

Brian Candler

unread,

Oct 18, 2022, 3:36:49 AM10/18/22

to Prometheus Users

Sorry, I missed an underscore there.

redis_cluster_known_nodes != on (service, group) group_left() count by (service, group) (up{service=~"exporter-redis-.*"})

marc koser

unread,

Oct 18, 2022, 7:27:10 AM10/18/22

to Prometheus Users

> So really it boils down to, what's a "node" and how do you count them? Is a single "node" a whole cluster, or is a cluster a collection of nodes?

A node is a redis service that is part of a cluster (id'ed by the `group` label), so a cluster is a collection of nodes. The sum of all nodes is a determinate and, under normal circumstances, a static value but since a redis 'node' is never forgotten unless told to I want to alert on this case since it can skew the interpolation of other metrics.

> In particular, what do these metrics mean?
>
> redis_cluster_known_nodes{group="group-a", instance="node-1", job="redis-cluster", service="exporter-redis-6379", team="sre"} 10
> redis_cluster_known_nodes{group="group-a", instance="node-2", job="redis-cluster", service="exporter-redis-6379"} 11
> redis_cluster_known_nodes{group="group-a", instance="node-3", job="redis-cluster", service="exporter-redis-6379"} 16
> redis_cluster_known_nodes{group="group-a", instance="node-4", job="redis-cluster", service="exporter-redis-6379"} 16
> redis_cluster_known_nodes{group="group-a", instance="node-5", job="redis-cluster", service="exporter-redis-6379"} 16

This represents the state of all known redis nodes belonging to a single cluster relative to a running node.

> They are all the same "service", but how come instance "node-1" contains or sees 10 "nodes", but instance "node-2" contains or sees 11 "nodes", and the other instances contain or see 16 "nodes"? Perhaps this inconsistency is the error you're trying to detect - in which case, what do you think is the correct number of nodes?

This is indeed the scenario I'm attempting to query for. In this case; when a node is joined to the cluster but is unreachable for any reason (ie: redis is uninstalled / re-installed and the node rejoins the cluster) the node's ID changes (the new ID is valid and reachable, the old ID is no longer valid and unreachable).

The correct value is 10: 5 `instance`'s x 2 `service`'s

> Let's say 16 is the correct answer for group="group-a" and service="exporter-redis-6379". Perhaps you didn't show the full set of "up" metrics. In which case, I'd first try to build an "up" query which gives the expected answer 16 on the right-hand side. Maybe something like this:
>

> count by (service, group) (up{service=~"exporter-redis-.*"})
>

> What does that expression show?

{group="group-a", service="exporter-redis-6379"} 5
{group="group-a", service="exporter-redis-6380"} 5

> When you have that part working, then we can work on matching the LHS. Since each *instance* seems to have its own distinct idea of the total number of nodes, then I expect this requires an N:1 match on (group,service). That is, there is 1 "should be" value for a given (service,group) on the RHS, and multiple nodes each with their own count of (service,group) on the LHS.

That sounds accurate

> If that's the case, it might end up something like this:
>
> redis_cluster_known_nodes != on (service, group) group left() count by (service, group) (up{service=~"exporter-redis-.*"})
>
> but at this point I'm just speculating.

This gives the same result as before.

I'll keep plugging away at this to see what I can come up with.

marc koser

unread,

Oct 18, 2022, 3:12:27 PM10/18/22

to Prometheus Users

Perhaps an easier option would be to compare redis_cluster_known_nodes against what it was n-time_interval_ago:

redis_cluster_known_nodes != avg_over_time (redis_cluster_known_nodes[1d:4h])

It's less-than ideal since it's not using a static,expected value of total cluster nodes and it would match when the cluster nodes become what is expected but I can deal with that for now.

Thanks for your help!

Brian Candler

unread,

Oct 19, 2022, 4:24:41 AM10/19/22

to Prometheus Users

Or even:

redis_cluster_known_nodes != redis_cluster_known_nodes offset 5m

marc koser

unread,

Oct 26, 2022, 6:46:19 AM10/26/22

to Prometheus Users

To close the loop on this, I was able to get this working using this query:

max by (group) (redis_cluster_known_nodes) != count by (group) (up{service=~"exporter-redis-.*"})

Thanks for your insight on this Brian.

Reply all

Reply to author

Forward