Joining two metrics in promtheus and having all labels from both

3,923 views
Skip to first unread message

Nick Carlton

unread,
Mar 6, 2023, 4:33:24 PM3/6/23
to Prometheus Users
Hello Everyone!

I am polling F5 Pool and Node information from the snmp exporter and am managing to get the below metrics:

`ltmPoolMbrStatusEnabledState{instance="ltm01", job="f5_ltm_test", ltmPoolMbrStatusEnabledState="1", ltmPoolMbrStatusNodeName="/Common/VPN1", ltmPoolMbrStatusPoolName="/Common/Pool1", ltmPoolMbrStatusPort="4500", prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-0"}`

`ltmPoolMbrStatusAvailState{instance="ltm01", job="f5_ltm_test", ltmPoolMbrStatusAvailState="2", ltmPoolMbrStatusNodeName="/Common/VPN1", ltmPoolMbrStatusPoolName="/Common/Pool1", ltmPoolMbrStatusPort="4500", prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-0"}`

However, would like to join the metrics together to get something like:

`ltmPoolMbrStatusState{instance="ltm01", job="f5_ltm_test", ltmPoolMbrStatusEnabledState="1", ltmPoolMbrStatusAvailState="2", ltmPoolMbrStatusNodeName="/Common/VPN1", ltmPoolMbrStatusPoolName="/Common/Pool1", ltmPoolMbrStatusPort="4500", prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-0"}`

Noting that `ltmPoolMbrStatusEnabledState` and `ltmPoolMbrStatusAvailState` occur in the final metric.

Is there a way to do this within the scrape job relabelling bits or does this have to be done with a query?

If so please could someone assist with the query I would need to join the two metrics. The common labels are `instance, ltmPoolMbrStatusNodeName, ltmPoolMbrStatusPoolName and ltmPoolMbrStatusPort`.

I have tried doing a join but it only seems to bring in one of the `ltmPoolMbrStatusEnabledState` and `ltmPoolMbrStatusAvailState` metrics depending on if its a left or right join, but I would like both to appear.

Thanks in advance.
Nick

Brian Candler

unread,
Mar 7, 2023, 3:31:39 AM3/7/23
to Prometheus Users
It has to be done in a query - the relabelling phase of a scrape job cannot see other metrics.

What you are looking for is one-to-many queries, which can pick up labels from the "one" side and apply them to the "many":


> I have tried doing a join but it only seems to bring in one of the `ltmPoolMbrStatusEnabledState` and `ltmPoolMbrStatusAvailState` metrics depending on if its a left or right join, but I would like both to appear.

Can you show what query you did, the raw metrics which it used, what result you got, and what you wanted to get instead?

Nick Carlton

unread,
Mar 7, 2023, 3:58:21 AM3/7/23
to Prometheus Users
Thanks for your response.

If I do `ltmPoolMbrStatusAvailState * on(instance,ltmPoolMbrStatusNodeName,ltmPoolMbrStatusPoolName,ltmPoolMbrStatusPort) group_right ltmPoolMbrStatusEnabledState` I get:

`{instance="ltm01", job="f5_ltm_test", ltmPoolMbrStatusEnabledState="1", ltmPoolMbrStatusNodeName="/Common/VPN1", ltmPoolMbrStatusPoolName="/Common/Pool1", ltmPoolMbrStatusPort="4500", prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-0"}`

So this has `ltmPoolMbrStatusEnabledState` but not `ltmPoolMbrStatusAvailState`. If I swap the metrics round:

`ltmPoolMbrStatusEnabledState * on(instance,ltmPoolMbrStatusNodeName,ltmPoolMbrStatusPoolName,ltmPoolMbrStatusPort) group_right ltmPoolMbrStatusAvailState`. I get:

`{instance="ltm01", job="f5_ltm_test", ltmPoolMbrStatusAvailState="1", ltmPoolMbrStatusNodeName="/Common/VPN1", ltmPoolMbrStatusPoolName="/Common/Pool1", ltmPoolMbrStatusPort="4500", prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-0"}`

This has `ltmPoolMbrStatusAvailState` but not `ltmPoolMbrStatusEnabledState`

So it seems I can get one or the other, not both. Unless im forming my PromQL wrong?

Brian Candler

unread,
Mar 7, 2023, 7:06:51 AM3/7/23
to Prometheus Users
To pick up labels from the other side, you need to list them as part of your group_right. e.g. (untested)

ltmPoolMbrStatusAvailState * on(instance,ltmPoolMbrStatusNodeName,ltmPoolMbrStatusPoolName,ltmPoolMbrStatusPort) group_right(foo,bar,baz) ltmPoolMbrStatusEnabledState

will pick up labels foo,bar,baz from the left side.


Here's a tested example:

node_filesystem_avail_bytes * on (instance) group_left(machine,release,version) node_uname_info

It's a many-to-one, where the left side is "many" and the right side is "one", and the given labels from node_uname_info are added to the labels from node_filesystem_avail_bytes.

Nick Carlton

unread,
Mar 7, 2023, 7:43:22 AM3/7/23
to Brian Candler, Prometheus Users
Ah ok,

I would have thought this was one-to-one because for each metric that exists in ltmPoolMbrStatusAvailState there is one exact match within ltmPoolMbrStatusEnabledState. Not multiple.

Unless I’m reading the definitions wrong.

Thanks

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/qlHK7r1hQVs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4c22a279-a51c-47db-993e-9107e54e883en%40googlegroups.com.

Brian Candler

unread,
Mar 7, 2023, 7:47:30 AM3/7/23
to Prometheus Users
Well, "many" means "one or more", and therefore "one" is a valid case of "many" :-)

But yes, you'll have to treat it as a many-to-one in order to pick up the extra labels.

If you know for sure that metrics A and B are always matched 1:1, then it doesn't matter whether you use group_left or group_right.  But you still need to provide the list of labels in brackets in order to pick up the labels from the other side.

Brian Candler

unread,
Mar 7, 2023, 7:53:05 AM3/7/23
to Prometheus Users
Something like:

ltmPoolMbrStatusAvailState * on(instance,ltmPoolMbrStatusNodeName,ltmPoolMbrStatusPoolName,ltmPoolMbrStatusPort)
group_left(ltmPoolMbrStatusEnabledState) ltmPoolMbrStatusEnabledState

(Slightly confusing that "ltmPoolMbrStatusEnabledState" is both a label name, and a metric name)

Nick Carlton

unread,
Mar 7, 2023, 8:28:01 AM3/7/23
to Prometheus Users
Thanks,

The reason the metrics have the same name as the label is because in order to bring the value in from the SNMP exporter as a lablel, you have to set the type as DisplayString. As from what I have read, doing a 'join' would not allow you yo get the value of the metric in as a label value.

I managed to get it working with:

`ltmPoolMbrStatusAvailState * on(instance,ltmPoolMbrStatusNodeName,ltmPoolMbrStatusPoolName,ltmPoolMbrStatusPort) group_right(ltmPoolMbrStatusAvailState) ltmPoolMbrStatusEnabledState)`

However, from doing that I have realised that what I am trying to achieve will require the metric to be pre-existing rather than using the Query to build the metric.

The eventual aim was to be able to report on F5 pools that do not have any nodes left in the pool to be able to service a request, the nodes can either be disabled or unavailable or both.

Initially I was using:

`count by (instance, ltmPoolMbrStatusPoolName) (ltmPoolMbrStatusAvailState{ltmPoolMbrStatusNodeName!~".*MAINT.*"})` - This gets me the total pool members minus any maint servers no matter the availability

and

`count by (instance, ltmPoolMbrStatusPoolName) (ltmPoolMbrStatusAvailState{ltmPoolMbrStatusNodeName!~".*MAINT.*"}!= 1)` - This gets me the total pool members that are NOT available. Not a value of 1. With 1 meaning available.

If I put these together like:


`count by (instance, ltmPoolMbrStatusPoolName) (ltmPoolMbrStatusAvailState{ltmPoolMbrStatusNodeName!~".*MAINT.*"}) - count by (instance, ltmPoolMbrStatusPoolName) (ltmPoolMbrStatusAvailState{ltmPoolMbrStatusNodeName!~".*MAINT.*"}!= 1) == 0`

And checked if the value equals 0, that would tell me that the pool has no members that are AVAILABLE. However, I would not know if any nodes in the pool had been user disabled, so I wouldn't be able to trust the metric. If the value was 1, I would not know if that 1 node was disabled for example.


What I was wanting to achieve by having the values for each in a single metric was what I could do this:

1. `count by (instance, ltmPoolMbrStatusPoolName) (ltmPoolMbrStatusAvailState{ltmPoolMbrStatusNodeName!~".*MAINT.*"})` - This gets me the total pool members minus maint servers no matter the availability

and

2. `count by (instance, ltmPoolMbrStatusPoolName) (ltmPoolMbrStatusAvailState{ltmPoolMbrStatusNodeName!~".*MAINT.*",ltmPoolMbrStatusAvailState="1",ltmPoolMbrStatusEnabledState!="1"})` - This gets me the total pool members that ARE available AND are NOT enabled. Meaning they are non functional

and

3. `count by (instance, ltmPoolMbrStatusPoolName) (ltmPoolMbrStatusAvailState{ltmPoolMbrStatusNodeName!~".*MAINT.*",ltmPoolMbrStatusAvailState!="1",ltmPoolMbrStatusEnabledState="1"})` - This gets me the total pool members that are NOT available AND are enabled. Meaning they are non functional

and

4. `count by (instance, ltmPoolMbrStatusPoolName) (ltmPoolMbrStatusAvailState{ltmPoolMbrStatusNodeName!~".*MAINT.*",ltmPoolMbrStatusAvailState!="1",ltmPoolMbrStatusEnabledState!="1"})` - This gets me the total pool members that are NOT available AND are NOT enabled. Meaning they are non functional

Then I could put that together and do `1 - 2 - 3 - 4 == 0`. If the final result was 0 then I could be pretty certain that no nodes are available to handle requests.

My issue with the above is that I don't have the `ltmPoolMbrStatusAvailState` metric with both the `ltmPoolMbrStatusAvailState` and `ltmPoolMbrStatusEnabledState` label values. Im not sure how I would manage to get the merged metric in this way.

The other issue is that queries 2,3 and 4 will not return a 0 value if for example on number 2 there are no members that ARE available AND are NOT enabled, so no value is returned, im not sure how that would be handled in the maths operation as it would likely be "no data" rather than 0.

Hope that makes sense! Its quite the mess.

Thanks
Nick

Brian Candler

unread,
Mar 7, 2023, 10:26:11 AM3/7/23
to Prometheus Users
> My issue with the above is that I don't have the `ltmPoolMbrStatusAvailState` metric with both the `ltmPoolMbrStatusAvailState` and `ltmPoolMbrStatusEnabledState` label values. Im not sure how I would manage to get the merged metric in this way.

You can nest queries as far as you need, including the group_left/group_right query shown before.  And if that makes them unmaintainable, you can use recording rules to generate new metrics containing the results of those queries.

> The other issue is that queries 2,3 and 4 will not return a 0 value if for example on number 2 there are no members that ARE available AND are NOT enabled, so no value is returned, im not sure how that would be handled in the maths operation as it would likely be "no data" rather than 0.

count(foo) counts the number of elements in instance vector foo, and if it's an empty vector it returns 0.

BUT: count by (foo) (bar) will only give a unique count for each value of the 'foo' label in the 'bar' metric (with missing/empty foo label counted as a separate value).  If there are no 'bar' metrics at all, then you'll get an empty result set.

If there is *some* metric "bar" that you can depend on always being present, then you can do something like
    foo or on (x,y,z) bar * 0

Bear in mind that things which *look* like Boolean operations in PromQL, aren't.

foo < 50      # return the instance vector "foo" trimmed to include only those timeseries whose value is < 50 (i.e. it's a filter)

foo or bar   # union of all values of "foo", plus those values of "bar" which don't have exactly matching label sets in "foo"

Nick Carlton

unread,
Mar 7, 2023, 11:20:14 AM3/7/23
to Brian Candler, Prometheus Users
This is the first I’ve heard of recording rules. So I would be able to convert the group_right query into a new metric using this method?

The documentation seems to suggest it’s a colon string with an expression. Does that mean the metric becomes the string containing the colons? And I guess these need to exist within the rule files like for alertmanager?

I think doing that would be a better solution however still need to work out the issue with not returning a zero value. I understand the premise of ‘ foo or on (x,y,z) bar * 0’ but not enough to be able to substitute the metrics and labels from my existing count by, are you able to provide further guidance on that?

Massively appreciate your help

Thanks
Nick 

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/qlHK7r1hQVs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Mar 7, 2023, 11:31:48 AM3/7/23
to Prometheus Users
On Tuesday, 7 March 2023 at 16:20:14 UTC Nick Carlton wrote:
This is the first I’ve heard of recording rules. So I would be able to convert the group_right query into a new metric using this method?

Yes.

The documentation seems to suggest it’s a colon string with an expression.

Not necessarily: the string you saw is just the metric name, and you can name your new metrics however you like.  However, colons are valid in metric names, and indeed are "reserved for user defined recording rules" (i.e. regular exporters are not supposed to use them).

Using a metric name with colons helps to distinguish synthetic metrics, reduces the risk of name clashes, and may include a potted history of how they were derived, as recommended here:

The actual value of the generated metric is defined by the "expr:" part of the recording rule.

Nick Carlton

unread,
Mar 7, 2023, 11:47:34 AM3/7/23
to Brian Candler, Prometheus Users
Thanks I’ll have a look into that.

Are you able to advise on substituting my query into the format ‘foo or on (x,y,z) bar * 0’ like you suggested.

I tried one but got unexpected expression so I’m obviously getting the syntax wrong….!

I suppose that if there are for example no nodes that are enabled but not available, would Prometheus actually know the pools that don’t match this because they theory they would not be returned at all  in the query, nevermind the count by.

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/qlHK7r1hQVs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Mar 7, 2023, 12:00:54 PM3/7/23
to Prometheus Users
On Tuesday, 7 March 2023 at 16:47:34 UTC Nick Carlton wrote:
I suppose that if there are for example no nodes that are enabled but not available, would Prometheus actually know the pools that don’t match this because they theory they would not be returned at all  in the query, nevermind the count by.

That's what I mean by having a stable metric, one which is always present, that you can combine with "or" or "unless".

If you just care about the 'instance' label, then the "up" metric can perform this purpose, which is created by Prometheus for every target it attempts to scrape.  This allows you to alert on the *absence* of a particular metric from any target, since you know which targets are being scraped:

In your case, you need to identify a particular metric which is always present for each load balancer pool (or whatever it is you're monitoring) regardless of its state.

But no, sorry, I am not going to write this for you :-)  It would involve digging down too far into the details of these F5 metrics than I care to.  Good luck!

Nick Carlton

unread,
Mar 7, 2023, 1:39:06 PM3/7/23
to Brian Candler, Prometheus Users
Thanks, 

I think I understand what you mean now.

In theory both of the metrics that I’m using to would work just without any manipulation, for example ltmPoolMbrStatusAvailState will return everything I care about and all the labels I need to use to match it’s just once I manipulate the values that it stops showing me everything.

So untested but:

‘ltmPoolMbrStatusAvailState or on (instance,ltmPoolMbrStatusNodeName,ltmPoolMbrStatusPoolName,ltmPoolMbrStatusPort) metricfromrecordingrule * 0’

Then the recording rule would be for the initial join:

‘ltmPoolMbrStatusAvailState * on(instance,ltmPoolMbrStatusNodeName,ltmPoolMbrStatusPoolName,ltmPoolMbrStatusPort) group_right(ltmPoolMbrStatusAvailState) ltmPoolMbrStatusEnabledState)‘

I imagine putting that above query where ‘ metricfromrecordingrule’ is in the main query will cause other issues, so am best to use a recording rule to get that data. That also allows me to use metricfromrecordingrule to do my checking for:

- enabled but unavailable
- disabled but available
- disabled and unavailable

And then I can do my subtraction based on that.

Sorry I’m new to more advanced querying. Is that looking along the right lines?


--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/qlHK7r1hQVs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages