Binary expressions and missing metrics

1,382 views
Skip to first unread message

Federico Buti

unread,
Mar 3, 2022, 6:12:11 AM3/3/22
to Prometheus Users
Hi list,

For a monitored system we setup a rule as follows:

absent(our_metric{environment="pro",service="bar",stack="foo"}) and on(stack, environment) up{service="bar",source="app"} == 1

This is one of the few absence rules we have in our ruleset. This is also a bit special because the exporter uses the absence of the metric to indicate a problem - something that is discouraged from guidelines. But that goes beyond my question anyway.

Using a binary AND operator seems to work fine, cutting out the cases in which the node is not scrapable. However this morning the node went missing. We had probably a misconfiguration in our provisioning which we are currently investigating.

As the node went missing the second operand of the binary operator could not be evaluated, simply because it was neither `1`, nor `0`. Or, in other words, the following was holding true:

absent(up{service="bar",source="app"}) = 1

I understand an alert can resolve if the related metric goes stale but I'm not sure how the logic should translate in this case. On the surface I would not expect the AND expression to fire as we are not able to say the "up" metric is really 1.

But maybe I'm missing the point here?

Thanks in advance,
F.

Brian Candler

unread,
Mar 3, 2022, 11:01:29 AM3/3/22
to Prometheus Users
You can use the PromQL browser in the prometheus web UI to debug this, since you can view the value of an expression at any previous point in time.

Try the two halves separately:

absent(our_metric{environment="pro",service="bar",stack="foo"}) 

up{service="bar",source="app"} == 1

Then try the whole expression at that point in time.  Either view the graph, or view the instant query and set the instant time to when there was a problem.

> As the node went missing the second operand of the binary operator could not be evaluated, simply because it was neither `1`, nor `0`

The expression:
    up{service="bar",source="app"} == 1
can only ever have the value 1 or be missing.  metric == constant is a filter, not a boolean.  The value it returns is the value of the LHS, or no value if the filter condition is not met.

Possibly you want to remove the "== 1" entirely:

absent(our_metric{environment="pro",service="bar",stack="foo"}) and on(stack, environment) up{service="bar",source="app"}

"and" expressions behave in a corresponding way:

    foo and bar

This is *not* boolean.  Rather, it takes the vector of timeseries "foo" and matches them up with the vector of timeseries "bar".  All those elements of foo which have exactly matching label sets with bar, are passed through unchanged.  Anything else is dropped.

So it's just a filter: "give me all values of foo, where there is also a value present for bar".  It does not have true/false values either as its input or its output.

> Or, in other words, the following was holding true:
> absent(up{service="bar",source="app"}) = 1

How do you know?  The "up" metric is always present for a target, whether or not scraping is successful: it would only not be present if you removed the target from the scrape job.  This could be the case if you are using some dynamic service discovery, and the service went away.  But then your real problem is how to stop services vanishing from service discovery.

Anyway, you can tell for sure by looking at historical values of these queries:

up{service="bar",source="app"}
absent(up{service="bar",source="app"})


baca...@gmail.com

unread,
Mar 4, 2022, 2:23:16 AM3/4/22
to Prometheus Users
Hi Brian,

thanks a lot for your reply.

I re-read my original mail and I recognize I should have probably delivered less information and went straight to the point. That probably created a bit of confusion. E.g. I never intended the up metric - or any other metric - to be considered a boolean. My bad. I'll try to get straight to the point this time.

>This is *not* boolean.  Rather, it takes the vector of timeseries "foo" and matches them up with the vector of timeseries "bar".  All those elements of foo which have exactly matching label >sets with bar, are passed through unchanged.  Anything else is dropped.

Right, and my question is the following. Mostly to understand the underlining behaviour, not because I have any particular problem to resolve.
Assuming the second metric goes missing how is the binary expression evaluated exactly? In the "normal" case, i.e. "foo and bar" we would not have points but in the case of "absent(foo) and bar", from my tests, it seems to me the "bar" filtering is simply ignored.

I can guess that is because "absent" is not really a metric per se and thus we are comparing two empty sets of labels - effectively reducing "absent(foo) and bar" to "absent(foo)".
I'd say, it would make sort of sense, right?

Cheers,
F.

Brian Candler

unread,
Mar 4, 2022, 3:46:19 AM3/4/22
to Prometheus Users
> Assuming the second metric goes missing how is the binary expression evaluated exactly?

The same as it always is.  Remember that the left-hand side and the right-hand side are both vectors, containing zero or more values, each value having a distinct set of labels. Noting the documentation here:

    vector1 and vector2 results in a vector consisting of the elements of vector1 for which there are elements in vector2 with exactly matching label sets. Other elements are dropped. The metric name and values are carried over from the left-hand side vector.

Therefore, if the RHS of "and" is an empty vector, then the result of the entire "and" expression is an empty vector - since there is nothing in vector2 for vector1 to match.

> In the "normal" case, i.e. "foo and bar" we would not have points but in the case of "absent(foo) and bar", from my tests, it seems to me the "bar" filtering is simply ignored.

I don't understand what mean by that. Can you give examples of the LHS and the RHS vectors, and the combined expression, which don't behave how you expect?

Note that "foo and bar" and "absent(foo) and bar" will both be empty if bar is empty, as just described.

"absent(foo)" is an unusual function:
- if the input vector has one or more values, i.e. any non-empty vector, its output is an empty vector (no values)
- if the input vector is empty, its output is one-element vector with a single value "1". The label set of that value depends on the exact form of the expression inside the parentheses; it tries to do "the right thing" but at worst you could have value 1 with empty label set {}

In your case,

    absent(our_metric{environment="pro",service="bar",stack="foo"})

will return
    {environment="pro",service="bar",stack="foo"} 1

i.e. a single-element vector with empty metric name, those labels, and the value 1.

Going back to the whole original expression:

    absent(our_metric{environment="pro",service="bar",stack="foo"}) and on(stack, environment) up{service="bar",source="app"} == 1

ISTM that is saying you want to generate an alert if our_metric{environment="pro",service="bar",stack="foo"} is missing, but only if metric up{service="bar",source="app"} exists *and* has value 1. That means the alert is suppressed if either:
(a) up{service="bar",source="app"} exists but its value is not 1
(b) up{service="bar",source="app"} does not exist - i.e. that expression returns an empty vector. ("up" is a special metric in prometheus; if it doesn't exist, it means there is no configured scrape job with those labels)

If that's not what you want, then think about what you actually want, and then how to express that.  For example, if you want to suppress the alert in case (a) but not in case (b), then you can do this:

    absent(our_metric{environment="pro",service="bar",stack="foo"}) unless on(stack, environment) up{service="bar",source="app"} != 1

------
If you don't mind, I will make an observation about the use of "and on(...)".  Since the LHS and RHS are vectors, an expression needs to identify corresponding values in the LHS vector and the RHS vector, to generate a vector of results. The on(...) part is when the LHS and RHS vectors don't have exactly the same label sets, and you need to ignore some when matching them up. I think you know all this already.

I find your expression rather confusing, because:
- we know that any values in the LHS vector must have labels {environment="pro",service="bar",stack="foo"}
- we know that any values in the RHS vector must have labels {service="bar",source="app"}
- "on(stack,environment)" says to pair up LHS and RHS values where the "stack" and "environment" labels match
- therefore, the RHS vector must also have stack="foo" and environment="pro"
- as this a one-to-one vector match: it will fail if a particular pair of (stack,environment) labels returns multiple values for the LHS and one or more for the RHS, or vice versa. Therefore we know (stack,environment) must be a unique match for a given service (*)

Therefore, implicitly I think all of (environment, service, stack) must match, i.e. this expression is the same as:

    absent(our_metric{environment="pro",service="bar",stack="foo"}) and on(environment, service, stack) up{environment="pro",service="bar",stack="foo",source="app"} == 1

And this can be simplified to:

    absent(our_metric{environment="pro",service="bar",stack="foo"}) and on(environment, service, stack) up{source="app"} == 1

I find the second version easier to read and reason about, because the environment/service/stack matching is all in one place, but you may disagree :-)

(*) This does provide another reason why an alert could fail to trigger.  If the "and" expression returns multiple values for the same (stack,environment) pair on either the LHS or the RHS, with at least one match on the other side, then the whole expression will generate an error.

However, I think it's unlikely in this particular case. We know the LHS can only possibly return a single-element vector, so this error condition could only occur if up{service="bar",source="app"} == 1 returns multiple values with the same pair of (stack,environment) labels. That is, it would only be a problem if you had something like this:
up{environment="pro",service="bar",stack="foo",source="app",xxx="yyy"} 1
up{environment="pro",service="bar",stack="foo",source="app",xxx="zzz"} 1

Federico Buti

unread,
Mar 4, 2022, 5:00:12 AM3/4/22
to Brian Candler, Prometheus Users
Hi Brian.

Thanks for the super-deep dive into the topic! This is simply awesome. And sorry for the mails mismatch...too many mail accounts! :-D

On Fri, 4 Mar 2022 at 09:46, Brian Candler <b.ca...@pobox.com> wrote:
> Assuming the second metric goes missing how is the binary expression evaluated exactly?

The same as it always is.  Remember that the left-hand side and the right-hand side are both vectors, containing zero or more values, each value having a distinct set of labels. Noting the documentation here:

    vector1 and vector2 results in a vector consisting of the elements of vector1 for which there are elements in vector2 with exactly matching label sets. Other elements are dropped. The metric name and values are carried over from the left-hand side vector.

Therefore, if the RHS of "and" is an empty vector, then the result of the entire "and" expression is an empty vector - since there is nothing in vector2 for vector1 to match.

> In the "normal" case, i.e. "foo and bar" we would not have points but in the case of "absent(foo) and bar", from my tests, it seems to me the "bar" filtering is simply ignored.

I don't understand what mean by that. Can you give examples of the LHS and the RHS vectors, and the combined expression, which don't behave how you expect?

I was referring to "absent(foo) and bar", which was the source of my original question. On the surface it seemed to me that  LHS was firing even though RHS was empty. But your detailed explanation below forced me to double-check again in the expression browser and now I see the RHS wasn't really empty as I first (erroneously) reported. Which matches the documentation you mentioned and makes everything click perfectly in my head. Was dumb of me, but I guess stuff happens. Thanks a lot.



Note that "foo and bar" and "absent(foo) and bar" will both be empty if bar is empty, as just described.

"absent(foo)" is an unusual function:
- if the input vector has one or more values, i.e. any non-empty vector, its output is an empty vector (no values)
- if the input vector is empty, its output is one-element vector with a single value "1". The label set of that value depends on the exact form of the expression inside the parentheses; it tries to do "the right thing" but at worst you could have value 1 with empty label set {}

In your case,

    absent(our_metric{environment="pro",service="bar",stack="foo"})

will return
    {environment="pro",service="bar",stack="foo"} 1

i.e. a single-element vector with empty metric name, those labels, and the value 1.

Going back to the whole original expression:

    absent(our_metric{environment="pro",service="bar",stack="foo"}) and on(stack, environment) up{service="bar",source="app"} == 1

ISTM that is saying you want to generate an alert if our_metric{environment="pro",service="bar",stack="foo"} is missing, but only if metric up{service="bar",source="app"} exists *and* has value 1. That means the alert is suppressed if either:
(a) up{service="bar",source="app"} exists but its value is not 1
(b) up{service="bar",source="app"} does not exist - i.e. that expression returns an empty vector. ("up" is a special metric in prometheus; if it doesn't exist, it means there is no configured scrape job with those labels)

Yes, I was interested in having (a). Then yesterday we experienced (b) because of a provision problem and I wrote to the list to understand that case better. Just to improve my knowledge. We do NOT want disappearance of targets which would lead to (b) ofc, but that is an investigation we are doing on our side to avoid the problem in the future.
 


If that's not what you want, then think about what you actually want, and then how to express that.  For example, if you want to suppress the alert in case (a) but not in case (b), then you can do this:

    absent(our_metric{environment="pro",service="bar",stack="foo"}) unless on(stack, environment) up{service="bar",source="app"} != 1

------

Cool! I've always struggled a bit with "unless" but I can totally give it a go for this case. As I should have mentioned I want to move away from the absent altogether but that is something is not going to happen soon due to the way the exporter is written atm, unfortunately.
 


If you don't mind, I will make an observation about the use of "and on(...)".  Since the LHS and RHS are vectors, an expression needs to identify corresponding values in the LHS vector and the RHS vector, to generate a vector of results. The on(...) part is when the LHS and RHS vectors don't have exactly the same label sets, and you need to ignore some when matching them up. I think you know all this already.

I find your expression rather confusing, because:
- we know that any values in the LHS vector must have labels {environment="pro",service="bar",stack="foo"}
- we know that any values in the RHS vector must have labels {service="bar",source="app"}
- "on(stack,environment)" says to pair up LHS and RHS values where the "stack" and "environment" labels match
- therefore, the RHS vector must also have stack="foo" and environment="pro"
- as this a one-to-one vector match: it will fail if a particular pair of (stack,environment) labels returns multiple values for the LHS and one or more for the RHS, or vice versa. Therefore we know (stack,environment) must be a unique match for a given service (*)

Therefore, implicitly I think all of (environment, service, stack) must match, i.e. this expression is the same as:

    absent(our_metric{environment="pro",service="bar",stack="foo"}) and on(environment, service, stack) up{environment="pro",service="bar",stack="foo",source="app"} == 1

And this can be simplified to:

    absent(our_metric{environment="pro",service="bar",stack="foo"}) and on(environment, service, stack) up{source="app"} == 1

I find the second version easier to read and reason about, because the environment/service/stack matching is all in one place, but you may disagree :-)

Not really sure why I should disagree here! :-D
This is a great insight and a source of reflection for us to improve our rule set. We have a few binary expressions using "and" for which the reasoning applied here could be taken in account. If anything it simplifies/shortens the expression a lot, which is always a plus, imo.

Thanks a lot for your huge help!
F.




--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/pyTVLNKp3XM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f24239ac-aa22-4b1e-bcd9-92861bfa2976n%40googlegroups.com.

Brian Candler

unread,
Mar 4, 2022, 5:44:44 AM3/4/22
to Prometheus Users
Glad it makes sense now. It was definitely a bump in the learning curve for me :-)

Regards, Brian.
Reply all
Reply to author
Forward
0 new messages