Offset alert never clearing

40 views
Skip to first unread message

cam

unread,
Dec 13, 2024, 3:49:33 AM12/13/24
to Prometheus Users
Hello all,

I have a rule which is trying to count time series that match a certain regexp and spot when this changes, to raise an alert more or less immediately (i.e. no for clause). This is counting a custom socket count metric that we need to catch any changes in.

  - alert: outboundSocketCountChange
    expr: (count({__name__=~"tcpsocket(.+)Inbound"} offset 30s) - count({__name__=~"tcpsocket(.+)Inbound"})) != bool 0
    labels:
      severity: critical
    annotations:
      summary: OB socket count has changed

It triggers fine when the value changes but it appears to then be stuck in firing, rather than resolving when the next evaluation window completes. Graphing the promQL shows exactly what I would expect - a single spike to 1 when the value changes and then back to zero. I would expect the alert to clear when it hits that zero.

Scrape and evaluation intervals are both set to 15s. Prom v2.45.

Am I missing something here? 

cam

unread,
Dec 13, 2024, 4:39:02 AM12/13/24
to Prometheus Users
This took about a week to appear on the list? Meantime, I have come up with the following.. 

  - alert: outboundSocketCountChange
    expr: ((count({__name__=~"tcpsocket(.+)Inbound"} offset 30s) - count({__name__=~"tcpsocket(.+)Inbound"})) != bool 0) == 1

    labels:
      severity: critical
    annotations:
      summary: OB socket count has changed

This does what I need but it makes me think I do not really understand how expr works in prom rules - is it something that simply evaluates to either 1 or 'true' as a go bool type?

c

Brian Candler

unread,
Dec 13, 2024, 7:00:42 AM12/13/24
to Prometheus Users
> I do not really understand how expr works in prom rules - is it something that simply evaluates to either 1 or 'true' as a go bool type?

No. It's not boolean logic at all.

PromQL works with *vectors*: a vector contains zero or more values, each with a distinct set of labels. An alert fires whenever the vector is non-empty, regardless of the value. That is, a value of 0 triggers an alert just as much as a value of 1000. It's the presence or absence of a value which controls alerting.

Take, for example, the promql query "foo". It might return the following, all current values of metric foo:

foo{instance="aaa"} 7
foo{instance="bbb"} 3
foo{instance="ccc"} 1

That's a vector with three values.

Now take the promql query "foo > 2". It returns a vector with 2 values:

foo{instance="aaa"} 7
foo{instance="bbb"} 3

If you use "foo > 2" as an alerting expression, then you'll have two alerts firing.  If the value of foo{instance="bbb"} drops to 2 or less, then the alerting expression returns an instant vector with only one value, so the bbb alert resolves, but the aaa alert continues.

This is the reason why "resolved" messages show the most recent value which triggered the alert, not the current (non-alerting) value. The current value is below the threshold, so is filtered out entirely from the PromQL results.

Now, an expression like count({__name__=~"tcpsocket(.+)Inbound"}) also gives a vector as its result. If there are no timeseries inside the parentheses, then it is the empty vector. If there are one or more timeseries, then you get a single-element vector containing a single value (which is the count of timeseries) and an empty label set.  You can try this for yourself in the PromQL query browser:

count({__name__=~"blah_nonexistent(.*)"})   #   empty result
count({__name__=~"node_filesystem(.*)"})    #    {} 1234   where {} means "empty label set"

Now, when you do a binary operation between two vector values, by default the result vector has one entry for every label set which matches exactly between the LHS and RHS vectors. Any label set on the LHS which is not matched on the RHS, or vice versa, is discarded and gives no value in the result vector.  But in this case, since the LHS and RHS will (almost) always have a single entry with empty label set, it will match.

Therefore, what I think you want is simply:

expr: count({__name__=~"tcpsocket(.+)Inbound"}) offset 30s != count({__name__=~"tcpsocket(.+)Inbound"})

That should do what you want *unless* __name__=~"tcpsocket(.+)Inbound" matches no timeseries at all, in which case the vector will be empty (on either the LHS or the RHS) and therefore the count() will be empty, and there's nothing to match to the other side.  If this is an important case for you then you can fake up a vector with empty labels:

expr: count({__name__=~"tcpsocket(.+)Inbound"}) offset 30s != count({__name__=~"tcpsocket(.+)Inbound"}) or vector(0)

Again, PromQL's "or" operator doesn't behave like boolean expression. What "or" does is to match the vectors on the LHS and the RHS:
- for any value on the LHS, use the value and label set from the LHS in the result (whether or not it matches something in the RHS)
- for any value on the RHS, whose label set does not exist in the LHS, then add it to the result.

vector(0) is a static value: an instant vector containing one element whose label set is empty with value 0.  So if the previous expression doesn't contain an element with empty label set, "... or vector(0)" will add it to the result, and that will trigger the alert (with value 0).

Colm McCartan

unread,
Dec 13, 2024, 7:33:26 AM12/13/24
to Brian Candler, Prometheus Users
This is incredibly helpful, thanks for taking the time to write it. I don't think there is anything like this level of description of how expr works in the docs, but I may have missed it.

You also correctly anticipated that the missing-time-series scenario was an issue for me in this work, so thanks for that too.

cam

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/AfVOhJ5rfOg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/prometheus-users/77fed316-4283-4fc3-98d9-99bcf630e37bn%40googlegroups.com.


--
............................................................
colm.m...@gmail.com

Chris Siebenmann

unread,
Dec 13, 2024, 10:19:11 AM12/13/24
to Colm McCartan, Brian Candler, Prometheus Users, Chris Siebenmann
> This is incredibly helpful, thanks for taking the time to write it. I
> don't think there is anything like this level of description of how
> expr works in the docs, but I may have missed it.

To be clear here, and to help locate it in the documentation, this isn't
a special behaviour of alert rule expressions, it's a general feature of
PromQL, the query language. You'll see the same issues anywhere you use
PromQL, such as in Grafana dashboards, ad-hoc queries through the
Prometheus web interface, and so on. It's covered in the general PromQL
documentation to some degree, in that they say that these things are
vector matching, but I think the current PromQL documentation doesn't
specifically cover some of the surprises that you can get here (arguably
it's not the right place for it and it would be better covered in some
sort of introduction/tutorial for PromQL, which doesn't currently
exist).

(There are various clever tricks to deal with the label and set union
issues, although I don't think anyone has collected them all in one
place.)

- cks
Reply all
Reply to author
Forward
0 new messages