missing data point make alert resolved

62 views
Skip to first unread message

林浩

unread,
Jun 3, 2020, 3:49:11 AM6/3/20
to Prometheus Users

We use node export to monitor os D state process, when D state process number > 500, it triggers pager alert rule like these

  - alert: Node_Process_In_D_State_Count_Critical
    expr: node_processes_state{state='D'} > 500
    for: 10m

but the problem is when OS running into problem status (too much D state process), looks like node export agent also running in problem status,  it can NOT report correct D state process metric to Prometheus server.
from the below screenshot, we can see some data points missing. This causes alert flapping, when data missing, the alert gets resolved.

is any way to avoid alert auto resolved when some data points missed? 

Jietu20200603-154527.jpg





Ben Kochie

unread,
Jun 3, 2020, 4:05:24 AM6/3/20
to 林浩, Prometheus Users
You can use something like `avg_over_time(node_processes_state{state='D'}[10m])` to smooth over missed scrapes. Depending on how sensitive you want this to be, you can also do `max_over_time()`.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f8560b07-9b00-4dfc-9671-667368ddd530%40googlegroups.com.

Brian Candler

unread,
Jun 3, 2020, 5:55:35 AM6/3/20
to Prometheus Users
Are you sure the data is missing? The query

node_processes_state{state='D'} > 500

will only show values in the graph which are over 500; you will see gaps when the value is below 500.  What do you see if you graph

node_processes_state{state='D'}

instead?

林浩

unread,
Jun 3, 2020, 6:26:19 AM6/3/20
to Prometheus Users
thanks Ben!  it looks like a good suggestion. still learning prom SQL,  see a function called absent.  do this function help for my case?

在 2020年6月3日星期三 UTC+8下午4:05:24,Ben Kochie写道:
You can use something like `avg_over_time(node_processes_state{state='D'}[10m])` to smooth over missed scrapes. Depending on how sensitive you want this to be, you can also do `max_over_time()`.

On Wed, Jun 3, 2020 at 9:49 AM 林浩 <haow...@gmail.com> wrote:

We use node export to monitor os D state process, when D state process number > 500, it triggers pager alert rule like these

  - alert: Node_Process_In_D_State_Count_Critical
    expr: node_processes_state{state='D'} > 500
    for: 10m

but the problem is when OS running into problem status (too much D state process), looks like node export agent also running in problem status,  it can NOT report correct D state process metric to Prometheus server.
from the below screenshot, we can see some data points missing. This causes alert flapping, when data missing, the alert gets resolved.

is any way to avoid alert auto resolved when some data points missed? 

Jietu20200603-154527.jpg





--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.
Message has been deleted

林浩

unread,
Jun 3, 2020, 6:29:53 AM6/3/20
to Prometheus Users
Brain, I sure data is missing, if I use node_processes_state{state='D'} graph will look like this
prom1.jpg








在 2020年6月3日星期三 UTC+8下午5:55:35,Brian Candler写道:

Ben Kochie

unread,
Jun 3, 2020, 6:40:29 AM6/3/20
to 林浩, Prometheus Users
No, absent doesn't help here.  The general practice here is to have a separate alert on the "up" metric to detect failed scrapes.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9c94720c-7347-448f-8790-26f745739e9e%40googlegroups.com.

Brian Candler

unread,
Jun 3, 2020, 8:22:46 AM6/3/20
to Prometheus Users
OK - then as Ben says, use avg_over_time or max_over_time.

Rajesh Reddy Nachireddi

unread,
Jun 8, 2020, 1:30:10 AM6/8/20
to Brian Candler, Prometheus Users
But again, it always makes queries complex and doesn't give reliable results.

If we use absent- queries are complex and not much useful when we are not expecting a boolean ..

avg_over_time or max_over_time - it smoothens the data but its not reliable..


Do we have a way to distinguish data not present vs data not matched while resolving and also it would be helpful to have timer based resolved similar to firing.

Want to hear from the community and maintainers.



On Wed, Jun 3, 2020 at 5:52 PM Brian Candler <b.ca...@pobox.com> wrote:
OK - then as Ben says, use avg_over_time or max_over_time.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Julien Pivotto

unread,
Jun 8, 2020, 1:58:11 AM6/8/20
to Rajesh Reddy Nachireddi, Brian Candler, Prometheus Users
Hello

This is a use case for a last_over_time function which does not exist and has been rejected in the past.


Regards

Rajesh Reddy Nachireddi

unread,
Jun 8, 2020, 2:09:26 AM6/8/20
to Julien Pivotto, Brian Candler, Prometheus Users
Thanks Julien. I used the similar approach for testing but it is really difficult to write 3 to 4 rules to raise/clear an alert

And another proposal was to check the number of times the condition matches FOR duration. This worlds for discrete data as well instead of streaming only.

Brian Candler

unread,
Jun 8, 2020, 2:26:39 AM6/8/20
to Prometheus Users
On Monday, 8 June 2020 06:30:10 UTC+1, Rajesh Reddy Nachireddi wrote:
 it would be helpful to have timer based resolved similar to firing.


- although I posted that under alertmanager and really it belongs under prometheus.

Brian Brazil

unread,
Jun 8, 2020, 3:23:31 AM6/8/20
to Rajesh Reddy Nachireddi, Brian Candler, Prometheus Users
On Mon, 8 Jun 2020 at 07:41, Rajesh Reddy Nachireddi <rajesh...@gmail.com> wrote:
Ok, thanks for the clarification. It is a necessary feature rather than a good-to-have feature.

As prometheus usage is getting to multiple industries, would it be possible to consider this https://github.com/prometheus/alertmanager/issues/204 under prometheus.

As Ben said this is a case for avg_over_time or max_over_time. Looking at just the last point would be too fragile, and once an alert fires adding additional semantics is only rearranging the deckchairs. See https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 and https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

Brian
 

The approach you mentioned doesn't scale well in large enterprise environments.

On Mon, Jun 8, 2020 at 12:06 PM Brian Candler <b.ca...@pobox.com> wrote:
On 08/06/2020 07:31, Rajesh Reddy Nachireddi wrote:
> Thanks Brian. Do we have this issue open under prometheus and with
> examples working .


No. Brian Brazil considers this feature unnecessary.



--

Brian Candler

unread,
Jun 8, 2020, 3:47:44 AM6/8/20
to Brian Brazil, Rajesh Reddy Nachireddi, Prometheus Users
On 08/06/2020 08:23, Brian Brazil wrote:
> As Ben said this is a case for avg_over_time or max_over_time. Looking
> at just the last point would be too fragile, and once an alert fires
> adding additional semantics is only rearranging the deckchairs. See
> https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 and
> https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

I understand, however I am still unconvinced by the asymmetry: an rule
has to be firing "for:" X minutes before an alert is triggered, but if
it dips below the threshold for one evaluation cycle then it's
immediately cleared.

If the use of avg_over_time or max_over_time was sufficient, there would
be no need for the "for:" clause.

Brian Brazil

unread,
Jun 8, 2020, 3:53:47 AM6/8/20
to Brian Candler, Rajesh Reddy Nachireddi, Prometheus Users
Not quite, for and *_over_time do different things. For example consider a brand new target, avg_over_time could fire instantly off a single sample whereas for on top of that gives us time for a bit more history to build up. There's a few other races that for helps with, and in general I'd use a for of at least 5m just for the sake of reducing false positives. When working with gauges you need both.

--
Reply all
Reply to author
Forward
0 new messages