Discrepancy in Alert Rule Evaluation.

234 views
Skip to first unread message

yagyans...@gmail.com

unread,
Nov 7, 2020, 3:16:31 AM11/7/20
to Prometheus Users

Hi. I am using Blackbox Exporter v 0.18.0 for generating Host Down Alerts. Below is the configured rule.
  - alert: HostDown
    expr: probe_success{job=~"Ping-All-Servers"} == 0
    for: 1m
    labels:
      severity: "CRITICAL"
    annotations:
      summary: "Server is Down - *{{ $labels.instance }}*"
      identifier: "*Cluster:* `{{ $labels.cluster }}`, *node:* `{{ $labels.node }}` "

Now, when I am checking my Prometheus' alert page http://x.x.x.x:9090/alerts, I see 7-8 HostDown in PENDING state everytime, and when at the same time I check my Blackbox Exporter's debug log, I don't see any Probe Failed for my ICMP module for those instances.
I am missing something here?

Thanks in advance!

yagyans...@gmail.com

unread,
Nov 7, 2020, 3:17:10 AM11/7/20
to Prometheus Users
Prometheus Version - 2.20.1

Brian Candler

unread,
Nov 7, 2020, 3:42:36 AM11/7/20
to Prometheus Users
Go into the Prometheus query browser (front page in the web interface, normally port 9090), and enter the query:

probe_success{job=~"Ping-All-Servers"}

and switch to graph mode.  Is the line going up and down?  Then probes are failing.

If you want to see logs of these failures, then on the blackbox_exporter you'll need to add --log.level=debug to its command line args.

Alternatively, if you are testing with curl, you can add "&debug=true" to the URL.  e.g.

curl -g 'localhost:9115/probe?module=foo&target=bar&debug=true'

Do this repeatedly until you see a failure, and the failure logs will be included in the HTTP response.

Note that the blackbox exporter by default sets a deadline of 0.5 seconds less than the scrape interval.  So if you have a very short scrape interval (say 1s) then each probe only has 0.5s to complete.

yagyans...@gmail.com

unread,
Nov 7, 2020, 3:49:15 AM11/7/20
to Prometheus Users
Hi Brian,

My Blackbox exporter is already running with Debug Log Mode and still, I don't see and probe failed logs for that period.
Also, I have ran the query for some of the instances that I saw in PENDING state, but I do not see any failures there also, probe_success is 1 for them constantly without any variation in between.

Brian Candler

unread,
Nov 7, 2020, 4:09:20 AM11/7/20
to Prometheus Users
The promQL query    probe_success{job=~"Ping-All-Servers"} == 0

is a filter.  It returns the set of timeseries where the job label matches "Ping-All-Servers" and the value is zero.  It cannot return a non-empty set of results unless those conditions are met.

What's your rule evaluation interval, and what's the scrape interval for your blackbox job?

Can you show a screenshot of the pending alerts? (In main prometheus web interface, the "Alerts" tab)

Brian Candler

unread,
Nov 7, 2020, 4:11:44 AM11/7/20
to Prometheus Users
On Saturday, 7 November 2020 08:49:15 UTC, yagyans...@gmail.com wrote:
My Blackbox exporter is already running with Debug Log Mode and still, I don't see and probe failed logs for that period.

But is this the same blackbox exporter which is also showing panics in its logs?

Yagyansh S. Kumar

unread,
Nov 7, 2020, 4:51:36 AM11/7/20
to Brian Candler, Prometheus Users
Yes, this is the same black box exporter that I have mentioned in that thread.
My evaluation_interval is 1 minute and my scrape_interval for the job is 5 seconds currently.

One weird and interesting thing I have noticed. Currently, I have 2 Prometheus instances, both scraping the same targets, one with version 2.12.0 and another with 2.20.1 and I am noticing this behaviour only in the 2.20.1 whereas in 2.12.0 I have not seen any discrepancy like this.
Here is the snapshot for PENDING alerts.
image.png



Below is the snapshot of probe_success for these instances from both the Prometheus instances.
From 2.20.1
Only the third instance shows probe_success 0 in between but that also does not corresponds with the timing of it being in PENDING state for HostDown.
image.png


probe_success for the same 3 servers from Prometheus instance v 2.12.0.
No zero even for the third instance.
image.png







--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/oKAPrUljkU0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/90269156-0db5-4c57-bcdf-7c1f310ad7b0o%40googlegroups.com.

Brian Candler

unread,
Nov 7, 2020, 5:01:53 AM11/7/20
to Prometheus Users
You won't necessarily see all the failures on that graph.  With a 5-second scrape interval, a 6 hour window contains 4,320 scrapes - more than the number of points fetched.  Many of the points will be skipped over.

I suggest you graph this instead:

min_over_time(probe_success[5m])

(Otherwise, you'd need to zoom in much closer and then scroll left and right)

Once you've sorted that, it becomes easier to compare the two prometheus servers.

Note: are these two servers talking to the *same* blackbox exporter - i.e. making remote connections over the network? Or does each prometheus server have its own blackbox exporter?  If they are separate blackbox exporter instances then there's likely some difference between the two, or the environment in which they are running.

Yagyansh S. Kumar

unread,
Nov 7, 2020, 5:15:57 AM11/7/20
to Brian Candler, Prometheus Users
Yes, both the Prometheus instances are talking to the same BBE indeed. Infact both have the exact same configuration file and are scraping the exact same targets.

Here is the graph for the modified query. Fails visible for 2.20.1 but none for 2.12.0.

2.12.0
image.png

2.20.1
image.png


Why is there a difference in the instances when they are talking to the same BBE.
Also, what do you think about the evaluation_time and scrape_interval combination? Should I change something to solve this?
--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/oKAPrUljkU0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Nov 7, 2020, 8:03:41 AM11/7/20
to Prometheus Users
Try looking at scrape_duration_seconds{job="Ping-All-Servers"}.  Maybe it's borderline to the scrape interval.

What does min_over_time(up{job="Ping-All-Servers"}[5m]) show?  In other words, is it the scrape to BBE which is failing, or the BBE probe? (I think the latter).

Is there a different network path between the two prometheus servers and BBE?

It still bothers me that BBE is logging panics.  Something weird is going on in your BBE.  Could even be a hardware problem.

I think you should paste your entire scrape config and BBE config, in case something else jumps out.

Yagyansh S. Kumar

unread,
Nov 7, 2020, 8:35:47 AM11/7/20
to Brian Candler, Prometheus Users
Try looking at scrape_duration_seconds{job="Ping-All-Servers"}.  Maybe it's borderline to the scrape interval.
>> That's interesting. Here are the top 20 scrape_duration_seconds maxed for last 1 hour by instance. Close to 5 seconds. Can this lead to some issue? But      again the thing comes why not Prometheus 2.12.0 having a problem with it.
     I'll try and correlate this with probe_success being zero.        
image.png

What does min_over_time(up{job="Ping-All-Servers"}[5m]) show?  In other words, is it the scrape to BBE which is failing, or the BBE probe? (I think the latter).

 Is there a different network path between the two prometheus servers and BBE?
 >> No. All are under are same DC and under the same cluster of ESX hosts.

It still bothers me that BBE is logging panics.  Something weird is going on in your BBE.  Could even be a hardware problem.
>> I can try and run the BBE on a different machine.
 
I think you should paste your entire scrape config and BBE config, in case something else jumps out.
>> My Prometheus scrape config is quite huge and messy at the moment(Around 3k lines). I am attaching the blackbox jobs from the entire scrape. Let me if you still need the entire scrape config, I'll share that too.
Attached bbe.yml and scrape.yml with Blackbox Jobs.

But after everything one questions still remains - Why isn't Prometheus 2.12.0 showing this discrepancy in evaluation?

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/oKAPrUljkU0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
scrape.yml
bbe.yml

Brian Candler

unread,
Nov 7, 2020, 11:41:21 AM11/7/20
to Prometheus Users
On Saturday, 7 November 2020 13:35:47 UTC, Yagyansh S. Kumar wrote:
Try looking at scrape_duration_seconds{job="Ping-All-Servers"}.  Maybe it's borderline to the scrape interval.
>> That's interesting. Here are the top 20 scrape_duration_seconds maxed for last 1 hour by instance. Close to 5 seconds. Can this lead to some issue?

Possibly. Maybe the scrape timeout handling has changed slightly between those version of prometheus.  I would in any case be concerned about the scrape duration being so close to the scrape interval, although failed scrapes should still show as "up == 0".

However, I note that the scrape.yml you posted shows the Ping-All-Servers job with a scrape interval of 10s, not 5s.

I also notice your module config has:

  icmp_prober:
   prober: icmp
   timeout: 30s
   icmp:
     preferred_ip_protocol: ip4 

I *think* the timeout is clipped to just under the scrape interval, so it should work, but I'd be inclined to set it lower anyway (say 3s); if you don't get a reply within 3s, you're unlikely to get one.

Since this test only does one ping, I would *expect* it to fail from time to time, and hence the alert go into "pending" state until the "for: 1m" has run its course.

Yagyansh S. Kumar

unread,
Nov 7, 2020, 12:16:06 PM11/7/20
to Brian Candler, Prometheus Users


On Sat, 7 Nov, 2020, 10:11 pm Brian Candler, <b.ca...@pobox.com> wrote:
On Saturday, 7 November 2020 13:35:47 UTC, Yagyansh S. Kumar wrote:
Try looking at scrape_duration_seconds{job="Ping-All-Servers"}.  Maybe it's borderline to the scrape interval.
>> That's interesting. Here are the top 20 scrape_duration_seconds maxed for last 1 hour by instance. Close to 5 seconds. Can this lead to some issue?

Possibly. Maybe the scrape timeout handling has changed slightly between those version of prometheus.  I would in any case be concerned about the scrape duration being so close to the scrape interval, although failed scrapes should still show as "up == 0".
     >> Let me see these intervals for the older version of Prometheus also.

However, I note that the scrape.yml you posted shows the Ping-All-Servers job with a scrape interval of 10s, not 5s.
     >> Sorry, my bad. I changed it to 10s after your last email to see what is the top scrape duration now. Interestingly, I did it for 2.12.0 first and below are the results. Top 20 is close to 10s now, but no discrepancy in this version.

2.12.0
image.png

2.20.1
image.png

I also notice your module config has:

  icmp_prober:
   prober: icmp
   timeout: 30s
   icmp:
     preferred_ip_protocol: ip4 

I *think* the timeout is clipped to just under the scrape interval, so it should work, but I'd be inclined to set it lower anyway (say 3s); if you don't get a reply within 3s, you're unlikely to get one.
>> Yes, I agree. To eliminate its role, I'll change it. I also noticed that my timeout is 4.5 seconds(Because scrape is 5s) for my icmp module.

Since this test only does one ping, I would *expect* it to fail from time to time, and hence the alert go into "pending" state until the "for: 1m" has run its course.
     >> Hm, fair enough. But while examining just 10 minutes back, I found out that even 2 instances got under FIRING HostDown too and indeed it was false and *it didn't come under FIRING for 2.12.0*. Unfortunately, I forgot to take a screenshot at that time. I'll provide more details on this once I get some FIRING alerts again.

This puts me in a bit of pickle because I was about to update my other instance also to 2.20.1 because 2.12.0 isn't optimized for remote write and it goes down because of OOM if I carry out remote write for this version. But now I am confused after this fiasco because a lot of people are receiving the notifications, so a false alert especially HostDown won't look good. 

Also, one more thing. I don't think this should effect alert evaluation but I'd like to mention that I am doing remote write using my Prometheus 2.20.1 to VictoriaMetrics. 

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/oKAPrUljkU0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Nov 7, 2020, 1:02:53 PM11/7/20
to Prometheus Users
I don't think it's a false alert.  If it's the rule you showed, then the only way you can get an alert is if the metric probe_success has value zero.  You should try to understand why BBE is returning zero; if necessary use tcpdump or wireshark to capture the HTTP traffic to and from it.

But you also need to resolve the issue with BBE panicking - does it log a backtrace when it does this? If so, showing the backtrace could help identify what's going on.

Yagyansh S. Kumar

unread,
Nov 8, 2020, 7:10:54 AM11/8/20
to Brian Candler, Prometheus Users
I'll try and get a backtrace and post it here.

But still the question remains, is BBE is returning probe_success 0, why is it doing only for 2.20.1 🙄. 



On Sat, 7 Nov, 2020, 11:33 pm Brian Candler, <b.ca...@pobox.com> wrote:
I don't think it's a false alert.  If it's the rule you showed, then the only way you can get an alert is if the metric probe_success has value zero.  You should try to understand why BBE is returning zero; if necessary use tcpdump or wireshark to capture the HTTP traffic to and from it.

But you also need to resolve the issue with BBE panicking - does it log a backtrace when it does this? If so, showing the backtrace could help identify what's going on.

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/oKAPrUljkU0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Nov 8, 2020, 7:17:12 AM11/8/20
to Prometheus Users
On Sunday, 8 November 2020 12:10:54 UTC, Yagyansh S. Kumar wrote:
I'll try and get a backtrace and post it here.

But still the question remains, is BBE is returning probe_success 0, why is it doing only for 2.20.1 🙄. 


It could be that 2.12 is missing the data point (scrape) entirely.

Can you try this on both servers?  If this graph has dips in it, it means that a data point was missed.  With 10s scrape interval, it would dip from 30 to 29 if one point missed.

count_over_time(probe_success{job="Ping-All-Servers"}[5m])

Yagyansh S. Kumar

unread,
Nov 8, 2020, 7:25:45 AM11/8/20
to Brian Candler, Prometheus Users
>> Pretty scary result for 2.20.1.
 
  2.20.1
image.png

2.12.0
image.png

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/oKAPrUljkU0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages