Random alerts missing

83 views
Skip to first unread message

Paras pradhan

unread,
Sep 15, 2022, 12:33:51 PM9/15/22
to Prometheus Users
Hello,

We use prometheus , alertmanager and blackbox-exporter to check hosts if they respond to icmp. Host counts are 1K+.  We noticed sometimes and randomly  the alerts are not generated (prometheus dashboard --> alerts) when the hosts/targets are actually down. Restarting prometheus, alertmanager and blackbox-exports fixes the issue. Don't see anything that standouts in the logs. How do I troubleshoot and is there anything like cache data in prometheus that needs to be cleared?

Thanks
Paras.

Julius Volz

unread,
Sep 19, 2022, 3:35:06 AM9/19/22
to Paras pradhan, Prometheus Users
Hi Paras,

Could you share more information about your setup:

* What's the alerting rule that isn't working as intended?
* For how long were the hosts down without getting alerted on?
* What did the underlying metrics (e.g. "up" for the exporter's own scrape health and "probe_success" for the backend probe health) collected by the Blackbox Exporter look like at the time when the alert should have been firing, but didn't?

One possibility is that your Blackbox exporter itself couldn't be scraped anymore, in which case its "up" metric would be 0 and the "probe_success" metric would be absent (and thus any alerts based on that metric would never fire).

Regards,
Julius

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/6bfb92dc-2a18-44d9-8fda-d6f84efba0e7n%40googlegroups.com.


--
Julius Volz
PromLabs - promlabs.com

Paras pradhan

unread,
Sep 19, 2022, 2:21:29 PM9/19/22
to Julius Volz, Prometheus Users
Hello Julius

* The rule is something like this:

- name: ServerDown
   rules:
   - alert: Server-InstanceDown
     expr: probe_success{job="blackbox_icmp-server"} == 0
     for: 1m

* When alerting is not working, they are down for hours until I restart prometheus and blackbox exporters. After restarting, everything is normal.

*  The underlying metrics (probe_sucess) get 0 when it's down but they don't change to Pending/Fired. 

Thanks
Paras.

Brian Candler

unread,
Sep 19, 2022, 4:31:58 PM9/19/22
to Prometheus Users
Prometheus version? Alertmanager version?

What if you enter the query
    probe_success{job="blackbox_icmp-server"} == 0
in the prometheus web interface (PromQL browser) while the problem is happening?  Does it show any results?

Paras pradhan

unread,
Sep 19, 2022, 4:39:11 PM9/19/22
to Brian Candler, Prometheus Users
Prometheus : 2.38.0
Alertmanager : 0.24.0
Blackbox: 0.22.0

probe_success{job="blackbox_icmp-server"}  returns 0. I see it .

Thanks
Paras.

Brian Candler

unread,
Sep 19, 2022, 4:44:08 PM9/19/22
to Prometheus Users
"Restarting prometheus, alertmanager and blackbox-exports fixes the issue"

Which one of these fixes the issue?  From what you've said, I am guessing that restarting only prometheus would do it - since you're saying you see no alerts in the Prometheus UI, not even in "pending" state.

Paras pradhan

unread,
Sep 19, 2022, 4:53:46 PM9/19/22
to Brian Candler, Prometheus Users
Correct. Restating prometheus does fix it.

Brian Candler

unread,
Sep 19, 2022, 5:03:20 PM9/19/22
to Prometheus Users
Are you collecting prometheus' own metrics? Something like this:

  - job_name: prometheus
    scrape_interval: 1m
    static_configs:
      - targets: ['localhost:9090']

If you are, then there are various metrics you should check, including:
prometheus_rule_evaluations_total
prometheus_rule_evaluation_failures_total
prometheus_rule_group_iterations_total
prometheus_rule_group_iterations_missed_total

For the rule / rule group in question, check which of these are incrementing during the problem period. If the 'failures' or 'missed' are incrementing, that points to a problem.  Similarly if the 'evaluations_total' or 'iterations_total' *isn't* incrementing.

Also, have a look at error output from prometheus while the problem is occurring:
journalctl -fu prometheus

Paras pradhan

unread,
Sep 19, 2022, 5:14:06 PM9/19/22
to Brian Candler, Prometheus Users
Getting "Empty Query Results" at this moment. I will check when I notice the problem again. 

Thanks for your input !
Paras.

Brian Candler

unread,
Sep 19, 2022, 5:20:36 PM9/19/22
to Prometheus Users
You should be getting results all the time, even when things are working.  If you are not, then it means those metrics are missing, which means most likely you are not collecting them.

You'll need a scrape job like the one I posted.

Paras pradhan

unread,
Sep 19, 2022, 5:40:17 PM9/19/22
to Brian Candler, Prometheus Users
Yes. This is what I have
scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

Paras pradhan

unread,
Sep 19, 2022, 5:55:34 PM9/19/22
to Brian Candler, Prometheus Users
None of these metrics are recognized . What am I missing?

prometheus_rule_evaluations_total
prometheus_rule_evaluation_failures_total
prometheus_rule_group_iterations_total
prometheus_rule_group_iterations_missed_total

Thanks

Brian Candler

unread,
Sep 19, 2022, 6:12:25 PM9/19/22
to Prometheus Users
What does
   up{job="prometheus"}
show?

If it's 0, then you have a problem with prometheus scraping itself.  What error do you see in the targets list in the web interface?  Maybe you've configured it to listen on a different port, or on a different path (--web.external-url), or with TLS or basic auth.

If it's 1, then it appears you have no alerting rules.  Which would explain why you don't get any alerts.

Paras pradhan

unread,
Sep 19, 2022, 6:20:20 PM9/19/22
to Brian Candler, Prometheus Users
It returns zero and in targets its "prometheus(0/1)" too. I have basic auth enabled. Is it possible to use basic auth and enable the prometheus scraping itself? 

Thanks
Paras.


Brian Candler

unread,
Sep 20, 2022, 3:01:10 AM9/20/22
to Prometheus Users
On Monday, 19 September 2022 at 23:20:20 UTC+1 pradha...@gmail.com wrote:
I have basic auth enabled. Is it possible to use basic auth and enable the prometheus scraping itself? 



scrape_configs:
  - job_name: "prometheus"
    basic_auth:
      username: "foo"
      password: "bar"
      # or: password_file: /etc/prometheus/scape.password

Paras pradhan

unread,
Sep 27, 2022, 3:48:06 PM9/27/22
to Brian Candler, Prometheus Users
Thanks Brian, I just enabled it and will be looking for failures and missed totals.  Not seeing anything at this moment and thanks for your help.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages