Random alerts missing

Paras pradhan

unread,

Sep 15, 2022, 12:33:51 PM9/15/22

to Prometheus Users

Hello,

We use prometheus , alertmanager and blackbox-exporter to check hosts if they respond to icmp. Host counts are 1K+. We noticed sometimes and randomly the alerts are not generated (prometheus dashboard --> alerts) when the hosts/targets are actually down. Restarting prometheus, alertmanager and blackbox-exports fixes the issue. Don't see anything that standouts in the logs. How do I troubleshoot and is there anything like cache data in prometheus that needs to be cleared?

Thanks

Paras.

Julius Volz

unread,

Sep 19, 2022, 3:35:06 AM9/19/22

to Paras pradhan, Prometheus Users

Hi Paras,

Could you share more information about your setup:

* What's the alerting rule that isn't working as intended?

* For how long were the hosts down without getting alerted on?

* What did the underlying metrics (e.g. "up" for the exporter's own scrape health and "probe_success" for the backend probe health) collected by the Blackbox Exporter look like at the time when the alert should have been firing, but didn't?

One possibility is that your Blackbox exporter itself couldn't be scraped anymore, in which case its "up" metric would be 0 and the "probe_success" metric would be absent (and thus any alerts based on that metric would never fire).

Regards,

Julius

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/6bfb92dc-2a18-44d9-8fda-d6f84efba0e7n%40googlegroups.com.

--

Julius Volz

PromLabs - promlabs.com

Paras pradhan

unread,

Sep 19, 2022, 2:21:29 PM9/19/22

to Julius Volz, Prometheus Users

Hello Julius

* The rule is something like this:

- name: ServerDown
rules:
- alert: Server-InstanceDown
expr: probe_success{job="blackbox_icmp-server"} == 0
for: 1m

* When alerting is not working, they are down for hours until I restart prometheus and blackbox exporters. After restarting, everything is normal.

* The underlying metrics (probe_sucess) get 0 when it's down but they don't change to Pending/Fired.

Thanks

Paras.

Brian Candler

unread,

Sep 19, 2022, 4:31:58 PM9/19/22

to Prometheus Users

Prometheus version? Alertmanager version?

What if you enter the query

probe_success{job="blackbox_icmp-server"} == 0

in the prometheus web interface (PromQL browser) while the problem is happening? Does it show any results?

Paras pradhan

unread,

Sep 19, 2022, 4:39:11 PM9/19/22

to Brian Candler, Prometheus Users

Prometheus : 2.38.0

Alertmanager : 0.24.0

Blackbox: 0.22.0

probe_success{job="blackbox_icmp-server"} returns 0. I see it .

Thanks

Paras.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8e9dedc5-38ca-4e22-883c-3f15a5f84227n%40googlegroups.com.

Brian Candler

unread,

Sep 19, 2022, 4:44:08 PM9/19/22

to Prometheus Users

"Restarting prometheus, alertmanager and blackbox-exports fixes the issue"

Which one of these fixes the issue? From what you've said, I am guessing that restarting only prometheus would do it - since you're saying you see no alerts in the Prometheus UI, not even in "pending" state.

Paras pradhan

unread,

Sep 19, 2022, 4:53:46 PM9/19/22

to Brian Candler, Prometheus Users

Correct. Restating prometheus does fix it.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0a344880-3ac6-4567-9e0a-7e8cec7177dan%40googlegroups.com.

Brian Candler

unread,

Sep 19, 2022, 5:03:20 PM9/19/22

to Prometheus Users

Are you collecting prometheus' own metrics? Something like this:

- job_name: prometheus
scrape_interval: 1m
static_configs:
- targets: ['localhost:9090']

If you are, then there are various metrics you should check, including:

prometheus_rule_evaluations_total

prometheus_rule_evaluation_failures_total

prometheus_rule_group_iterations_total

prometheus_rule_group_iterations_missed_total

For the rule / rule group in question, check which of these are incrementing during the problem period. If the 'failures' or 'missed' are incrementing, that points to a problem. Similarly if the 'evaluations_total' or 'iterations_total' *isn't* incrementing.

Also, have a look at error output from prometheus while the problem is occurring:

journalctl -fu prometheus

Paras pradhan

unread,

Sep 19, 2022, 5:14:06 PM9/19/22

to Brian Candler, Prometheus Users

Getting "Empty Query Results" at this moment. I will check when I notice the problem again.

Thanks for your input !

Paras.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/50e6a4a9-2e0c-4804-bc01-29925565310bn%40googlegroups.com.

Brian Candler

unread,

Sep 19, 2022, 5:20:36 PM9/19/22

to Prometheus Users

You should be getting results all the time, even when things are working. If you are not, then it means those metrics are missing, which means most likely you are not collecting them.

You'll need a scrape job like the one I posted.

Paras pradhan

unread,

Sep 19, 2022, 5:40:17 PM9/19/22

to Brian Candler, Prometheus Users

Yes. This is what I have

scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/798b2d5d-1cc7-47cd-a073-f1511397e098n%40googlegroups.com.

Paras pradhan

unread,

Sep 19, 2022, 5:55:34 PM9/19/22

to Brian Candler, Prometheus Users

None of these metrics are recognized . What am I missing?

prometheus_rule_evaluations_total

prometheus_rule_evaluation_failures_total

prometheus_rule_group_iterations_total

prometheus_rule_group_iterations_missed_total

Thanks

Brian Candler

unread,

Sep 19, 2022, 6:12:25 PM9/19/22

to Prometheus Users

What does

up{job="prometheus"}

show?

If it's 0, then you have a problem with prometheus scraping itself. What error do you see in the targets list in the web interface? Maybe you've configured it to listen on a different port, or on a different path (--web.external-url), or with TLS or basic auth.

If it's 1, then it appears you have no alerting rules. Which would explain why you don't get any alerts.

Paras pradhan

unread,

Sep 19, 2022, 6:20:20 PM9/19/22

to Brian Candler, Prometheus Users

It returns zero and in targets its "prometheus(0/1)" too. I have basic auth enabled. Is it possible to use basic auth and enable the prometheus scraping itself?

Thanks

Paras.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9f2afcdc-963b-4465-bc77-847413d47075n%40googlegroups.com.

Brian Candler

unread,

Sep 20, 2022, 3:01:10 AM9/20/22

to Prometheus Users

On Monday, 19 September 2022 at 23:20:20 UTC+1 pradha...@gmail.com wrote:

I have basic auth enabled. Is it possible to use basic auth and enable the prometheus scraping itself?

Yes: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config

scrape_configs:
- job_name: "prometheus"

basic_auth:

username: "foo"

password: "bar"
# or: password_file: /etc/prometheus/scape.password

Paras pradhan

unread,

Sep 27, 2022, 3:48:06 PM9/27/22

to Brian Candler, Prometheus Users

Thanks Brian, I just enabled it and will be looking for failures and missed totals. Not seeing anything at this moment and thanks for your help.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d3b6e1d9-a600-4f85-8bb5-993974f694ffn%40googlegroups.com.

Reply all

Reply to author

Forward