Show scrape errors in prometheus metrics

4,361 views
Skip to first unread message

Tom Liefheid

unread,
Mar 6, 2021, 5:16:04 AM3/6/21
to Prometheus Users
Hi,

Are we able to see errors in scrapes in the prometheus metrics itself?
i sometimes have issues on network level, which makes my prometheus unable to scrape targets, causing it to send alerts.

It would be useful to have a label or something in a metric from prometheus to visualise in my grafana instance the errors, so it's easier to pinpoint issues on scrapes in the future

Thanks,
Tom

Ben Kochie

unread,
Mar 6, 2021, 5:45:43 AM3/6/21
to Tom Liefheid, Prometheus Users
Yes, this is what the `up` metric provides. There's also `scrape_duration_seconds` that provides the time it took to perform the scrape. This makes it easier to see timeouts.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/6a2d640e-e03f-44cd-aa6d-08cc6c913849n%40googlegroups.com.

Evelyn Pereira Souza

unread,
Mar 7, 2021, 12:38:26 AM3/7/21
to promethe...@googlegroups.com
On 06.03.21 11:45, Ben Kochie wrote:
> Yes, this is what the `up` metric provides. There's also
> `scrape_duration_seconds` that provides the time it took to perform the
> scrape. This makes it easier to see timeouts
Hi

a few additions from
https://www.omerlh.info/2019/03/04/keeping-prometheus-in-shape/

- Use scrape_duration for monitoring
- Use scrape_limit to drop problematic targets
- Use scrape_samples_scraped to monitor the size of metrics exposed by
specific target

alert: ScrapeDuration
expr: max(scrape_duration_seconds) > 15
for: 5m
labels:
severity: high
annotations:
summary: "Prometheus Scrape Duration is getting near the limit"


alert: TeamAwesomeScraeSampleSize
expr: max(scrape_samples_scraped[kubernetes_namespace='awesome']) > 1000
for: 5m
labels:
severity: high
annotations:
summary: "Oh No! One of our services is exposing too much metrics!"

kind regards
Evelyn
OpenPGP_0x61776FA8E38403FB.asc
OpenPGP_signature

Tom Liefheid

unread,
Mar 16, 2021, 5:04:13 AM3/16/21
to Prometheus Users
Thanks for your answers,

In my current setup, running prometheus in HA, i have 1 instance who can't scrape apps, but the other one can. I want to find out which one isn't able to scrape the apps, so i can restart it. i don't see anything in the logs that reflect the issues. it would be nice if we could 'translate' the output of the /targets page to some kind of metric, if that makes sense

Op zondag 7 maart 2021 om 06:38:26 UTC+1 schreef Evelyn Pereira Souza:

Stuart Clark

unread,
Mar 16, 2021, 5:35:50 AM3/16/21
to Tom Liefheid, Prometheus Users
On 16/03/2021 09:04, Tom Liefheid wrote:
> Thanks for your answers,
>
> In my current setup, running prometheus in HA, i have 1 instance who
> can't scrape apps, but the other one can. I want to find out which one
> isn't able to scrape the apps, so i can restart it. i don't see
> anything in the logs that reflect the issues. it would be nice if we
> could 'translate' the output of the /targets page to some kind of
> metric, if that makes sense
All scrapes automatically produce the "up" metric, so a value of 0 would
indicate a failure (as you would see with red sections of the target
page). You should see labels for the job/target which is failing. It can
be a useful metric to alert on, and then look at logs/the target page to
try to figure out why the scrape is failing.

--
Stuart Clark

Tom Liefheid

unread,
Mar 16, 2021, 6:14:24 AM3/16/21
to Prometheus Users
yes, but running a HA prometheus doesn't let me see which prom-instance has the issues, as only 1 is failing

Op dinsdag 16 maart 2021 om 10:35:50 UTC+1 schreef Stuart Clark:

Stuart Clark

unread,
Mar 16, 2021, 7:23:52 AM3/16/21
to Tom Liefheid, Prometheus Users
On 16/03/2021 10:14, Tom Liefheid wrote:
> yes, but running a HA prometheus doesn't let me see which
> prom-instance has the issues, as only 1 is failing

How do you mean? You can query each instance or external labels should
indicate which instance has the failing metric...

--
Stuart Clark

Reply all
Reply to author
Forward
0 new messages