Blackbox_Exporter 0.24.0 - probe endpoint every 15s with30s timeout ?

84 views
Skip to first unread message

Alexander Wilke

unread,
Jan 9, 2024, 5:04:42 AM1/9/24
to Prometheus Users
Hello,
I want to use blackbox_exporter and http prober to login to an API.

My goal is to do the login every 15s which could be

xx:yy:00
xx:yy:15
xx:yy:30
xx:yy:45

I could solve this with scrape_interval: 15s.
But in addition I want to allow a scrape timeout of 30s which is longer than the scrape_timeout.

If I run two blackbox_probes in parallel with scrape_interval: 30s and scrape_timeout: 30s this will work but both probes will start more or less at the same time.

xx:yy:00
xx:yy:30

The idea behind tha is:
In general the API response for login is very fast. For whatever reason sometimes it takes 30s or more. I do not want the probe to just fail after 15s but want to see and understand how long a login request takes.

If I abort a long lasting request or do a parellel login this may work very fast. So it is probably not a problem with the API in general but with specific user session or other unknown circumstances. So I want many scrape intervals but the imeout sometimes needs to be higher OR I need several blackbox_probes which do not start at the same time but are spread equally.

Any ideas?
Is this possible with prometheus 2.48.1 and blackbox_exporter 0.24.0 ?



Brian Candler

unread,
Jan 9, 2024, 9:43:51 AM1/9/24
to Prometheus Users
Unfortunately, the timeout can't be longer than the scrape interval, firstly because this would require overlapping scrapes, and secondly the results could be returned out-of-order: e.g.

xx:yy:00 scrape 1: takes 25 seconds, gives result at xx:yy:25
xx:yy:15 scrape 2: takes 5 seconds, gives result at xx:yy:20

> If I run two blackbox_probes in parallel with scrape_interval: 30s and scrape_timeout: 30s this will work but both probes will start more or less at the same time.

Actually I think you'll find they'd be evenly spread out over the scrape interval - try it.

For example, make a single scrape job with a 60 second scrape interval, and list the same target 4 times - but make sure you apply some distinct label to each instance, so that they generate 4 separate timeseries.  You can then look at the raw timestamps in the database to check the actual scrape times: easiest way is by using the PromQL web interface and supplying a range vector query, like probe_success{instance="foo"}[5m].  This has to be in table view, not graph view.  Don't mix any other targets into that scrape job, because they'll be spread together.

Alternatively, KISS: use a 15 second scrape interval, and simply accept that "scrape failed" = "took longer than 15 seconds". Does it really matter whether it was 20 seconds or 25 seconds? Can you get that information from somewhere else if needed, e.g. web server logs?

Brian Candler

unread,
Jan 9, 2024, 9:45:42 AM1/9/24
to Prometheus Users
(Thinks: maybe it's *not* necessary to apply distinct labels? This feels wrong somehow, but I can't pinpoint exactly why it would be bad)

Alexander Wilke

unread,
Jan 9, 2024, 6:35:09 PM1/9/24
to Prometheus Users
Hello,
it's only working partly I think. If I add the same target several times to the same job then prometheus treats targets with the exact naming as on.
This results in one target on prometheus' webui target list and tcpdump confirms onle one scrape per 60s

      - targets:
        - pfsense.oberndorf.ca:443        # pfsense webui tcp tls test
        - pfsense.oberndorf.ca:443        # pfsense webui tcp tls test
        - pfsense.oberndorf.ca:443        # pfsense webui tcp tls test
        - pfsense.oberndorf.ca:443        # pfsense webui tcp tls test

If I use this I have 4 different namings for the same target which results in 4 scrapes. However with this max 4 permutations are possible I think and with http only 2.

        scheme: https
      - targets:
        - pfsense.oberndorf.ca:443        # pfsense webui tcp tls test
        - https://pfsense.oberndorf.ca        # pfsense webui tcp tls test
        - https://pfsense.oberndorf.ca:443        # pfsense webui tcp tls test
        - pfsense.oberndorf.ca        # pfsense webui tcp tls test


And at least I they do not spread as equal as I hoped and in addition I now have 4 different instances.
Maybe I could fix this with relabling the "instance" field but this sound as wrong as relabeling the "job".

same_target_4times.JPG


Back to your question:
"Does it really matter whether it was 20 seconds or 25 seconds?"

I don't know if this is relevant. It's a rare issue and I am in discussion with the vendor of the API/appliance. However it maybe could give me some more indication if the API would respond after lets say 50s oder 3 minutes.
If scrape_timeout is reached the exporter sends a RST if I remember correctly which is good to close the connections but will also close the connection to the API and API server maybe just writes "client closed connection" or something similar to the log.

I don't know if this is really a problem if the answers of two parallel probes overlap (timeout longer than duration) because the connection uses different source ports and prometheus allows the "out-of-order" ingestion if I remember correctly.
Perhaps it could lead to many unclosed connections which need memory. lt's say interval is 1s and timeout is 60s there could be 60 connections in parallel.

Maybe a longer timeout than scrape_interval could be handled like this:

scrape_interval: 15s
scrape_timeout: 60s

if scrape_time is longer than scrape_interval check if probe duration succeeded before scrape_timeout and do the next scrape according to scrape_interval.
if scrape_duration is longer than scrape_interval and shorter than scrape_timeout skip next scrape until timeout reached or scrape succeeded.

However this would not allow parallel scrapes.


Probably this is a rare scenario and debugging an API with blackbox_exporter was only an idea. I just wanted to ask if I miss something :-)

Thanks for sharing your ideas.

Brian Candler

unread,
Jan 10, 2024, 3:23:38 AM1/10/24
to Prometheus Users
If the scrape job is removing duplicate targets, then try giving them distinct labels as I originally suggested:

     - labels:
         subinstance: 1
       targets:
        - pfsense.oberndorf.ca:443
     - labels:
         subinstance: 2
       targets:
        - pfsense.oberndorf.ca:443
     - labels:
         subinstance: 3
       targets:
        - pfsense.oberndorf.ca:443
     - labels:
         subinstance: 4
       targets:

I haven't tested this, but if this works it will give you 4 distinct timeseries, which is safe.

If you wanted to experiment with dropping the subinstance label so as to merge them into one big timeseries then you'd have to do that in metric relabelling rules. However I remember why that's probably a bad idea: it's because the timestamps could be out-of-order when one scrape has a long latency but the next one doesn't.  So better to keep these as separate timeseries, and merge them when querying by using aggregation functions.

If that doesn't work then another option might be to try using fragments in the URL, e.g. 'https://pfsense.oberndorf.ca#1',  'https://pfsense.oberndorf.ca#2' etc.
Reply all
Reply to author
Forward
0 new messages