Grafana data mismatch in different time frames.

267 views
Skip to first unread message

sreehari M V

unread,
Mar 7, 2022, 9:17:34 AM3/7/22
to Prometheus Users
Hi Team,

We use a discrete plugin(panel) in Grafana to display the data from blackbox_exporter to track the end-point(URL) availability and prometheus data retention period is 50 days. This panel shows URL available and  unavailable time in percentage.

Issue is smaller down time (E.g: 502 return code for 1hr )  is getting  ignored when we select a larger time range in Grafana (above 1 month) and the panel is showing 100% URL available.  But if we select a smaller time frame in Grafana, the URL unavailable time is displayed.

Suspecting issue with  below query mentioned in panel. Can somebody please provide a solution for this issue ?

Prom query Used in Grafana discrete plugin
probe_httpd_status_code{instance="https://xxxxxxx",job=blackbox-generic-endpoints"}


Prometheus Version - 2.31.0
Blackbox exporter - 0.13.0
Grafana Version - 6.7.4
Scrape_interval: 30s

Thanks and regards
SreeHari

sreehari M V

unread,
Mar 7, 2022, 9:20:13 AM3/7/22
to Prometheus Users

Brian Candler

unread,
Mar 7, 2022, 1:30:30 PM3/7/22
to Prometheus Users
My guess is: when the plugin queries over a large time range, it is sending a large step time to the prometheus API, which is skipping over the times of interest.

Now, you can argue that this is a problem with the way that the panel queries Prometheus. However, querying a 1 month range with a 30 second step would be extremely inefficient (returning ~86,000 data points).  So really, it would be better if you were to have a *counter* of how many times a 502 status code is returned, and then the plugin can calculate a rate over each step.

You can use a recording rule, running at the same interval as your blackbox scrapes, to increment a counter for each 502 response from blackbox_exporter.

(Incidentally, the query that you've posted is syntactically invalid - it has mismatched quotes)

sreehari M V

unread,
Mar 28, 2022, 7:15:10 AM3/28/22
to Brian Candler, Prometheus Users
Hi Brian,

Thanks for your reply. Could you share a sample config / query to fix this issue if possible ? I am a beginner and did not understand your reply fully.

Thanks and regards
Sreehari


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5a281000-2be7-4f9b-8e9d-2062be33d99fn%40googlegroups.com.

Brian Candler

unread,
Mar 28, 2022, 7:40:51 AM3/28/22
to Prometheus Users
Your problem is this: suppose you're recording blackbox_exporter output, and for simplicity I'll choose probe_success, which looks something like this (1 for OK, 0 when there's a problem):

---------_-------_----------------------_-------------------------_----------------

You're then viewing it in Grafana across a very wide time range, which picks out individual data points for each pixel:

-    -    -    -    -    -    -    -    _    -    -    -    -    -    -    -    -

If you zoom out a long way, you can see it is likely to skip over points where the value was zero.  This is bound to happen when taking samples in this way.

In an ideal world, you'd make each failure event increment a counter:
                                                                  _________________
                 _______________________--------------------------                 
_________--------

Then when you look over any time period, you can see how many failures occurred within that window.  I think that's the best way to approach the problem.  Since blackbox_exporter doesn't expose a counter like this, you'd have to synthesise one, e.g. using a recording rule.

Assuming you only have the existing timeseries, then as a workaround for probe_success, you could try using something like this:

min_over_time(probe_success[$__interval])

$__interval is the time span in grafana of one data point (and changes with the graph resolution).  With this query, it "looks back" in time before each point, and if *any* of the data points is zero, the result will be zero for that point; if they are all 1 then the result will be 1. But you may find that if you zoom in too close, you get gaps in your graph.

Or you can use:

avg_over_time(probe_success[$__interval])

In this case, if one point covers 4 samples, and the samples were 1 1 0 1, then you will get a data point showing 0.75 as the availability.

Now, that isn't going to work for probe_httpd_status_code, which has values like 200 or 404 or 503; an "average" of these isn't helpful.  But you could do:

max_over_time(probe_httpd_status_code{instance="https://xxxxxxx",job=blackbox-generic-endpoints"}[$__interval])

Then you'll get whatever is the highest status code over that time range.  That is, if the results for the time window covered by one point in the graph were 200 200 404 200 503 200, then you'll see 503 for that point.  That may be good enough for what you need.

sreehari M V

unread,
Mar 31, 2022, 5:56:40 AM3/31/22
to Brian Candler, Prometheus Users
Thank you, Brian.

I have tested the mentioned queries and results are attached.

  Query Total down time in month by  checking one day time frame(sum) Down time in 1 month time frame 
Query - 1
( old query)
probe_http_status_code{instance="https://xxxxxxxxxx",job="blackbox-generic-endpoints"} 26 minutes 4 hrs 
Query - 2 max_over_time(probe_http_status_code{instance="https://xxxxxxxxxx",job="blackbox-generic-endpoints"}[$__interval]) 34 minutes 12 hrs
Query - 3 min_over_time(probe_http_status_code{instance="https://xxxxxxxxxx",job="blackbox-generic-endpoints"}[$__interval]) 18 minutes  2 hrs ( URL unavailable ) 

Actual down time for this endpoint/URL is around 30 minutes and that is almost matching in the first two queries when we take the sum of  downtime  values of 1 day time for one month. ( details attached. )

However first two queries not providing exact down time and giving more than 4 hrs down time in one month time frame ( results attached. )

Could you please suggest another solution or provide a fix for this issue.

Regards.
Sreehari

result1.jpg
result2.jpg

Brian Candler

unread,
Mar 31, 2022, 9:53:00 AM3/31/22
to Prometheus Users
> Could you please suggest another solution or provide a fix for this issue.

Not really I'm afraid, because your question is really about Grafana not about Prometheus.

All the raw data is present and correct in Prometheus, and Grafana isn't querying it in the correct way; the fix is on the Grafana side.

If It were me, I'd connect to the Prometheus API, do a query to collect the raw data, analyse it in the way that I want, and generate the result.  A range vector query sent to the instant query endpoint will return all the data points in the given time window, with their actual collection timestamps, e.g.
probe_http_status_code{instance="https://xxxxxxxxxx",job="blackbox-generic-endpoints"}[30d]
Reply all
Reply to author
Forward
0 new messages