1 minute scrape, [1m] == no data ([90s] has data)

55 views
Skip to first unread message

Laurent Demailly

unread,
Oct 1, 2020, 3:21:43 PM10/1/20
to Prometheus Users
I described in detailed the problem in https://github.com/prometheus/prometheus/issues/8001 (which was closed but see details there)

In short the default install from helm has 1 minute scrape which makes the istio and kiali dashboards "empty" because they use [1m] in the query

Given a 1m scrape means the data is on average 30s old, I don't think returning "no data" for queries is a very useful behavior, even if I was scraping every 10 minutes I expect that any resolution I ask would be extrapolated - but I guess I have the wrong expectations?

Can someone talk me through why "no data" is the right answer for [1m] while there is data for [90s]

sample query:
sum(rate(istio_tcp_received_bytes_total{reporter="source"}[90s])) by (destination_workload, destination_workload_namespace, destination_service)

Thanks a lot
Laurent

Brian Brazil

unread,
Oct 1, 2020, 3:27:19 PM10/1/20
to Laurent Demailly, Prometheus Users
On Thu, 1 Oct 2020 at 20:21, Laurent Demailly <ldem...@gmail.com> wrote:
I described in detailed the problem in https://github.com/prometheus/prometheus/issues/8001 (which was closed but see details there)

In short the default install from helm has 1 minute scrape which makes the istio and kiali dashboards "empty" because they use [1m] in the query

Given a 1m scrape means the data is on average 30s old, I don't think returning "no data" for queries is a very useful behavior, even if I was scraping every 10 minutes I expect that any resolution I ask would be extrapolated - but I guess I have the wrong expectations?

Can someone talk me through why "no data" is the right answer for [1m] while there is data for [90s]


sample query:
sum(rate(istio_tcp_received_bytes_total{reporter="source"}[90s])) by (destination_workload, destination_workload_namespace, destination_service)

Thanks a lot
Laurent

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/cd723026-4634-4b89-b633-47c9d8543e37n%40googlegroups.com.


--

Brian Candler

unread,
Oct 2, 2020, 3:38:22 AM10/2/20
to Prometheus Users
On Thursday, 1 October 2020 20:21:43 UTC+1, Laurent Demailly wrote:
I described in detailed the problem in https://github.com/prometheus/prometheus/issues/8001 (which was closed but see details there)

In short the default install from helm has 1 minute scrape which makes the istio and kiali dashboards "empty" because they use [1m] in the query

You need to use at least [2m] for a rate query, if scraping at 1 minute intervals.
 
Given a 1m scrape means the data is on average 30s old, I don't think returning "no data" for queries is a very useful behavior, even if I was scraping every 10 minutes I expect that any resolution I ask would be extrapolated - but I guess I have the wrong expectations?

Can someone talk me through why "no data" is the right answer for [1m] while there is data for [90s]

rate() calculates the average rate between the *first* and *last* data points in the given time window.
irate() calculates the average rate between the *last two* data points in the given time window.

It uses the timestamps of the actual stored data points to calculate the rate, i.e. (v2-v1)/(t2-t1)    (**)

However, you need at least two data points to get an answer.  If your data is scraped at 1 minute intervals, then a 1-minute window will only ever contain one data point.  A 90-second window will sometimes contain two data points (in which case a rate is available), or one data point (in which case there is no answer).  If you graph this, the line will have gaps; to draw a point at time T, the rate shown is for the window between T-90 and T, which sometimes exists, and sometimes doesn't.

This is maybe surprising at first.  But it is consistent: for example, count_over_time(foo) will tell you the number of data points *within the window*.

When you do an instant query, then the value of a metric at query time T is nearest *previous* value of the metric.  So you might have expected rate(foo[1m]) to take the value of foo at the end of the window, and the value of foo at the start of the window, and calculate the rate between those.  But that's not how it works, for several reasons.  One is that it would have to look backwards *before* the start of the window to find the previous value (an instant query, by default, looks back up to 5 minutes).  Another is because the rate would bounce up and down as points enter and leave the window, whereas prometheus calculates an accurate rate between two timestamped values.

(**) That is a simplified description, because there is additional work to handle counter resets.  Basically, only periods of time within the window where the counter is not decreasing are considered, and an average rate is calculated from these.

For a slightly longer description, see:

Reply all
Reply to author
Forward
0 new messages