Why don't I see gaps in instance vectors if Prometheus itself is down by < 5 Mins

60 views
Skip to first unread message

vteja...@gmail.com

unread,
Jan 18, 2022, 6:54:07 PM1/18/22
to Prometheus Users
Hi,

I just run Prometheus (2.32.0-beta.0) without changing any CLI flags. If I shut it down for <5 Mins, I don't see a gap in my graphs. If I shut it down for > 5Mins, I see a gap. Is this valid?

Thanks,
Teja
ppk.PNG

Brian Candler

unread,
Jan 19, 2022, 2:59:57 AM1/19/22
to Prometheus Users
Yes.  Google "Prometheus Staleness".

In short: the value of a metric at time T is the most recent value recorded on or before time T, and prometheus will look backwards in time up to --query.lookback-delta (default 5 minutes) to find this value.

Brian Candler

unread,
Jan 19, 2022, 3:57:40 AM1/19/22
to Prometheus Users
I believe the following is true as well:

- if prometheus does a scrape and a metric is missing (in the exporter output) which was present in a previous scrape, it's immediately marked as "stale", i.e. a staleness marker is inserted into the timeseries.

- however in your case, you're turning off the prometheus server, so there's no scraping taking place.  You just get a point in the timeseries at the time of the last scrape before prometheus shut down, and then at the first scrape after prometheus starts up.  There is no indication within the timeseries data itself that anything is missing; therefore, PromQL queries will look back up to 5 minutes.

vteja...@gmail.com

unread,
Jan 20, 2022, 6:45:46 AM1/20/22
to Prometheus Users
Thanks for the explanation, I thought staleness is applicable only to Prometheus Targets, haven't imagined this concept to Prometheus restarts and unavailability. So, you say 'statelessness' is also applied to Prometheus availability.

Is there a way to know if a metric or series is marked stale through the general PromQL curl command and also through the expression browser?

After reading a few Prometheus docs and Q&A, is it still a good practice to keep scraping intervals around 2 minutes and not change the default query look-back delta flag?

vteja...@gmail.com

unread,
Jan 20, 2022, 8:27:29 AM1/20/22
to Prometheus Users
With this approach,  how do the users know the truth? Why did Prometheus invoke query look-back? Is it due to Prometheus Target unavailability/unreachability or Prometheus unavailability?

Brian Candler

unread,
Jan 20, 2022, 11:45:49 AM1/20/22
to Prometheus Users
On Thursday, 20 January 2022 at 11:45:46 UTC vteja...@gmail.com wrote:
Thanks for the explanation, I thought staleness is applicable only to Prometheus Targets, haven't imagined this concept to Prometheus restarts and unavailability. So, you say 'statelessness' is also applied to Prometheus availability.

No, I'm saying the opposite.

If prometheus fails to scrape a metric which it scraped before in the same scrape job, it inserts a staleness marker.  However if you stop and start prometheus, then there is no staleness marker to write.

Prometheus therefore falls back to its normal default behaviour, which is to look back up to 5 minutes for the previous valid data point.

> With this approach,  how do the users know the truth? Why did Prometheus invoke query look-back? Is it due to Prometheus Target unavailability/unreachability or Prometheus unavailability?

None of those.  It's quite simply because time series consist of values at particular points in time, e.g. X1 at T1, X2 and T2, X3 at T3, where Tn are the exact times they were scraped.

When you ask for the value of a timeseries at some arbitrary time T, there is almost certainly not going to be any data point which exists at exactly time T (it would be extremely unlikely).  Therefore, Prometheus defines the value of a timeseries at time T to be the value of the *most recent data point* at or before time T.  But it also constrains itself to looking back no more than 5 minutes (this is tunable) so as not to expend an unlimited amount of effort looking for a data point hours or even years earlier.

Think about what happens when prometheus draws a graph.  It samples the timeseries at a series of steps across a time window: say at time 01:00, 01:30, 02:00, 02:30, 03:00 etc.  The start/end times and the size of the steps will be determined by your graphing software and your screen resolution.

Now say you are scraping data points at 1 minute intervals, and points were read in as X1 at 01:17, X2 at 02:18, X3 at 03:17.

The graph will show:
01:00 - no data (no value within the previous 5 minutes)
01:30 - value is X1
02:00 - value is X1
02:30 - value is X2
03:00 - value is X2
03:30 - value is X3

Note that a timeseries has no idea of what its "scrape interval" is, because there isn't one.  Although *normally* they are scraped at *roughly* regular intervals, nothing enforces this.  You could have a scrape job running at 1m intervals, and then switch it to 15s intervals for a while, and then switch it back to 1m intervals.  All the points will be saved in the timeseries.  But if you shutdown prometheus, well, there's no way of knowing this has occurred.  There will be a larger interval between scrapes than "normal", but as far as prometheus knows, you might just have missed a couple of scrapes, or increased the scrape interval for a little while.

vteja...@gmail.com

unread,
Jan 20, 2022, 9:01:37 PM1/20/22
to Prometheus Users
Thank you very much for the detailed explanation!

I will write here what I understood, please shout if I am wrong:
  1. If we have a self-monitoring job in Prometheus and if it restarts
       a. if restart time > 5Mins, we see gaps and there are no staleness markers will be applied by Prometheus as its process got restarted
       b. if restart time <= 5Mins, there will not be any gaps in the graphs, Prometheus will auto-fill the best known (last known scrape) values.
  2. If a series is marked stale, Prometheus fills the NaN value in the TSDB for that series.
  3. Gaps in graphs mean that the target is unavailable or unreachable.
A few more questions on this subject:
  1. Is there a metric that gives us a hint about the number of stale series?
  2. How do we know if a series is marked stale?
  3. Is it a good idea to adjust the query delta look-back CLI flag?
  4. Can I set a scraping interval of a job to 20 minutes? At the moment, one can't adjust query delta look-back per scrape job.

Brian Candler

unread,
Jan 21, 2022, 3:17:33 AM1/21/22
to Prometheus Users
> Can I set a scraping interval of a job to 20 minutes? At the moment, one can't adjust query delta look-back per scrape job.

It's not recommended to scrape less frequently than once every 2 minutes.  With the default 5-minute lookback, this gives a degree of robustness against losing a single scrape.

In theory, you could set the lookback to say 50 minutes and then scrape every 20 minutes.  Like I say, it's not recommended, and as you've observed, this is a global setting.

Is there a particular reason why scraping every 2 minutes can't be done?  Don't worry about TSDB storage.  Prometheus does delta-compression, so if repeated scrapes of the same exporter give the same value, the difference between them is zero and it uses hardly any storage at all (just the timestamp deltas).

If the problem is that your scrape task is expensive to run, then run it from a cronjob and put the output somewhere where it can be scraped (e.g. node_exporter textfile collector).  This is a good idea anyway for expensive metrics, as it avoids DoS problems if multiple clients are scraping the same exporter.

I can't really answer your other questions about staleness markers.  My understanding is that staleness markers are not exposed to users (even though internally they're a special kind of NaN); so if you query a timeseries which is stale, I would expect that vector result would not include that timeseries - it would be just as if the timeseries did not exist at that point in time.  In other words, I'd expect that count(foo) would give the number of timeseries for metric "foo" which are not stale.  But that's just my expectation; you should test it if it matters to you.  It's completely different from the question you originally raised about stopping and restarting prometheus.

vteja...@gmail.com

unread,
Jan 21, 2022, 8:28:24 AM1/21/22
to Prometheus Users
Sounds reasonable, scraping around 2 minutes and storage shouldn't be a problem.
I wonder about huge ingestion expectations on Prometheus when deployed as a standalone instance. I will test this and figure out if I see an impact.

I agree that I deviated a bit from my original question. But you helped me understand the concept and reasoning.

Thanks a ton Brain Candler, really appreciate your efforts!

Cheers,
Teja

Reply all
Reply to author
Forward
0 new messages