Facing 5m staleness issue even with 2.x

Aniket Kulkarni

unread,

Apr 19, 2022, 6:03:00 AM4/19/22

to promethe...@googlegroups.com

Hi,

I have referred below links:

I understand this was a problem with 1.x

https://github.com/prometheus/prometheus/issues/398

I also got this link as a solution

https://promcon.io/2017-munich/talks/staleness-in-prometheus-2-0/

No doubt it's a great session. But I am still not clear as to what change I have to make and where?

I also couldn't find the prometheus docs useful for this.

I am using following tech stack:

Gatling -> graphite-exporter -> prometheus-> grafana.

I am still facing staleness issue. Please guide me on the solution or any extra configuration needed?

I am using the default storage system by prometheus and not any external one.

Stuart Clark

unread,

Apr 19, 2022, 7:14:32 AM4/19/22

to Aniket Kulkarni, promethe...@googlegroups.com

Could you describe a bit more of the problem you are seeing and what you
are wanting to do?

All time series will be marked as stale if they have not been scraped
for a while, which causes data to stop being returned by queries, which
is important as things like labels will change over time (especially for
things like Kubernetes which include pod names). It is expected that
targets will be regularly scraped, so things shouldn't otherwise
disapear (unless there is an error, which should be visible via
something like the "up" metric).

As the standard staleness interval is 5 minutes it is recommended that
the maximum scrape period should be no more that 2 minutes (to allow for
a failed scrape without the time series being marked as stale).

--
Stuart Clark

Aniket Kulkarni

unread,

Apr 19, 2022, 7:26:31 AM4/19/22

to Stuart Clark, promethe...@googlegroups.com

Thanks for the response Stuart..

To explain you more..

I am load testing an application through Gatling scripts (similar to jmeter).

Now I want to have a real time monitoring of this load test.

For this, Gatling supports graphite writer protocol(it can't directly talk with prometheus hence I have used graphite-exporter in between)

Now Promotheus will collect these metrics sent by Gatling and provide to Grafana to plot the graphs.

Now the problem is I am getting graphs but even after my load test is finished, I see the last value graph repeating for 5 minutes.

Which is the known issue of prometheus... Hence I am confused on how to resolve this issue? Any configuration need to be added to prometheus.yml file?

Please let me know if you need any further details..

Brian Candler

unread,

Apr 19, 2022, 9:14:23 AM4/19/22

to Prometheus Users

This is an issue with graphite-exporter, not prometheus or staleness.

The problem is this: if your application simply stops sending data to graphite-exporter, then graphite-exporter has no idea whether the time series has finished or not, so it keeps exporting it for a while.

See https://github.com/prometheus/graphite_exporter#usage

"To avoid using unbounded memory, metrics will be garbage collected five minutes after they are last pushed to. This is configurable with the --graphite.sample-expiry flag."

Once graphite-exporter stops exporting the metric, then on the next scrape prometheus will see that the timeseries has gone, and it will immediately mark it as stale (i.e. has no more values), and everything is fine.

Therefore, reducing --graphite.sample-expiry may help, although you need to know how often your application sends graphite data; if you set this too short, then you'll get gaps in your graphs.

Another option you could try is to get your application to send a "NaN" value at the end of this run. But technically this is a real NaN value, not a staleness marker (staleness markers are internally represented as a special kind of NaN, but that's an implementation detail that you can't rely on). Still, a NaN may be enough to stop Grafana showing any values from this point onwards.

Aniket Kulkarni

unread,

Apr 19, 2022, 10:55:12 AM4/19/22

to promethe...@googlegroups.com

Thanks a lot Brian..

Setting --graphite.sample-expiry flag solved the issue.

For now, I have kept it to 15 seconds... any guidance on how to decide this correct value would be appreciated.

Brian Candler

unread,

Apr 19, 2022, 12:46:19 PM4/19/22

to Prometheus Users

It depends on:

1. How often Gatling sends its graphite metrics

2. How often Prometheus scrapes graphite-exporter

If Prometheus is scraping graphite-exporter every 15 seconds, then you'll need to keep --graphite.sample-expiry to at least 15 seconds; otherwise you may lose the last metric value written by Gatling.

Reply all

Reply to author

Forward