Hi all, hoping someone can help me out on moving forward with my issues.
Apologies for the generalities, I have started experiencing the below issues but am at a loss of how to determine the root cause. I am fairly new to this new tech stack, I have tried everything I can think of please see below.
Issue:
Within the last couple of months our graphs have exhibited erratic and or missing data. Our graphs have become unreliable and broken. Both Grafana and Prometheus graphs show these odd patterns.




Setup:
We are running graphite_exporters on our app hosts which Prometheus scrapes every minute. Data is remotely written to an Influxdb for backup and long term data retention. Data retention is set to ~2weeks to ~2months (depending on the env) on Prometheus so older data than what set retention setting is coming from Influxdb.
- graphite_exporters are set to produce metrics every minute
- prometheus is set to scrape every minute
Theories:
I am going back and forth between 2 theories, either we have reached a limit (scraping?) or for some reason we have started to experience this:
https://github.com/prometheus/prometheus/issues/2364. I am starting to think its not the latter as why then were our graphs OK all the way up until just ~1-2 months ago.
One box is ingesting 4 million - this box one set of rules/agrregation.rules shows evaluation time as just of 6 minutes. The rest of the rules on this box and all other boxes report in milliseconds.
The other is ingesting 2 million - now looks like this box is also (starting?) to exhibit these issues?
Its hard to troubleshoot as I set different time frames and load same data during different times of the day the graphs totally change shapes which hints at the 2364 issue above.
Throubleshooting:
- I have tested with changing the scrapes from 1m (our running setting) to 2m, 3m, 5m, 10m. Graph issue remain unchanged.
- parsed /var/log/messages (no errors stand out)
- parsed prometheus.log's (no errors stand out)
- parsed influxdb.log's (no errors stand out)
- parsed grafana log's (no errors stand out) ...although since graphs in prometheus exhibit same behavior I doubt this is a Grafana issue.
- prometheus targets healthy and produce metrics
- restarted prometheus, influxdb, and grafana no change to graph issues
- verified crape configs
- verified aggregation rules
- looked at resources on server and in prometheus health graphs, looks like servers aren't even breaking a sweat
- agents working as expected (graphite_exporter)
- metrics load on (graphite_exporter's web scape url)
-
If anyone has any feedback or ideas I would greatly appreciate that. Again apologies I do not know what I can possibly provide here to be more clear at what the issue is.
Thanks,