Prometheus / Grafana / Graphite_Exporter - Missing Data

82 views
Skip to first unread message

rbend...@gmail.com

unread,
Feb 25, 2019, 2:17:11 PM2/25/19
to Prometheus Users
Hi all, hoping someone can help me out on moving forward with my issues.

Apologies for the generalities, I have started experiencing the below issues but am at a loss of how to determine the root cause. I am fairly new to this new tech stack, I have tried everything I can think of please see below.

Issue:

Within the last couple of months our graphs have exhibited erratic and or missing data. Our graphs have become unreliable and broken. Both Grafana and Prometheus graphs show these odd patterns.

Capture_graph_issue_001.JPG



Capture_graph_issue_002.JPG


Capture_graph_issue_003.JPG


Capture_graph_issue_004.JPG



Setup:

We are running graphite_exporters on our app hosts which Prometheus scrapes every minute. Data is remotely written to an Influxdb for backup and long term data retention. Data retention is set to ~2weeks to ~2months (depending on the env) on Prometheus so older data than what set retention setting is coming from Influxdb. 

- graphite_exporters are set to produce metrics every minute
- prometheus is set to scrape every minute

Theories:

I am going back and forth between 2 theories, either we have reached a limit (scraping?) or for some reason we have started to experience this: https://github.com/prometheus/prometheus/issues/2364. I am starting to think its not the latter as why then were our graphs OK all the way up until just ~1-2 months ago.

One box is ingesting 4 million - this box one set of rules/agrregation.rules shows evaluation time as just of 6 minutes. The rest of the rules on this box and all other boxes report in milliseconds. 
The other is ingesting 2 million - now looks like this box is also (starting?) to exhibit these issues?

Its hard to troubleshoot as I set different time frames and load same data during different times of the day the graphs totally change shapes which hints at the 2364 issue above.

Throubleshooting:

- I have tested with  changing the scrapes from 1m (our running setting) to 2m, 3m, 5m, 10m. Graph issue remain unchanged.
- parsed /var/log/messages (no errors stand out)
- parsed prometheus.log's  (no errors stand out)
- parsed influxdb.log's (no errors stand out)
- parsed grafana log's  (no errors stand out) ...although since graphs in prometheus exhibit same behavior I doubt this is a Grafana issue.
- prometheus targets healthy and produce metrics
- restarted prometheus, influxdb, and grafana no change to graph issues
- verified crape configs
- verified aggregation rules
- looked at resources on server and in prometheus health graphs, looks like servers aren't even breaking a sweat
- agents working as expected (graphite_exporter)
- metrics load on (graphite_exporter's web scape url)
-

If anyone has any feedback or ideas I would greatly appreciate that. Again apologies I do not know what I can possibly provide here to be more clear at what the issue is.

Thanks,

Joy Bhattacherjee

unread,
Feb 26, 2019, 8:26:30 AM2/26/19
to rbend...@gmail.com, Prometheus Users
Try to setup a dashboard around the following metrics:
 - prometheus_config_last_reload_successful
 - alertmanager_config_last_reload_successful
 - sum(rate(prometheus_target_scrape_pool_sync_total[$__interval]))
 - avg(scrape_duration_seconds) by (service)

We found that the moment config-reloader fails to reload config, or
either altermanager or prometheus config has an unparseable entry
prometheus will start throwing a bunch of errors and start missing 1m wide chunks of data.

You might grep see the error from the logs, which will look like

caller=manager.go:675 component="rule manager" msg="loading groups failed" err="group... 
caller=main.go:625 err="error loading config from \"/etc/prometheus/...\"
one or more errors occurred while applying the new configuration..

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/269feafd-cabf-4b2b-a4cd-55a7e243d163%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

rbend...@gmail.com

unread,
Feb 26, 2019, 1:09:04 PM2/26/19
to Prometheus Users
Thank you for the suggestion...Ill get right on this today.
Reply all
Reply to author
Forward
0 new messages