Hi there,
I recently hit an missing data point issue using prometheus. Want to get some help here. Thanks.
Issue:
Increasing scrape_interval in prometheus resulted in missing data points.
My scenario:
I am using prometheus CloudWatch Exporter plus prometheus to fetch aws cloudwatch metrics for ec2 instances cpuutilizaiton. The key configs for the Exporter and Prometheus is initially as follows:
Config. Value
Scrape_interval (prometheus) 120s
Scrape_timeout (prometheus) 60s
Delay_seconds (Exporter) 600s
Range_seconds (Exporter) 600s
Period_seconds (Exporter) 60s
It is working fine with this set of configs, meaning the metrics I got from cloudwatch has no missing data point.
Later on, I increased Prometheus scrape_interval to 320s and all other configs are the same. I need to do this due to some other reason which I am not explaining here. After this change, the same metrics started to show some missing values, as shown below:
(attached graph)
You can see the missing data around time 11:30 and between 12:30 and 13:00.
There’re more of these data gaps in the metrics. And something I noticed is that the length of the missing data gap seems to match the scrape_interval config. For example, the first data gap above is from 11:24:26 to 11:30:08; the second data gap is from 12:44:14 to 12:50:53. Both length of gaps are around but not the same as the scrape_interval which is 320s.
Is there something already known? This is making my graph looking bad. The prometheus logs doesn’t provide much useful information as I can find.
Any pointer how to investigate this issue? Thanks!
The maximum scrape interval is 5 minutes (otherwise time series
will be marked as stale), however it is recommended to have a
maximum of 2-2.5 minutes to allow for a single scrape failure
(which can happen due to a timeout or slight network issue)
without staleness. Is there a reason you are trying to increase
the scrape interval above 2 minutes?
-- Stuart Clark
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e9568470-40f3-469d-936c-41fbaf25cdadn%40googlegroups.com.
I have one more question for node_exporter: say if I want to get ec2 instance cpu metrics for lots of clusters, do I need to run node_exporter on every node in all clusters? From the doc of node_exporter, it looks like one exporter will only collect metrics for the node it's running on, which means in my case I do need to install node_exporter on every nodes for all clusters.If that is the case, then node_exporter might not work for my case -- I can't run a node_exporter on every node. Then cloudwatch exporter can do this since I only need one exporter instance to collect all ec2 instance cpu metrics in one region, but it's just slow.
Yes you would install the node exporter on each EC2 instance. A common way to do that is to build it into the AMIs you are using or to use cloud-init to add it on startup. In addition to CPU you get a lot more metrics that Cloudwatch isn't able to supply - full details about networking, memory, disk, systemd, etc.
Cloudwatch is known to be slow, not just the actual API calls but
also the time it takes for metrics to be available (a value
returned by the API might be comparatively old rather than being
real-time). Using the Node exporter is also likely to be cheaper
as the only costs are network bandwidth rather than the various
API calls.
-- Stuart Clark