Increasing scrape_interval causing missing data point for metrics

chuanjia xing

unread,

Mar 22, 2021, 5:48:27 PM3/22/21

to Prometheus Users

Hi there,

I recently hit an missing data point issue using prometheus. Want to get some help here. Thanks.

Issue:

Increasing scrape_interval in prometheus resulted in missing data points.

My scenario:

I am using prometheus CloudWatch Exporter plus prometheus to fetch aws cloudwatch metrics for ec2 instances cpuutilizaiton. The key configs for the Exporter and Prometheus is initially as follows:

Config. Value

Scrape_interval (prometheus) 120s

Scrape_timeout (prometheus) 60s

Delay_seconds (Exporter) 600s

Range_seconds (Exporter) 600s

Period_seconds (Exporter) 60s

It is working fine with this set of configs, meaning the metrics I got from cloudwatch has no missing data point.

Later on, I increased Prometheus scrape_interval to 320s and all other configs are the same. I need to do this due to some other reason which I am not explaining here. After this change, the same metrics started to show some missing values, as shown below:

(attached graph)

You can see the missing data around time 11:30 and between 12:30 and 13:00.

There’re more of these data gaps in the metrics. And something I noticed is that the length of the missing data gap seems to match the scrape_interval config. For example, the first data gap above is from 11:24:26 to 11:30:08; the second data gap is from 12:44:14 to 12:50:53. Both length of gaps are around but not the same as the scrape_interval which is 320s.

Is there something already known? This is making my graph looking bad. The prometheus logs doesn’t provide much useful information as I can find.

Any pointer how to investigate this issue? Thanks!

Screen Shot 2021-03-22 at 2.18.45 PM.png

Stuart Clark

unread,

Mar 22, 2021, 6:07:08 PM3/22/21

to chuanjia xing, Prometheus Users

The maximum scrape interval is 5 minutes (otherwise time series will be marked as stale), however it is recommended to have a maximum of 2-2.5 minutes to allow for a single scrape failure (which can happen due to a timeout or slight network issue) without staleness. Is there a reason you are trying to increase the scrape interval above 2 minutes?

-- 
Stuart Clark

chuanjia xing

unread,

Mar 22, 2021, 6:22:29 PM3/22/21

to Prometheus Users

Thanks for your quick response Stuart!

The reason I increase the scrape_interval to be longer than 2 mins is that I have several regions in aws to query for ec2 cpuutilization metrics, and for the Exporter, some region it took ~3mins to return the cloudwatch matrics. Let's say if it took 3mins, then on prometheus side, I have to set the scrape_timeout longer than 3mins otherwise it will timeout; and scrape_interval needs to be no less than scrape_timeout other prometheus will assert, so I have to set scrape_interval longer here.

So for my case, if the cloudwatch Exporter took long time to get the metrics, do you think is there any way I can get over the missing data issue from you side? Thanks!

Ben Kochie

unread,

Mar 22, 2021, 6:35:24 PM3/22/21

to chuanjia xing, Prometheus Users

You should gather CPU utilization from the node_exporter, not cloudwatch. This is much more scaleable and won't run into these problems.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e9568470-40f3-469d-936c-41fbaf25cdadn%40googlegroups.com.

chuanjia xing

unread,

Mar 22, 2021, 6:53:22 PM3/22/21

to Prometheus Users

Thanks. The reason I am using cloudwatch exporter is because I want to get cpuutilization metrics per cluster / service, not on the node level.

I haven't used node_exporter before, not sure if I can get cpuutilization metrics for per cluster / service?

Stuart Clark

unread,

Mar 22, 2021, 7:02:03 PM3/22/21

to chuanjia xing, Prometheus Users

On 22/03/2021 22:53, chuanjia xing wrote:
> Thanks. The reason I am using cloudwatch exporter is because I want to
> get cpuutilization metrics per cluster / service, not on the node level.
> I haven't used node_exporter before, not sure if I can get
> cpuutilization metrics for per cluster / service?

Node exporter gathers metrics for a single EC2 instance, but once
scraped you can use PromQL to aggregate things together as desired. A
common method is to scrape the instances using the EC2 service discovery
mechanism and use relabling to add labels from various tags (for example
cluster and service tags).

--
Stuart Clark

Stuart Clark

unread,

Mar 22, 2021, 7:03:48 PM3/22/21

to chuanjia xing, Prometheus Users

On 22/03/2021 22:53, chuanjia xing wrote:

> Thanks. The reason I am using cloudwatch exporter is because I want to
> get cpuutilization metrics per cluster / service, not on the node level.
> I haven't used node_exporter before, not sure if I can get
> cpuutilization metrics for per cluster / service?

I would suggest only using the Cloudwatch exporter for things which you
can't get elsewhere. So for example use node exporter, MySQL exporter,
etc. for the main metrics, but the cloudwatch exporter for things like
ALB or NAT gateway metrics (which aren't available elsewhere).

--
Stuart Clark

chuanjia xing

unread,

Mar 22, 2021, 7:09:59 PM3/22/21

to Prometheus Users

Thanks Stuart. I didn't know node exporter can also collect metrics at instance level. If it can get per instance level cpu metrics and faster than cloudwatch exporter, then that should satisfy my requirements. I'll take a look at node exporter then.

chuanjia xing

unread,

Mar 22, 2021, 7:30:01 PM3/22/21

to Prometheus Users

I have one more question for node_exporter: say if I want to get ec2 instance cpu metrics for lots of clusters, do I need to run node_exporter on every node in all clusters? From the doc of node_exporter, it looks like one exporter will only collect metrics for the node it's running on, which means in my case I do need to install node_exporter on every nodes for all clusters.

If that is the case, then node_exporter might not work for my case -- I can't run a node_exporter on every node. Then cloudwatch exporter can do this since I only need one exporter instance to collect all ec2 instance cpu metrics in one region, but it's just slow.

Stuart Clark

unread,

Mar 22, 2021, 8:02:43 PM3/22/21

to chuanjia xing, Prometheus Users

On 22/03/2021 23:30, chuanjia xing wrote:

I have one more question for node_exporter: say if I want to get ec2 instance cpu metrics for lots of clusters, do I need to run node_exporter on every node in all clusters? From the doc of node_exporter, it looks like one exporter will only collect metrics for the node it's running on, which means in my case I do need to install node_exporter on every nodes for all clusters.
If that is the case, then node_exporter might not work for my case -- I can't run a node_exporter on every node. Then cloudwatch exporter can do this since I only need one exporter instance to collect all ec2 instance cpu metrics in one region, but it's just slow.

Yes you would install the node exporter on each EC2 instance. A common way to do that is to build it into the AMIs you are using or to use cloud-init to add it on startup. In addition to CPU you get a lot more metrics that Cloudwatch isn't able to supply - full details about networking, memory, disk, systemd, etc.

Cloudwatch is known to be slow, not just the actual API calls but also the time it takes for metrics to be available (a value returned by the API might be comparatively old rather than being real-time). Using the Node exporter is also likely to be cheaper as the only costs are network bandwidth rather than the various API calls.

-- 
Stuart Clark

chuanjia xing

unread,

Mar 22, 2021, 8:25:31 PM3/22/21

to Prometheus Users

Thanks Stuart. I'll need to think about if it's doable for my case to run node_exporter on each ec2 instances. I am in an infra team, doing that will have lots of impact which I need to evaluate. But thanks for your suggestions.

One more questions regarding cloudwatch exporter: for my case, another option (actually it's my first option) to get cluster / service level cpu metrics is instead querying ec2 intance metrics, I can collect AutoScalingGroup cpu metrics, which will be faster since the # of ASG is much smaller than the # of ec2 instances. But unfortunately, using cloudwatch exporter, it doesn't support ASG metrics directly since the aws api it's using doesn't support ASG: https://docs.aws.amazon.com/resourcegroupstagging/latest/APIReference/supported-services.html

I am actually thinking if I can get over this limitation by using a different aws API. Do you know if this is something doable? (I can ask this question in a separate conversation if needed)

thanks.

Reply all

Reply to author

Forward