What causes "counter jumped backwards" in node-exporter

998 views
Skip to first unread message

Bruce Merry

unread,
Jul 20, 2020, 2:11:38 PM7/20/20
to Prometheus Users
We've recently upgraded a fleet of machines to both Ubuntu 18.04 (Linux 5.3) and node-exporter to 1.0.1 (run under Docker). We're now seeing a lot of messages like these:

CPU Idle counter jumped backwards, possible hotplug event, resetting CPU stats
CPU User counter jumped backwards
CPU System counter jumped backwards
CPU Iowait counter jumped backwards
CPU SoftIRQ counter jumped backwards

I examined a handful of messages and the counter seems to go backwards by a very small amount (0.01). Over the last 3 days, 74% of the messages are from one machine and 24% from another, with multiple machines making up the rest.

I assume that this something odd in the system itself and that node-exporter is just reporting what it sees. But any tips on what might cause it / how to fix it, and whether it is worth worrying about? Is it likely to be caused by running under Docker?

Here's the uname and docker-compose.yml for completeness:

Linux imgr1 5.3.0-62-generic #56~18.04.1-Ubuntu SMP Wed Jun 24 16:17:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

version: '2'
services:
  node-exporter:
    image: prom/node-exporter:v1.0.1
    command:
      - '--collector.textfile.directory=/textfile'
      - '--no-collector.zfs'
      - '--no-collector.wifi'
      - '--no-collector.bcache'
      - '--path.rootfs=/host'
      - '--collector.vmstat.fields=^(oom_kill|pgpg|pswp|nr|pg.*fault).*'
      - '--collector.filesystem.ignored-fs-types=^(tmpfs|autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|fuse.lxcfs|nfs|nsfs|ceph|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$$'
    network_mode: host
    pid: host
    volumes:
      - /var/spool/node-exporter:/textfile:ro
      - /:/host:ro,rslave

Thanks
Bruce

Ben Kochie

unread,
Jul 20, 2020, 2:24:59 PM7/20/20
to Bruce Merry, Prometheus Users
Thanks for posting this. I've been looking for more cases where people see these issues.

Linux does not do any mutex locking of the CPU metric counters. So we added some to the node_exporter in order to detect and mitigate spurious counter resets.

In all of my testing and evidence, I've only see this happen on iowait data. But it's interesting that you see it on other events is useful. I don't think your docker use has any impact. I'd suspect it has more to do with the underlying server environment. Is this bare metal? VMs? What hypervisor?

See the discussion and code in this issue: https://github.com/prometheus/node_exporter/issues/1686



--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2aac7b8b-8780-4a2f-973d-b0344720b876n%40googlegroups.com.

Bruce Merry

unread,
Jul 20, 2020, 2:46:33 PM7/20/20
to Prometheus Users
On Monday, July 20, 2020 at 8:24:59 PM UTC+2 sup...@gmail.com wrote:
Thanks for posting this. I've been looking for more cases where people see these issues.

Linux does not do any mutex locking of the CPU metric counters. So we added some to the node_exporter in order to detect and mitigate spurious counter resets.

In all of my testing and evidence, I've only see this happen on iowait data. But it's interesting that you see it on other events is useful. I don't think your docker use has any impact. I'd suspect it has more to do with the underlying server environment.

Looking the last week of logs, I see 41% for user, 31% for idle, 26% for system, 1.4% for iowait, 0.3% for softirq.
 
Is this bare metal? VMs? What hypervisor?

It's bare metal. What's also odd is that it's restricted to one batch of machines that have the same hardware and run the same workloads. Other machines (with different hardware and workloads) don't exhibit these warnings despite having been deployed at the same time with the same kernel and OS. The affected machines have 12 cores and hyperthreading enabled (so 24 virtual cores) while the other machines generally have up to 8 cores and no HT, so possibly the affected machines have a higher chance of running into race conditions for the kernel data structures.

Let me know if there are other details you'd like to investigate.

Cheers
Bruce

Ben Kochie

unread,
Jul 21, 2020, 3:53:41 AM7/21/20
to Bruce Merry, Prometheus Users
On Mon, Jul 20, 2020 at 8:46 PM Bruce Merry <bme...@ska.ac.za> wrote:
On Monday, July 20, 2020 at 8:24:59 PM UTC+2 sup...@gmail.com wrote:
Thanks for posting this. I've been looking for more cases where people see these issues.

Linux does not do any mutex locking of the CPU metric counters. So we added some to the node_exporter in order to detect and mitigate spurious counter resets.

In all of my testing and evidence, I've only see this happen on iowait data. But it's interesting that you see it on other events is useful. I don't think your docker use has any impact. I'd suspect it has more to do with the underlying server environment.

Looking the last week of logs, I see 41% for user, 31% for idle, 26% for system, 1.4% for iowait, 0.3% for softirq.

That's extremely interesting, I have only ever seen these for iowait. Usually when the CPU use is moderate and there is a small background rate of iowait.

What is the typical CPU utilization for these nodes? Do you notice if any correlation between CPUs that jump backwards and the load on that CPU at that time? My question is, when a CPU jumps backwards, is it under high or low utilization?
 
 
Is this bare metal? VMs? What hypervisor?

It's bare metal. What's also odd is that it's restricted to one batch of machines that have the same hardware and run the same workloads. Other machines (with different hardware and workloads) don't exhibit these warnings despite having been deployed at the same time with the same kernel and OS. The affected machines have 12 cores and hyperthreading enabled (so 24 virtual cores) while the other machines generally have up to 8 cores and no HT, so possibly the affected machines have a higher chance of running into race conditions for the kernel data structures.

Let me know if there are other details you'd like to investigate.

It would be interesting to get some more details on what kernel versions are there. From our other investigations, I don't recall there being much change in the kernel code here, but it might be helpful.

When idle jumps backwards, how much does it jump back by? What are the absolute values for the counter before and after the jump back? Right now we reset the counters if idle jumps back any amount, assuming this happens when the kernel hotplugs a CPU. But this was a very big assumption based on some limited testing. We might want to change things to only reset everything if there's a jump back of more than X%.
 

Cheers
Bruce

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Bruce Merry

unread,
Jul 21, 2020, 11:37:03 AM7/21/20
to Prometheus Users
Let me get back to you on this in a few days - we have structured logs in Elasticsearch so it should be possible to answer all your questions, but I'll need to find a few hours to figure out how to use the tools.

Kernel is
Linux imgr1 5.3.0-62-generic #56~18.04.1-Ubuntu SMP Wed Jun 24 16:17:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux


Bruce Merry

unread,
Jul 22, 2020, 4:14:32 AM7/22/20
to Prometheus Users
On Tuesday, July 21, 2020 at 9:53:41 AM UTC+2 sup...@gmail.com wrote:

What is the typical CPU utilization for these nodes? Do you notice if any correlation between CPUs that jump backwards and the load on that CPU at that time? My question is, when a CPU jumps backwards, is it under high or low utilization?

It's fairly I/O-heavy workloads, so the CPUs are seldom maxed out. The jumps occur both when the system is idle and when the CPUs are working - I've checked that is the case on an individual CPU level too. At the moment I'm just eye-balling graphs, no hard statistics to tell whether there is correlation, but there is nothing blatantly jumping out at me.

It does appear that when a jump occurs, it occurs on multiple CPUs at the same time (even across sockets), and sometimes multiple counters for the same CPU.

 
When idle jumps backwards, how much does it jump back by? What are the absolute values for the counter before and after the jump back? Right now we reset the counters if idle jumps back any amount, assuming this happens when the kernel hotplugs a CPU. But this was a very big assumption based on some limited testing. We might want to change things to only reset everything if there's a jump back of more than X%.

Again, this is just eyeballing logs, but I've looked at maybe 100 instances and they're all jumping back by 0.01 seconds (which is also the resolution). The idle counters are in the range 1-2 million, the others in the range 10k-100k.
 
Cheers
Bruce
Reply all
Reply to author
Forward
0 new messages