--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2aac7b8b-8780-4a2f-973d-b0344720b876n%40googlegroups.com.
Thanks for posting this. I've been looking for more cases where people see these issues.Linux does not do any mutex locking of the CPU metric counters. So we added some to the node_exporter in order to detect and mitigate spurious counter resets.In all of my testing and evidence, I've only see this happen on iowait data. But it's interesting that you see it on other events is useful. I don't think your docker use has any impact. I'd suspect it has more to do with the underlying server environment.
Is this bare metal? VMs? What hypervisor?
On Monday, July 20, 2020 at 8:24:59 PM UTC+2 sup...@gmail.com wrote:Thanks for posting this. I've been looking for more cases where people see these issues.Linux does not do any mutex locking of the CPU metric counters. So we added some to the node_exporter in order to detect and mitigate spurious counter resets.In all of my testing and evidence, I've only see this happen on iowait data. But it's interesting that you see it on other events is useful. I don't think your docker use has any impact. I'd suspect it has more to do with the underlying server environment.Looking the last week of logs, I see 41% for user, 31% for idle, 26% for system, 1.4% for iowait, 0.3% for softirq.
Is this bare metal? VMs? What hypervisor?It's bare metal. What's also odd is that it's restricted to one batch of machines that have the same hardware and run the same workloads. Other machines (with different hardware and workloads) don't exhibit these warnings despite having been deployed at the same time with the same kernel and OS. The affected machines have 12 cores and hyperthreading enabled (so 24 virtual cores) while the other machines generally have up to 8 cores and no HT, so possibly the affected machines have a higher chance of running into race conditions for the kernel data structures.Let me know if there are other details you'd like to investigate.
Cheers
Bruce--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f75a9bcf-0b59-4184-8191-c0da05886b38n%40googlegroups.com.
What is the typical CPU utilization for these nodes? Do you notice if any correlation between CPUs that jump backwards and the load on that CPU at that time? My question is, when a CPU jumps backwards, is it under high or low utilization?
When idle jumps backwards, how much does it jump back by? What are the absolute values for the counter before and after the jump back? Right now we reset the counters if idle jumps back any amount, assuming this happens when the kernel hotplugs a CPU. But this was a very big assumption based on some limited testing. We might want to change things to only reset everything if there's a jump back of more than X%.