False Positive Alerts for node CPU Usage for one node

111 views
Skip to first unread message

James S

unread,
Jul 8, 2021, 9:31:24 AM7/8/21
to Prometheus Users
We are getting False positive for only one node all the time. we do not have this issue with other nodes

we have the rule configured for the CPU usage was

alert:NodeCPUUtilWar
expr: instance:node_cpu_utilisation:rate1m > 0.8
for: 5m

record: instance:node_cpu_utilisation:rate1m 
- expr: 
      1 - avg without (cpu, mode) (rate(node_cpu-seconds_total{job="node_exporter", mode ="idle"} [1m]))


Stuart Clark

unread,
Jul 8, 2021, 9:50:37 AM7/8/21
to James S, Prometheus Users
What makes you say it is a false positive? What does the graph of that
metric show?

--
Stuart Clark

James S

unread,
Jul 8, 2021, 11:07:43 AM7/8/21
to Prometheus Users
We do not see any stress on the cluster and we do not see this in GCP cloud monitoring this behavior. 

Stuart Clark

unread,
Jul 8, 2021, 12:13:23 PM7/8/21
to James S, Prometheus Users
On 2021-07-08 16:07, James S wrote:
> We do not see any stress on the cluster and we do not see this in GCP
> cloud monitoring this behavior.
>

What does the graph of the metric look like?

Is this a single or multiple CPU machine?

James S

unread,
Jul 8, 2021, 12:48:05 PM7/8/21
to Prometheus Users
It is 4 CPU machine
the Grafana graph: 
DAD8018C-0081-4856-8D89-C4700BB65F23.png
GCP monitoring:
78903591-5A30-42EC-934D-ED4065F3B46B.png

James S

unread,
Jul 8, 2021, 12:49:54 PM7/8/21
to Prometheus Users

GCP monitoring CPU usage for the node
78903591-5A30-42EC-934D-ED4065F3B46B.png

James S

unread,
Jul 8, 2021, 2:54:55 PM7/8/21
to Prometheus Users

I have changed the query to 
sum(rate(node_cpu_seconds_total{mode!="idle"} [5m])) by (node) / sum(kube_node_status_capacity_cpu_cores) by node

But the result is the same. my problem is not fixed

Laurent Dumont

unread,
Jul 9, 2021, 6:19:19 AM7/9/21
to James S, Prometheus Users
I don't know how GCP calculates their CPU metrics, but node_cpu_seconds_total looks to contain statistics for user/kernel/interrupt etc spaces. Maybe you can make a separate graph based on each of those and see if one is much higher (https://www.robustperception.io/understanding-machine-cpu-usage)

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7c0120dd-6e8c-4e55-9f46-a97d0d176229n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages