Hi,
I'm trying to figure out why data are missing sometime in dashboard backed by prometheus. Our setup is more or less standard prometheus-operator helm chart. It defines following recording rule:
record: instance:node_cpu_utilisation:rate1m
expr: 1 - avg without(cpu, mode) (rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m]))
There are 9 nodes in the cluster, but the dashboard that displays this metric only displays 7 nodes. Switching the dashboard to the expression directly shows all data as expected. Noteworthy things:
- there are no exceptions in the log, no failed rule evaluations
- the issue shows (almost) consistently for more than 2 hours by now
- in two occasions in this period one of the missing nodes became part of the recorded rule for what seems to be one scrape interval and dropped again immediately
- after prometheus restart, the issue persists
- other rules defined within the same group seem to be impacted in the same way (e.g. instance:node_network_receive_bytes_excluding_lo:rate1m that calculates network usage in the same fashion)
This cluster suffered some performance issues in the past and had the scrape/evaluation interval extended to 90s. During this period the instance:node_cpu_utilisation:rate1m didn't record any data (because it uses range that was shorter than actual scrape/evaluation). The problem became apparent after switching back to the original 30s scrape/evaluation interval. In this moment all 9 nodes should have its CPU usage correctly displayed, but only 7 appeared.
Has anybody encountered similar situation?
Thanks,
Vojta