rule recorded data seem incomplete

17 views

Skip to first unread message

vojta....@gmail.com

unread,

Sep 6, 2020, 5:38:10 AM9/6/20

to Prometheus Users

Hi,

I'm trying to figure out why data are missing sometime in dashboard backed by prometheus. Our setup is more or less standard prometheus-operator helm chart. It defines following recording rule:

record: instance:node_cpu_utilisation:rate1m

expr: 1 - avg without(cpu, mode) (rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m]))

There are 9 nodes in the cluster, but the dashboard that displays this metric only displays 7 nodes. Switching the dashboard to the expression directly shows all data as expected. Noteworthy things:

there are no exceptions in the log, no failed rule evaluations
the issue shows (almost) consistently for more than 2 hours by now
in two occasions in this period one of the missing nodes became part of the recorded rule for what seems to be one scrape interval and dropped again immediately
after prometheus restart, the issue persists
other rules defined within the same group seem to be impacted in the same way (e.g. instance:node_network_receive_bytes_excluding_lo:rate1m that calculates network usage in the same fashion)

This cluster suffered some performance issues in the past and had the scrape/evaluation interval extended to 90s. During this period the instance:node_cpu_utilisation:rate1m didn't record any data (because it uses range that was shorter than actual scrape/evaluation). The problem became apparent after switching back to the original 30s scrape/evaluation interval. In this moment all 9 nodes should have its CPU usage correctly displayed, but only 7 appeared.

Has anybody encountered similar situation?

Thanks,

Vojta

Reply all

Reply to author

Forward

0 new messages