rule recorded data seem incomplete

17 views
Skip to first unread message

vojta....@gmail.com

unread,
Sep 6, 2020, 5:38:10 AM9/6/20
to Prometheus Users
Hi,

I'm trying to figure out why data are missing sometime in dashboard backed by prometheus. Our setup is more or less standard prometheus-operator helm chart.  It defines following recording rule:

record: instance:node_cpu_utilisation:rate1m
expr: 1 - avg without(cpu, mode) (rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m]))

There are 9 nodes in the cluster, but the dashboard that displays this metric only displays 7 nodes. Switching the dashboard to the expression directly shows all data as expected. Noteworthy things:

  • there are no exceptions in the log, no failed rule evaluations
  • the issue shows (almost) consistently for more than 2 hours by now
  • in two occasions in this period one of the missing nodes became part of the recorded rule for what seems to be one scrape interval and dropped again immediately
  • after prometheus restart, the issue persists
  • other rules defined within the same group seem to be impacted in the same way (e.g. instance:node_network_receive_bytes_excluding_lo:rate1m that calculates network usage in the same fashion)
This cluster suffered some performance issues in the past and had the scrape/evaluation interval extended to 90s. During this period the instance:node_cpu_utilisation:rate1m didn't record any data (because it uses range that was shorter than actual scrape/evaluation). The problem became apparent after switching back to the original 30s scrape/evaluation interval. In this moment all 9 nodes should have its CPU usage correctly displayed, but only 7 appeared.

Has anybody encountered similar situation?

Thanks,
Vojta

Reply all
Reply to author
Forward
0 new messages