--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/8C86C0B4-E36A-4425-97E4-376E9D63214F%40nvidia.com.
Hello Casey,
Thanks!
With the plethora of metrics that we are going to collect across various layers of OVN control plane (OVN K8s, OVN, and OVS), we can build a lot of alerts based on these metrics. In addition to what you have already mentioned, few other alerts could be
Raft:
- too many elections over short period of time (should give an idea on optimal setting for election timer)
- the followers are behind leader in terms of log entry (we can use the e2e_timestamp that gets written to NB by ovnkube-master)
(or actually use the log_entry_index from cluster/status output)
- if db_size is too high, then perhaps force compaction. in a 1000-node cluster large db_size could be a problem
- see if we have network partition between the RAFT cluster nodes (two different cluster_id will be reported)
Ovn-controller:
- number of geneve_ports should be same as the number of nodes in the cluster
- lot of recomputes (LogicalFlow to OpenFlow flows)
and many more. Let me get these changes upstream, and we can then collaborate on the alerts.
Regards,
~Girish
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CALbOP4FXOHbZVaoHGgnPmEGzkXy8Gbkgzs4TsEjuyPTnxtyptw%40mail.gmail.com.