OVN-K8s/OVN/OVS Prometheus metrics

Girish Moodalbail

unread,

Jun 10, 2020, 3:09:31 PM6/10/20

to ovn-kub...@googlegroups.com

Hello all,

Please find below the document that captures all the OVN-K8s/OVN/OVS metrics that we have added support for and going to submit a PR for

https://docs.google.com/document/d/1BAsjLOpAeSyIq2UcyukPK7PgKCqgAKuHcz3962DtoUo/edit?usp=sharing

Please go through it and comment on those metrics. Also, let us know if we have missed any metrics.

Regards,
~Girish

Casey Callendrello

unread,

Jun 16, 2020, 7:20:21 AM6/16/20

to Girish Moodalbail, ovn-kub...@googlegroups.com

Girish,

This is really helpful, thanks!

Looking at all these metrics, I start thinking of possible alerts we could write. Cases we should catch include

Raft:

- No active leader

- Raft is about to be broken (no redundancy) for a long time. (need more metrics)

OVN-Northd:

- Too many failed transactions

OVN-Controller:

- Too many packets being dropped

- Unable to reach sbdb (do we have a metric for this?)

- Too far behind sbdb (do we have a metric for this?)

What other cases can we catch? It seems like ovn-controller might have a lot of possibilities.

--Casey

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/8C86C0B4-E36A-4425-97E4-376E9D63214F%40nvidia.com.

Girish M G (GmG)

unread,

Jun 18, 2020, 2:58:41 PM6/18/20

to Casey Callendrello, Girish Moodalbail, ovn-kub...@googlegroups.com

Hello Casey,

Thanks!

With the plethora of metrics that we are going to collect across various layers of OVN control plane (OVN K8s, OVN, and OVS), we can build a lot of alerts based on these metrics. In addition to what you have already mentioned, few other alerts could be

Raft:

- too many elections over short period of time (should give an idea on optimal setting for election timer)

- the followers are behind leader in terms of log entry (we can use the e2e_timestamp that gets written to NB by ovnkube-master)

(or actually use the log_entry_index from cluster/status output)

- if db_size is too high, then perhaps force compaction. in a 1000-node cluster large db_size could be a problem

- see if we have network partition between the RAFT cluster nodes (two different cluster_id will be reported)

Ovn-controller:

- number of geneve_ports should be same as the number of nodes in the cluster

- lot of recomputes (LogicalFlow to OpenFlow flows)

and many more. Let me get these changes upstream, and we can then collaborate on the alerts.

Regards,

~Girish

To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CALbOP4FXOHbZVaoHGgnPmEGzkXy8Gbkgzs4TsEjuyPTnxtyptw%40mail.gmail.com.

Reply all

Reply to author

Forward