OVN-K8s/OVN/OVS Prometheus metrics

136 views
Skip to first unread message

Girish Moodalbail

unread,
Jun 10, 2020, 3:09:31 PM6/10/20
to ovn-kub...@googlegroups.com
Hello all,

Please find below the document that captures all the OVN-K8s/OVN/OVS metrics that we have added support for and going to submit a PR for

https://docs.google.com/document/d/1BAsjLOpAeSyIq2UcyukPK7PgKCqgAKuHcz3962DtoUo/edit?usp=sharing

Please go through it and comment on those metrics. Also, let us know if we have missed any metrics.

Regards,
~Girish


Casey Callendrello

unread,
Jun 16, 2020, 7:20:21 AM6/16/20
to Girish Moodalbail, ovn-kub...@googlegroups.com
Girish,
This is really helpful, thanks!

Looking at all these metrics, I start thinking of possible alerts we could write. Cases we should catch include

Raft:
- No active leader
- Raft is about to be broken (no redundancy) for a long time. (need more metrics)

OVN-Northd:
- Too many failed transactions

OVN-Controller:
- Too many packets being dropped
- Unable to reach sbdb (do we have a metric for this?)
- Too far behind sbdb (do we have a metric for this?)

What other cases can we catch? It seems like ovn-controller might have a lot of possibilities.

--Casey


--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/8C86C0B4-E36A-4425-97E4-376E9D63214F%40nvidia.com.

Girish M G (GmG)

unread,
Jun 18, 2020, 2:58:41 PM6/18/20
to Casey Callendrello, Girish Moodalbail, ovn-kub...@googlegroups.com

Hello Casey,

 

Thanks!

 

With the plethora of metrics that we are going to collect across various layers of OVN control plane (OVN K8s, OVN, and OVS),  we can build a lot of alerts based on these metrics. In addition to what you have already mentioned, few other alerts could be


Raft:

- too many elections over short period of time (should give an idea on optimal setting for election timer)

- the followers are behind leader in terms of log entry (we can use the e2e_timestamp that gets written to NB by ovnkube-master)

  (or actually use the log_entry_index from cluster/status output)

- if db_size is too high, then perhaps force compaction. in a 1000-node cluster large db_size could be a problem

- see if we have network partition between the RAFT cluster nodes (two different cluster_id will be reported)


Ovn-controller:

- number of geneve_ports should be same as the number of nodes in the cluster

- lot of recomputes (LogicalFlow to OpenFlow flows)


and many more. Let me get these changes upstream, and we can then collaborate on the alerts.


Regards,

~Girish


Reply all
Reply to author
Forward
0 new messages