[Please Review] Generic Operators Health metric - Proposal

15 views
Skip to first unread message

Shirly Radco

unread,
Feb 13, 2023, 6:36:23 AM2/13/23
to operator-framework-olm-dev
Hi,

I'm the Observability team lead of the OpenShift Virtualization operator
and currently I'm working on adding a health metric for the operator.

I know that today the OLM is unable to really tell if an operator is healthy or not.

I wanted to propose to ask operators to report operators' health metric in a generic name, operator_health_status.
It will have the operator name label the same as its reported in the csv_succeeded metric and it can have 3 valid values: 0/1/2 that corresponds to Healthy/Warning/Critical.

This will enable other operators to have their way of checking the operator status and report it in a single metric that OLM and OCP can use.

There are a few operators that are currently reporting their health metric and each with a different name, so it is impossible to use them in a generic way. For example: ceph_health_status, odf_system_health_status, etc.

I think it would be good to set this as the standard way to report operator health metrics.

I can also add this as a best practice for operators in the Operator Observability Best Practices in Operator SDK.

Please let me know what you think and how I can promote this suggestion.

Best regards,
Shirly Radco
Reply all
Reply to author
Forward
0 new messages