Hi,
I'm the Observability team lead of the OpenShift Virtualization operator
and currently I'm working on adding a health metric for the operator.
I know that today the OLM is unable to really tell if an operator is healthy or not.
I wanted to propose to ask operators to report operators' health metric in a generic name, operator_health_status.
It will have the operator name label the same as its reported in the csv_succeeded metric and it can have 3 valid values: 0/1/2 that corresponds to Healthy/Warning/Critical.
This will enable other operators to have their way of checking the operator status and report it in a single metric that OLM and OCP can use.
There are a few operators that are currently reporting their health metric and each with a different name, so it is impossible to use them in a generic way. For example: ceph_health_status, odf_system_health_status, etc.
I think it would be good to set this as the standard way to report operator health metrics.
I can also add this as a best practice for operators in the
Operator Observability Best Practices in Operator SDK.
Please let me know what you think and how I can promote this suggestion.
Best regards,
Shirly Radco