Consultation around a generic metric for Kubernetes operators

29 views
Skip to first unread message

Shirly Radco

unread,
Mar 26, 2023, 2:24:53 AM3/26/23
to Prometheus Users
Hi,

Short summery:
Can we have a metric that reports 3 values (0/1/2), to indicate status instead of using labels or adding the status to the metric name?

Full story:
I'm working on creating a general recommendation for reporting an Kubernetes operator health metric.

The full proposal is here , https://github.com/operator-framework/operator-sdk/pull/6315/files.

I proposed to recommend operators to add a new health metric that would have the following naming:
<operator-name-prefix>_operator_health_status [1]

I proposed that the values of this metric would indicate the health status:
  * `0` - Indicates that the operator is healthy and working as expected.
  * `1` - Indicates that the operator has some issues that needs to be addressed and can potentially lead to loss of functionality.
  * `2` - Indicates that the operator is unhealthy and there is a loss of functionality that should be addressed.

There is a disagreement about this, since there are no examples that I could find that has a third optional value.
Usually these metrics are represented as Boolean (Healthy/Unhealthy) or the status is stated in the metric name.

The reviewers believe its not recommend to have more than 2 possible values(Boolean).
I see few issues with this:
1. The metric is sent from different operators and it would be problematic to have a label to indicate the level of health in a consistent way.
2. I don't see an issue with querying Prometheus with more than 2 values. It might be more efficient than filtering with labels.

I would appreciate you insights on this, considering that the metric is sent from multiple sources that are all developed separately. 

Thank you,
Shirly Radco

[1] I proposed a different prefix and same suffix since I know there is an issue with sending the same metric name to Prometheus with a different help text.
Since we can't enforce the help text to be exactly the same, the suffix should be enough to be able to display all the operators health metrics in the same panel.
Also, it would be easier to identify the origin of metrics that have an issue.

Brian Candler

unread,
Mar 26, 2023, 10:44:25 AM3/26/23
to Prometheus Users
> There is a disagreement about this, since there are no examples that I could find that has a third optional value.

The closest example I can think of is Nagios plugins (0=ok, 1=warning, 2=critical, 3=unknown). See nrpe_exporter:

And, I guess things like ifOperStatus from snmp_exporter.

I'd say having a 0/1/2 status isn't necessarily "wrong", and in Grafana you can map these numbers to strings and/or colours.

However, you're also right to say this isn't normal recommended practice. Typically you've have a set of timeseries and set one to 1 and the others to 0. Client libraries tend to call this group of metrics an "enum", e.g.

I wouldn't worry about efficiency. Prometheus timeseries are very cheap, especially when the metric values are mostly constant.

Shirly Radco

unread,
Mar 29, 2023, 11:23:18 AM3/29/23
to Prometheus Users
Thank you Brian for your help with this.
I really appreciate it.

Shirly

Reply all
Reply to author
Forward
0 new messages