Guidance on structure for deprecated API use metric

Jordan Liggitt

unread,

May 1, 2020, 3:19:45 PM5/1/20

to Han Kang, Frederic Branczyk, kubernetes-sig-...@googlegroups.com

Following up on the discussion from yesterday's SIG meeting about the metric for use of deprecated APIs.

The goal is to make it easy for an admin to observe what deprecated APIs are being called:

optionally filtered to APIs removed in version X
optionally filtered by request verb (e.g. write requests only)
optionally filtered by scope (e.g. cluster-wide lists vs namespaced lists)
with the ability to observe the total count of requests made to a given deprecated API over time

A few possible approaches have presented themselves:

My original idea was to record a subset of apiserver_request_total in a new counter metric. It was pointed out in the SIG meeting that duplicating the verb/scope/count tracking data between the two metrics wasn't great.
@dashpole suggested a gauge metric with a constant value of 1, labeled with group / version / resource / subresource / removed_version that could be joined to apiserver_request_total. Experimenting with this seemed to work well and let me do queries like this:
apiserver_requested_deprecated_apis{removed_version="1.22"} * on(group,version,resource,subresource) group_right() apiserver_request_total
Clayton suggested adding deprecated / removed_version labels to apiserver_request_total, since the values for those labels should be constant for all series with identical group / version / resource / subresource labels. There were some concerns that the cardinality of that metric is already so high (conservatively, 2000+ series for a given group/version/resource/subresource) that adding even constant-value labels to the existing series is not ideal.

I'd like feedback from Frederic and Han on which direction to pursue.

Thanks,

Jordan

Han Kang

unread,

May 1, 2020, 6:11:32 PM5/1/20

to Jordan Liggitt, Frederic Branczyk, kubernetes-sig-instrumentation

Yes, my personal preference is to avoid 3 (I like 2), for the following reasons:

We should be way more deliberate with what I consider one of the most critical apiserver metrics. The potential of exacerbating existing cardinality issues on one of the most important metrics for SLis/SLOs concerns me, to say the least.

As Jordan mentioned there is some concern (he was probably referring to me) about adding constant label* metadata. Today our metric looks like this:

apiserver_request_total { client, code, component, contentType, group, resource, scope, subresource, verb, version }

Which means that, given all the unique combos of group/version/resource/subresource, we would then be duplicating this label and value across the unique combos of the remaining dimensions { verb,client,code,component,contentType }. Each of these time-series and their unique label values is stored in memory, so there is some threshold where the unique combos of values for the labels { verb,client,code,component,contentType } will cause the actual memory footprint (and payload size of the endpoint) to be larger than if we just broke off this metric off completely.

Let's say we have unique values for { group/version/resource/subresource } (so I will omit them from my example), then if we had the following metrics

apiserver_request_total { client="a", code="200", component="aggregator", contentType="unbounded", scope="resource", verb="PUT" }
apiserver_request_total { client="a", code="200", component="aggregator", contentType="unbounded", scope="resource", verb="POST" }

We would have to do this:
apiserver_request_total { client="a", code="200", component="aggregator", contentType="unbounded", scope="resource", verb="PUT", removed_version="1.22"}
apiserver_request_total { client="a", code="200", component="aggregator", contentType="unbounded", scope="resource", verb="POST", removed_version="1.22"}

This adds up quite poorly in worst case scenarios.

But despite that last point (which is valid I believe), I would really love to be more deliberate with SLO/SLI metrics. For me at least, it feels really quite dangerous to continue shoving everything into one of the two most widely used kube-apiserver metrics for alerting. The blast radius is really terrible if we screw it up badly with those metrics, since basically everyone alerts against them.

*technically, I don't think it actually a 'constant' label as prometheus defines it, since we would actually have more than one value for this label which would vary by value of the resource label, despite the fact that we would not be increasing the total number of time series for the metric.

Thanks,

Han

--

- Han

Frederic Branczyk

unread,

May 4, 2020, 5:57:35 AM5/4/20

to kubernetes-sig-instrumentation

The biggest thing that worries me about all of this, is actually not so much the metrics themselves, but what we intend to do with them. We've been trying to do alerts like this in OpenShift, with not a whole lot of success so far I would say. The fact that there have been 1 or more requests to an API during the lifetime of the apiserver does not suggest it's actively being used, or many things are out of control for the administrator. What would an administrator do about that if they get an alert for it? Interactions via kubectl are probably the easiest, but what about controllers/operators? A user can't just change that, it will need to be fixed in the code most likely. The combination of the above tended to have the alerts just be silenced and/or ignored, as it's not possible to track individual clients with this (nor should we with metrics, as you mentioned this is really a case for the audit log).

Just looking at the metrics, I think I like the balance of 2 the most as well (not extending the existing request metrics, and not adding a whole lot of new series), but I'm still unsure whether it will actually have the result we desire. At the very least we should document how to go from the metric to the audit log entry (simple grepping would be sufficient I think).

Jordan Liggitt

unread,

May 5, 2020, 10:46:32 AM5/5/20

to kubernetes-sig-instrumentation

I agree the metrics by themselves are not sufficient, which is why the KEP also adds kubectl warnings (for user visibility) and audit annotations (to let admins identify specific problematic clients).

> What would an administrator do about that if they get an alert for it?

I envision several ways admins could use this or respond to the metric:

Requiring the metric to be clean after a CI run to ensure nothing is making use of deprecated APIs
Adding a warning before upgrading to version X if requests to deprecated APIs removed in version X were made recently
Using a cheap metric check to trigger a more expensive audit log sweep to identify/notify problematic clients

> Just looking at the metrics, I think I like the balance of 2 the most as well (not extending the existing request metrics, and not adding a whole lot of new series)

Sounds good.

> At the very least we should document how to go from the metric to the audit log entry

Definitely, I plan to add documentation for the following:

Detecting if deprecated requests have been made to an API server
Joining the deprecated request metric to the count metric to get more details about volume, read/write nature, scope, etc
Filtering audit events to requests for deprecated APIs to locate clients

Frederic Branczyk

unread,

May 5, 2020, 11:13:27 AM5/5/20

to Jordan Liggitt, kubernetes-sig-instrumentation

Sounds perfect. Happy with that strategy.

--
You received this message because you are subscribed to a topic in the Google Groups "kubernetes-sig-instrumentation" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubernetes-sig-instrumentation/BACdXH4LscY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kubernetes-sig-instru...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-instrumentation/62bfc7c8-aab3-41c7-99cb-e69cf88a0749%40googlegroups.com.