Hi,
It was suggested that I reach out to this group about this issue after a conversation in the
#sig-api-machinery slack channel. Grafana recently ran into
issue #123571, which had already been reported and closed because it was not considered a bug at the time.
Running an extension server seems to be supported in code based on the
handling of ExternalName in the ResolveCluster, and based on
this section of the docs, which describes running the extension server in the cluster as the most common way, but not the only way:
The most common way to implement the APIService is to run an extension API server in Pod(s) that run in your cluster.
Based on that information, this seems like a bug to me. Am I interpreting that correctly? The main issue seems to be this
line in the TLS Config for the kube-aggregator proxy client. I
opened a PR to attempt to address this issue by setting the ServerName in TLS Config to the externalName set in the Service when available.
There is potential for backwards incompatible changes if for some reason a user has generated a server certificate for the remote extension server that only includes the in-cluster "{name}.{namespace}.svc" hostname, and not the host set in the Service's externalName. It's currently unclear how big of an issue this would be since this seems like a way to workaround the existing behavior, but the only backwards compatible way I could think of to address this with the current changes in my PR would be to retry the request if the new behavior fails.
Any additional context, feedback, or suggestions on how/where to get feedback on this issue would be very much appreciated.
Thanks!
-Todd