Hi sig-network.
I have been thinking about this for a while, but have not been ready
to really write it yet. Here goes. Sorry it got long.
TL;DR: I think we can do better topology almost-automatically, and I
think we should try.
Long form:
I am not sure that we should proceed with the topology API at this
time. I am not sure we should rip it out yet, either, but I want to
discuss.
First, let's look at the motivating use-cases.
1) Per-node services
2) Avoiding cross-zone traffic when in-zone endpoints would suffice
Were there more? Every case I can recall devolves into one of those two.
Sometimes it is nice to step back and think about a more perfect
world. In that world, how would topology (specifically use-case #2)
be handled? I think it would be automatic. There wouldn't be an API
at all - it would Just Work. And in fact, we have many of the pieces
to make it Just Work already, but not glued together. If we have an
HPA we have a metric and a threshold. If we could collect stats and
compare them to that threshold, we could bias traffic in real-time to
make better decisions. Imagine that - the load would almost be ...
balanced!
In thinking about this, I realized that I have had this conversation
before. The last time was in the context of CPU scheduling and
explicit CPU pinning APIs. For years now I have been a major
roadblock to people who wanted to add such APIs to k8s. My argument
has always been that I believe we can significantly improve 85% of
cases with no API at all, if we just focus on that. Until we prove
that we've run out of runway on the automatic path, we shouldn't be
adding explicit APIs. APIs are forever.
In fact, the API that we're proposing for topology forces a pretty
crappy tradeoff onto the user. THEY have to ensure it stays balanced
or else WE will do the wrong thing. To be fair, there are projects
like descheduler and scheduling API changes coming that may make this
better (though I might make the same automatic-vs-manual argument
there).
If we had a properly smart LB, I bet it could do better automatically
than users can do explicitly. That, hilariously, ends with us saying
"don't use the topology API, it's worse than automatic". How can we
empower smart LBs and give ourselves enough info to become smarter?
Let's put per-node services aside as a special-case. If we have to
handle that explicitly, I think that's acceptable.
Imagine we had environment providers describe the cost metric for
significant topology labels. I think we could use that to bias
routing (e.g. probability in iptables) such that the vast majority of
connections stay in-zone when possible, but spill-over when needed
(rather than forcing users to balance).
For a half-baked example:
Assume a cluster has nodes in 2 zones: A and B. Each zone has 2 nodes
A1, A2, and B1, B2.
Cluster has a service "foo", which is small enough to fit into one EPslice.
We (k8s) define a topology resource and providers publish instances of
that resource which indicate that traffic between zones has a cost
metric of 2 (just an example).
We add a "weight" to each endpoint in a slice.
EPSlice controller is configured to consider zone as a significant
topology. When writing EPSlice for service foo, it writes 2 slices -
one for each zone. It labels the slices as both "service=foo" but
also "zone={A,B}".
Assuming each zone has the same number of endpoints, the slices will
be identical but with opposite weights. If the zones are unbalanced,
the weights would be different. Something like a function of number
of clients in the zone (cores?) and number of endpoints in the zone
and number of endpoints not in the zone. There is prior art to look
at.
kube-proxy would also be configured to consider zone as a significant
topology. When selecting EPslices for "foo" it would select
"service=foo" and "zone={A,B}" (matching its own zone). It would see
the "best" endpoints for it to use, with biases. Probability would be
skewed towards same-zone, but if there are not enough endpoints it may
include some from other zones. Non-deterministic, but probabilistic.
In the case of the balanced cluster, there's no need for clients in A
to go to endpoints in B. We don't even really need the feedback loop
(though that would be better). If the cluster became unbalanced (e.g.
more endpoints in A), clients in A would stay within A, but clients in
B would get a small chance of crossing into A.
Users don't have to specify anything, it just happens naturally. No
API. We'd need a special API for "per-node" but that seems OK as a
special case.
There's an open question about whether this optimization function
should be in every proxy or centralized in the controller. Centralized
seems better at high scale, worse at low scale. Doing it in each
proxy means that proxies can be smarter (e.g. could consider local
load or even ingest global load metrics). Controller could probably
also do that, but less fine-grained. It could do subsetting, but not
per-node.
NOW, if that still isn't good enough, then it may be time for an explicit API.
Lastly, I think it's worthwhile to discuss normalizing topology keys.
I've always said that it's arbitrary, but in truth there's been low
demand for other keys and other systems have standardized on 2 or 3
level hierarchies. Maybe we should standardize
on 2 or 3 levels? We already define region and zone. xDS defines
sub-zone (
https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/base.proto#envoy-v3-api-msg-config-core-v3-locality).
If we normalize topology, we may consider deprecating
"
kubernetes.io/hostname" in favor of "
topology.kubernetes.io/node" or
similar.
If we normalize topology, there are possible alternatives to
topologyKeys. For example, we could simply enumerate a set of
balancing algorithms that understand region/zone.
There's a lot in here. Thoughts?