Rethinking topology API

387 views
Skip to first unread message

Tim Hockin

unread,
Mar 11, 2020, 1:11:50 AM3/11/20
to kubernetes-sig-network
Hi sig-network.

I have been thinking about this for a while, but have not been ready
to really write it yet. Here goes. Sorry it got long.

TL;DR: I think we can do better topology almost-automatically, and I
think we should try.

Long form:

I am not sure that we should proceed with the topology API at this
time. I am not sure we should rip it out yet, either, but I want to
discuss.

First, let's look at the motivating use-cases.

1) Per-node services
2) Avoiding cross-zone traffic when in-zone endpoints would suffice

Were there more? Every case I can recall devolves into one of those two.

Sometimes it is nice to step back and think about a more perfect
world. In that world, how would topology (specifically use-case #2)
be handled? I think it would be automatic. There wouldn't be an API
at all - it would Just Work. And in fact, we have many of the pieces
to make it Just Work already, but not glued together. If we have an
HPA we have a metric and a threshold. If we could collect stats and
compare them to that threshold, we could bias traffic in real-time to
make better decisions. Imagine that - the load would almost be ...
balanced!

In thinking about this, I realized that I have had this conversation
before. The last time was in the context of CPU scheduling and
explicit CPU pinning APIs. For years now I have been a major
roadblock to people who wanted to add such APIs to k8s. My argument
has always been that I believe we can significantly improve 85% of
cases with no API at all, if we just focus on that. Until we prove
that we've run out of runway on the automatic path, we shouldn't be
adding explicit APIs. APIs are forever.

In fact, the API that we're proposing for topology forces a pretty
crappy tradeoff onto the user. THEY have to ensure it stays balanced
or else WE will do the wrong thing. To be fair, there are projects
like descheduler and scheduling API changes coming that may make this
better (though I might make the same automatic-vs-manual argument
there).

If we had a properly smart LB, I bet it could do better automatically
than users can do explicitly. That, hilariously, ends with us saying
"don't use the topology API, it's worse than automatic". How can we
empower smart LBs and give ourselves enough info to become smarter?

Let's put per-node services aside as a special-case. If we have to
handle that explicitly, I think that's acceptable.

Imagine we had environment providers describe the cost metric for
significant topology labels. I think we could use that to bias
routing (e.g. probability in iptables) such that the vast majority of
connections stay in-zone when possible, but spill-over when needed
(rather than forcing users to balance).

For a half-baked example:

Assume a cluster has nodes in 2 zones: A and B. Each zone has 2 nodes
A1, A2, and B1, B2.

Cluster has a service "foo", which is small enough to fit into one EPslice.

We (k8s) define a topology resource and providers publish instances of
that resource which indicate that traffic between zones has a cost
metric of 2 (just an example).

We add a "weight" to each endpoint in a slice.

EPSlice controller is configured to consider zone as a significant
topology. When writing EPSlice for service foo, it writes 2 slices -
one for each zone. It labels the slices as both "service=foo" but
also "zone={A,B}".

Assuming each zone has the same number of endpoints, the slices will
be identical but with opposite weights. If the zones are unbalanced,
the weights would be different. Something like a function of number
of clients in the zone (cores?) and number of endpoints in the zone
and number of endpoints not in the zone. There is prior art to look
at.

kube-proxy would also be configured to consider zone as a significant
topology. When selecting EPslices for "foo" it would select
"service=foo" and "zone={A,B}" (matching its own zone). It would see
the "best" endpoints for it to use, with biases. Probability would be
skewed towards same-zone, but if there are not enough endpoints it may
include some from other zones. Non-deterministic, but probabilistic.

In the case of the balanced cluster, there's no need for clients in A
to go to endpoints in B. We don't even really need the feedback loop
(though that would be better). If the cluster became unbalanced (e.g.
more endpoints in A), clients in A would stay within A, but clients in
B would get a small chance of crossing into A.

Users don't have to specify anything, it just happens naturally. No
API. We'd need a special API for "per-node" but that seems OK as a
special case.

There's an open question about whether this optimization function
should be in every proxy or centralized in the controller. Centralized
seems better at high scale, worse at low scale. Doing it in each
proxy means that proxies can be smarter (e.g. could consider local
load or even ingest global load metrics). Controller could probably
also do that, but less fine-grained. It could do subsetting, but not
per-node.

NOW, if that still isn't good enough, then it may be time for an explicit API.

Lastly, I think it's worthwhile to discuss normalizing topology keys.
I've always said that it's arbitrary, but in truth there's been low
demand for other keys and other systems have standardized on 2 or 3
level hierarchies. Maybe we should standardize
on 2 or 3 levels? We already define region and zone. xDS defines
sub-zone (https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/base.proto#envoy-v3-api-msg-config-core-v3-locality).

If we normalize topology, we may consider deprecating
"kubernetes.io/hostname" in favor of "topology.kubernetes.io/node" or
similar.

If we normalize topology, there are possible alternatives to
topologyKeys. For example, we could simply enumerate a set of
balancing algorithms that understand region/zone.

There's a lot in here. Thoughts?

Mikaël Cluseau

unread,
Mar 11, 2020, 2:42:07 AM3/11/20
to Tim Hockin, kubernetes-sig-network
Hi Tim,

in short, my thoughts around all that are: :-)

On the global automation idea, I do agree on every thing but one: the approach seems to value global load balance, but I think one of the things users also express with topology, is a latency or cost requirement. The API could be of a higher level just to express the goal, with a single field in the service, something like the IP ToS field (lowdelay, throughput, reliability, lowcost, and add "balance" because that's k8s business).

On the smarter proxy question, I think it's something that will be possible in future with a reasonable cost per node. Not everything will have to be in it of course, just the node-local relevant parts. I think the optimization function should be split somewhere around (1) the controller does everything that's node specific (stats and maybe some scoring ie per zone/region/rack); (2) the proxy takes this information to compute the optimum for the local node. To me, avoiding (2) means that (a) we have a load on the controller that will be hard to predict (memory? API IOPS?) and (b) that a lot a fluctuent data will have to be stored and published by the API.


--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CAO_RewZcFC3%2BKKuWbBZOcL961SRFGPF93Ae6P2QVsXH1y%3DCGMQ%40mail.gmail.com.

Sandor Szuecs

unread,
Mar 11, 2020, 5:20:42 AM3/11/20
to Tim Hockin, kubernetes-sig-network
Tim, thanks for taking the time writing this e-mail!

On Wed, 11 Mar 2020 at 06:11, 'Tim Hockin' via kubernetes-sig-network <kubernetes-...@googlegroups.com> wrote:
Hi sig-network.

I have been thinking about this for a while, but have not been ready
to really write it yet.  Here goes.  Sorry it got long.

TL;DR: I think we can do better topology almost-automatically, and I
think we should try.

Long form:

I am not sure that we should proceed with the topology API at this
time.  I am not sure we should rip it out yet, either, but I want to
discuss.

First, let's look at the motivating use-cases.

1) Per-node services
2) Avoiding cross-zone traffic when in-zone endpoints would suffice

Were there more?  Every case I can recall devolves into one of those two.

In my case it would be only 2). And to be honest, I tested in AWS cross AZ traffic is similar in our measurements, than internal AZ traffic.
Adding maybe 1ms is not the problem in our case, but there will be others, that care about it.
In our case the main drivers to use topology will be the cost reduction and maybe the reliability (AZ outage).

Sometimes it is nice to step back and think about a more perfect
world.  In that world, how would topology (specifically use-case #2)
be handled?  I think it would be automatic.  There wouldn't be an API
at all - it would Just Work.  And in fact, we have many of the pieces
to make it Just Work already, but not glued together.  If we have an
HPA we have a metric and a threshold.  If we could collect stats and
compare them to that threshold, we could bias traffic in real-time to
make better decisions.  Imagine that - the load would almost be ...
balanced!

--------------8<--------------

NOW, if that still isn't good enough, then it may be time for an explicit API.

Lastly, I think it's worthwhile to discuss normalizing topology keys.
I've always said that it's arbitrary, but in truth there's been low
demand for other keys and other systems have standardized on 2 or 3
level hierarchies.  Maybe we should standardize
on 2 or 3 levels? 

Region, zone, rack (?) and maybe cluster?
I don't really need rack, because we didn't measure that inter zone traffic is a problem for us. Might be interesting for people running their own DCs.
I am thinking for a while now how to do multi cluster applications, and I believe this would be a good case for cluster topology.

We already define region and zone.  xDS defines
sub-zone (https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/core/v3/base.proto#envoy-v3-api-msg-config-core-v3-locality).

If we normalize topology, we may consider deprecating
"kubernetes.io/hostname" in favor of "topology.kubernetes.io/node" or
similar.

I am fine with that.
 

If we normalize topology, there are possible alternatives to
topologyKeys.  For example, we could simply enumerate a set of
balancing algorithms that understand region/zone.

There's a lot in here.  Thoughts?

Thanks for your write up and examples that makes it more easy to follow!

Best, sandor
--
 

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CAO_RewZcFC3%2BKKuWbBZOcL961SRFGPF93Ae6P2QVsXH1y%3DCGMQ%40mail.gmail.com.


--
Sandor Szücs | 418 I'm a teapot

ayodele abejide

unread,
Mar 11, 2020, 9:47:41 AM3/11/20
to kubernetes-sig-network
> Lastly, I think it's worthwhile to discuss normalizing topology keys.
I've always said that it's arbitrary, but in truth there's been low
demand for other keys and other systems have standardized on 2 or 3
level hierarchies

If topology remains arbitrary what you have discussed here can be used
(abused) to solve this issue:
https://github.com/kubernetes/kubernetes/issues/85395

A user can define an arbitrary topology assign them cost(weights) and use
that to influence traffic patterns.

Dan Winship

unread,
Mar 11, 2020, 11:05:00 AM3/11/20
to Tim Hockin, kubernetes-sig-network
On 3/11/20 1:11 AM, 'Tim Hockin' via kubernetes-sig-network wrote:
> kube-proxy would also be configured to consider zone as a significant
> topology. When selecting EPslices for "foo" it would select
> "service=foo" and "zone={A,B}" (matching its own zone). It would see
> the "best" endpoints for it to use, with biases. Probability would be
> skewed towards same-zone, but if there are not enough endpoints it may
> include some from other zones. Non-deterministic, but probabilistic.
>
> In the case of the balanced cluster, there's no need for clients in A
> to go to endpoints in B. We don't even really need the feedback loop
> (though that would be better). If the cluster became unbalanced (e.g.
> more endpoints in A), clients in A would stay within A, but clients in
> B would get a small chance of crossing into A.
>
> Users don't have to specify anything, it just happens naturally. No
> API. We'd need a special API for "per-node" but that seems OK as a
> special case.

Why do we need a special API for "per-node"? Kube-proxy knows which
endpoint IPs are local (same node) and which aren't, so it could just
weight the local ones higher.

-- Dan

Tim Hockin

unread,
Mar 11, 2020, 2:48:23 PM3/11/20
to Dan Winship, kubernetes-sig-network
Maybe? I am not sure, and I guess it would be a bit of a research
project as we define the weight function.

E.g. Imagine a single-zone cluster that has one 32-core node and one 4
core node. There's a service running with 2 replicas. Naively, one
would expect the 2 replicas to be about evenly loaded, but clearly the
larger node could host more clients than the smaller node. That would
cause overload on one repolica and underload on the other. The way I
was thinking of this function would consider the total compute (36
cores) and the total capacity (2 replicas) and bias each node based on
that. The maller node could always choose its local replica, but the
larger node would choose its local replica 56% of the time and the
other replica 44% of the time - resulting in about 50% of compute
capacity routing to each replica.

Now, this logic gets harder with more than 2 nodes, and I am not sure
whether node is really part of the calculus because of cardinality,
but I think you get the point.

Casey Callendrello

unread,
Aug 13, 2020, 9:52:53 AM8/13/20
to Tim Hockin, Dan Winship, kubernetes-sig-network
Sorry to necro-thread, but I've been idly thinking about this for a while. It hearkens back to my days at $CDN.

Topology-aware routing splits clients and servers in to buckets - we're finally diverging from kube-proxy's unweighted round robin. This means we need to think a bit more carefully about how it all fits together.

What we have is a marriage problem - we would like to assign each client bucket to the "best" available server bucket. Best, in this case, would be something that minimizes some topology cost function without exceeding capacity or fairness bounds. In general, to do this, we need at least 4 things:

0. A global load-balancing decision
1. current headroom per server bucket
2. demand attribution - how much traffic is coming from each bucket?
3. a topology cost function

Additionaly, one might like to see

4. capacity estimates
5. fairness function & bounds

Let's look in to each of these individually

0. A global load-balancing decision

Because we're marrying clients to servers, no one node knows enough to compute a full load-balancing picture. Just like every Real Network in existence (Andromeda, for example), you need a single global bucketing decision. You don't need to balance every request, rather you need to assign client buckets to server buckets, and push that state out. Then, the individual load balancers (i.e. kube-proxy) can round-robin within that decision.

1. Current headroom:

We don't actually care how much load a particular pod is serving. If one particular bucket is hot, but our metrics are good, then we don't care. Uneven load distribution is fine.

Rather, we care about when a "bucket" is over capacity. Fortunately, the autoscaling efforts have already explored this space and decided on a solution around a single Prometheus metric, as you pointed out, Tim. So, we need some sort of performance metric and bounds for it. This doesn't have to be bit/sec or conns/sec. It can be latency %-ile, or even memory usage.

2. Demand attribution

If a server bucket exceeds its bounds (e.g. request latency is too high), then we need to re-compute the marriage. One or more client buckets need either more server buckets, or need fewer competing clients. Except where the load balancing is injective, this is difficult to accomplish without demand attribution.

Fortunately, we *could* get an approximate measure for this from kube-proxy from iptables / ipvs counters. Not perfect, but not bad. If we assume requests are broadly equivalent in terms of demand generated, this would give us enough basic information to go on.

3. Cost function

This is, indeed, the whole point of the exercise. As an end-user, I might care about minimizing request latency. Or, I might want to avoid usage of expensive interconnects. Or both. Back at $CDN days, this was a bit easer - we just checked our bandwidth contract (modulo 95/5... maybe AWS users do have it easier).

So we want to minimize the cost. What do we do when we can't assign a client to a server in the same bucket? Is it okay to send a small bit of load to an "expensive" server, if it minimizes overall AWS bill, or is that going to blow your latency budget and should be avoided? This goes hand-in-hand with fairness bounds.

We need some way to express costs between toplogy levels, and this is not a trivial exercise.

4. Capacity estimates

If the load balancing is computed frequently, you can be purely reactive and don't need an estimate for each bucket's size. But knowing available capacity per bucket makes this easier.

5. Fairness bounds

Without fairness bounds, the cost function may do something you don't expect, like assign one bucket of clients to servers that are very far away. Overall minimization functions can do funny things. If you care about spreading the pain, you also need some way to the load balancer not to exceed a certain cost for every single client-server pair.


Conclusion

So, this has gotten long, but it's been fun to write. I hope it's been helpful. Where do I think we should go from this?

Well, the API for expressing topology hierarchies seems reasonable. However, if we want kube-proxy to function outside the most trivial of load-balancing scenarios, we need, at the very least, a headroom metric and a cost function. Thanks to the work done by the autoscaling teams, we actually are closer to this than we think.

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.

Andrew Kim

unread,
Aug 13, 2020, 10:49:26 AM8/13/20
to Casey Callendrello, Tim Hockin, Dan Winship, kubernetes-sig-network
IMO I think the current state of Service Topology is actually pretty
good and most of the concerns around traffic weight distribution can
be solved by leveraging existing scheduler features (pod
affinity/anti-affinity). Many users work around the uneven traffic
distribution when using externalTrafficPolicy=Local in this way and it
has been working for the most part - though admittedly this is just
based on what I've been seeing so correct me if I'm wrong here.

I would be in favor of keeping the Service Topology API simple and
seeing if we can rely on existing scheduler / auto scaling features to
ensure traffic distribution is even.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CALbOP4Fx2SKaWbdL0osngKtVx5okJcE14TyhY4_zkrzvuLs4ZA%40mail.gmail.com.

Antonio Ojea

unread,
Aug 14, 2020, 11:51:57 AM8/14/20
to Andrew Kim, Casey Callendrello, Tim Hockin, Dan Winship, kubernetes-sig-network
I have 2 questions: 

1. What is the line between this being a network problem or a pod scheduling problem?
All the examples seem to be solvable with one or the other, being the the network solution always depending on the pod scheduling, but I may be missing something :/

2. The network traffic entropy is important for any traffic engineering application, is not the same the entropy of an internet backbone or a relative "stable" environment than  a bursty environment with unrelated applications running on it with a lot of churn. 
Is the goal to enable this optimization in ALL the cluster traffic or just in some of them? Should we talk about a threshold to consider a "traffic" stable enough to apply traffic optimization?

And one option for the standardization question, I think 3 levels of hierarchy is better, who knows what the multi-cluster topic will bring in this regard, better not to cap ourselves if there is no need :-)

Tim Hockin

unread,
Aug 14, 2020, 12:10:59 PM8/14/20
to Andrew Kim, Casey Callendrello, Dan Winship, kubernetes-sig-network
Inline responses, of course.

On Thu, Aug 13, 2020 at 7:49 AM Andrew Kim <kim.an...@gmail.com> wrote:
>
> IMO I think the current state of Service Topology is actually pretty
> good and most of the concerns around traffic weight distribution can
> be solved by leveraging existing scheduler features (pod
> affinity/anti-affinity). Many users work around the uneven traffic
> distribution when using externalTrafficPolicy=Local in this way and it
> has been working for the most part - though admittedly this is just
> based on what I've been seeing so correct me if I'm wrong here.
>
> I would be in favor of keeping the Service Topology API simple and
> seeing if we can rely on existing scheduler / auto scaling features to
> ensure traffic distribution is even.

Do you mean `topolgyKeys` or just the labels? My problem with
per-service topology keys, is a) it is too rigid (same or bust); b) it
doesn't need to be an API (we can infer it).
Nice write-up. I agree overall, though I suspect we can get a LOT of
value without any feedback, which is what the EPSlice subsetting was
exploring. After that, we can incrementally improve with feedback.
It makes some dubious assumptions - the cost between any 2 zones or
regions is equal, for example. Within a cluster that is likely to be
true. In a multi-cluster world, clearly not.

So in terms of next steps, I would like us to convince ourselves that
the EPS subsetting is or is-not a good vehicle for delivering
endpoints (It's basically EDS, in the limit). Then we can optimize
the heuristic to consider feedback and cost metrics.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-network/CABc050Fuc-zrrZ8OGXXvtJuA7K3L82tPr81Ge8LKiJwfXokVjA%40mail.gmail.com.

Andrew Kim

unread,
Aug 14, 2020, 12:42:20 PM8/14/20
to Tim Hockin, Casey Callendrello, Dan Winship, kubernetes-sig-network
> Do you mean `topolgyKeys` or just the labels?

I meant `topologyKeys`.

> My problem with
> per-service topology keys, is a) it is too rigid (same or bust); b) it
> doesn't need to be an API (we can infer it).

I may be missing context here, can you clarify what you mean by "same
or bust" in this scenario?
I thought topologyKeys is a prioritized list so the worst case for
"bust" would be falling back to cluster-wide (assuming '*' is the last
topology key), which is no worse than what we do today for in-cluster
traffic.

Tim Hockin

unread,
Aug 14, 2020, 12:49:45 PM8/14/20
to Andrew Kim, Casey Callendrello, Dan Winship, kubernetes-sig-network
On Fri, Aug 14, 2020 at 9:42 AM Andrew Kim <kim.an...@gmail.com> wrote:
>
> > Do you mean `topolgyKeys` or just the labels?
>
> I meant `topologyKeys`.
>
> > My problem with
> > per-service topology keys, is a) it is too rigid (same or bust); b) it
> > doesn't need to be an API (we can infer it).
>
> I may be missing context here, can you clarify what you mean by "same
> or bust" in this scenario?

It will always choose an endpoint in the same-zone before considering
any endpoint in another zone, which we can demonstrate is somewhat
dumb, and which shifts the burden of doing the right thing onto users,
when I really think we can do better automatically.

In an ideal world of perfectly balanced scheduling it might be OK, but
we don't live in that world.

> I thought topologyKeys is a prioritized list so the worst case for
> "bust" would be falling back to cluster-wide (assuming '*' is the last
> topology key), which is no worse than what we do today for in-cluster
> traffic.

If have 10 nodes in zone A and 10 nodes in zone B, but scheduling has
ended up with 1 service endpoint in zone A and 9 in zone B, all 10
zone A nodes will pound the one replica.

I am not against something like that as an opt-in (MAYBE) but I don't
think it is a sane default. What was shown at sig-net last week seems
much saner.

Andrew Kim

unread,
Aug 14, 2020, 12:55:24 PM8/14/20
to Tim Hockin, Casey Callendrello, Dan Winship, kubernetes-sig-network
Makes sense -- thanks for clarifying Tim!

Tim Hockin

unread,
Aug 14, 2020, 1:19:41 PM8/14/20
to Andrew Kim, Casey Callendrello, Dan Winship, kubernetes-sig-network
As with all things, debate is welcome, but I have become firmly
"less-is-more" when it comes to API :)

Andrew Kim

unread,
Aug 14, 2020, 1:43:50 PM8/14/20
to Tim Hockin, Casey Callendrello, Dan Winship, kubernetes-sig-network
I agree about saner defaults, but I don't think there's a silver
bullet here and there are enough use-cases that would warrant an API
in the form of something like topologyKeys.

As an API I think it's pretty clean / simple and most of the edge
cases could be solved by using scheduling features like pod affinity /
anti-affinity.

Admittedly I haven't spent too much time looking at the EPS subsetting
work so will definitely add that to my reading list before I start a
debate :)

Tim Hockin

unread,
Aug 14, 2020, 1:49:48 PM8/14/20
to Andrew Kim, Casey Callendrello, Dan Winship, kubernetes-sig-network
There's clearly a need for a way to express "same node".

It's NOT clear whether there will really be a need for topology keys
in general, especially since we KEP'ed (PR coming eventually) that
topology keys are standard.

My instinct is to rip out the topology keys API, focus on doing it
automatically and add something for same-node. But I am not sure that
toplogyKeys (or something very much like it) is WRONG and that would
be thrash. So I am inclined to just leave it for now, explore how to
do better automatically, then revisit.

Andrew Kim

unread,
Aug 14, 2020, 1:55:01 PM8/14/20
to Tim Hockin, Casey Callendrello, Dan Winship, kubernetes-sig-network
Yeah makes sense -- and agreed that most of the topologyKeys use-cases
are actually just "same node" in-cluster.

Rob Scott

unread,
Sep 22, 2020, 7:44:53 PM9/22/20
to Andrew Kim, Tim Hockin, Casey Callendrello, Dan Winship, kubernetes-sig-network, Rick Chen
Hey Everyone,

We ended up proposing a new KEP based on this. Rick Chen (ckyiricky) did a lot of awesome work to evaluate potential approaches here and we worked together to make this proposal. Although it's not completely automatic, I think it's a reasonable evolution and simplification of the API. Would love your feedback here: https://github.com/kubernetes/enhancements/pull/2005

Thanks!

Rob

Reply all
Reply to author
Forward
0 new messages