Re: [kubernetes/kubernetes] kubenetes svc endpoints are not updated when one of API is down (when running HA api) (#56584)

2 views
Skip to first unread message

k8s-ci-robot

unread,
Nov 29, 2017, 2:14:56 PM11/29/17
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@calvix: Reiterating the mentions to trigger a notification:
@kubernetes/sig-api-machinery-bugs

In response to this:

/sig api-machinery
@kubernetes/sig-api-machinery-bugs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Yu Liao

unread,
Dec 4, 2017, 4:28:27 PM12/4/17
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

/cc @mikedanese
/sub

Mike Danese

unread,
Dec 4, 2017, 4:32:27 PM12/4/17
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

This is a dup of #22609 and should be fixed by #51698 which is currently alpha

calvix

unread,
Jan 4, 2018, 3:05:34 AM1/4/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

closing dup

calvix

unread,
Jan 4, 2018, 3:06:01 AM1/4/18
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Closed #56584.

Aaron Weisberg

unread,
Dec 23, 2019, 2:26:44 PM12/23/19
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Is there a manual fix for this? Seems like any manual IP changes to the kubernetes endpoint object is getting reversed


You are receiving this because you are on a team that was mentioned.

Reply to this email directly, view it on GitHub, or unsubscribe.

Daniel Strivelli

unread,
Dec 14, 2020, 1:41:34 PM12/14/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Is there a manual fix for this for those that are running an older version? Seems like any manual IP changes to the kubernetes endpoint object is getting reversed

Did you ever find a resolution to this? We don't see proper "load balancing" happening between the 3 API servers listed in the endpoints. One of our masters is pegged super high while the others just sit barely idling. We have a load balancer in front of the control-plane for externally originated API requests, but that doesn't appear to have a hold on the internal comms.

I need to get to the bottom of the 1 master being pinned, or figure out how to adjust those endpoints in the Kubernetes service to point to the load balancer IPs instead of the direct master IPs.

Daniel Smith

unread,
Dec 14, 2020, 2:08:48 PM12/14/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

One of our masters is pegged super high while the others just sit barely idling.

Causes:

  1. Co-located controller manager / scheduler which have the lease and are not going through the load balancer
  2. Rolling restart of apiservers leaves most clients connected to a single apiserver.
  3. Clients are using HTTP2 and route their many requests over a single connection, exacerbating 2

I think in 1.20 it should be safe to turn back on the probabalistic GOAWAY which will fix 2 / 3 over time. To fix 1 you also have to send those component's traffic through the load balancer.

There is probably a better bug to attach this info to. @liggitt @deads2k

Jordan Liggitt

unread,
Dec 14, 2020, 2:25:40 PM12/14/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

I think #94532 (comment) is the http/2 issue we need a new go version for in order to enable probabalistic GOAWAY

Daniel Smith

unread,
Dec 14, 2020, 2:28:51 PM12/14/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Yeah I've conflated this with the http2 disconnection detection change that made it into 1.20. (which, to be fair, is going to be needed by clients in the scenario that triggers my 2 above)

Daniel Strivelli

unread,
Dec 14, 2020, 3:34:46 PM12/14/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@lavalamp thanks for the reply! We are running a pretty old cluster (1.12, long story, working toward upgrades), so not sure if that changes anything.

Going through your list

  1. Co-located controller manager / scheduler which have the lease and are not going through the load balancer

Isn't this a common pattern for HA masters? I've never seen an arch where the controller-manager and scheduler pods for each master were not co-located on the same host.
Maybe I'm misunderstanding. Regardless, are you aware of anything I can check to see if this is the case?

  1. Rolling restart of apiservers leaves most clients connected to a single apiserver.

Could this happen by patching your master hosts one by one, and rebooting them in a rolling fashion, causing the apiservers to align to switch and switch and switch and coalesce on one master? If that is the case, how would you recommend avoiding that issue? Reboot simultaneously??

  1. Clients are using HTTP2 and route their many requests over a single connection, exacerbating 2

I'm generally unfamiliar with HTTP2 and Kubernetes Clients' use of it internally, so unsure if this is a culprit in 1.12.

I think in 1.20 it should be safe to turn back on the probabalistic GOAWAY which will fix 2 / 3 over time.

1.12, womp womp.

To fix 1 you also have to send those component's traffic through the load balancer.

How is this achieved? Everything I've found points to the internal kubernetes service being used to comm back to the control-plane, but I can't change the endpoints without the reconciler setting it back.

Truly appreciate your time and willingness to provide insight.

Daniel Smith

unread,
Dec 14, 2020, 4:33:36 PM12/14/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

I don't know that we've specifically done anything about this issue since 1.12, but I really recommend you upgrade before spending much more time on this!

Co-located controller manager / scheduler which have the lease and are not going through the load balancer

I think you're talking about the lease? How would one prevent them from landing on the same master? Or how could you split them once the leases are co-located?

The lease is one aspect, there's no built-in prevention (unless you count resource limits causing one of the components to OOM if both leases end up on the same node). Just delete one of the leases and wait.

A bigger deal is that these components are likely connecting directly to apiserver rather than through the load balancer. See below.

Rolling restart of apiservers leaves most clients connected to a single apiserver.

Could this happen by patching your master hosts one by one, and rebooting them in a rolling fashion, causing the apiservers to align to switch and switch and switch and coalesce on one master?

Right.

If that is the case, how would you recommend avoiding that issue? Reboot simultaneously??

No, then you have a (short) outage... There's really not a good fix for this. The probabalistic GOAWAY is the best we've got. Or will have, anyway. If this is really the problem (which I doubt) you should be able to even out the load a bit by restarting the most heavily loaded apiserver. That only helps if that apiserver is taking more than 50% of the load. But I doubt this is your problem.

To fix 1 you also have to send those component's traffic through the load balancer.

How is this achieved? Everything I've found points to the internal kubernetes service being used to comm back to the control-plane,

I'd be surprised if your controller manager / scheduler are connecting that way. Take a look at their flags and/or kubeconfig files. This is probably your biggest contributor.

In a really extreme case you could run a couple controller managers with different leases and sets of controllers. But this isn't a good experiment to do if prod is on fire, and I think you should spend your time upgrading... :)

Daniel Strivelli

unread,
Dec 14, 2020, 4:40:00 PM12/14/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Thanks again @lavalamp, total boss.

Daniel Smith

unread,
Dec 14, 2020, 4:45:27 PM12/14/20
to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Good luck with your upcoming 8 consecutive upgrades! You'll be the boss after that...

Reply all
Reply to author
Forward
0 new messages