Proposal for acknowledging terminating pods in rolling update as an optional behaviour

38 views
Skip to first unread message

Filip Krepinsky

unread,
Jan 19, 2022, 11:41:48 AM1/19/22
to kubernete...@googlegroups.com
Hi,

In some cases people are surprised that their deployment can momentarily have more pods during a rollout than described ( replicas - maxUnavailable < x < replicas + maxSurge). The culprit are Terminating pods that can run in addition to the Running + Starting pods.
Even though Terminating pods are not considered part of a deployment this can cause problems with resource usage and scheduling:

1. Unnecessary autoscaling of nodes in tight environments and driving up cloud costs. This can hurt especially if
 - you rollout multiple deployments at the same time.
 - you have generous termination periods and your pods take a long time to shutdown (example here https://github.com/kubernetes/kubernetes/issues/95498#issuecomment-814048997)
 
relevant issues:
- https://github.com/kubernetes/kubernetes/issues/95498
- https://github.com/kubernetes/kubernetes/issues/99513
- https://github.com/kubernetes/kubernetes/issues/41596
- https://github.com/kubernetes/kubernetes/issues/97227
 
2. A problem also arises in contentious environments where pods are fighting for resources. This can bring up exponential backoff for not yet started pods into big numbers and unnecessarily delay start of such pods until they pop from the queue when there are computing resources to run them. This can slow down the deployment considerably.

relevant issue: https://github.com/kubernetes/kubernetes/issues/98656

In this issue the resources were limited by a quota, but this can be due to other reasons as well. In our use case we noticed, this can occur also in high availability scenarios where pods are expected to run only on certain nodes and pod anti-affinity forbids to run two pods at the same node.


For all of these issues it could make sense to wait for a pod to be terminated before scheduling a new one. Even though some issues can be partially mitigated by proper setup of maxUnavailable and maxSurge, it is not applicable for all of them.

I would like to propose a new opt-in behaviour that would solve this. Deployment controller would include Terminating pods in the computation of current running replicas when deciding if the new RS should scale up (or old in case of proportional scaling).

This could be configured for example in .spec.strategy.rollingUpdate.scalingPolicy with possible values
1. IgnoreTerminatingPods - default and current behaviour
2. WaitForTerminatingPods 

The disadvantage of this feature is a slower rollout in resource unrestricted environments. So, using this feature would be advised only for similar use cases to the mentioned ones above.

Please let me know if you would benefit from this feature or if you see any problems associated with it.

Filip

Filip Krepinsky

unread,
Feb 2, 2022, 3:07:18 PM2/2/22
to kubernete...@googlegroups.com
As was discussed in the last sig-apps meeting, I have created an issue to track this in order to obtain more feedback from the community before proceeding further [1].

And also have submitted an update to the documentation to explicitly mention the behaviour of terminating pods in a rollout [2]: 

Reply all
Reply to author
Forward
0 new messages