Descheduler Question

cs

unread,

Jan 25, 2021, 8:36:37 AM1/25/21

to kubernetes-sig-scheduling

Hi,

I somehow struggle a little bit with for example following setup:

I have a 3 node Kubernetes cluster and configured a deployment with for example 6 replicas. After the first start all 3 nodes are holding 2 pods each, which is great - tried it with podAntiAffinity, topologySpreadConstraints and/or nodeAffinity.

If I now shutdown one node, the still online nodes get each another pod - works as intended.
But then if the previously shutdown node gets back online, I want that the pod-distribution sets back to 2 pods on each node. But all I achieved was that the node that came back online, either got only 1 or 0 pods.

I must be missing something, can anybody help please?

Br,
cs

Mike Dame

unread,

Jan 25, 2021, 8:44:28 AM1/25/21

to cs, kubernetes-sig-scheduling

Hi,

It sounds like you are seeing an issue with the topologySpreadConstraint strategy, is that correct? This is my guess because PodAffinity and NodeAffinity wouldn't necessarily deschedule any pods when a new node comes online.

Could you share the deployment yaml for these pods, so we can see how your topology spread constraints are configured? The node yamls would also be helpful.

Thanks!

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-scheduling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-sch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/62854a00-b6a6-4c33-989c-0e432e499c51n%40googlegroups.com.

--

Mike Dame

Sr. Software Engineer, OpenShift

Red Hat Westford, MA

GitHub: @damemi

cs

unread,

Jan 25, 2021, 10:29:56 AM1/25/21

to kubernetes-sig-scheduling

Hi,

thanks for help.
probably. This is the yaml-file of the deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
name: logstash
namespace: test-elk
labels:
    app: logstash
spec:
replicas: 6
selector:
    matchLabels:
      app: logstash
template:
    metadata:
      annotations:
        co.elastic.logs/module: logstash
      labels:
        app: logstash
        stackmonitoring: logstash
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                - my-first-node
                - my-second-node
                - my-third-node
      # this tells kubernetes to move pods if a node is unreachable or not ready for about 10 seconds
      # default this is 5 minutes
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 10
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        tolerationSeconds: 10
      topologySpreadConstraints:
        # with 6 replicas and 3 nodes the following config will schedule 2 pods
        # to each node - if a node goes down - 2 pods will be moved to the still
        # available 2 nodes
        # try to schedule on all nodes not more than 2
        - maxSkew: 2
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: logstash
        # but do not schedule more than 3 on one node - e.g. if replicas is 7, one pod will stay as pending
        - maxSkew: 3
          # node just is a node label: on the first-node it's node1, on second node2,...
          # it did not work with the kubernetes.io/hostname - i saw than e.g. 4 pods on one node
          topologyKey: node
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: logstash
      containers:
      - image: docker.elastic.co/logstash/logstash:7.10.1
        name: logstash
        ports:
        - containerPort: 9600
          name: https
          protocol: TCP
        livenessProbe:
          httpGet:
            path: /
            port: 9600
          initialDelaySeconds: 90
          periodSeconds: 5
        readinessProbe:
          httpGet:
            path: /
            port: 9600
          initialDelaySeconds: 90
          periodSeconds: 5

And I installed descheduler via helm-chart with following values.yaml

[...default values above...]

deschedulerPolicy:
strategies:
    RemoveDuplicates:
      enabled: false
    RemovePodsViolatingNodeTaints:
      enabled: false
    RemovePodsViolatingNodeAffinity:
      enabled: true
      params:
         nodeAffinityType:
         - requiredDuringSchedulingIgnoredDuringExecution
    RemovePodsViolatingInterPodAntiAffinity:
      enabled: false
    LowNodeUtilization:
      enabled: false
      params:
        nodeResourceUtilizationThresholds:
          thresholds:
            cpu: 20
            memory: 20
            pods: 20
          targetThresholds:
            cpu: 50
            memory: 50
            pods: 50
    RemovePodsViolatingTopologySpreadConstraint:
       enabled: true
       params:
         # hard and soft-constraints - hard is if set 'whenUnsatisfiable: DoNotSchedule'
         # and soft if this is set to 'ScheduleAnyway'
         includeSoftConstraints: true

[...default values below...]

The plan behind the configured topologySpreadConstraints is noted in the comments. I thought now, that descheduler will reschedule those pods to balance them again - but as the second part, tells Kubernetes that 3 pods are OK too, it is doing nothing.

And I know that I could use just the first part of the topologySpreadConstraints and set it to doNotSchedule - but then if a node goes down, the 2 pods will stay in pending, as long as the node is not coming back online. Why is it so hard and complicated to balance a few deployment pods evenly over available nodes?

thanks for any help in advance!

br,

clem

Mike Dame

unread,

Jan 25, 2021, 2:16:54 PM1/25/21

to cs, kubernetes-sig-scheduling

Hi,

Thanks for sharing the yaml. That is very helpful!

As I am reading it, you are using MaxSkew to set the maximum number of pods per node, but that is not what it does. MaxSkew says that no 2 topologies may have sizes that differ by more than that value.

So, if you have 6 pods on 3 nodes (size 2,2,2), and one node goes down (new size 3,3). When the new node comes back up, the sizes are now (3,3,0). This is within the MaxSkew of 3.

If you set the first constraint to DoNotSchedule, then when a node goes down those extra 2 pods should still be scheduled, because with 3 pods on each node, they are still within the MaxSkew of 2. If the pods are stuck in pending, it may be that the other 2 nodes are being filtered out for some reason (though I don't see anything in the pod yaml that would indicate that)

Does that help answer your question?

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/72a67b06-8635-4e81-b115-e27407bc3c52n%40googlegroups.com.

cs

unread,

Jan 26, 2021, 2:24:07 AM1/26/21

to kubernetes-sig-scheduling

Hi,

thank you very much for clarification. Hmm, after reading the topologySpreadConstraint manual the first time I thought that it should work too with maxSkew 2 - but it didn't work - this is why I came up with the initial config with the sub-constraint with maxSkew 3

I tried it now again. No Affinities and just with:

topologySpreadConstraints:

- maxSkew: 2
topologyKey: kubernetes.io/hostname

          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: logstash

As soon as I stop docker on one node, reboot it or do a shutdown , the to pods that were running on that node, switch to "Terminating" and 2 new Pods "pop" up as pending (the ready state is 0/1 after a few seconds in terminating state)

NAMESPACE NAME                                                 READY UP-TO-DATE   AVAILABLE AGE
test-elk   deployment.apps/logstash                             4/6    6            4          12m
NAMESPACE NAME                                                 READY STATUS       RESTARTS   AGE
test-elk   pod/logstash-849fc74bf5-2jfpx                        0/1    Pending      0          6s
test-elk   pod/logstash-849fc74bf5-5jf5d                        1/1    Running      0          12m
test-elk   pod/logstash-849fc74bf5-cd6nv                        1/1    Running      0          12m
test-elk   pod/logstash-849fc74bf5-jnqks                        1/1    Terminating 0          4m36s      < pod where node is unreachable/not ready
test-elk   pod/logstash-849fc74bf5-m5mnz                        0/1    Pending      0          6s
test-elk   pod/logstash-849fc74bf5-pv8tp                        1/1    Running      1          12m
test-elk   pod/logstash-849fc74bf5-tpzjn                        1/1    Running      1          12m
test-elk   pod/logstash-849fc74bf5-wv92s                        1/1    Terminating 0          4m36s      < pod where node is unreachable/not ready

If I try with maxSkew 3 - yay, the pods are started on the other nodes, but when the downed node comes back - they stay on those nodes.

I'm banging my head, pulling my hair and I think I'm just too stupid for that feature :D

btw. kubernetes is version 1.18.12 - is this maybe the culprit?

Mike Dame

unread,

Jan 26, 2021, 9:23:15 AM1/26/21

to cs, kubernetes-sig-scheduling

Hi,

Thanks for the update. I'm sorry for the confusion but I think this is making some progress.

It's correct that with MaxSkew=3, descheduler will have the new pods stay on their nodes even when the new node comes back (because 3-0 is within the acceptable skew).

However I am not sure why you're new pods are stuck in Pending. Are the old pods in Terminating for a long time (longer than a few seconds)? And do the new ones stay in Pending even after the old pods finally terminate?

If you can provide the `kubectl describe` output for those Pending pods after the old ones finally terminate, that could help us debug further. Your TopologySpread config looks fine to me now, so I am thinking the Pending issue is something else.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/b0ef16c6-aade-4774-b8de-436a027d4548n%40googlegroups.com.

cs

unread,

Jan 27, 2021, 2:44:25 AM1/27/21

to kubernetes-sig-scheduling

Hi,

thank you so much for your patience with me.

* started the deployment with 6 replicas and maxskew 2 - pods evenly distributed on the nodes - 2, 2, 2
* stopped docker on the last node - expecting pods from that node are started on other nodes - 2, 2, 0 (node not-ready)

but pods that ran on last node are stuck in "terminating" - kubectl describe of one of those pods

(obfuscated some stuff)

Name:                      logstash-849fc74bf5-ks7wd
Namespace:                 test-elk
Priority:                  0
Node:                      node-3/10.12.14.70
Start Time:                Wed, 27 Jan 2021 08:01:39 +0100
Labels:                    app=logstash
                           pod-template-hash=849fc74bf5
                           stackmonitoring=logstash
Annotations:               cni.projectcalico.org/podIP:
                           cni.projectcalico.org/podIPs:
                           co.elastic.logs/module: logstash
Status:                    Terminating (lasts 10m)
Termination Grace Period: 30s
IP:                        192.168.193.88
IPs:
IP:           192.168.193.88
Controlled By: ReplicaSet/logstash-849fc74bf5
Containers:
logstash:
    Container ID:   docker://5e6cda23b5034997430339d4f9706f1b752d42b5a9138e50bb0c0f2539e52714
    Image:          docker.elastic.co/logstash/logstash:7.10.1
    Image ID:       docker-pullable://docker.elastic.co/logstash/logstash@sha256:a9ac93266b783eb26b7ebf1d4635f3f0a8ab094f12ef7faf138f8d7d248aa339
    Port:           9600/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 27 Jan 2021 08:01:46 +0100
    Ready:          False
    Restart Count: 0
    Liveness:       http-get http://:9600/ delay=90s timeout=1s period=5s #success=1 #failure=3
    Readiness:      http-get http://:9600/ delay=90s timeout=1s period=5s #success=1 #failure=3
    Environment:
        [...]
    Mounts:
      /etc/logstash/certificates/test-elk-es-http-certs-public.crt from cert-ca (ro,path="tls.crt")
      /etc/logstash/patterns/containerlog_patterns from config-volume (ro,path="containerlog_patterns")
      /etc/logstash/patterns/syslog_patterns from config-volume (ro,path="syslog_patterns")
      /usr/share/logstash/config/logstash.yml from config-volume (ro,path="logstash.yml")
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m97dk (ro)
Conditions:
Type              Status
Initialized       True
Ready             False
ContainersReady   False
PodScheduled      True
Volumes:
config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      logstash-configmap
    Optional: false
cert-ca:
    Type:        Secret (a volume populated by a Secret)
    SecretName: test-elk-es-http-certs-public
    Optional:    false
default-token-m97dk:
    Type:        Secret (a volume populated by a Secret)
    SecretName: default-token-m97dk
    Optional:    false
QoS Class:       BestEffort
Node-Selectors: <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 10s
                 node.kubernetes.io/unreachable:NoExecute for 10s
Events:
Type     Reason            Age                   From               Message
----     ------            ----                  ----               -------
Warning FailedScheduling 25m                   default-scheduler 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 2 node(s) didn't match pod topology spread constraints.
Warning FailedScheduling 25m                   default-scheduler 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 2 node(s) didn't match pod topology spread constraints.
Normal   Scheduled         23m                   default-scheduler Successfully assigned test-elk/logstash-849fc74bf5-ks7wd to node-3
Normal   Pulled            22m                   kubelet            Container image "docker.elastic.co/logstash/logstash:7.10.1" already present on machine
Normal   Created           22m                   kubelet            Created container logstash
Normal   Started           22m                   kubelet            Started container logstash
Warning DNSConfigForming 11m (x13 over 23m)    kubelet            Search Line limits were exceeded, some search paths have been omitted, the applied search line is: test-elk.svc.cluster.local svc.cluster.local cluster.local [...]
Normal   SandboxChanged    11m                   kubelet            Pod sandbox changed, it will be killed and re-created.
Normal   Killing           11m                   kubelet            Stopping container logstash
Warning Unhealthy         10m (x3 over 11m)     kubelet            Liveness probe failed: Get http://192.168.193.88:9600/: dial tcp 192.168.193.88:9600: connect: invalid argument
Warning Unhealthy         2m59s (x99 over 11m) kubelet            Readiness probe failed: Get http://192.168.193.88:9600/: dial tcp 192.168.193.88:9600: connect: invalid argument

and here a kubectl describe of a pending pod

Name:           logstash-849fc74bf5-dwh6k
Namespace:      test-elk
Priority:       0
Node:           <none>
Labels:         app=logstash
                pod-template-hash=849fc74bf5
                stackmonitoring=logstash
Annotations:    co.elastic.logs/module: logstash
Status:         Pending
IP:
IPs:            <none>
Controlled By: ReplicaSet/logstash-849fc74bf5
Containers:
logstash:
    Image:      docker.elastic.co/logstash/logstash:7.10.1
    Port:       9600/TCP
    Host Port: 0/TCP
    Liveness:   http-get http://:9600/ delay=90s timeout=1s period=5s #success=1 #failure=3
    Readiness: http-get http://:9600/ delay=90s timeout=1s period=5s #success=1 #failure=3
    Environment:
      [...]
    Mounts:
      /etc/logstash/certificates/test-elk-es-http-certs-public.crt from cert-ca (ro,path="tls.crt")
      /etc/logstash/patterns/containerlog_patterns from config-volume (ro,path="containerlog_patterns")
      /etc/logstash/patterns/syslog_patterns from config-volume (ro,path="syslog_patterns")
      /usr/share/logstash/config/logstash.yml from config-volume (ro,path="logstash.yml")
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-m97dk (ro)
Conditions:
Type           Status
PodScheduled   False
Volumes:
config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      logstash-configmap
    Optional: false
cert-ca:
    Type:        Secret (a volume populated by a Secret)
    SecretName: test-elk-es-http-certs-public
    Optional:    false
default-token-m97dk:
    Type:        Secret (a volume populated by a Secret)
    SecretName: default-token-m97dk
    Optional:    false
QoS Class:       BestEffort
Node-Selectors: <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 10s
                 node.kubernetes.io/unreachable:NoExecute for 10s
Events:
Type     Reason            Age   From               Message
----     ------            ---- ----               -------
Warning FailedScheduling 21m   default-scheduler 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 2 node(s) didn't match pod topology spread constraints.
Warning FailedScheduling 21m   default-scheduler 0/4 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 2 node(s) didn't match pod topology spread constraints.

I also did a force delete on one of the "terminating" pods, but still not one of the pending pods is starting.

What am I missing? :(

cs

unread,

Jan 27, 2021, 4:00:22 AM1/27/21

to kubernetes-sig-scheduling

oh, dear - i think i found something *facepalm*

I have a 3 node-worker cluster with 1 master

as i used this topologyconstraint:

        topologySpreadConstraints:
        - maxSkew: 2
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: logstash

the master-node also was taken into account. I never thought about that and assumed as the master cannot schedule pods anyway it's not taken into accoung :|

I now change the topologyKey to node - where each real worker node has a label (node=node1, node=node2,...).

after deleteing the old config and applying the new one I got now a pod distribution of 2, 1, 3 - not perfect but okisch. the "monk" in me would like to see 2, 2, 2 from start on.

After I stopped docker on the last node again - the new distribution was 2, 2, 0

Then I started docker again and got 2, 2, 2

If I now stop docker once again on the last node it's again 2, 2, 0 - descheduler says because it is balanced.

Wei Huang

unread,

Jan 27, 2021, 12:58:51 PM1/27/21

to kubernetes-sig-scheduling

Hi Clemens,

You're right. Tainted nodes are also included in the "searching scope" of PodTopologySpread - no matter what kind of taint it is.

It's documented in the official doc: https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/ (the 2nd bullet of #known-limitations section)

Mike Dame

unread,

Jan 27, 2021, 1:17:07 PM1/27/21

to Wei Huang, kubernetes-sig-scheduling

So, that is why the 2 pods are stuck in Pending -- the node isn't actually "down" from the scheduler's point of view, it just fails the taint/toleration checks. And because the other nodes have 2 pods on them already, no more can be scheduled without violating MaxSkew=2.

Would going back to using `ScheduleAnyway`, along with the MaxSkew=2 and topologyKey=node solve this? Then the pods should evenly spread when the node is available, and still spread between the other 2 nodes when it's not (does the scheduler still attempt to spread evenly between remaining nodes when the skew is unsatisfiable?)

Regarding the (2,1,3) distribution, a MaxSkew of 1 should keep them at (2,2,2).

To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-scheduling/45c34918-c0e8-4d5d-957a-cc1a76a62b46n%40googlegroups.com.

Wei Huang

unread,

Jan 27, 2021, 1:53:34 PM1/27/21

to Mike Dame, kubernetes-sig-scheduling

ScheduleAnyway may help on this particular case (2,1,3). But it may not fit for every distribution case, as it's best-efforts.

Actually when I raised https://github.com/kubernetes/kubernetes/issues/80921, a rough idea was to bring up a third mode (in addition to DoNotSchedule and ScheduleAnyway) to exclude the nodes with system-applied taints. (or a plugin argument to exclude the system-applied taints)

Regards,
--------------
Wei Huang
hwe...@gmail.com

cs

unread,

Jan 28, 2021, 1:54:15 AM1/28/21

to kubernetes-sig-scheduling

Hi,

yeah, must have missed the taint-thingie :| - sorry for all the confusion *shame, shame, shame - dingding*.

well, i think i will stick with maxSkew 1 and 'DoNotSchedule'. It's just unfortunate that it's not easily possible with Kubernetes to move those pods from a failing node(s) temporarily to another node and automatically move them back when the node(s) are ready again with an even spread.

So why all that hassle? On our productive ECK/ELK environment our Logstashes will have to do a lot, and I fear that if too many pods are missing for a while, our buffer (a RabbitMQ-cluster on other machines) will queue up too many messages in worst case scenarios. And as those Logstash-Pods will use quite a lot of resources, not all "missing" pods should be moved to the other nodes.
It would be nice to see something in Kubernetes that you could set a minimum and maximum amount of pods for a deplyoment that should run on available nodes for cases like this. Yes, I know - autoscaling maybe interesting for that - I'll have a look at that maybe later.

Maybe I'll just solve it by writing some script that will check the state and "rebalance" the thing via a simple cron-job. At least I think I now understood how "topologySpreadConstraints" really work and we definitely will use this feature for other deployments too.

Thanks again for all the help and patience!

Reply all

Reply to author

Forward