Pod Anti Affinity Feature Question and how to implement failure domains in on enterprise deployments

769 views
Skip to first unread message

krma...@gmail.com

unread,
Jan 30, 2017, 1:52:51 AM1/30/17
to Kubernetes developer/contributor discussion
Hi All
I am trying to understand Pod Anti affinity feature mostly as a way to implement the DaemonSet behavior using Deployments to take advantage of Rolling Update in the short term until DaemonSet updates gets in.


Here is what i would like to express:-
- Deploy only one instance of Pod1 on any node with label X
- When doing Rolling Update , make the pods in one failure domain unavailable at a time(i know there is mention of zones on GCE, but how does one implement it on bare metal on-premise deployments)


The way i am thinking to implement it is the following:-

- label all nodes with annotations failure-domain.beta.kubernetes.io/zone depending on the rack they are on.(would this work on bare metal deployments to spread the pods across failure domains and only bring down one failure domain at a time ?)
- create a deployment with replica count as number of nodes with a specific label say X(not zone related label)
- label the pods of the deployment with k1=v1
- pod anti affinity would say run on a node with label X, where the pod already running on node N1, should not have a label k1=v1. this  means if there is already an existing pod as the one about to run, dont schedule one more pod on it)


Would this work ? The only other problem i see is in my case if i add a new node or node becomes stuck, i need to adjust my replica count for deployment to make sure deployment succeeds.


I agree this would be hack, but i need this urgently , until damoen set supports upgrades.
Let me know if this way of implement daemonsets would work ?

-Mayank




David Oppenheimer

unread,
Jan 30, 2017, 3:41:07 AM1/30/17
to krma...@gmail.com, Kubernetes developer/contributor discussion
I assume "deploy only one instance of Pod1 on any node with label X" means (1) constrain the pods to only run on nodes with label X and (2) run only one Pod1 pod on each such node. In that case
(1) you can get "constrain the pods to only run on nodes with label X" using requiredDuringSchedulingIgnoredDuringExecution node affinity (see the "Node affinity" section here)
(2) you can get "run at most one Pod1 pod per node" using (1) plus using requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity with TopologyKey of kubernetes.io/hostname and key=k1, operator=In, values=v1 (see the "Inter-pod affinity/anti-affinity" section here)

For spreading across failure domains, yes, if you're not running on a cloud provider that populates the failure-domain.beta.kubernetes.io/zone label, then you can manually populate it yourself based on whatever value you want (rack is fine) and the scheduler will automatically spread based on it, on a best-effort basis.  If you want to require spreading one per rack instead of doing it on a best-effort basis, then you need to explicitly set requiredDuringSchedulingIgnoredDuringExecution pod anti-affinity in your pod template and use failure-domain.beta.kubernetes.io/zone as the TopologyKey and key=k1, operator=In, values=v1. Of course, since presumably no node is in more than one rack, then if you do this then you presumably don't need (2) above.




--
You received this message because you are subscribed to the Google Groups "Kubernetes developer/contributor discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-dev@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/e65f7fb6-397a-4894-80b7-d216bf582db2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

krma...@gmail.com

unread,
Feb 6, 2017, 4:23:11 PM2/6/17
to Kubernetes developer/contributor discussion, krma...@gmail.com
Thanks David, i will give this a try.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.

krma...@gmail.com

unread,
Feb 13, 2017, 12:01:59 AM2/13/17
to Kubernetes developer/contributor discussion, krma...@gmail.com
One more question David/Kargakis
Nowhere its documented that the Kubernetes Deployment Rolling Update will make sure that it will only bring down(make unavailable) one node per failure zone as determined by the label failure-domain.beta.kubernetes.io/zone. Can someone confirm this behavior ? Is this going to be similar with Rolling Update for DaemonSets as well ?

-Mayank

David Oppenheimer

unread,
Feb 13, 2017, 12:15:16 AM2/13/17
to krma...@gmail.com, Kubernetes developer/contributor discussion
I would suggest making that feature request in the corresponding Github issues if you think it is important.


To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-dev@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/dacbd290-8794-4c30-87e5-6d470515beb5%40googlegroups.com.

Michail Kargakis

unread,
Feb 13, 2017, 4:03:48 AM2/13/17
to krma...@gmail.com, Kubernetes developer/contributor discussion
There is no such guarantee offered by Deployments afaik

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-dev@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/dacbd290-8794-4c30-87e5-6d470515beb5%40googlegroups.com.

Brendan Burns

unread,
Feb 14, 2017, 12:25:36 AM2/14/17
to Kubernetes developer/contributor discussion, krma...@gmail.com
Currently, if you are concerned about this, the best thing is probably to set maxUnavailable to zero and set maxSurge > 0, that will ensure that you always have sufficient replicas across failure domains, (at the cost of more replicas than you need while the rollout is proceeding)

--brendan

krma...@gmail.com

unread,
Feb 14, 2017, 4:15:13 AM2/14/17
to Kubernetes developer/contributor discussion, krma...@gmail.com

Thanks all of you , i have opened the following issue https://github.com/kubernetes/kubernetes/issues/41394 and i am also happy to drive and implement it if there is enough agreement and interest.

Brendan, no that doesn't meet our requirements. In Azure when i last used it, had the concept of failure domains and Update Domains, i would like to see something similar in Rolling Updates.

krma...@gmail.com

unread,
Mar 21, 2017, 3:11:54 AM3/21/17
to Kubernetes developer/contributor discussion, krma...@gmail.com
Hi David
Reopening this thread with more questions and thanks again for all previous answers.

In your example above using pod anti affinity, how would we make sure , the spreading of pods allows more than one pod per node.

So lets say i create 3 failure zones, each zone with a single host. I want all pods to be evenly distributed across these three zones. So if i have 6 replicas, i want two pods per host/zone. How would we model this using pod anti affinity. I understand, in this case we can set the topologyKey as the zone, but what would be the pod anty affinity labels ?

-Mayank

David Oppenheimer

unread,
Mar 21, 2017, 3:26:45 AM3/21/17
to krma...@gmail.com, Kubernetes developer/contributor discussion
You can use preferredDuringSchedulingIgnoredDuringExecution anti-affinity. The label selector should match the pods in the collection you are trying to spread (e.g. give each of the pods a label like "foo=bar" and then use key=foo, value=bar, operator=equal as the selector). Use the node label for zone as the topologyKey. The scheduler will try to spread the pods of the collection across zones. (It should also try to spread across nodes due to the built-in priority function that does that, but you can add your own anti-affinity like the above but with the node label for node name as the topologyKey in addition.)


To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-dev@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/52738f4f-1e52-44df-a33d-c16da1935cc8%40googlegroups.com.

krma...@gmail.com

unread,
Mar 21, 2017, 4:29:07 AM3/21/17
to Kubernetes developer/contributor discussion, krma...@gmail.com
Thanks David
The documentation here is https://kubernetes.io/docs/user-guide/node-selection/ says the following:-

" The rules are of the form “this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y.” Y is expressed as a LabelSelector with an associated list of namespaces (or “all” namespace"

So i interpreted the pod anti affinity rules as "this pod should not run in X(failure zone) if that zone is already running one or more pods that meet rule Y(the label selector for the pods)" which is why i thought it doesnt allow running more than one pod per node of the same type. I guess the way you describe above is more generic than the description in the above documentation which should be updated :-). 

I will give this a try. Let me know if my understanding is right and the documentation indeed is wrong ?


-Mayank

David Oppenheimer

unread,
Mar 21, 2017, 4:41:24 AM3/21/17
to krma...@gmail.com, Kubernetes developer/contributor discussion
On Tue, Mar 21, 2017 at 1:29 AM, <krma...@gmail.com> wrote:
Thanks David
The documentation here is https://kubernetes.io/docs/user-guide/node-selection/ says the following:-

" The rules are of the form “this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y.” Y is expressed as a LabelSelector with an associated list of namespaces (or “all” namespace"

So i interpreted the pod anti affinity rules as "this pod should not run in X(failure zone) if that zone is already running one or more pods that meet rule Y(the label selector for the pods)" which is why i thought it doesnt allow running more than one pod per node of the same type. I guess the way you describe above is more generic than the description in the above documentation which should be updated :-). 

Yeah, what you quoted applies to requiredDuringSchedulingIgnoredDuringExecution. For preferredDuringSchedulingIgnoredDuringExecution, just replace "not run" with "try not to run." The subsequent paragraph tries to explain that, but I agree it's not clear.

It would be great if you could send a PR to make the documentation clearer.

BTW we have a feature in progress that will allow to to specify a number with requiredDuringSchedulingIgnoredDuringExecution anti-affinity, rather than it always being 1.

 
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-dev@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/75b852bb-eff0-476b-9100-189ded52ea25%40googlegroups.com.

krma...@gmail.com

unread,
Mar 24, 2017, 12:23:02 AM3/24/17
to Kubernetes developer/contributor discussion, krma...@gmail.com

Yes i will definitely make a PR once i understand it well enough

More questions :-)
-thanks for the PR. Thats awesome, that we have a PR that will allow to specify a number with requiredDuringSchedulingIgnoredDuringExecution anti-affinity, rather than it always being 1. One thing not clear to me is ,are you saying that PR is specifically for requiredDuringSchedulingIgnoredDuringExecution and not for preferredDuringSchedulingIgnoredDuringExecution. Why cant we specify a number per zone for pod anti affinity with preferredDuringSchedulingIgnoredDuringExecution ?




- Is the default out of the box scheduling using the labels

failure-domain.beta.kubernetes.io supposed to work as expected in 1.5.3 as well.




On My test on a GCE cluster with 5 nodes and one master, i made master as schedulable as well.

Then i added 3 fault zones with two nodes each. When i scale a Deployment from 6 to 9, i see that out of 3 nodes, 2 go to the same zone.




When i do the same test without master, but only with minions, then each set of 3 replicas go to a different zone.

Is there anything special about the master that changes the logic. I am looking at code selector_spreading.go




-Mayank

David Oppenheimer

unread,
Mar 24, 2017, 2:21:24 AM3/24/17
to krma...@gmail.com, Kubernetes developer/contributor discussion
On Thu, Mar 23, 2017 at 9:23 PM, <krma...@gmail.com> wrote:

Yes i will definitely make a PR once i understand it well enough

More questions :-)
-thanks for the PR. Thats awesome, that we have a PR that will allow to specify a number with requiredDuringSchedulingIgnoredDuringExecution anti-affinity, rather than it always being 1. One thing not clear to me is ,are you saying that PR is specifically for requiredDuringSchedulingIgnoredDuringExecution and not for preferredDuringSchedulingIgnoredDuringExecution.


Yes 

Why cant we specify a number per zone for pod anti affinity with preferredDuringSchedulingIgnoredDuringExecution ?


 Can you explain your use case for that? The expected scenario is someone has N failure domains and M > N  pods, so saying "try to put X pods per failure domain" is the same as saying "spread evenly across all failure domains". So it's not clear why you would need a number.





- Is the default out of the box scheduling using the labels

failure-domain.beta.kubernetes.io supposed to work as expected in 1.5.3 as well.


Out-of-the-box it will spread across hosts and zones. See CalculateSpreadPriority() in the file you mentioned.
 





On My test on a GCE cluster with 5 nodes and one master, i made master as schedulable as well.

Then i added 3 fault zones with two nodes each. When i scale a Deployment from 6 to 9, i see that out of 3 nodes, 2 go to the same zone.




When i do the same test without master, but only with minions, then each set of 3 replicas go to a different zone.

Is there anything special about the master that changes the logic. I am looking at code selector_spreading.go


Kubernetes decides where to schedule a pod based on a variety of factors -- namely all of the default priority functions (assuming you haven't changed the defaults). There are a number of others besides spreading across failure domain. For example, trying to balance the resource consumption on each node. The master already has a lot of pods on it so Kubernetes will initially try to avoid the master until the other nodes are more full.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-dev+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-dev@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/bd799c28-5231-4328-8239-9b73a025b251%40googlegroups.com.
Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages