Smart scheduling of host networking pods or daemon sets

43 views
Skip to first unread message

Yaroslav Molochko

unread,
Oct 23, 2016, 3:05:47 AM10/23/16
to Kubernetes user discussion and Q&A
Hello, I need some advise on what is the best solution for my use case. In my case each and every compute node has the same role, but amount of processes must be different according to amount of CPU. We deal with big amount of traffic and delays are critical, so we use host networking. So basically I need to create let say replication controller with n replicas, and I need to place N*CPU cores instances of some app per node. And what is most important - I need to have rolling update, and it seems DaemonSet doesn't support it.
I pointed out about host networking, because it means we need somehow place instance with port 10000 on node, but we can't place another instance of this app with port 10000, it must be some other instance with port 10001 or some other port.

So if we say we have a node with 4 CPU cores, we will have:
app1 with configuration to listen on 10000 port
app1 with configuration to listen on 10001 port
app1 with configuration to listen on 10002 port
app1 with configuration to listen on 10003 port
+ several other apps 

I'm okay to develop missing part, the question is: what is the best place to put this functionality? I see 2 possible options:
 - Scheduler: extra scheduling rules which would prevent application of the same type be placed on the same node)
 - DaemonSet: extend it with rolling update functionality

Is there anything I'm missing, should consider? 

David Oppenheimer

unread,
Oct 23, 2016, 3:14:36 AM10/23/16
to kubernet...@googlegroups.com
A hacky solution would be the following. Let's assume the largest number of CPUs on any node in your cluster is M, and you have N nodes in the cluster. Then you can create M ReplicaSets, each with recplias=N and using a different hostport.

So for example if M=4 and N=3, then you create

ReplicaSet A: {replicas = 3, hostPort = 10000, limit = 1000 miliCpu}
ReplicaSet B: {replicas = 3, hostPort = 10001, limit = 1000 milliCpu}
ReplicaSet C: {replicas = 3, hostPort = 10002, limit = 1000 milliCpu}
ReplicaSet D: {replicas = 3, hostPort = 10003, limit = 1000 milliCpu}

(I'm mixing stuff that's specified in the container/podspec/pod but you get the idea.)

Some of the pods created by these ReplicaSets will go pending and never schedule, because there aren't enough CPUs for them. But if N and M aren't huge, this is probably fine.

BTW note that a node with X number of CPUs can't necessarily run X 1000-milliCpu containers, because system pods take up some resources.



--
You received this message because you are subscribed to the Google Groups "Kubernetes user discussion and Q&A" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-users+unsubscribe@googlegroups.com.
To post to this group, send email to kubernetes-users@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-users.
For more options, visit https://groups.google.com/d/optout.

Yaroslav Molochko

unread,
Oct 23, 2016, 3:36:54 AM10/23/16
to kubernet...@googlegroups.com
Thank you for the suggestion, I’ve thought of it as well, but it does’ give a guarantee that deployment will be done in that order as you described, unless you specify each and every node in the nodeSelector for each and every replicaset, which makes whole idea of k8s meaningless. 

For instance, if you run 100 pods over 50 nodes, scheduler will try to place as many pods as it is able, doesn’t matter what port it is listening, of course pods with the same port will not be able to run on the same node, and there is possibility that after couple of liveness probes scheduler will place it somewhere else, but it doesn’t give guarantee that the new node will not have that port occupied already. 

It will converge in some time of course, but this means that each new deployment/upgrade will lead to low performance (not all processes handle traffic) and may even lead to downtime. That is why I’m thinking of the extension of basic scheduler or daemonset functionality with what we need. 

You received this message because you are subscribed to a topic in the Google Groups "Kubernetes user discussion and Q&A" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubernetes-users/DRzHmrjmiAs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kubernetes-use...@googlegroups.com.
To post to this group, send email to kubernet...@googlegroups.com.

David Oppenheimer

unread,
Oct 23, 2016, 4:08:51 AM10/23/16
to kubernet...@googlegroups.com
I may be missing something, but I don't understand why you say it won't work, or why you say you need to use nodeSelector.

First, let me clarify one thing: hostPort is a scheduler predicate. At scheduling time, when the scheduler is considering to schedule pod A onto node B, it rejects node B if any of the hostPorts used by containers on node B are requested by any containers in pod A. This code is here. It is not the case that A will schedule onto node B and then fail healthcheck. Instead, A will never schedule on to B.

So using hostPort will ensue that no more than one pod from each ReplicaSet will schedule onto each node. If the number of replicas of the ReplicaSet is greater than or equal to the number of nodes, then exactly one pod should schedule on each node (some may be pending, of course). If every container requests 1000 milliCpu (and there is one container per pod), then the number of pods per node should equal the number of CPUs (with the caveat I mentioned at the end of my previous email).

If you still believe there is a problem, can you give a concrete example?


On Sun, Oct 23, 2016 at 12:36 AM, Yaroslav Molochko <ono...@gmail.com> wrote:
Thank you for the suggestion, I’ve thought of it as well, but it does’ give a guarantee that deployment will be done in that order as you described, unless you specify each and every node in the nodeSelector for each and every replicaset, which makes whole idea of k8s meaningless. 

For instance, if you run 100 pods over 50 nodes, scheduler will try to place as many pods as it is able, doesn’t matter what port it is listening, of course pods with the same port will not be able to run on the same node, and there is possibility that after couple of liveness probes scheduler will place it somewhere else, but it doesn’t give guarantee that the new node will not have that port occupied already. 

It will converge in some time of course, but this means that each new deployment/upgrade will lead to low performance (not all processes handle traffic) and may even lead to downtime. That is why I’m thinking of the extension of basic scheduler or daemonset functionality with what we need. 
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-users+unsub...@googlegroups.com.

To post to this group, send email to kubernetes-users@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-users.
For more options, visit https://groups.google.com/d/optout.


-- 
You received this message because you are subscribed to a topic in the Google Groups "Kubernetes user discussion and Q&A" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubernetes-users/DRzHmrjmiAs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kubernetes-users+unsub...@googlegroups.com.

To post to this group, send email to kubernetes-users@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-users.
For more options, visit https://groups.google.com/d/optout.

Yaroslav Molochko

unread,
Oct 23, 2016, 4:17:20 AM10/23/16
to kubernet...@googlegroups.com
wow, I’ve missed port predicate completely. 
I’ll need to do some experimenting of course but at least in theory it should work. 

Thank you very much! 

To unsubscribe from this group and all its topics, send an email to kubernetes-use...@googlegroups.com.
To post to this group, send email to kubernet...@googlegroups.com.

Rodrigo Campos

unread,
Oct 23, 2016, 11:21:53 AM10/23/16
to kubernet...@googlegroups.com
Sorry, can you elaborate on why host network?

What is the problem of a deployment that reserves the CPU needed (so k8s will only schedule pods if the CPU is available, so you won't have more pods per node, etc.)

Also, the scheduler by default tries to spread the deployment across nodes.

I really don't see why what you want to achieve can't be done with just a deployment. Am I missing something?

Yaroslav Molochko

unread,
Oct 23, 2016, 3:20:29 PM10/23/16
to kubernet...@googlegroups.com
About host networking
We deal with some sort of VPN solution, that is why client source address is a must as well as extra delay affects user experience under heavy load. 
We did an experiment under docker-compose and it is working way better with host networking for us. 

As David Oppenheimer pointed out, there is nodePort predicate, which makes your and his solution with required CPU parameter working, at least in theory (did not try it yet). I’ve completely missed this nodePort predicate which made this solution less preferable, because of long convergency time, with the predicate - solution is completely valid. 


You received this message because you are subscribed to a topic in the Google Groups "Kubernetes user discussion and Q&A" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kubernetes-users/DRzHmrjmiAs/unsubscribe.

To unsubscribe from this group and all its topics, send an email to kubernetes-use...@googlegroups.com.
To post to this group, send email to kubernet...@googlegroups.com.

Rodrigo Campos

unread,
Oct 23, 2016, 4:16:47 PM10/23/16
to kubernet...@googlegroups.com
On Sun, Oct 23, 2016 at 12:20:08PM -0700, Yaroslav Molochko wrote:
> About host networking
> We deal with some sort of VPN solution, that is why client source address is a must as well as extra delay affects user experience under heavy load.

Which client source address? Which is the client and which one is the server in
the case you want to preserve the "source address"?

> We did an experiment under docker-compose and it is working way better with host networking for us.

I don't know what docker compose does, but it might not be comparable.

In kubernetes you have plenty of options for networking, and can use AWS vpc in
AWS or the private google cloud network, with kubernetes on those cloud
providers at least. And there are tons of other options also (liek weave, etc.
and those may affect performance, I don't know)

Also, coreos uses flannel and I doubt you will see an impact with that either.

But in any case, I really doubt the docker-compose experiment implies
hostNetworking in kubernetes is faster.

>
> As David Oppenheimer pointed out, there is nodePort predicate,

nodePort? hostPort maybe?

> which makes your and his solution with required CPU parameter working, at
> least in theory (did not try it yet). I’ve completely missed this nodePort
> predicate which made this solution less preferable, because of long
> convergency time, with the predicate - solution is completely valid.

Not sure what you want to say with this. Can you please elaborate? Why is node
port important to you? Why not use the scheduler to spread the pods in different
VMs? And why do you have a hard requirement on that? Are you sure it's a hard
requirement?



Thanks a lot,
Rodrigo

Yaroslav Molochko

unread,
Oct 23, 2016, 6:15:43 PM10/23/16
to kubernet...@googlegroups.com
I’m really happy that this topic raised some attention. I can’t open all the cards, but as I’ve said previously we are dealing with VPN and anonymity in the internet. What this means - we must make sure none of user information is exposed outside, otherwise in some countries person can be even beheaded. And obviously, countries like this, try to block out our solution. So, each node has about 100+ outgoing IPs, which are dynamic and managed by our security officers, they check banned and compromised IPs and change them on the hosts dynamically. We must also make sure that source IP of the client is arrived without changes. We also have pretty strict delay metrics which we should follow, otherwise techniques such as delay measurement may be used to identify user, more info here: http://dimacs.rutgers.edu/Workshops/Anonymous/parv.pdf

Answering your questions:

On Oct 23, 2016, at 1:16 PM, Rodrigo Campos <rod...@sdfg.com.ar> wrote:

On Sun, Oct 23, 2016 at 12:20:08PM -0700, Yaroslav Molochko wrote:
About host networking
We deal with some sort of VPN solution, that is why client source address is a must as well as extra delay affects user experience under heavy load.

Which client source address? Which is the client and which one is the server in
the case you want to preserve the "source address”?

In our case it is both. We must make sure we identified user and there is no MitM involved, as well as we must make sure we set correct outgoing IP for that particular user. 


We did an experiment under docker-compose and it is working way better with host networking for us.

I don't know what docker compose does, but it might not be comparable.

In kubernetes you have plenty of options for networking, and can use AWS vpc in
AWS or the private google cloud network, with kubernetes on those cloud
providers at least. And there are tons of other options also (liek weave, etc.
and those may affect performance, I don't know)


Even though AWS and GCP are two amazing cloud providers, and we would love to be able to migrate there, we can’t do this because of several reasons:
1. They charge a lot for traffic
2. They are not available in all the countries we try to be present. 

Also, coreos uses flannel and I doubt you will see an impact with that either.

But in any case, I really doubt the docker-compose experiment implies
hostNetworking in kubernetes is faster.


As David Oppenheimer pointed out, there is nodePort predicate,

nodePort? hostPort maybe?

Excuse me for the mixing up this, you are right it is hostPort. 


which makes your and his solution with required CPU parameter working, at
least in theory (did not try it yet). I’ve completely missed this nodePort
predicate which made this solution less preferable, because of long
convergency time, with the predicate - solution is completely valid.

Not sure what you want to say with this. Can you please elaborate? Why is node
port important to you? Why not use the scheduler to spread the pods in different
VMs? And why do you have a hard requirement on that? Are you sure it's a hard
requirement?

Sure, will be happy to clear things up for you. As I’ve mentioned, we have pretty strict delay requirements, which lead us to patching kernel with our modified network stack (4.8 kernel has part of our patches BTW), and a lot of our applications are communicating between each other through unix sockets because it is almost 50% faster in our case. So, we have 1 application (lets call it IN) which is dealing with customer’s encrypted connections, and we place as many instances as we have CPUs cores. We also have outgoing gateway app (lets call it OUT) which is applying some anonymizing rules and sent traffic through some specific IP. So basically IN communicate with OUT through unix sockets, and as OUT is not so CPU demanded we run only one instance of it. But we can’t run as many instance as CPUs because OUT app is RAM demanding due to extended caching. So what I’ve thought of, I could run (CPUCores * replicaSets) = amount of replicas, with exposed port (one for each IN instance on the node) and create one replicaSet for OUT with hostPort which would prevent other instances to be placed over that node. 

But if you have better solution - I’m all ears. 

Rodrigo Campos

unread,
Oct 23, 2016, 6:37:28 PM10/23/16
to kubernet...@googlegroups.com
On Sun, Oct 23, 2016 at 03:15:31PM -0700, Yaroslav Molochko wrote:
> I’m really happy that this topic raised some attention. I can’t open all the
> cards, but as I’ve said previously we are dealing with VPN and anonymity in
> the internet. What this means - we must make sure none of user information is
> exposed outside, otherwise in some countries person can be even beheaded. And
> obviously, countries like this, try to block out our solution. So, each node
> has about 100+ outgoing IPs, which are dynamic and managed by our security
> officers, they check banned and compromised IPs and change them on the hosts
> dynamically. We must also make sure that source IP of the client is arrived
> without changes. We also have pretty strict delay metrics which we should
> follow, otherwise techniques such as delay measurement may be used to identify
> user, more info here: http://dimacs.rutgers.edu/Workshops/Anonymous/parv.pdf
> <http://dimacs.rutgers.edu/Workshops/Anonymous/parv.pdf>

Oh, cool use case :)

> Answering your questions:
>
> > On Oct 23, 2016, at 1:16 PM, Rodrigo Campos <rod...@sdfg.com.ar> wrote:
> >
> > On Sun, Oct 23, 2016 at 12:20:08PM -0700, Yaroslav Molochko wrote:
> >> About host networking
> >> We deal with some sort of VPN solution, that is why client source address is a must as well as extra delay affects user experience under heavy load.
> >
> > Which client source address? Which is the client and which one is the server in
> > the case you want to preserve the "source address”?
>
> In our case it is both. We must make sure we identified user and there is no
> MitM involved, as well as we must make sure we set correct outgoing IP for
> that particular user.

Oh, okay. I see


> >> We did an experiment under docker-compose and it is working way better with host networking for us.
> >
> > I don't know what docker compose does, but it might not be comparable.
> >
> > In kubernetes you have plenty of options for networking, and can use AWS vpc in
> > AWS or the private google cloud network, with kubernetes on those cloud
> > providers at least. And there are tons of other options also (liek weave, etc.
> > and those may affect performance, I don't know)
> >
>
> Even though AWS and GCP are two amazing cloud providers, and we would love to be able to migrate there, we can’t do this because of several reasons:
> 1. They charge a lot for traffic
> 2. They are not available in all the countries we try to be present.

Sure, just saying that there are options and probably the docker-compose
comparission isn't fair. Just that.

>
> > Also, coreos uses flannel and I doubt you will see an impact with that either.
> >
> > But in any case, I really doubt the docker-compose experiment implies
> > hostNetworking in kubernetes is faster.
> >
> >>
> >> As David Oppenheimer pointed out, there is nodePort predicate,
> >
> > nodePort? hostPort maybe?
>
> Excuse me for the mixing up this, you are right it is hostPort.

No problem, wasn't sure I was following :-D

>
> >
> >> which makes your and his solution with required CPU parameter working, at
> >> least in theory (did not try it yet). I’ve completely missed this nodePort
> >> predicate which made this solution less preferable, because of long
> >> convergency time, with the predicate - solution is completely valid.
> >
> > Not sure what you want to say with this. Can you please elaborate? Why is node
> > port important to you? Why not use the scheduler to spread the pods in different
> > VMs? And why do you have a hard requirement on that? Are you sure it's a hard
> > requirement?
>
> Sure, will be happy to clear things up for you. As I’ve mentioned, we have
> pretty strict delay requirements, which lead us to patching kernel with our
> modified network stack (4.8 kernel has part of our patches BTW), and a lot of
> our applications are communicating between each other through unix sockets
> because it is almost 50% faster in our case.

Aham...

> So, we have 1 application (lets
> call it IN) which is dealing with customer’s encrypted connections, and we
> place as many instances as we have CPUs cores. We also have outgoing gateway
> app (lets call it OUT) which is applying some anonymizing rules and sent
> traffic through some specific IP. So basically IN communicate with OUT through
> unix sockets, and as OUT is not so CPU demanded we run only one instance of
> it. But we can’t run as many instance as CPUs because OUT app is RAM demanding
> due to extended caching. So what I’ve thought of, I could run (CPUCores *
> replicaSets) = amount of replicas, with exposed port (one for each IN instance
> on the node) and create one replicaSet for OUT with hostPort which would
> prevent other instances to be placed over that node.

Oh, okay. Now I understand. It makes sense, although you will probably lose the
IP address of the client connecting, though. There are efforts to preserve that
(it's easy on HTTP, as you add it in a header now), but they are alpha I think
now.

Also, as there is an "Ingress" kind, it's been discussed an "Egress" too. You
can try to see the state or push it forward and makes sure it works for you ;)

But, to have a solution today, what you say makes sense. But I'm not sure how
you will communicate between the IN and OUT pods if they are different pods and
you need unix sockets and it is SO sensible to performance.

I would consider doing: 1 pod that has several containers, several IN containers
that each reserve the CPU usage you want and one OUT container (that also
reserves the mem usage you want). All in one pod.

This way, you can communicate via unix sockets using an emptyDir volume or
HostPath if that is more performant. Also, the OUT container may need to use
hostNetwork to do the outgoing IP thing you need.

And if a logical host consists of several IN and one OUT instances, then you
really want them all in the same pod. That what a pod tries to abstract, really.

Longer term, something like an egress kind might be a real win for this use
case, but it's not something available today.



Thanks a lot,
Rodrigo

Rodrigo Campos

unread,
Oct 23, 2016, 7:45:04 PM10/23/16
to kubernet...@googlegroups.com
On Sun, Oct 23, 2016 at 11:37:17PM +0100, Rodrigo Campos wrote:
> But, to have a solution today, what you say makes sense. But I'm not sure how
> you will communicate between the IN and OUT pods if they are different pods and
> you need unix sockets and it is SO sensible to performance.
>
> I would consider doing: 1 pod that has several containers, several IN containers
> that each reserve the CPU usage you want and one OUT container (that also
> reserves the mem usage you want). All in one pod.
>
> This way, you can communicate via unix sockets using an emptyDir volume or
> HostPath if that is more performant. Also, the OUT container may need to use
> hostNetwork to do the outgoing IP thing you need.

Also, you may consider shared mem, etc. to communicate between the two if
performance wise is better for you. And, of course, you can do that between
different containers on the same pod ;)

Yaroslav Molochko

unread,
Oct 23, 2016, 8:17:48 PM10/23/16
to kubernet...@googlegroups.com
Thank you for your time and valuable suggestions, please find my comments below:

On Oct 23, 2016, at 3:37 PM, Rodrigo Campos <rod...@sdfg.com.ar> wrote:

But, to have a solution today, what you say makes sense. But I'm not sure how
you will communicate between the IN and OUT pods if they are different pods and
you need unix sockets and it is SO sensible to performance.

I was thinking of shared folder from host machine, we could make dedicated volume for that shared folder, which can be even in tmpfs just to avoid inodes crawling attack vector.

I would consider doing: 1 pod that has several containers, several IN containers
that each reserve the CPU usage you want and one OUT container (that also
reserves the mem usage you want). All in one pod.

This way, you can communicate via unix sockets using an emptyDir volume or
HostPath if that is more performant. Also, the OUT container may need to use
hostNetwork to do the outgoing IP thing you need.

And if a logical host consists of several IN and one OUT instances, then you
really want them all in the same pod. That what a pod tries to abstract, really.

Thanks for the suggestions, what bothers me though, this may lead to extra work on building up dedicated pod configurations (means extra replicaset) for each type of node we have. During years of evolution, we’ve got plenty of system types, from 1Core 1GB RAM to 32core 64GB of RAM and everything in the middle :) This is around dozen of configurations so it is doable in general, but I would love to abstract HW level completely. 

That is why I’m thinking of some “smart” scheduling which would not require that maintenance overhead, but your solution may fit our needs if we will not find a better way, and so far I’m running out of ideas :(

Rodrigo Campos

unread,
Oct 27, 2016, 9:35:59 AM10/27/16
to kubernet...@googlegroups.com


On Sunday, October 23, 2016, Yaroslav Molochko <ono...@gmail.com> wrote:
Thank you for your time and valuable suggestions, please find my comments below:

On Oct 23, 2016, at 3:37 PM, Rodrigo Campos <rod...@sdfg.com.ar> wrote:

But, to have a solution today, what you say makes sense. But I'm not sure how
you will communicate between the IN and OUT pods if they are different pods and
you need unix sockets and it is SO sensible to performance.

I was thinking of shared folder from host machine, we could make dedicated volume for that shared folder, which can be even in tmpfs just to avoid inodes crawling attack vector.

I would consider doing: 1 pod that has several containers, several IN containers
that each reserve the CPU usage you want and one OUT container (that also
reserves the mem usage you want). All in one pod.

This way, you can communicate via unix sockets using an emptyDir volume or
HostPath if that is more performant. Also, the OUT container may need to use
hostNetwork to do the outgoing IP thing you need.

And if a logical host consists of several IN and one OUT instances, then you
really want them all in the same pod. That what a pod tries to abstract, really.

Thanks for the suggestions, what bothers me though, this may lead to extra work on building up dedicated pod configurations (means extra replicaset) for each type of node we have. During years of evolution, we’ve got plenty of system types, from 1Core 1GB RAM to 32core 64GB of RAM and everything in the middle :) This is around dozen of configurations so it is doable in general, but I would love to abstract HW level completely. 

Why would it lead to that? You can't put more than one instance of an app in one node? Is this because of the IP address of each node?



Thanks a lot,
Rodrigo

Yaroslav Molochko

unread,
Oct 27, 2016, 1:01:27 PM10/27/16
to kubernet...@googlegroups.com
I’ve actually read through code and documentation for past couple of days, and I’ve come to pretty elegant solution (IMO):
Starting from 1.4 k8s provides inter-pod anti-affinity (still alpha but the functionality is there)
So, I create node labels for node size, let say: S, M, L, XL, … with tag such as:
s: size
to get more than one labels of the size per node, which means node L size will have S, M, and L labels. 

Then, I create replicaset with pod anti-affinity, which will make sure no more than one instance if the pod is located on the node, and I also have node selector of size. So let say S has 1 core, than we have only 1 instance of the IN app. Node M have let say 2 core, we have one for S, and one for M node selectors. 

as a result, I’ll have exactly what I need one instance of app configuration per node (due to host networking), and rolling updates out of the box. 
So basically I was going to implement the same functionality but it seems we are all set now. 

I dud not believe this functionality is present, but after that I love k8s even more :) 

Reply all
Reply to author
Forward
0 new messages