IPVS backend in kube-proxy

1,733 views
Skip to first unread message

tomasz.p...@intel.com

unread,
Dec 12, 2016, 1:22:02 PM12/12/16
to kubernetes-sig-network
Folks,

There's a discussion related to a new IPVS backend ip kube-proxy. Our small research at Intel shows that it may really improve CPU utilization (especially in DR mode) when compared with current iptables mode. I've notice that work has already started [1] but looks like code is abandoned. Can we have this item on the agenda for the next sig-networking meeting so we can discuss few different features set for ipvs backend ?

Thanks in advance.

Thomas Graf

unread,
Dec 12, 2016, 4:59:01 PM12/12/16
to tomasz.p...@intel.com, kubernetes-sig-network
I'm happy to give a quick overview and demo of the BPF based
kube-proxy alternative we have been working on. It might be an
alternative to IPVS. The base numbers are equal or better than IPVS
and much better with DRS enabled. It works with IPv4 and IPv6.

The LB is not cilium specific at all and can work alongside any other
network plugin as well. The only cilium specific part is a couple of
hundred lines to listen for new services specs and translate that into
BPF configuration. It should be simple to separate that if there is
interest.

Let us know if there is interest and we'll talk about it for a couple
of minutes.
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-sig-ne...@googlegroups.com.
> To post to this group, send email to
> kubernetes-...@googlegroups.com.
> Visit this group at https://groups.google.com/group/kubernetes-sig-network.
> For more options, visit https://groups.google.com/d/optout.

feisky

unread,
Dec 12, 2016, 10:20:45 PM12/12/16
to kubernetes-sig-network, tomasz.p...@intel.com
I'm happy to give a quick overview and demo of the BPF based 
kube-proxy alternative we have been working on. It might be an 
alternative to IPVS. The base numbers are equal or better than IPVS and much better with DRS enabled. It works with IPv4 and IPv6. 

Is there any issues or repos related with BPF based alternative?

Tim Hockin

unread,
Dec 12, 2016, 11:26:08 PM12/12/16
to tomasz.p...@intel.com, kubernetes-sig-network
An issue recently came to light that IPVS doesn't support port ranges,
which many people have asked for. Keep it in mind as prototypes
emerge.

winc...@gmail.com

unread,
Dec 13, 2016, 1:32:04 AM12/13/16
to kubernetes-sig-network
I am from Huawei, We have prototyped IPVS as third kube-proxy load balancing mode, validated various networking scenarios to be working, as well as NodePort and ExternalIP.  During test, we found IPVS is much better than IPTables in terms of latency, CPU and memory utilization. I think the problem of IPTables is every packet on the host, regardless it's related to kubernetes service or not, is enumerated through all IPTables rules, when there are thousands of services it results in significant latency and CPU utilization. I would be glad to publish and integrate the implementation.  Some test data FYI:

Metrics

Number of Service

LVS

IPTables

Time to access service

1000

10 ms

7-18 ms

5000

9 ms

15-80 ms

10000

9 ms

80-7000 ms

15000

9 ms

Unresponsive

50000

9 ms

Unresponsive

Memory Usage

1000

386 MB

1.1 G

5000

N/A

1.9 G

10000

542 MB

2.3 G

15000

N/A

Out of Memory

50000

1272 MB

Out of Memory

CPU Usage

1000

0%

 N/A

5000

50% - 85%

10000

50%-100%

15000

  N/A

50000

  N/A

tomasz.p...@intel.com

unread,
Dec 13, 2016, 3:49:40 AM12/13/16
to kubernetes-sig-network

tomasz.p...@intel.com

unread,
Dec 13, 2016, 3:57:24 AM12/13/16
to kubernetes-sig-network, tomasz.p...@intel.com
Hey Tim,

You mean port range in services ? 

Cheers


On Tuesday, December 13, 2016 at 5:26:08 AM UTC+1, Tim Hockin wrote:
An issue recently came to light that IPVS doesn't support port ranges,
which many people have asked for.  Keep it in mind as prototypes
emerge.

On Mon, Dec 12, 2016 at 10:22 AM,  <tomasz.p...@intel.com> wrote:
> Folks,
>
> There's a discussion related to a new IPVS backend ip kube-proxy. Our small
> research at Intel shows that it may really improve CPU utilization
> (especially in DR mode) when compared with current iptables mode. I've
> notice that work has already started [1] but looks like code is abandoned.
> Can we have this item on the agenda for the next sig-networking meeting so
> we can discuss few different features set for ipvs backend ?
>
> Thanks in advance.
>
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an

tomasz.p...@intel.com

unread,
Dec 13, 2016, 3:59:18 AM12/13/16
to kubernetes-sig-network
Hey Haibin,

This is exactly what we have seen in our research as well. So let's have a chat on next sig meeting. 

Cheers

Mingqiang Liang

unread,
Dec 13, 2016, 9:27:25 AM12/13/16
to tomasz.p...@intel.com, kubernetes-sig-network

Folks,

The reason my PR https://github.com/kubernetes/kubernetes/pull/30134 is in WIP status is because I use seesaw's ipvs package to talk sync ipvs configuration, which has a libnl compile and runtime dependency, so we need to update some build/deploy scripts.

I am recently trying to use a pure go approach of netlink(github.com/vishvananda/netlink/nl) to talk with ipvs kernel module. Unfortunately, I am not a netlink expert, if anyone is familiar with netlink(for example, know how to construct a "ipvsadm --restore" netlink request), it will definitely help a lot to speed up the development process.


On a side note, libnetwork have a netlink ipvs package we could leverage on, see https://github.com/docker/libnetwork/tree/master/ipvs . But unfortunately, it only has Create/Update/Delete methods for ipvs Service and Destination, missing Get methods for ipvs Service/Destination. netlink request for Get is not hard to construct, but it's challenging to parse the netlink response. 





--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-ne...@googlegroups.com.

To post to this group, send email to kubernetes-...@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-sig-network.
For more options, visit https://groups.google.com/d/optout.
--
Best Regards,
Liang Mingqiang

Thomas Graf

unread,
Dec 13, 2016, 9:40:35 AM12/13/16
to Mingqiang Liang, tomasz.p...@intel.com, kubernetes-sig-network
I'm the original author of libnl. Since ipvs is using libnl, the
easiest way to reverse engineer the Netlink messages being sent is to
run the ipvs package with the environment variable NLCB=debug. It will
print all Netlink messages sent to the kernel in decoded form to
stderr. Hope this helps.

Chris Marino

unread,
Dec 13, 2016, 10:23:21 AM12/13/16
to tomasz.p...@intel.com, kubernetes-sig-network
This is worth discussing.

I find this whole area extremely confusing and would like a more
complete understanding of what is actually going on here. Also, there
are separate issues here that are easy to conflate, but worth
understanding in isolation.

IPVS as another back end for kube-proxy would be pretty interesting.
However, its not at all clear to me how the performance gains
described are a direct result of IPVS (vs other network
optimizations).

My (incomplete) understanding of DPDK and other network optimizations
(VPP, etc.) is that they are most effective when used within a
dedicated networking device. How these optimizations benefit the more
general use case where hosts mix workloads is not obvious to me.

Also, (guessing) the work to support service mesh's is going to impact
kube-proxy as well. Which makes me think that maybe some of these
things could be broken out as completely separate kube-proxy
alternatives.

Unfortunately I have a conflict for this week's sig, but if this is on
the agenda, I will do my best to attend.

CM

Thomas Graf

unread,
Dec 13, 2016, 10:32:34 AM12/13/16
to Chris Marino, tomasz.p...@intel.com, kubernetes-sig-network
On 13 December 2016 at 16:23, Chris Marino <ch...@romana.io> wrote:
> This is worth discussing.
>
> I find this whole area extremely confusing and would like a more
> complete understanding of what is actually going on here. Also, there
> are separate issues here that are easy to conflate, but worth
> understanding in isolation.
>
> IPVS as another back end for kube-proxy would be pretty interesting.
> However, its not at all clear to me how the performance gains
> described are a direct result of IPVS (vs other network
> optimizations).

The performance gain comes from avoiding sequential lists. In iptables
context, some of them can be avoided with ipset but in particular for
DNAT many of them remain. This limits scale. IPVS, nftables, BPF, ...
work around this by providing more complex data structures such as
hash tables to optimize this.
This is *not* about the fixed cost of the datapath itself and is thus
unrelated to other network optimizations.

> My (incomplete) understanding of DPDK and other network optimizations
> (VPP, etc.) is that they are most effective when used within a
> dedicated networking device. How these optimizations benefit the more
> general use case where hosts mix workloads is not obvious to me.

This used to be mostly true but is no longer the case. Though this is
not related to the IPVS vs iptables performance difference that is
being referred to in this thread.

Tomasz Pa

unread,
Dec 13, 2016, 11:30:36 AM12/13/16
to kubernetes-sig-network, tomasz.p...@intel.com

On Tuesday, December 13, 2016 at 4:23:21 PM UTC+1, Chris Marino wrote:

IPVS as another back end for kube-proxy would be pretty interesting.
However, its not at all clear to me how the performance gains
described are a direct result of IPVS (vs other network
optimizations).

Easiest way to get Kubernetes into DPDK is OVS, but than we would need yet another backend for doing nat/snat in a OVS way :P Performance gains would be noticeable on network intensive workloads but DPDK can also ensure very stable jitter as packet switching would be done in user space and with help of intel RDT and cpuset you can ensure dedicated CPU resources. 



My (incomplete) understanding of DPDK and other network optimizations
(VPP, etc.) is that they are most effective when used within a
dedicated networking device. How these optimizations benefit the more
general use case where hosts mix workloads is not obvious to me.



It depends how network intensive is your workload :P For regular use cases (excluding loadbalancers, highly loaded memcache) replacing iptables with ipvsadm should be enough. 
I will try to come up with few slides showing pottential use cases for DPDK and where it really benefits. 
 

TP 

Tomasz Pa

unread,
Dec 13, 2016, 11:31:54 AM12/13/16
to kubernetes-sig-network, ch...@romana.io, tomasz.p...@intel.com

On Tuesday, December 13, 2016 at 4:32:34 PM UTC+1, Thomas Graf wrote:
The performance gain comes from avoiding sequential lists. In iptables
context, some of them can be avoided with ipset but in particular for
DNAT many of them remain. This limits scale. IPVS, nftables, BPF, ...
work around this by providing more complex data structures such as
hash tables to optimize this.
This is *not* about the fixed cost of the datapath itself and is thus
unrelated to other network optimizations.

Using ipsets would mean that we would lose our load-balancing ability :P  

Chris Marino

unread,
Dec 13, 2016, 11:54:33 AM12/13/16
to Tomasz Pa, kubernetes-sig-network, tomasz.p...@intel.com
Hi Tomasz, so the performance gains you see use OVS and DPDK? That
doesn't come as any real surprise, but it kind of illustrates the
point. The performance gains are the result of the combination of
things, some of which might never be deployed by an operator.

CM

Tomasz Pa

unread,
Dec 13, 2016, 11:57:30 AM12/13/16
to kubernetes-sig-network, ss7...@gmail.com, tomasz.p...@intel.com
Hey Chris,

For ipvs you see performance gains without DPDK, which are most noticeable on bigger and dense deployments.

Cheers

Guru Shetty

unread,
Dec 13, 2016, 12:16:18 PM12/13/16
to Tomasz Pa, kubernetes-sig-network, tomasz.p...@intel.com
On 13 December 2016 at 08:30, Tomasz Pa <ss7...@gmail.com> wrote:

On Tuesday, December 13, 2016 at 4:23:21 PM UTC+1, Chris Marino wrote:

IPVS as another back end for kube-proxy would be pretty interesting.
However, its not at all clear to me how the performance gains
described are a direct result of IPVS (vs other network
optimizations).

Easiest way to get Kubernetes into DPDK is OVS, but than we would need yet another backend for doing nat/snat in a OVS way :P Performance gains would be noticeable on network intensive workloads but DPDK can also ensure very stable jitter as packet switching would be done in user space and with help of intel RDT and cpuset you can ensure dedicated CPU resources.

OVS uses flow cache. So increase in number of services should ideally not effect performance. Here is a simple example of load-balancing:
 

I have Kubernetes integrated with OVS via OVN at [1]. It currently uses Linux kernel datapath and is flow cache based for both load-balancing and NAT (instead of iptables). The load-balancing can use some tweaks to make it perform better. But generally, since it is flow cache based, increase in number of services should ideally not effect performance ( I do not have performance benchmarks to back my statement). 


Currently OVS DPDK does not have NAT support. There are patches which will likely go out for review soon. We should at least be able to quantify the performance improvements there.
 



My (incomplete) understanding of DPDK and other network optimizations
(VPP, etc.) is that they are most effective when used within a
dedicated networking device. How these optimizations benefit the more
general use case where hosts mix workloads is not obvious to me.



It depends how network intensive is your workload :P For regular use cases (excluding loadbalancers, highly loaded memcache) replacing iptables with ipvsadm should be enough. 
I will try to come up with few slides showing potential use cases for DPDK and where it really benefits. 
 

TP 

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-network+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig-network@googlegroups.com.

Tim Hockin

unread,
Dec 13, 2016, 1:04:23 PM12/13/16
to tomasz.p...@intel.com, kubernetes-sig-network
> You mean port range in services ?

Yes - this is an oft-requested feature, and it's not explicitly
supported in IPVS, but it is possible in iptables. Here are the
issues I see:


1) The ability to remap ports has been pretty important for people to
be able to run webservers as non-root (bind/listen on :8080 in the
container) but still serve on port 80. IPVS supports remapping, but
only in masquerade mode (NAT) which should hit the same underlying
conntrack path. DSR mode can not do port remapping.

2) Lots of people are asking for port ranges on Services. IPTables
can translate one range to another range, but IPVS only supports
single-port and whole-IP (no remapping). Adding 1000 individual ports
seems like a bad idea.

3) We're increasingly doing "interesting" things with iptables
(firewalling, node ports, etc) and the intersection with IPVS is
unclear to me.

All that said, the numbers look good. I wonder if we can make the
iptables implementation scale better. We have ideas on how to do
that, we just have not pursued them because nobody really runs 50000
services (yet).
>> > email to kubernetes-sig-ne...@googlegroups.com.
>> > To post to this group, send email to
>> > kubernetes-...@googlegroups.com.
>> > Visit this group at
>> > https://groups.google.com/group/kubernetes-sig-network.
>> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "kubernetes-sig-network" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kubernetes-sig-ne...@googlegroups.com.

Ben Bennett

unread,
Dec 13, 2016, 1:09:39 PM12/13/16
to winc...@gmail.com, kubernetes-sig-network, Phil Cameron
I'd be interested to know the testing methodology you used.  We have tried 20,000 services and the numbers were nowhere near as bad.

We'll re-run the tests for this and publish the results along with our test setup.

-ben

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-network+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig-network@googlegroups.com.

Tim Hockin

unread,
Dec 13, 2016, 1:13:58 PM12/13/16
to Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
I've had the same question.  Folks have claimed 500ms and more just to do an iptables-restore.  I was NEVER able to repro this - I wonder if there's some kernel regression or patch that RH folks carry that causes problems?

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-network+unsubscri...@googlegroups.com.

To post to this group, send email to kubernetes-sig-network@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-sig-network.
For more options, visit https://groups.google.com/d/optout.

Eric Paris

unread,
Dec 13, 2016, 1:17:49 PM12/13/16
to Tim Hockin, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
I think we did find a 20% ish improvement in our kernels when we turned
off kernel audit. But we never got anywhere near the ubuntu kernel
numbers. So still don't know what the different is, and haven't really
done any digging....

On Tue, 2016-12-13 at 10:13 -0800, 'Tim Hockin' via kubernetes-sig-
> > > > proxy. Our small research at Intel shows that it may really
> > > > improve CPU utilization (especially in DR mode) when compared
> > > > with current iptables mode. I've notice that work has already
> > > > started [1] but looks like code is abandoned. Can we have this
> > > > item on the agenda for the next sig-networking meeting so we
> > > > can discuss few different features set for ipvs backend ?
> > > >
> > > > Thanks in advance.
> > > >
> > >
> > > -- 
> > > You received this message because you are subscribed to the
> > > Google Groups "kubernetes-sig-network" group.
> > > To unsubscribe from this group and stop receiving emails from it,
> > > send an email to kubernetes-sig-network+unsubscribe@googlegroups.
> > > com.

Haibin Xie

unread,
Dec 13, 2016, 2:50:23 PM12/13/16
to kubernetes-sig-network, winc...@gmail.com, pcam...@redhat.com
The setup is: one master with two slaves running 8 pods, on top of the 8 pods incrementally creating services. The number is the total memory utilization on slave, there might be noises. if you could it would be good to split down to iptables specific virtual and physical memory usage.
To post to this group, send email to kubernetes-...@googlegroups.com.

Clayton Coleman

unread,
Dec 13, 2016, 3:18:56 PM12/13/16
to kubernetes-sig-network
I'd like to see continued discussion of BPF based approaches - one of the signature features that userspace has that iptables lacks is endpoint failover (connection refused = next endpoint tried). That leads to higher disruption than necessary when endpoints go down, and as IPVS can't handle this scenario I'm hoping BPF gives us more flexibility. Knowing whether that's possible would be important to us.

Thomas Graf

unread,
Dec 13, 2016, 4:18:28 PM12/13/16
to Clayton Coleman, kubernetes-sig-network
Clayton,

tl;dr I think it's possible although it will look slightly different
than the user space solution.

Let me try to give you an answer and you tell me whether this is what
you want ;-)

Taking your userspace proxy as base behaviour: I'm assuming you are
talking TCP here. It will retry the LB to backend connection n times
in a round robin fashion where the failover is probably triggered by
connect() returning an error. On the kernel side, each connect()
attempt willl include several TCP retransmission attempts of the SYN
packets. The timeout is configurable.

Implementing this in BPF would look differently but would probably
turn out with a similar end result. I'm saying BPF here but this
applies to any programmable approach. The assumption would be that the
client retransmits multiple SYN packets until giving up. This is a
requirement of the TCP spec. The key is to ensure that each SYN is
balanced to a different backend to hopefully get a backend which is
alive. This is currently *NOT* what we are doing as we are taking the
(hardware) hash of the packet unmodified to select the backend. The
hash will be identical for all SYN retransmits. However, it will be
trivial to add some logic to recognize SYN retransmits and add a
different constant value to the hash for each retrans and upon success
of an established connection, carry that constant offset for the
lifetime of the connection to ensure that all subsequent packets go to
the same backend. It means carrying some state but it should still be
a lot more lightweight than a full blown socket based user space
proxy.


On 13 December 2016 at 21:18, Clayton Coleman <smarter...@gmail.com> wrote:
> I'd like to see continued discussion of BPF based approaches - one of the signature features that userspace has that iptables lacks is endpoint failover (connection refused = next endpoint tried). That leads to higher disruption than necessary when endpoints go down, and as IPVS can't handle this scenario I'm hoping BPF gives us more flexibility. Knowing whether that's possible would be important to us.
>

Dan Williams

unread,
Dec 13, 2016, 5:05:43 PM12/13/16
to Tim Hockin, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
On Tue, 2016-12-13 at 10:13 -0800, 'Tim Hockin' via kubernetes-sig-
network wrote:
> I've had the same question.  Folks have claimed 500ms and more just
> to do
> an iptables-restore.  I was NEVER able to repro this - I wonder if
> there's
> some kernel regression or patch that RH folks carry that causes
> problems?

I think Haibin Xie's numbers are about something different.  iptables-
related things are:

1) iptables-restore time (eg, changing the iptables rules in the
kernel)

2) request/response time through a service (eg, a packet actually
traversing the iptables kernel data structures to the pod and back)

I usually talk about #1.

AIUI Haibin Xie is talking about #2.

Ben Bennet and Eric Paris are also talking about #2, since we've done
some internal benchmarking with large numbers of services and the
iptables proxy.  But our benchmarking was mostly about throughput, not
CPU/memory utilization and latency.

But I could be completely wrong?

Dan
> > > > proxy. Our
> > > > small research at Intel shows that it may really improve CPU
> > > > utilization
> > > > (especially in DR mode) when compared with current iptables
> > > > mode. I've
> > > > notice that work has already started [1] but looks like code is
> > > > abandoned.
> > > > Can we have this item on the agenda for the next sig-networking
> > > > meeting so
> > > > we can discuss few different features set for ipvs backend ?
> > > >
> > > > Thanks in advance.
> > > >
> > > --
> > > You received this message because you are subscribed to the
> > > Google Groups
> > > "kubernetes-sig-network" group.
> > > To unsubscribe from this group and stop receiving emails from it,
> > > send an
> > > email to kubernetes-sig-ne...@googlegroups.com.
> > > To post to this group, send email to kubernetes-sig-network@googl
> > > eg
> > > roups.com.
> > > Visit this group at https://groups.google.com/grou
> > > p/kubernetes-sig-network.
> > > For more options, visit https://groups.google.com/d/optout.
> > >
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups
> > "kubernetes-sig-network" group.
> > To unsubscribe from this group and stop receiving emails from it,
> > send an
> > email to kubernetes-sig-ne...@googlegroups.com.

Eric Paris

unread,
Dec 13, 2016, 5:09:32 PM12/13/16
to Dan Williams, Tim Hockin, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
On Tue, 2016-12-13 at 16:05 -0600, Dan Williams wrote:
> On Tue, 2016-12-13 at 10:13 -0800, 'Tim Hockin' via kubernetes-sig-
> network wrote:
> > I've had the same question.  Folks have claimed 500ms and more just
> > to do
> > an iptables-restore.  I was NEVER able to repro this - I wonder if
> > there's
> > some kernel regression or patch that RH folks carry that causes
> > problems?
>
> I think Haibin Xie's numbers are about something different.
>  iptables-
> related things are:
>
> 1) iptables-restore time (eg, changing the iptables rules in the
> kernel)
>
> 2) request/response time through a service (eg, a packet actually
> traversing the iptables kernel data structures to the pod and back)
>
> I usually talk about #1.
>
> AIUI Haibin Xie is talking about #2.
>
> Ben Bennet and Eric Paris are also talking about #2, since we've done
> some internal benchmarking with large numbers of services and the
> iptables proxy.  But our benchmarking was mostly about throughput,
> not
> CPU/memory utilization and latency.
>
> But I could be completely wrong?

I talked about #1 earlier today on list. (and #2 internally)

But agree them seem likely orthogonal...

-Eric

Thomas Graf

unread,
Dec 13, 2016, 5:16:13 PM12/13/16
to Dan Williams, Tim Hockin, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
Dan,

On 13 December 2016 at 23:05, Dan Williams <dc...@redhat.com> wrote:
> On Tue, 2016-12-13 at 10:13 -0800, 'Tim Hockin' via kubernetes-sig-
> network wrote:
>> I've had the same question. Folks have claimed 500ms and more just
>> to do
>> an iptables-restore. I was NEVER able to repro this - I wonder if
>> there's
>> some kernel regression or patch that RH folks carry that causes
>> problems?
>
> I think Haibin Xie's numbers are about something different. iptables-
> related things are:
>
> 1) iptables-restore time (eg, changing the iptables rules in the
> kernel)

I'm pretty sure the issue you are seeing here is because iptables
requires to allocate a relatively large chunk of memory each time it
replaces the entire table. iptables does not add/remove individual
rules but replaces entire tables. It does so by attempting kmalloc()
and then falling back on vmalloc() because depending on the memory
situation of the machine it may have become very difficult for the
kernel to allocate a large chunk (>128MB).

You will thus likely see very different iptables-restore times
depending on the memory situation on the machine.

Bowei Du

unread,
Dec 13, 2016, 5:51:44 PM12/13/16
to Haibin Xie, kubernetes-sig-network
Hi Haibin,

Would it be possible to check in the yaml and scripts you used to run the benchmark? It will make it easier to repro your numbers.

Thanks,
Bowei

To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-network+unsubscri...@googlegroups.com.
To post to this group, send email to kubernetes-...@googlegroups.com.
Visit this group at https://groups.google.com/group/kubernetes-sig-network.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-network+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-sig-network@googlegroups.com.

Bowei Du

unread,
Dec 13, 2016, 5:52:13 PM12/13/16
to Haibin Xie, kubernetes-sig-network
*by check in == post to github, gist

Dan Williams

unread,
Dec 13, 2016, 6:15:10 PM12/13/16
to Thomas Graf, Tim Hockin, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
That could well be the case.  In the two machines I tested, IIRC:

1) 4C/8T/24GB: took about 350ms for 8000 iptables rules
2) 2C/4T/8GB: up to 700ms for 8000 iptables rules

Both running Fedora 24 with 4.7 kernels.

Tim cannot seem to produce anything greater than 100ms, but he's using
3.19 kernels.

Dan

Clayton

unread,
Dec 13, 2016, 8:14:48 PM12/13/16
to Thomas Graf, kubernetes-sig-network
That sounds like a net win. Tim might disagree with me ;) but I have people running userspace still because it's a better end application experience. We've been discussing this from a kernel perspective to see whether network BNF could be a backport specifically to enable these sorts of scenarios, so there's interest in our end for exploring the solution space.

Jeremy Eder

unread,
Dec 13, 2016, 8:26:02 PM12/13/16
to Eric Paris, Tim Hockin, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
On Tue, Dec 13, 2016 at 1:17 PM, Eric Paris <epa...@redhat.com> wrote:
I think we did find a 20% ish improvement in our kernels when we turned
off kernel audit. But we never got anywhere near the ubuntu kernel
numbers. So still don't know what the different is, and haven't really
done any digging....

Where can I see more detail about this
​ report?  What was the test?​

Tim Hockin

unread,
Dec 14, 2016, 12:47:15 AM12/14/16
to Clayton, Thomas Graf, kubernetes-sig-network
My preference order of cool tech to make work

* eBPF
* faster iptables (if someone can show me a benchmark, I know how to
fix some of it)
* IPVS
* userspace.

Userspace with:
- iptables-save and -restore would perform better
- TPROXY would potentially preserve source IP
- splice() would perform much better

I am intrinsically against userspace because it has to be running for
traffic to flow.

Ilya Dmitrichenko

unread,
Dec 14, 2016, 6:51:19 AM12/14/16
to Tim Hockin, Clayton, Thomas Graf, kubernetes-sig-network
Would it be correct to assume that eBPF solution would also make it easier to report any service-level metrics to the user-space?

Thomas Graf

unread,
Dec 14, 2016, 7:05:23 AM12/14/16
to Ilya Dmitrichenko, Tim Hockin, Clayton, kubernetes-sig-network
On 12/14/16 at 11:51am, Ilya Dmitrichenko wrote:
> Would it be correct to assume that eBPF solution would also make it easier
> to report any service-level metrics to the user-space?

I'm careful with saying "absolutely" because the capabilities are limited
by the view you have at L3/L4 since we are talking L4 LB in this case.
So something like HTTP request/response latency measurement would require
some additional work.

However, you can collect any statistic you want based on the view you have
and share it. As with tracepoints, this can be compiled in at runtime.
The real value here is that aggregation can happen in the kernel already,
f.e. you can collect a histogram about packet sizes passing through each
service without sending all packets to user space. User space only reads
the aggregated histogram periodically and could display it f.e. in weave
scope:

1 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 22 | |
64 -> 127 : 98 | |
128 -> 255 : 213 | |
256 -> 511 : 1444251 |******** |
512 -> 1023 : 660610 |*** |
1024 -> 2047 : 535241 |** |
2048 -> 4095 : 19 | |
4096 -> 8191 : 180 | |
8192 -> 16383 : 5578023 |************************************* |
16384 -> 32767 : 632099 |*** |
32768 -> 65535 : 6575 | |

This is an example of the recent lwt bpf series:

struct bpf_elf_map SEC("maps") lwt_len_hist_map = {
.type = BPF_MAP_TYPE_PERCPU_HASH,
.size_key = sizeof(__u64),
.size_value = sizeof(__u64),
.pinning = 2,
.max_elem = 1024,
};

static unsigned int log2(unsigned int v)
{
unsigned int r;
unsigned int shift;

r = (v > 0xFFFF) << 4; v >>= r;
shift = (v > 0xFF) << 3; v >>= shift; r |= shift;
shift = (v > 0xF) << 2; v >>= shift; r |= shift;
shift = (v > 0x3) << 1; v >>= shift; r |= shift;
r |= (v >> 1);
return r;
}

static unsigned int log2l(unsigned long v)
{
unsigned int hi = v >> 32;
if (hi)
return log2(hi) + 32;
else
return log2(v);
}

SEC("len_hist")
int do_len_hist(struct __sk_buff *skb)
{
__u64 *value, key, init_val = 1;

key = log2l(skb->len);

value = bpf_map_lookup_elem(&lwt_len_hist_map, &key);
if (value)
__sync_fetch_and_add(value, 1);
else
bpf_map_update_elem(&lwt_len_hist_map, &key, &init_val, BPF_ANY);

return BPF_OK;

Dan Williams

unread,
Dec 14, 2016, 2:23:10 PM12/14/16
to Tim Hockin, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
Some quick harder numbers to the debate for 2000 services (~10000
iptables rules) versus no services:

* +7 to +15% additional CPU usage for kube-proxy
* +2GB system memory usage (MemFree before/after)
* +40MB RSS usage for kube-proxy
* +300ms required time to complete syncProxyRules() (15ms -> ~330ms)

Fedora 24 4.8.8 kernel, 4C/8T + 24GB RAM system.  Kube run using local-
up-cluster.sh.

Dan

On Tue, 2016-12-13 at 16:05 -0600, Dan Williams wrote:

Tim Hockin

unread,
Dec 14, 2016, 5:16:45 PM12/14/16
to Dan Williams, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
Dan, do you have a test script that produces those numbers? I'll
happily run it on our VMs - I was doing 10k+ rules and I never got
more than 150ms for iptables-restore.

Dan Williams

unread,
Dec 14, 2016, 5:21:51 PM12/14/16
to Tim Hockin, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
On Wed, 2016-12-14 at 14:16 -0800, Tim Hockin wrote:
> Dan, do you have a test script that produces those numbers?  I'll
> happily run it on our VMs - I was doing 10k+ rules and I never got
> more than 150ms for iptables-restore.

#!/bin/bash
for i in {1..2000}; do
   YAML="apiVersion: v1
kind: Service
metadata:
  labels:
    name: nginxservice${i}
  name: nginxservice${i}
spec:
  ports:
    - port: $(expr 82 + ${i})
      targetPort: 80
      protocol: TCP
  selector:
    app: nginx
  type: LoadBalancer"
    echo "${YAML}" | cluster/kubectl.sh $1 -f - || exit 0
done

run with "create" or "delete" as the argument.  Then I just "tail -f
/tmp/kube-proxy.log" (since I'm doing local-up-cluster.sh) and watched
for the "syncProxyRules() took XXXX seconds" messages.

Dan
> > > > On Tue, Dec 13, 2016 at 10:09 AM, Ben Bennett <bbennett@redhat.

Dan Williams

unread,
Dec 14, 2016, 5:24:33 PM12/14/16
to Tim Hockin, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
On Wed, 2016-12-14 at 16:21 -0600, Dan Williams wrote:
> On Wed, 2016-12-14 at 14:16 -0800, Tim Hockin wrote:
> >
> > Dan, do you have a test script that produces those numbers?  I'll
> > happily run it on our VMs - I was doing 10k+ rules and I never got
> > more than 150ms for iptables-restore.
>
> #!/bin/bash
> for i in {1..2000}; do

You'll need to increase your services CIDR to a /22 or something to
cover this.

I also needed to:

prlimit --pid <pidof kube-proxy> --nofile=20000:20000

or something similar too otherwise the proxy will run out of fds and
you'll see a message to that effect in the proxy logs.  I posted about
that on #sig-network today before I figured out what it was.

Dan
> > > > > > > email to kubernetes-sig-network+unsubscribe@googlegroups.

Tim Hockin

unread,
Dec 14, 2016, 5:26:31 PM12/14/16
to Dan Williams, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
That's not going to show me everything you listed.

Dan Williams

unread,
Dec 14, 2016, 5:35:52 PM12/14/16
to Tim Hockin, Ben Bennett, winc...@gmail.com, kubernetes-sig-network, Phil Cameron
On Wed, 2016-12-14 at 14:26 -0800, 'Tim Hockin' via kubernetes-sig-
network wrote:
> That's not going to show me everything you listed.

The info gathering is literally just:

head -n 8 /proc/meminfo
ps a -o vsz,rss,args --cols 75 | grep output | grep kube | grep -v sudo
top (and watch kube-proxy CPU usage)

Luiz Filho

unread,
Dec 15, 2016, 7:11:30 PM12/15/16
to kubernetes-sig-network, tomasz.p...@intel.com
Liang Mingqiang,

You can check this out: https://github.com/qmsk/clusterf/tree/master/ipvs, it is a pure go implementation of IPVS.

But I would suggest instead of using it directly, create a standalone implementation, in a different project, with a better api, proper tests and concurrency safety. It seems that ipvs is gaining some attention lately and everyone would benefit from having a good and well-tested package.

On Tuesday, December 13, 2016 at 12:27:25 PM UTC-2, Mingqiang Liang wrote:

Folks,

The reason my PR https://github.com/kubernetes/kubernetes/pull/30134 is in WIP status is because I use seesaw's ipvs package to talk sync ipvs configuration, which has a libnl compile and runtime dependency, so we need to update some build/deploy scripts.

I am recently trying to use a pure go approach of netlink(github.com/vishvananda/netlink/nl) to talk with ipvs kernel module. Unfortunately, I am not a netlink expert, if anyone is familiar with netlink(for example, know how to construct a "ipvsadm --restore" netlink request), it will definitely help a lot to speed up the development process.


On a side note, libnetwork have a netlink ipvs package we could leverage on, see https://github.com/docker/libnetwork/tree/master/ipvs . But unfortunately, it only has Create/Update/Delete methods for ipvs Service and Destination, missing Get methods for ipvs Service/Destination. netlink request for Get is not hard to construct, but it's challenging to parse the netlink response. 





On Tue, Dec 13, 2016 at 2:22 AM <tomasz.p...@intel.com> wrote:
Folks,

There's a discussion related to a new IPVS backend ip kube-proxy. Our small research at Intel shows that it may really improve CPU utilization (especially in DR mode) when compared with current iptables mode. I've notice that work has already started [1] but looks like code is abandoned. Can we have this item on the agenda for the next sig-networking meeting so we can discuss few different features set for ipvs backend ?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-network" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-network+unsub...@googlegroups.com.
To post to this group, send email to kubernetes-...@googlegroups.com.
--
Best Regards,
Liang Mingqiang

Quinton Hoole

unread,
Apr 21, 2017, 4:15:48 PM4/21/17
to kubernetes-sig-network
We're trying to tie the various IPVS discussions together and produce a working implementation in 

https://github.com/kubernetes/kubernetes/issues/44063

Q

Reply all
Reply to author
Forward
0 new messages