Hello Han,
Thank you for the quick reply.
What you say make sense as a short term solution. The row-filtering might not be that helpful for the topology we have in OVN K8s. So, by disabling it we shouldn’t encounter any overhead.
If it is a quick fix to disable conditional monitoring, then please provide us the patch and we can test it against the OVN K8s topology at scale.
Thanks once again.
Regards,
~Girish
From: Han Zhou <hz...@ovn.org>
Date: Friday, January 10, 2020 at 12:23 PM
To: Girish Moodalbail <gmood...@nvidia.com>
Cc: Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>, "hz...@ovn.org" <hz...@ovn.org>, "agin...@ebay.com" <agin...@ebay.com>, sdn-dev
<sdn...@exchange.nvidia.com>, "ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path
External email: Use caution opening links or attachments |
Hi Girish,
Thanks for reporting the findings!
It is straightforward for both client side and server side to always send all conditions - it is the declarative way, which easier to implement and ensure correctness. However, as you noticed, it is less efficient.
To achieve what you proposed (i.e. sending only the changed conditions), it is another incremental processing - for monitor condition - which is a big change.
We didn't encounter such problem yet, even if we have more nodes, probably because we don't use per-node logical switch, so we have much less logical switches, which results in much smaller size of monitor conditions. We might encounter same problem if we have more logical switches.
I have a different proposal to tackle this issue, at least for your scenario. In k8s (and many other use cases), every workload is supposed to be reachable to any other workloads, which means each node need to have the full mesh data of the topology. In this case, conditional monitoring doesn't help much. It doesn't reduce size of data to be monitored by each node, but it introduces lots of cost:
1. The condition filtering in ovsdb-server for each transaction is heavy
2. The monitor cache in server side cannot be shared across clients
3. The monitor condition update can be heavy (as brought up by this topic)
Because of this, we may provide an option for ovn-controller to disable conditional monitoring - monitors only tables and columns it needs, but not row-level filtering. It would be a small change.
Does this sound reasonable?
Thanks,
Han
I have a different proposal to tackle this issue, at least for your scenario. In k8s (and many other use cases), every workload is supposed to be reachable to any other workloads, which means each node need to have the full mesh data of the topology. In this case, conditional monitoring doesn't help much. It doesn't reduce size of data to be monitored by each node, but it introduces lots of cost:
1. The condition filtering in ovsdb-server for each transaction is heavy
2. The monitor cache in server side cannot be shared across clients
3. The monitor condition update can be heavy (as brought up by this topic)
Because of this, we may provide an option for ovn-controller to disable conditional monitoring - monitors only tables and columns it needs, but not row-level filtering. It would be a small change.
Does this sound reasonable?
Thanks,
Han
This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To post to this group, send email to ovn-kub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/7DAC57E9-4057-4216-A5AA-85FCEC1300FB%40nvidia.com.
For more options, visit https://groups.google.com/d/optout.
Hello Han,
Thanks! We will give it a try and let you know ASAP.
Regards,
~Girish
From: Han Zhou <hz...@ovn.org>
Date: Monday, January 13, 2020 at 3:24 PM
To: Girish Moodalbail <gmood...@nvidia.com>
Cc: Han Zhou <hz...@ovn.org>, Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>, "agin...@ebay.com" <agin...@ebay.com>, sdn-dev <sdn...@exchange.nvidia.com>,
"ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path
External email: Use caution opening links or attachments |
Hi Girish,
I sent out the patches in a series: https://patchwork.ozlabs.org/project/openvswitch/list/?series=152938
Somehow patchwork didn't show the patch 2/3 as part of the series: https://patchwork.ozlabs.org/patch/1222380/
Please let me know if it works for you.
Thanks,
Han
Hello Han,
Once again thank for the patches. With these patches the OVSDB SB server is not 100% CPU bound like it was before.
However, I am seeing a different issue on the ovn-controller side. In the OVN K8s logical topology, we have L3 gateway router per-node. So, on the 600-node cluster we have 600 L3 Gateways. Each of the L3 Gateway is connected to physical world by a logical switch and a localnet port on that logical switch.
Before your patch, every Node had one pair of patch ports connecting the integration bridge and the physical bridge. The ovn-controller creates the patch port. From the ovn-controller man page we have:
external_ids:ovn-localnet-port in the Port table
The presence of this key identifies a patch port as one
created by ovn-controller to connect the integration
bridge and another bridge to implement a localnet
logical port. Its value is the name of the logical port
with type set to localnet that the port implements. See
external_ids:ovn-bridge-mappings, above, for more
information.
Each localnet logical port is implemented as a pair of
patch ports, one in the integration bridge, one in a
different bridge, with the same
external_ids:ovn-localnet-port value.
With you patch, since the node now gets all the rows from the tables it is interested in, I see 600 pair of patch ports on each of the node.
$ ovs-vsctl list-ports br-int |grep patch-br-int-to-breth0 |wc -l
600
$ ovs-vsctl list-ports breth0 | grep patch |wc -l
644
This is definitely not correct. Let me know if you need more information.
Since the L3GW is pinned to a chassis, the localnet port that connects the gateway to physical network also belongs to that chassis. So, we need to see only one pair of patch ports.
Regards,
~Girish
From: Han Zhou <hz...@ovn.org>
Date: Monday, January 13, 2020 at 3:24 PM
To: Girish Moodalbail <gmood...@nvidia.com>
Cc: Han Zhou <hz...@ovn.org>, Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>, "agin...@ebay.com" <agin...@ebay.com>, sdn-dev <sdn...@exchange.nvidia.com>,
"ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path
External email: Use caution opening links or attachments |
Hi Girish,
I sent out the patches in a series: https://patchwork.ozlabs.org/project/openvswitch/list/?series=152938
Somehow patchwork didn't show the patch 2/3 as part of the series: https://patchwork.ozlabs.org/patch/1222380/
Please let me know if it works for you.
Thanks,
Han
Hello Han,
Thanks for the explanation. That was our understanding as well. Yes, we will need optimizations in ovn-controller when *monitor-all* is set (especially for the LogicalFlow table). On our setup we are seeing close to 600K OpenFlow rules on *br-int* after the patch. Before, it was only 60K flows.
Let me try the workaround you suggest below and see how it reduces the port count as well as the OpenFlow rules count.
Thanks,
~Girish
From: Han Zhou <hz...@ovn.org>
Date: Wednesday, January 15, 2020 at 11:46 PM
To: Girish Moodalbail <gmood...@nvidia.com>
Cc: Han Zhou <hz...@ovn.org>, Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>, "agin...@ebay.com" <agin...@ebay.com>, sdn-dev <sdn...@exchange.nvidia.com>,
"ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path
External email: Use caution opening links or attachments |
Hi Girish,
The reason why it creates 600 pairs of patch ports is that all port-bindings are now monitored and processed by patch_run() in ovn-controller.
Previously, the datapaths connected by GW-router ports are not regarded as local datapaths, and not being monitored.
We can add an *optimization* in ovn-controller so that only port-bindings resides on local datapaths are processed (probably same optimization for logical flow processing). Note: this optimization is needed only for the use case when "ovn-monitor-all" is set to true. I will work on a new patch for this.
A workaround to this (before the optimization is done), is that in ovn-k8s, do not use same network_name for the localnet ports on different nodes. Instead, use a node-specific name for options:network_name of localnet port and use it in the bridge-mappings in OVS settings as well.
Thanks,
Han
Hi Han,
Hello Han,
Thanks for the explanation. That was our understanding as well. Yes, we will need optimizations in ovn-controller when *monitor-all* is set (especially for the LogicalFlow table). On our setup we are seeing close to 600K OpenFlow rules on *br-int* after the patch. Before, it was only 60K flows.
Let me try the workaround you suggest below and see how it reduces the port count as well as the OpenFlow rules count.
Thanks,
~Girish
From: Han Zhou <hz...@ovn.org>
Date: Wednesday, January 15, 2020 at 11:46 PM
To: Girish Moodalbail <gmood...@nvidia.com>
Cc: Han Zhou <hz...@ovn.org>, Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>, "agin...@ebay.com" <agin...@ebay.com>, sdn-dev <sdn...@exchange.nvidia.com>, "ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path
External email: Use caution opening links or attachments
Hi Girish,
The reason why it creates 600 pairs of patch ports is that all port-bindings are now monitored and processed by patch_run() in ovn-controller.
Previously, the datapaths connected by GW-router ports are not regarded as local datapaths, and not being monitored.
We can add an *optimization* in ovn-controller so that only port-bindings resides on local datapaths are processed (probably same optimization for logical flow processing). Note: this optimization is needed only for the use case when "ovn-monitor-all" is set to true. I will work on a new patch for this.
Please let me know if you have the patch ready for testing.
We tried workaround on our 600+ nodes cluster.1
# ovs-vsctl list-ports breth0 | grep patch |wc -l
1The flow count on br-int still around 600k.
#ovs-ofctl dump-aggregate br-int
NXST_AGGREGATE reply (xid=0x4): packet_count=1130981 byte_count=106970768 flow_count=611249
Regards,
Zhen
Thanks Han,
We will try and let you know.
Regards,
~Girish
From: Han Zhou <hz...@ovn.org>
Date: Tuesday, February 18, 2020 at 3:38 PM
To: Han Zhou <hz...@ovn.org>
Cc: "Zhen Wang (SW-CLOUD)" <zhe...@nvidia.com>, Girish Moodalbail <gmood...@nvidia.com>, Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>,
"agin...@ebay.com" <agin...@ebay.com>, sdn-dev <sdn...@exchange.nvidia.com>, "ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path
External email: Use caution opening links or attachments |
Hi Winson/Girish,
I am guessing that the extra flows could be related to neighbour (mac-binding) flows. So I just sent another patch:
So please apply both patches and test again:
Please try and let me know if you still see extra patch ports or extra flows.
Thanks,
Han
On Tue, Feb 18, 2020 at 2:32 PM Han Zhou <hz...@ovn.org> wrote:
>
> Hi Winson/Girish,
>
> I sent a patch to improve the ovn-monitor-all.
>
https://patchwork.ozlabs.org/project/openvswitch/list/?series=159378
>
> It should not create extra patch ports any more. However, I didn't see why it was creating extra OVS flows, since in both logical flow processing and port-binding processing the datapath is checked and ensured that the logical flow/port binding belongs to
local datapath only. Since you have the k8s environment, could you help check what are the extra flows got installed when ovn-monitor-all is enabled?
>
> Thanks,
> Han
>
> On Thu, Feb 6, 2020 at 10:37 AM Han Zhou <hz...@ovn.org> wrote:
> > Sorry, I haven't got time on it yet. I will work on it probably next week.
> >
> > Thanks,
> > Han
Hi Han,
Thanks a lot for the patch!
I tested it on our k8s cluster with 640 nodes.
With your patch,
#1, the extra patch ports on br-local/br-eth0 problem is fixed.
#2, open flow counter on br-int reduced from 600k+ to 260k+.
ovs-ofctl dump-aggregate br-int
NXST_AGGREGATE reply (xid=0x4): packet_count=226736
byte_count=16314715 flow_count=260852
External email: Use caution opening links or attachments
Hi Winson/Girish,
I am guessing that the extra flows could be related to neighbour (mac-binding) flows. So I just sent another patch:
So please apply both patches and test again:
Please try and let me know if you still see extra patch ports or extra flows.
Thanks,Han
On Tue, Feb 18, 2020 at 2:32 PM Han Zhou <hz...@ovn.org> wrote:
>
> Hi Winson/Girish,
>
> I sent a patch to improve the ovn-monitor-all.
> https://patchwork.ozlabs.org/project/openvswitch/list/?series=159378
>
> It should not create extra patch ports any more. However, I didn't see why it was creating extra OVS flows, since in both logical flow processing and port-binding processing the datapath is checked and ensured that the logical flow/port binding belongs to local datapath only. Since you have the k8s environment, could you help check what are the extra flows got installed when ovn-monitor-all is enabled?
>
> Thanks,
> Han
>
> On Thu, Feb 6, 2020 at 10:37 AM Han Zhou <hz...@ovn.org> wrote:
> >
> >
> >
> > On Thu, Feb 6, 2020 at 10:16 AM winson wang <zhe...@nvidia.com> wrote:
> >>
> > Sorry, I haven't got time on it yet. I will work on it probably next week.
> >
> > Thanks,
> > Han
> >>
> >>
> >>
Hi Han,
External email: Use caution opening links or attachments
Hi Winson,
Thanks for the update. I am glad that it helps.Do you think it worth being backported for release 20.03?
Yes, it is definitely needed for the deployment with ovn-monitor-all="true".
Please backport this fix to 20.03.
Thanks,
Winson
Thanks,Han