At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path

14 views
Skip to first unread message

Girish Moodalbail

unread,
Jan 10, 2020, 2:47:50 PM1/10/20
to Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, Dumitru Ceara, hz...@ovn.org, agin...@ebay.com, sdn-dev, ovn-kub...@googlegroups.com
Hello all,

Apologies for directly reaching out. We think that the issue is serious enough for ovn-kubernetes project that we wanted to bring this to all of yours attention ASAP. We are doing scale testing, and in our 600-node cluster we are seeing that OVN SB OVSDB is 100% CPU compute when we add a new worker node to the k8s cluster.

The rough topology for ovn-kubernetes project is as below

.-----. .-----. .-----. .-----.
( l3GW1 ) ( l3GW2 ) ( l3GW3 ) ........ (l3GW600)
`-+---' `-+---' `--+--' `-+---'
| | | |
| | | |
| | | |
+----+----------+----------+------------------------+--------------+
| Join (100.64.0.0/16) |
+-------------------------------+----------------------------------+
|
.-------------------------------+---------------------------------.
( ovn_cluster_router (distributed router) )
`----+----------+----------+-------------------------+------------'
| | | |
| | | |
+-+-+ +-+-+ +-+-+ +--+--+
|LS1| |LS2| |LS3| ........ |LS600|
+---+ +---+ +---+ +-----+


In the topology above, the L3GW is pinned to each node. The LSX logical switch is pinned to each node by having their logical ports pinned to that node itself (even though the switch by itself is distributed). We have a distributed router called ovn_cluster_router that provides east-west connectivity.

Say, now we add a 601st logical switch to the topology above (mimicking a new node join to the k8s cluster).

(1) ovn-nbctl lrp-add ovn_cluster_router rtos-node1 00:00:00:55:F9:4B 192.169.0.1/24
(2) ovn-nbctl ls-add node1 -- set logical_switch node1 other-config:subnet=192.169.0.0/24
other-config:exclude_ips=192.169.0.2 external-ids:gateway_ip=192.169.0.1/24
(3) ovn-nbctl lsp-add node1 stor-node1 -- set logical_switch_port stor-node1 type=router
options:router-port=rtos-node1 addresses="00:00:00:55:F9:4B"

We see that SB OVSDB Server is 100% CPU bound for about 20 minutes. Note that we observe this behavior for lower number of nodes as well.

This is what we think is happening in the single-threaded ovsdb-server. After the step(3) above, each of the node sends a monitor_cond_change request to the ovsdb-server to include any changes in the new datapath for tables (DNS, IP_Multicast, Logical_Flow, MAC_Binding, Multicast_Group, Port_Binding). This monitor_cond_change request payload is not just for the new datapath, but for all of the 601 datapaths. It is a huge payload (around 2M of data). Please see here: https://drive.google.com/file/d/112bOGKIDyEsi1pBknwT2Vee8SuQ5JyQV/view?usp=sharing

Now, ovsdb-server receives monitor_cond_change from 600 nodes with that huge payload, and it is 100% busy on string operations
$ ltrace -s1024 -f -p 11227 -c
^C% time seconds usecs/call calls function
------ ----------- ----------- --------- --------------------
39.00 5.406449 62 86907 strlen
19.51 2.703908 62 43415 strncmp
16.29 2.258175 62 36032 memcpy
14.79 2.050832 59 34235 strcmp
5.59 0.774294 59 12935 free
2.15 0.298712 56 5279 malloc
1.60 0.221739 60 3671 memset
1.06 0.147539 57 2582 calloc
0.00 0.000386 64 6 vsnprintf
0.00 0.000121 60 2 snprintf
------ ----------- ----------- --------- --------------------
100.00 13.862155 225064 total

And the hot stack seems to be in ovsdb_jsonrpc_monitor_cond_change() and the function it calls

11227 0.003651 strcmp("etor-GR_node40 ", "breth0_node39") = 3 <0.000159>
> exe(ovsdb_atom_compare_3way+0xd5) [44625d]
lib/ovsdb-data.c:249:16
> exe(ovsdb_clause_evaluate+0x167) [420e07]
ovsdb/condition.c:357:19
> exe(ovsdb_condition_match_any_clause+0x70) [4211e6]
ovsdb/condition.c:470:13
> exe(ovsdb_condition_empty_or_match_any+0x3a) [41024e]
ovsdb/condition.h:75:13
> exe(ovsdb_monitor_row_update_type_condition+0x94) [411ade]
ovsdb/monitor.c:841:15
> exe(ovsdb_monitor_compose_row_update2+0x7f) [411fe5]
ovsdb/monitor.c:1014:12
> exe(ovsdb_monitor_compose_cond_change_update+0x10c) [4126a1]
ovsdb/monitor.c:1171:24
> exe(ovsdb_monitor_get_update+0x1b0) [412931]
ovsdb/monitor.c:1242:21
> exe(ovsdb_jsonrpc_monitor_cond_change+0x3eb) [40de47]
ovsdb/jsonrpc-server.c:1618:19
> exe(ovsdb_jsonrpc_session_got_request+0x1be) [40c3f0]
ovsdb/jsonrpc-server.c:1006:17
> exe(ovsdb_jsonrpc_session_run+0xf6) [40b36b]
ovsdb/jsonrpc-server.c:556:17
> exe(ovsdb_jsonrpc_session_run_all+0x2d) [40b4c7]
ovsdb/jsonrpc-server.c:586:21
> exe(ovsdb_jsonrpc_server_run+0x136) [40af2b]
ovsdb/jsonrpc-server.c:401:9
> exe(main_loop+0x1a0) [404a41]
ovsdb/ovsdb-server.c:209:9
> exe(main+0x7c9) [40559a]
ovsdb/ovsdb-server.c:460:5
> libc.so.6(__libc_start_main+0xf2) [7f859fe70812]
> exe(_start+0x2d) [40456d]

We wanted to understand why monitor_cond_change is sending such a huge payload. Shouldn't it be just sending the new datapath's uuid to ovsdb-server?

Regards,
~Girish




-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Girish Moodalbail

unread,
Jan 10, 2020, 4:29:33 PM1/10/20
to Han Zhou, Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, Dumitru Ceara, agin...@ebay.com, sdn-dev, ovn-kub...@googlegroups.com

Hello Han,

 

Thank you for the quick reply.

 

What you say make sense as a short term solution. The row-filtering might not be that helpful for the topology we have in OVN K8s. So, by disabling it we shouldn’t encounter any overhead.

 

If it is a quick fix to disable conditional monitoring, then please provide us the patch and we can test it against the OVN K8s topology at scale.

 

Thanks once again.

 

Regards,

~Girish

 

 

From: Han Zhou <hz...@ovn.org>
Date: Friday, January 10, 2020 at 12:23 PM
To: Girish Moodalbail <gmood...@nvidia.com>
Cc: Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>, "hz...@ovn.org" <hz...@ovn.org>, "agin...@ebay.com" <agin...@ebay.com>, sdn-dev <sdn...@exchange.nvidia.com>, "ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path

 

External email: Use caution opening links or attachments

 

Hi Girish,

 

Thanks for reporting the findings!

It is straightforward for both client side and server side to always send all conditions - it is the declarative way, which easier to implement and ensure correctness. However, as you noticed, it is less efficient.

To achieve what you proposed (i.e. sending only the changed conditions), it is another incremental processing - for monitor condition - which is a big change.

 

We didn't encounter such problem yet, even if we have more nodes, probably because we don't use per-node logical switch, so we have much less logical switches, which results in much smaller size of monitor conditions. We might encounter same problem if we have more logical switches.

 

I have a different proposal to tackle this issue, at least for your scenario. In k8s (and many other use cases), every workload is supposed to be reachable to any other workloads, which means each node need to have the full mesh data of the topology. In this case, conditional monitoring doesn't help much. It doesn't reduce size of data to be monitored by each node, but it introduces lots of cost:

1. The condition filtering in ovsdb-server for each transaction is heavy

2. The monitor cache in server side cannot be shared across clients

3. The monitor condition update can be heavy (as brought up by this topic)

 

Because of this, we may provide an option for ovn-controller to disable conditional monitoring - monitors only tables and columns it needs, but not row-level filtering. It would be a small change.

Does this sound reasonable?

 

Thanks,

Han


This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Phil Cameron

unread,
Jan 10, 2020, 4:50:38 PM1/10/20
to ovn-kub...@googlegroups.com
FYI We are evaluating switch per namespace (New Big Deal...). There could be 10000 namespaces and switches. 

 

I have a different proposal to tackle this issue, at least for your scenario. In k8s (and many other use cases), every workload is supposed to be reachable to any other workloads, which means each node need to have the full mesh data of the topology. In this case, conditional monitoring doesn't help much. It doesn't reduce size of data to be monitored by each node, but it introduces lots of cost:

1. The condition filtering in ovsdb-server for each transaction is heavy

2. The monitor cache in server side cannot be shared across clients

3. The monitor condition update can be heavy (as brought up by this topic)

 

Because of this, we may provide an option for ovn-controller to disable conditional monitoring - monitors only tables and columns it needs, but not row-level filtering. It would be a small change.

Does this sound reasonable?

 

Thanks,

Han


This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To post to this group, send email to ovn-kub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/7DAC57E9-4057-4216-A5AA-85FCEC1300FB%40nvidia.com.
For more options, visit https://groups.google.com/d/optout.


Mark Michelson

unread,
Jan 13, 2020, 8:12:36 AM1/13/20
to Han Zhou, Girish Moodalbail, Dan Williams, Ben Pfaff, Numan Siddique, Dumitru Ceara, agin...@ebay.com, sdn-dev, ovn-kub...@googlegroups.com
On 1/10/20 3:22 PM, Han Zhou wrote:
>
>
> On Fri, Jan 10, 2020 at 11:47 AM Girish Moodalbail
> <gmood...@nvidia.com <mailto:gmood...@nvidia.com>> wrote:
> >
> > Hello all,
> >
> > Apologies for directly reaching out. We think that the issue is
> serious enough for ovn-kubernetes project that we wanted to bring this
> to all of yours attention ASAP. We are doing scale testing, and in our
> 600-node cluster we are seeing that OVN SB OVSDB is 100% CPU compute
> when we add a new worker node to the k8s cluster.
> >
> > The rough topology for ovn-kubernetes project is as below
> >
> >     .-----.    .-----.   .-----.                   .-----.
> >    ( l3GW1 )  ( l3GW2 ) ( l3GW3 )    ........     (l3GW600)
> >     `-+---'    `-+---'   `--+--'                   `-+---'
> >       |          |          |                        |
> >       |          |          |                        |
> >       |          |          |                        |
> >  +----+----------+----------+------------------------+--------------+
> >  |                       Join (100.64.0.0/16 <http://100.64.0.0/16>)
>                       |
> >  +-------------------------------+----------------------------------+
> >                                  |
> >  .-------------------------------+---------------------------------.
> > (                 ovn_cluster_router (distributed router)           )
> >  `----+----------+----------+-------------------------+------------'
> >       |          |          |                         |
> >       |          |          |                         |
> >     +-+-+      +-+-+      +-+-+                    +--+--+
> >     |LS1|      |LS2|      |LS3|        ........    |LS600|
> >     +---+      +---+      +---+                    +-----+
> >
> >
> > In the topology above, the L3GW is pinned to each node. The LSX
> logical switch is pinned to each node by having their logical ports
> pinned to that node itself (even though the switch by itself is
> distributed). We have a distributed router called ovn_cluster_router
> that provides east-west connectivity.
> >
> > Say, now we add a 601st logical switch to the topology above
> (mimicking a new node join to the k8s cluster).
> >
> > (1) ovn-nbctl lrp-add ovn_cluster_router rtos-node1 00:00:00:55:F9:4B
> 192.169.0.1/24 <http://192.169.0.1/24>
> > (2) ovn-nbctl ls-add node1 -- set logical_switch node1
> other-config:subnet=192.169.0.0/24 <http://192.169.0.0/24>
> >     other-config:exclude_ips=192.169.0.2
> external-ids:gateway_ip=192.169.0.1/24 <http://192.169.0.1/24>
> Hi Girish,
>
> Thanks for reporting the findings!
> It is straightforward for both client side and server side to always
> send all conditions - it is the declarative way, which easier to
> implement and ensure correctness. However, as you noticed, it is less
> efficient.
> To achieve what you proposed (i.e. sending only the changed conditions),
> it is another incremental processing - for monitor condition - which is
> a big change.

Hi Han.

Out of curiosity, is the problem here with the OVSDB protocol itself, or
just with what is being done in ovn-controller?

In other words, if ovn-controller were to be modified to send
monitor_cond_change messages that only specified the new datapaths,
would that result in expected behavior? Or would ovsdb_server interpret
the message to mean that we now are interested only in changes regarding
the new datapath and not in changes regarding the previously monitored
datapaths?

>
> We didn't encounter such problem yet, even if we have more nodes,
> probably because we don't use per-node logical switch, so we have much
> less logical switches, which results in much smaller size of monitor
> conditions. We might encounter same problem if we have more logical
> switches.
>

Girish Moodalbail

unread,
Jan 13, 2020, 6:32:44 PM1/13/20
to Han Zhou, Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, Dumitru Ceara, agin...@ebay.com, sdn-dev, ovn-kub...@googlegroups.com

Hello Han,

 

Thanks! We will give it a try and let you know ASAP.

 

Regards,

~Girish

 

From: Han Zhou <hz...@ovn.org>
Date: Monday, January 13, 2020 at 3:24 PM
To: Girish Moodalbail <gmood...@nvidia.com>
Cc: Han Zhou <hz...@ovn.org>, Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>, "agin...@ebay.com" <agin...@ebay.com>, sdn-dev <sdn...@exchange.nvidia.com>, "ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path

 

External email: Use caution opening links or attachments

 

Hi Girish,

 

Somehow patchwork didn't show the patch 2/3 as part of the series:  https://patchwork.ozlabs.org/patch/1222380/

 

Please let me know if it works for you.

 

Thanks,

Han

Girish Moodalbail

unread,
Jan 15, 2020, 7:54:44 PM1/15/20
to Han Zhou, Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, Dumitru Ceara, agin...@ebay.com, sdn-dev, ovn-kub...@googlegroups.com

Hello Han,

 

Once again thank for the patches. With these patches the OVSDB SB server is not 100% CPU bound like it was before.

 

However, I am seeing a different issue on the ovn-controller side. In the OVN K8s logical topology, we have L3 gateway router per-node. So, on the 600-node cluster we have 600 L3 Gateways. Each of the L3 Gateway is connected to physical world by a logical switch and a localnet port on that logical switch.

 

Before your patch, every Node had one pair of patch ports connecting the integration bridge and the physical bridge. The ovn-controller creates the patch port. From the ovn-controller man page we have:

 

              external_ids:ovn-localnet-port in the Port table

                     The presence of this key identifies a patch port as one

                     created by ovn-controller to connect the integration

                     bridge and another bridge to implement a localnet

                     logical port. Its value is the name of the logical port

                     with type set to localnet that the port implements. See

                     external_ids:ovn-bridge-mappings, above, for more

                     information.

 

                     Each localnet logical port is implemented as a pair of

                     patch ports, one in the integration bridge, one in a

                     different bridge, with the same

                     external_ids:ovn-localnet-port value.

 

 

With you patch, since the node now gets all the rows from the tables it is interested in, I see 600 pair of patch ports on each of the node.

$ ovs-vsctl list-ports br-int |grep patch-br-int-to-breth0 |wc -l

600

 

 

$ ovs-vsctl list-ports breth0 | grep patch |wc -l

644

 

This is definitely not correct. Let me know if you need more information.

 

Since the L3GW is pinned to a chassis, the localnet port that connects the gateway to physical network also belongs to that chassis. So, we need to see only one pair of patch ports.

 

Regards,

~Girish

 

 

From: Han Zhou <hz...@ovn.org>
Date: Monday, January 13, 2020 at 3:24 PM
To: Girish Moodalbail <gmood...@nvidia.com>
Cc: Han Zhou <hz...@ovn.org>, Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>, "agin...@ebay.com" <agin...@ebay.com>, sdn-dev <sdn...@exchange.nvidia.com>, "ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path

 

External email: Use caution opening links or attachments

 

Hi Girish,

 

Somehow patchwork didn't show the patch 2/3 as part of the series:  https://patchwork.ozlabs.org/patch/1222380/

 

Please let me know if it works for you.

 

Thanks,

Han

 

 

Girish Moodalbail

unread,
Jan 16, 2020, 12:15:00 PM1/16/20
to Han Zhou, Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, Dumitru Ceara, agin...@ebay.com, sdn-dev, ovn-kub...@googlegroups.com

Hello Han,

 

Thanks for the explanation. That was our understanding as well. Yes, we will need optimizations in ovn-controller when *monitor-all* is set (especially for the LogicalFlow table). On our setup we are seeing close to 600K OpenFlow rules on *br-int* after the patch. Before, it was only 60K flows.

 

Let me try the workaround you suggest below and see how it reduces the port count as well as the OpenFlow rules count.

 

Thanks,

~Girish

 

From: Han Zhou <hz...@ovn.org>
Date: Wednesday, January 15, 2020 at 11:46 PM
To: Girish Moodalbail <gmood...@nvidia.com>
Cc: Han Zhou <hz...@ovn.org>, Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>, "agin...@ebay.com" <agin...@ebay.com>, sdn-dev <sdn...@exchange.nvidia.com>, "ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path

 

External email: Use caution opening links or attachments

 

Hi Girish,

 

The reason why it creates 600 pairs of patch ports is that all port-bindings are now monitored and processed by patch_run() in ovn-controller.

Previously, the datapaths connected by GW-router ports are not regarded as local datapaths, and not being monitored.

We can add an *optimization* in ovn-controller so that only port-bindings resides on local datapaths are processed (probably same optimization for logical flow processing). Note: this optimization is needed only for the use case when "ovn-monitor-all" is set to true. I will work on a new patch for this.

 

A workaround to this (before the optimization is done), is that in ovn-k8s, do not use same network_name for the localnet ports on different nodes. Instead, use a node-specific name for options:network_name of localnet port and use it in the bridge-mappings in OVS settings as well.

 

Thanks,

Han

winson wang

unread,
Feb 6, 2020, 1:16:55 PM2/6/20
to Han Zhou, Girish Moodalbail, Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, Dumitru Ceara, agin...@ebay.com, sdn-dev, ovn-kub...@googlegroups.com

Hi Han,


On 1/16/20 9:14 AM, Girish Moodalbail wrote:

Hello Han,

 

Thanks for the explanation. That was our understanding as well. Yes, we will need optimizations in ovn-controller when *monitor-all* is set (especially for the LogicalFlow table). On our setup we are seeing close to 600K OpenFlow rules on *br-int* after the patch. Before, it was only 60K flows.

 

Let me try the workaround you suggest below and see how it reduces the port count as well as the OpenFlow rules count.

 

Thanks,

~Girish

 

From: Han Zhou <hz...@ovn.org>
Date: Wednesday, January 15, 2020 at 11:46 PM
To: Girish Moodalbail <gmood...@nvidia.com>
Cc: Han Zhou <hz...@ovn.org>, Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>, "agin...@ebay.com" <agin...@ebay.com>, sdn-dev <sdn...@exchange.nvidia.com>, "ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path

 

External email: Use caution opening links or attachments

 

Hi Girish,

 

The reason why it creates 600 pairs of patch ports is that all port-bindings are now monitored and processed by patch_run() in ovn-controller.

Previously, the datapaths connected by GW-router ports are not regarded as local datapaths, and not being monitored.

We can add an *optimization* in ovn-controller so that only port-bindings resides on local datapaths are processed (probably same optimization for logical flow processing). Note: this optimization is needed only for the use case when "ovn-monitor-all" is set to true. I will work on a new patch for this.

Please let me know if you have the patch ready for testing.

We tried workaround on our 600+ nodes cluster.
ovn-bridge-mappings="physnet_Node0:brenp1s0,locnet_Node0:br-local"
# ovs-vsctl list-ports br-int |grep patch-br-int-to-breth0 |wc -l

1

# ovs-vsctl list-ports breth0 | grep patch |wc -l

1

The flow count on br-int still around 600k.

#ovs-ofctl dump-aggregate br-int

 NXST_AGGREGATE reply (xid=0x4): packet_count=1130981 byte_count=106970768 flow_count=611249


Regards,

Zhen

Girish Moodalbail

unread,
Feb 18, 2020, 6:40:12 PM2/18/20
to Han Zhou, Zhen Wang (SW-CLOUD), Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, Dumitru Ceara, agin...@ebay.com, sdn-dev, ovn-kub...@googlegroups.com

Thanks Han,

 

We will try and let you know.

 

Regards,

~Girish

 

From: Han Zhou <hz...@ovn.org>
Date: Tuesday, February 18, 2020 at 3:38 PM
To: Han Zhou <hz...@ovn.org>
Cc: "Zhen Wang (SW-CLOUD)" <zhe...@nvidia.com>, Girish Moodalbail <gmood...@nvidia.com>, Dan Williams <dc...@redhat.com>, Ben Pfaff <b...@ovn.org>, Numan Siddique <nusi...@redhat.com>, Mark Michelson <mmic...@redhat.com>, Dumitru Ceara <dce...@redhat.com>, "agin...@ebay.com" <agin...@ebay.com>, sdn-dev <sdn...@exchange.nvidia.com>, "ovn-kub...@googlegroups.com" <ovn-kub...@googlegroups.com>
Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s of minutes while adding a new data path

 

External email: Use caution opening links or attachments

 

Hi Winson/Girish,

I am guessing that the extra flows could be related to neighbour (mac-binding) flows. So I just sent another patch:

 

So please apply both patches and test again:

 

Please try and let me know if you still see extra patch ports or extra flows.

 

Thanks,

Han


On Tue, Feb 18, 2020 at 2:32 PM Han Zhou <hz...@ovn.org> wrote:
>
> Hi Winson/Girish,
>
> I sent a patch to improve the ovn-monitor-all.
> https://patchwork.ozlabs.org/project/openvswitch/list/?series=159378
>
> It should not create extra patch ports any more. However, I didn't see why it was creating extra OVS flows, since in both logical flow processing and port-binding processing the datapath is checked and ensured that the logical flow/port binding belongs to local datapath only. Since you have the k8s environment, could you help check what are the extra flows got installed when ovn-monitor-all is enabled?
>
> Thanks,
> Han
>
> On Thu, Feb 6, 2020 at 10:37 AM Han Zhou <hz...@ovn.org> wrote:

> > Sorry, I haven't got time on it yet. I will work on it probably next week.
> >
> > Thanks,
> > Han

winson wang

unread,
Feb 20, 2020, 5:35:12 PM2/20/20
to Han Zhou, Girish Moodalbail, Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, Dumitru Ceara, agin...@ebay.com, ovn-kub...@googlegroups.com

Hi Han,

Thanks a lot for the patch!

I tested it on our k8s cluster with 640 nodes.

With your patch, 

#1, the extra patch ports on br-local/br-eth0 problem is fixed.

#2, open flow counter on br-int reduced from 600k+ to 260k+.

ovs-ofctl dump-aggregate br-int
NXST_AGGREGATE reply (xid=0x4): packet_count=226736 byte_count=16314715 flow_count=260852

ovn-controller cpu hot time reduced a lot with the patch.

Regards,
Winson

On 2/18/20 3:36 PM, Han Zhou wrote:
External email: Use caution opening links or attachments

Hi Winson/Girish,

I am guessing that the extra flows could be related to neighbour (mac-binding) flows. So I just sent another patch:

So please apply both patches and test again:

Please try and let me know if you still see extra patch ports or extra flows.

Thanks,
Han

On Tue, Feb 18, 2020 at 2:32 PM Han Zhou <hz...@ovn.org> wrote:
>
> Hi Winson/Girish,
>
> I sent a patch to improve the ovn-monitor-all.
> https://patchwork.ozlabs.org/project/openvswitch/list/?series=159378
>
> It should not create extra patch ports any more. However, I didn't see why it was creating extra OVS flows, since in both logical flow processing and port-binding processing the datapath is checked and ensured that the logical flow/port binding belongs to local datapath only. Since you have the k8s environment, could you help check what are the extra flows got installed when ovn-monitor-all is enabled?
>
> Thanks,
> Han
>
> On Thu, Feb 6, 2020 at 10:37 AM Han Zhou <hz...@ovn.org> wrote:
> >
> >
> >
> > On Thu, Feb 6, 2020 at 10:16 AM winson wang <zhe...@nvidia.com> wrote:
> >>
> > Sorry, I haven't got time on it yet. I will work on it probably next week.
> >
> > Thanks,
> > Han
> >>
> >>  
> >>

winson wang

unread,
Feb 20, 2020, 6:55:16 PM2/20/20
to Han Zhou, Girish Moodalbail, Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, Dumitru Ceara, agin...@ebay.com, ovn-kub...@googlegroups.com

Hi Han,


On 2/20/20 3:39 PM, Han Zhou wrote:
External email: Use caution opening links or attachments

Hi Winson,

Thanks for the update. I am glad that it helps.
Do you think it worth being backported for release 20.03?

Yes,  it is definitely needed for the deployment with ovn-monitor-all="true".

Please backport this fix to 20.03.


Thanks,

Winson




Thanks,
Han

Dumitru Ceara

unread,
Feb 23, 2020, 6:24:02 AM2/23/20
to Girish Moodalbail, winson wang, Han Zhou, Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, agin...@ebay.com, ovn-kub...@googlegroups.com
Hi Girish,

I went ahead and opened PR
https://github.com/ovn-org/ovn-kubernetes/pull/1089 to enable
ovn-monitor-all now that Han's follow up patches were merged on
branch-20.03.

We had also identified OVSDB server SB row-level filtering as a scale
issue in our own internal OVN only scale testing environment and
enabling ovn-monitor-all is an effective work around until we address
conditional monitoring scalability.

From a backwards compatibility perspective, unconditionally enabling
ovn-monitor-all is not an issue because it doesn't have any effect on
older OVN versions (< OVN 20.03).

Regards,
Dumitru

On 2/21/20 12:55 AM, winson wang wrote:
> Hi Han,
>
>
> On 2/20/20 3:39 PM, Han Zhou wrote:
>> *External email: Use caution opening links or attachments*
>>> *External email: Use caution opening links or attachments*
>>> <mailto:gmood...@nvidia.com>>
>>> > >> Cc: Han Zhou <hz...@ovn.org <mailto:hz...@ovn.org>>, Dan
>>> Williams <dc...@redhat.com <mailto:dc...@redhat.com>>, Ben Pfaff
>>> <b...@ovn.org <mailto:b...@ovn.org>>, Numan Siddique
>>> <nusi...@redhat.com <mailto:nusi...@redhat.com>>, Mark
>>> Michelson <mmic...@redhat.com <mailto:mmic...@redhat.com>>,
>>> Dumitru Ceara <dce...@redhat.com <mailto:dce...@redhat.com>>,
>>> "agin...@ebay.com <mailto:agin...@ebay.com>" <agin...@ebay.com
>>> <mailto:agin...@ebay.com>>, sdn-dev <sdn...@exchange.nvidia.com
>>> <mailto:sdn...@exchange.nvidia.com>>,
>>> "ovn-kub...@googlegroups.com
>>> <mailto:ovn-kub...@googlegroups.com>"
>>> <ovn-kub...@googlegroups.com
>>> <mailto:ovn-kub...@googlegroups.com>>
>>> <mailto:gmood...@nvidia.com>>
>>> > >> Cc: Han Zhou <hz...@ovn.org <mailto:hz...@ovn.org>>, Dan
>>> Williams <dc...@redhat.com <mailto:dc...@redhat.com>>, Ben Pfaff
>>> <b...@ovn.org <mailto:b...@ovn.org>>, Numan Siddique
>>> <nusi...@redhat.com <mailto:nusi...@redhat.com>>, Mark
>>> Michelson <mmic...@redhat.com <mailto:mmic...@redhat.com>>,
>>> Dumitru Ceara <dce...@redhat.com <mailto:dce...@redhat.com>>,
>>> "agin...@ebay.com <mailto:agin...@ebay.com>" <agin...@ebay.com
>>> <mailto:agin...@ebay.com>>, sdn-dev <sdn...@exchange.nvidia.com
>>> <mailto:sdn...@exchange.nvidia.com>>,
>>> "ovn-kub...@googlegroups.com
>>> <mailto:ovn-kub...@googlegroups.com>"
>>> <ovn-kub...@googlegroups.com
>>> <mailto:ovn-kub...@googlegroups.com>>
>>> <mailto:gmood...@nvidia.com>>
>>> > >> Cc: Dan Williams <dc...@redhat.com <mailto:dc...@redhat.com>>,
>>> Ben Pfaff <b...@ovn.org <mailto:b...@ovn.org>>, Numan Siddique
>>> <nusi...@redhat.com <mailto:nusi...@redhat.com>>, Mark
>>> Michelson <mmic...@redhat.com <mailto:mmic...@redhat.com>>,
>>> Dumitru Ceara <dce...@redhat.com <mailto:dce...@redhat.com>>,
>>> "hz...@ovn.org <mailto:hz...@ovn.org>" <hz...@ovn.org
>>> <mailto:hz...@ovn.org>>, "agin...@ebay.com
>>> <mailto:agin...@ebay.com>" <agin...@ebay.com
>>> <mailto:agin...@ebay.com>>, sdn-dev <sdn...@exchange.nvidia.com
>>> <mailto:sdn...@exchange.nvidia.com>>,
>>> "ovn-kub...@googlegroups.com
>>> <mailto:ovn-kub...@googlegroups.com>"
>>> <ovn-kub...@googlegroups.com
>>> <mailto:ovn-kub...@googlegroups.com>>
>>> > >> Subject: Re: At scale SB ovsdb-server 100% CPU bound for 10s
>>> of minutes while adding a new data path
>>> > >>
>>> > >>  
>>> > >>
>>> > >> External email: Use caution opening links or attachments
>>> > >>
>>> > >>  
>>> > >>
>>> > >>
>>> > >>
>>> > >> On Fri, Jan 10, 2020 at 11:47 AM Girish Moodalbail
>>> <gmood...@nvidia.com <mailto:gmood...@nvidia.com>> wrote:
>>> > >> >
>>> > >> > Hello all,
>>> > >> >
>>> > >> > Apologies for directly reaching out. We think that the
>>> issue is serious enough for ovn-kubernetes project that we wanted
>>> to bring this to all of yours attention ASAP. We are doing scale
>>> testing, and in our 600-node cluster we are seeing that OVN SB
>>> OVSDB is 100% CPU compute when we add a new worker node to the
>>> k8s cluster.
>>> > >> >
>>> > >> > The rough topology for ovn-kubernetes project is as below
>>> > >> >
>>> > >> >     .-----.    .-----.   .-----.                   .-----.
>>>          
>>> > >> >    ( l3GW1 )  ( l3GW2 ) ( l3GW3 )    ........    
>>> (l3GW600)        
>>> > >> >     `-+---'    `-+---'   `--+--'                   `-+---'
>>>          
>>> > >> >       |          |          |                        |    
>>>          
>>> > >> >       |          |          |                        |    
>>>          
>>> > >> >       |          |          |                        |    
>>>          
>>> > >> >
>>>  +----+----------+----------+------------------------+--------------+
>>> > >> >  |                       Join (100.64.0.0/16
>>> <http://100.64.0.0/16>)                       |
>>> > >> >
>>>  +-------------------------------+----------------------------------+
>>> > >> >                                  |                        
>>>          
>>> > >> >
>>>  .-------------------------------+---------------------------------.
>>> > >> > (                 ovn_cluster_router (distributed router)
>>>           )
>>> > >> >
>>>  `----+----------+----------+-------------------------+------------'
>>> > >> >       |          |          |                         |  
>>>          
>>> > >> >       |          |          |                         |  
>>>          
>>> > >> >     +-+-+      +-+-+      +-+-+                    +--+--+
>>>          
>>> > >> >     |LS1|      |LS2|      |LS3|        ........    |LS600|
>>>          
>>> > >> >     +---+      +---+      +---+                    +-----+
>>> > >> >
>>> > >> >
>>> > >> > In the topology above, the L3GW is pinned to each node.
>>> The LSX logical switch is pinned to each node by having their
>>> logical ports pinned to that node itself (even though the switch
>>> by itself is distributed). We have a distributed router called
>>> ovn_cluster_router that provides east-west connectivity.
>>> > >> >
>>> > >> > Say, now we add a 601st logical switch to the topology
>>> above (mimicking a new node join to the k8s cluster).
>>> > >> >
>>> > >> > (1) ovn-nbctl lrp-add ovn_cluster_router rtos-node1
>>> 00:00:00:55:F9:4B 192.169.0.1/24 <http://192.169.0.1/24>
>>> > >> > (2) ovn-nbctl ls-add node1 -- set logical_switch node1
>>> other-config:subnet=192.169.0.0/24 <http://192.169.0.0/24>
>>> > >> >     other-config:exclude_ips=192.169.0.2
>>> external-ids:gateway_ip=192.169.0.1/24 <http://192.169.0.1/24>

Girish Moodalbail

unread,
Feb 23, 2020, 10:21:27 AM2/23/20
to Dumitru Ceara, Zhen Wang (SW-CLOUD), Han Zhou, Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, agin...@ebay.com, ovn-kub...@googlegroups.com
Thanks Dumitru. Your changes LGTM and Dan Williams beat me to merging your PR by 16 seconds __.

Regards,
~Girish
--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/a4cc1704-dd03-469b-c108-18fe69c8477d%40redhat.com.


Dumitru Ceara

unread,
Feb 24, 2020, 4:54:12 AM2/24/20
to Girish Moodalbail, Zhen Wang (SW-CLOUD), Han Zhou, Dan Williams, Ben Pfaff, Numan Siddique, Mark Michelson, agin...@ebay.com, ovn-kub...@googlegroups.com
On 2/23/20 4:21 PM, Girish Moodalbail wrote:
> Thanks Dumitru. Your changes LGTM and Dan Williams beat me to merging your PR by 16 seconds __.
>
> Regards,
> ~Girish

Thanks Girish :)

Regards,
Dumitru
Reply all
Reply to author
Forward
0 new messages