ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

150 views
Skip to first unread message

Winson Wang

unread,
Aug 5, 2020, 3:58:59 PM8/5/20
to ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou
Hello OVN Experts,

With ovn-k8s,  we need to keep the flows always on br-int which needed by running pods on the k8s node.
Is there an ongoing project to address this problem?
If not,  I have one proposal not sure if it is doable.
Please share your thoughts.

The issue:

In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on every K8s node.  When we restart ovn-controller for upgrade using `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still works fine since br-int with flows still be Installed.


However, when a new ovn-controller starts it will connect OVS IDL and do an engine init run,  clearing all OpenFlow flows and install flows based on SB DB.

With open flows count above 200K+,  it took more than 15 seconds to get all the flows installed br-int bridge again.


Proposal solution for the issue:

When the ovn-controller gets “exit --start”,  it will write a  “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in external-ids column. When new ovn-controller starts, it will check if the “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from OVS IDL to decide if it will force a recomputing process?


Test log:

Check flow cnt on br-int every second:


packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=10322

packet_count=0 byte_count=0 flow_count=34220

packet_count=0 byte_count=0 flow_count=60425

packet_count=0 byte_count=0 flow_count=82506

packet_count=0 byte_count=0 flow_count=106771

packet_count=0 byte_count=0 flow_count=131648

packet_count=2 byte_count=120 flow_count=158303

packet_count=29 byte_count=1693 flow_count=185999

packet_count=188 byte_count=12455 flow_count=212764




--
Winson

Han Zhou

unread,
Aug 5, 2020, 6:35:48 PM8/5/20
to Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou
On Wed, Aug 5, 2020 at 12:58 PM Winson Wang <windso...@gmail.com> wrote:
Hello OVN Experts,

With ovn-k8s,  we need to keep the flows always on br-int which needed by running pods on the k8s node.
Is there an ongoing project to address this problem?
If not,  I have one proposal not sure if it is doable.
Please share your thoughts.

The issue:

In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on every K8s node.  When we restart ovn-controller for upgrade using `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still works fine since br-int with flows still be Installed.


However, when a new ovn-controller starts it will connect OVS IDL and do an engine init run,  clearing all OpenFlow flows and install flows based on SB DB.

With open flows count above 200K+,  it took more than 15 seconds to get all the flows installed br-int bridge again.


Proposal solution for the issue:

When the ovn-controller gets “exit --start”,  it will write a  “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in external-ids column. When new ovn-controller starts, it will check if the “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from OVS IDL to decide if it will force a recomputing process?



Hi Winson,

Thanks for the proposal. Yes, the connection break during upgrading is a real issue in a large scale environment. However, the proposal doesn't work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB, which is a completely different connection from the ovs-vswitchd open-flow connection.
To avoid clearing the open-flow table during ovn-controller startup, we can find a way to postpone clearing the OVS flows after the recomputing in ovn-controller is completed, right before ovn-controller replacing with the new flows. This should largely reduce the time of connection broken during upgrading. Some changes in the ofctrl module's state machine are required, but I am not 100% sure if this approach is applicable. Need to check more details.

Thanks,
Han

Test log:

Check flow cnt on br-int every second:


packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=0

packet_count=0 byte_count=0 flow_count=10322

packet_count=0 byte_count=0 flow_count=34220

packet_count=0 byte_count=0 flow_count=60425

packet_count=0 byte_count=0 flow_count=82506

packet_count=0 byte_count=0 flow_count=106771

packet_count=0 byte_count=0 flow_count=131648

packet_count=2 byte_count=120 flow_count=158303

packet_count=29 byte_count=1693 flow_count=185999

packet_count=188 byte_count=12455 flow_count=212764




--
Winson

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CAMu6iS8eC2EtMJbqBccGD0hyvLFBkzkeJ9sXOsT_TVF3Ltm2hA%40mail.gmail.com.

Girish Moodalbail

unread,
Aug 5, 2020, 7:21:51 PM8/5/20
to Han Zhou, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou
On Wed, Aug 5, 2020 at 3:35 PM Han Zhou <zho...@gmail.com> wrote:


On Wed, Aug 5, 2020 at 12:58 PM Winson Wang <windso...@gmail.com> wrote:
Hello OVN Experts,

With ovn-k8s,  we need to keep the flows always on br-int which needed by running pods on the k8s node.
Is there an ongoing project to address this problem?
If not,  I have one proposal not sure if it is doable.
Please share your thoughts.

The issue:

In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on every K8s node.  When we restart ovn-controller for upgrade using `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still works fine since br-int with flows still be Installed.


However, when a new ovn-controller starts it will connect OVS IDL and do an engine init run,  clearing all OpenFlow flows and install flows based on SB DB.

With open flows count above 200K+,  it took more than 15 seconds to get all the flows installed br-int bridge again.


Proposal solution for the issue:

When the ovn-controller gets “exit --start”,  it will write a  “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in external-ids column. When new ovn-controller starts, it will check if the “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from OVS IDL to decide if it will force a recomputing process?



Hi Winson,

Thanks for the proposal. Yes, the connection break during upgrading is a real issue in a large scale environment. However, the proposal doesn't work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB, which is a completely different connection from the ovs-vswitchd open-flow connection.
To avoid clearing the open-flow table during ovn-controller startup, we can find a way to postpone clearing the OVS flows after the recomputing in ovn-controller is completed, right before ovn-controller replacing with the new flows. This should largely reduce the time of connection broken during upgrading. Some changes in the ofctrl module's state machine are required, but I am not 100% sure if this approach is applicable. Need to check more details.


Thanks Han. Yes, postponing clearing of OpenFlow flows until all of the logical flows have been translated to OpenFlows will reduce the connection downtime. The question though is that can we use 'replace-flows' or 'mod-flows equivalent where-in the non-modified flows remain intact and all the sessions related to those flows will not face any downtime?

Regards,
~Girish

Han Zhou

unread,
Aug 5, 2020, 8:36:13 PM8/5/20
to Girish Moodalbail, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou
I am not sure about the "replace-flows". However, I think these are independent optimizations. I think postponing the clearing would solve the major part of the problem. I believe currently > 90% of the time is spent on waiting for computing to finish while the OVS flows are already cleared, instead of on the one time flow installation. But yes, that could be a further optimization.
 
Regards,
~Girish

Girish Moodalbail

unread,
Aug 5, 2020, 9:16:43 PM8/5/20
to Han Zhou, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou
Agree.

Venugopal Iyer

unread,
Aug 6, 2020, 11:54:55 AM8/6/20
to Han Zhou, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou

Hi, Han:

 

A comment inline:

 

From: ovn-kub...@googlegroups.com <ovn-kub...@googlegroups.com> On Behalf Of Han Zhou
Sent: Wednesday, August 5, 2020 3:36 PM
To: Winson Wang <windso...@gmail.com>
Cc: ovs-d...@openvswitch.org; ovn-kub...@googlegroups.com; Dumitru Ceara <dce...@redhat.com>; Han Zhou <hz...@ovn.org>
Subject: Re: ovn-k8s scale: how to make new ovn-controller process keep the previous Open Flow in br-int

 

External email: Use caution opening links or attachments

 

 

 

On Wed, Aug 5, 2020 at 12:58 PM Winson Wang <windso...@gmail.com> wrote:

Hello OVN Experts,


With ovn-k8s,  we need to keep the flows always on br-int which needed by running pods on the k8s node.

Is there an ongoing project to address this problem?

If not,  I have one proposal not sure if it is doable.

Please share your thoughts.

The issue:

In large scale ovn-k8s cluster there are 200K+ Open Flows on br-int on every K8s node.  When we restart ovn-controller for upgrade using `ovs-appctl -t ovn-controller exit --restart`,  the remaining traffic still works fine since br-int with flows still be Installed.

 

However, when a new ovn-controller starts it will connect OVS IDL and do an engine init run,  clearing all OpenFlow flows and install flows based on SB DB.

With open flows count above 200K+,  it took more than 15 seconds to get all the flows installed br-int bridge again.

 

Proposal solution for the issue:

When the ovn-controller gets “exit --start”,  it will write a  “ovs-cond-seqno” to OVS IDL and store the value to Open vSwitch table in external-ids column. When new ovn-controller starts, it will check if the “ovs-cond-seqno” exists in the Open_vSwitch table,  and get the seqno from OVS IDL to decide if it will force a recomputing process?



 

Hi Winson,

 

Thanks for the proposal. Yes, the connection break during upgrading is a real issue in a large scale environment. However, the proposal doesn't work. The "ovs-cond-seqno" is for the OVSDB IDL for the local conf DB, which is a completely different connection from the ovs-vswitchd open-flow connection.

To avoid clearing the open-flow table during ovn-controller startup, we can find a way to postpone clearing the OVS flows after the recomputing in ovn-controller is completed, right before ovn-controller replacing with the new flows.

[vi> ]

[vi> ] Seems like we force recompute today if the OVS IDL is reconnected. Would it be possible to defer

decision to  recompute the flows based on  the  SB’s nb_cfg we have  sync’d with? i.e.  If  our nb_cfg is

in sync with the SB’s global nb_cfg, we can skip the recompute?  At least if nothing has changed since

the restart, we won’t need to do anything.. We could stash nb_cfg in OVS (once ovn-controller receives

conformation from OVS that the physical flows for an nb_cfg update are in place), which should be cleared if

OVS itself is restarted.. (I mean currently, nb_cfg is used to check if NB, SB and Chassis are in sync, we

could extend this to OVS/physical flows?)

 

Have not thought through this though .. so maybe I am missing something…

 

Thanks,

 

-venu

Han Zhou

unread,
Aug 6, 2020, 1:13:30 PM8/6/20
to Venugopal Iyer, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou
nb_cfg is already used by ovn-controller to do that, with the help of "barrier" of OpenFlow, but I am not sure if it 100% working as expected.

This basic idea should work, but in practice we need to take care of generating the "installed" flow table and "desired" flow table in ovn-controller.
I'd start with "postpone clearing OVS flows" which seems a lower hanging fruit, and then see if any further improvement is needed.

Han Zhou

unread,
Aug 6, 2020, 1:22:29 PM8/6/20
to Numan Siddique, Venugopal Iyer, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou


On Thu, Aug 6, 2020 at 9:15 AM Numan Siddique <num...@ovn.org> wrote:


We can also think if its possible to do the below way
   - When ovn-controller starts, it will not clear the flows, but instead will get the dump of flows  from the br-int and populate these flows in its installed flows
    - And then when it connects to the SB DB and computes the desired flows, it will anyway sync up with the installed flows with the desired flows
    - And if there is no difference between desired flows and installed flows, there will be no impact on the datapath at all.

Although this would require a careful thought and proper handling.

Numan, as I responded to Girish, this avoids the time spent on the one-time flow installation after restart (the < 10% part of the connection broken time), but I think currently the major problem is that > 90% of the time is spent on waiting for computing to finish while the OVS flows are already cleared. It is surely an optimization, but the most important one now is to avoid the 90% time. I will look at postpone clearing flows first.
 

Thanks
Numan

Han Zhou

unread,
Aug 6, 2020, 6:18:21 PM8/6/20
to Venugopal Iyer, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou
(resend using my gmail so that it can reach the ovn-kubernetes group.)

I thought about it again and it seems the idea of remembering nb_cfg doesn't work for the upgrading scenario. Even if nb_cfg is the same and we are sure about the flow that's installed in OVS reflects the certain nb_cfg version, we cannot say the OVS flows doesn't need any change, because the new version of ovn-controller implementation may translate same SB data into different OVS flows. So clearing the flow table is still the right thing to do, in terms of upgrading. (syncing back OVS flows from ovs-vswitchd to ovn-controller could avoid clearing the whole table, but that's a different approach as mentioned by Numan, and nb_cfg is not helpful anyway)

Thanks,
Han

Han Zhou

unread,
Aug 7, 2020, 2:46:21 PM8/7/20
to Numan Siddique, Venugopal Iyer, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou
I thought about this again. It seems more complicated than it appeared and let me summarize here:

The connection break time during the upgrading consists two parts:
1) The time gap between flow clearing and the start of the flow installation for the fully computed flows, i.e. waiting for flow installation.
2) The time spent during flow installation, which takes several rounds of ovn-controller main loop iteration. (I take back my earlier statement that this contributes only 10% of the total time. According to the log shared by Girish, it seems at least more than 50% of the time is spent here).

For 1), postponing clearing flows is the solution, but it is not as easy as I thought, because there is no easy way to determine if ovn-controller has completed the initial computing.
When ovn-controller starts, it initializes the IDL connections with SB and local OVSDB, and sends the initial monitor conditions to SB DB. It may take several rounds of receiving SB notifications, update monitor conditions, and computing to generate all flows required. If we replace the flows to OVS before it is fully complete, it would end up with the same problem. I can't think of an ideal and clean approach to solve the problem. However, a "not so good" solution could be, support an option for ovn-controller command to delay the clearing of OVS flows. It is then the operator's job to figure out the best time to delay, according to the scale of their environment, to reduce the time gap on waiting for the new flow installation. This is not an ideal approach, but I think it should be helpful for large scale environment upgrading in practise. Thoughts?

For 2), Numan's suggestion of syncing back OVS flows before flow installation and installing only the delta (without clearing the flows) seems to be perfect solution. However, there are some tricky parts that need to be considered:
1. Apart from OVS flows, meter and group table also need to be restored
2. The installed flows in ovn-controller require some other metadata that is not available from OVS, such as sb_uuid.
3. The syncing itself may take significant extra cost and further delays the initialization.

Alternatively, for 2), I think probably we can utilize the "bundle" operation of OpenFlow to replace the flows in OVS atomically (on ovs-vswitchd side) which should avoid the long connection break. I am not sure which one is more applicable yet.

I'd also like to emphasize that even though the solution for 2) doesn't clear flows, it doesn't avoid problem 1) automatically, because we will still need to figure out when the major flow compute is complete and ready to be installed/synced to OVS. Otherwise, we could replace the old huge flow tables with a small number of incompleted flows, which still results in the same connection break.

Thanks,
Han

Han Zhou

unread,
Aug 7, 2020, 4:04:03 PM8/7/20
to Numan Siddique, Venugopal Iyer, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou


On Fri, Aug 7, 2020 at 12:35 PM Numan Siddique <num...@ovn.org> wrote:


I have another suggestion to handle this issue during upgrade.

Let's say the br-int has ports p1, p2, ....,p10 which corresponds to the logical ports ... p1, p2, ... , p10,

Then the following can be done

1. Create a temporary bridge - br-temp
    ovs-vsctl add-br br-temp

2. Create the ports p1, p2... to p10 in br-temp with different names but external_ids:iface-id set properly.
   Eg.
   ovs-vsctl add-port br-temp temp-p1 -- set interface temp-p1 type=internal -- set interface temp-p1 external_ids:iface-id=p1
   ..
   ..
   ovs-vsctl add-port br-temp temp-p10 -- set interface temp-p10 type=internal -- set  interface temp-p10 external_ids:iface-id=p1

    (I think this can be easily scripted)

3. Just before restart of ovn-controller run -   ovs-vsctl set open . external_ids:ovn-bridge=br-temp

4. Restart ovn-controller after upgrading

5. Wait till ovn-controller connects to the SB ovsdb-server and all the flows appear in br-temp

6. Switch back to the ovn bridge to br-int - ovs-vsctl set open . external_ids:ovn-bridge=br-int

7. Delete br-temp - ovs-vsctl del-br br-temp

Till step 5, there should not be any datapath impact as br-int is untouched and all the flows would be there.

There could be some downtime after step 6 as ovn-controller may delete all the flows in br-int and re add again. But the
duration should be shorter.

Please note I have not tested this myself. But its worth testing this in a small environment before trying on an actual deployment.

You could skip step 2, but if ovn-monitor-all is false, then you would still see some delay due to conditional monitoring.

This is totally under the operator/admin control. And there is no need for any ovn-controller changes. We can still work on approach (2)
and handle all the tricky parts mentioned by Han, but this may take time.

Any thoughts about this ? We used this similar approach when I worked on a migration script to migrate an existing OpenStack deployment with ML2OVS to
ML2OVN.


Thanks Numan. This is interesting and I didn't think of this kind of approach before. It is an alternative solution for problem 1). It avoids the connection break for waiting for the flow computation. However, it has the same problem as the "delay time" option I proposed - it requires operator's judgement regarding the flow computing completion (in step 5). And the steps seem too heavy for a regular upgrading operation (while it sounds reasonable for migration). Would an option "delay time" simplify the operation if it can be easily implemented? I could try that.

And yes, problem 2) still needs to be addressed (step 6). Any thoughts on OpenFlow "bundle"?

Thanks,
Han


Thanks
Numan

Venugopal Iyer

unread,
Aug 7, 2020, 4:50:41 PM8/7/20
to Han Zhou, Numan Siddique, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou

Hi, Han:

[vi> ]

[vi> ] So, if I understand it, the ovn-controller completes all its iterations and then finally does a

bundle(replace-flows)? That sounds good to me – that way since it is replace-flows, we don’t flush,

so don’t need any explicit delay time?

 

Thanks,

 

-venu

Venugopal Iyer

unread,
Aug 7, 2020, 4:56:15 PM8/7/20
to Venugopal Iyer, Han Zhou, Numan Siddique, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou

Hi, Han:

 

An additional comment;

[vi> ] though ovn-controller needs to handle the case where the bundle op fails? Since it’ll revert all the flows to

what it was before?

 

-venu

 

Thanks,

 

-venu

Han Zhou

unread,
Aug 7, 2020, 5:35:29 PM8/7/20
to Venugopal Iyer, Numan Siddique, Winson Wang, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Dumitru Ceara, Han Zhou
We will still need the delay time to solve problem 1). As stated earlier, solving problem 2) alone wouldn't help, because we would end up replacing the flows too early before ovn-controller completes all required iterations. For example, we may end up replacing the 200k+ flows with 100 flows while it would still compute the rest of the flows when a new SB notification is coming. We still need the delay time to make sure that ovn-controller performs the bundle operation not too early.

 

[vi> ] though ovn-controller needs to handle the case where the bundle op fails? Since it’ll revert all the flows to

what it was before?


OpenFlow transaction failures should be handled properly. I think this should not be very different from how we handle a regular flow-add failure, although there might be some specific handling here since the bundle operation should happen only once.

Thanks,
Han

 

-venu

 

 
Reply all
Reply to author
Forward
0 new messages