[ovs-discuss][OVN] logical flow explosion in lr_in_ip_input table for dnat_and_snat IPs

62 views
Skip to first unread message

Girish Moodalbail

unread,
Jun 3, 2020, 10:16:52 PM6/3/20
to ovn-kub...@googlegroups.com, ovs-discuss, Han Zhou, Numan Siddique, Dumitru Ceara
Hello all,

While working on an extension, see the diagram below, to the existing OVN logical topology for the ovn-kubernetes project, I am seeing an explosion of the "Reply to ARP requests" logical flows in the `lr_in_ip_input` table for the distributed router (ovn_cluster_router) configured with gateway port (rtol-LS)

                        internet          
               ---------+-------------->  
                        |                  
                        |                                  
      +----------localnet-port---------+  
      |LS                              |  
      +-----------------ltor-LS--------+  
                           |              
                           |              
 +---------------------rtol-LS------------+
 |           ovn_cluster_router           |
 |          (Distributed Router)          |
 +-rtos-ls0------rtos-ls1--------rtos-ls2-+
      |              |              |        
      |              |              |      
+-----+-+       +----+--+     +-----+-+    
|  LS0  |       |  LS1  |     |  LS2  |    
+-+-----+       +-+-----+     +-+-----+        
  |               |             |          
  p0              p1            p2        
 IA0             IA1           IA2        
 EA0             EA1           EA2 
(Node0)          (Node1)       (Node2)

In the topology above, each of the three logical switch port has an internal address of IAx and an external address of EAx (dnat_and_snat IP). They are all bound to their respective nodes (Nodex). A packet from `p0` heading towards the internet will be SNAT'ed to EA0 on the local hypervisor and then sent out through the LS's localnet-port on that hypervisor. Basically, they are configured for distributed NATing.

I am seeing interesting "Reply to ARP requests" flows for arp.tpa set to "EAX". Flows are like this:

For EA0
priority=90, match=(inport == "rtos-ls0" && arp.tpa == EA0 && arp.op == 1), action=(/* ARP reply */)
priority=90, match=(inport == "rtos-ls1" && arp.tpa == EA0 && arp.op == 1), action=(/* ARP reply */)
priority=90, match=(inport == "rtos-ls2" && arp.tpa == EA0 && arp.op == 1), action=(/* ARP reply */)

For EA1
priority=90, match=(inport == "rtos-ls0" && arp.tpa == EA1 && arp.op == 1), action=(/* ARP reply */)
priority=90, match=(inport == "rtos-ls1" && arp.tpa == EA0 && arp.op == 1), action=(/* ARP reply */)
priority=90, match=(inport == "rtos-ls2" && arp.tpa == EA1 && arp.op == 1), action=(/* ARP reply */)

Similarly, for EA2.

So, we have N * N "Reply to ARP requests" flows for N nodes each with 1 dnat_and_snat ip. 
This is causing scale issues.

If you look at the flows for `EA0`, i am confused as to why is it needed?
  1. When will one see an ARP request for the EA0 from any of the LS{0,1,2}'s logical switch port.
  2. If it is needed at all, can't we just remove the `inport` thing altogether since the flow is configured for every port of logical router port except for the distributed gateway port rtol-LS. For this port, we could add an higher priority rule with action set to `next`.
  3. Say, we don't need east-west NAT connectivity. Is there a way to make these ARPs be learnt dynamically, like we are doing for join and external logical switch (the other thread [1]).
Regards,
~Girish


Han Zhou

unread,
Jun 4, 2020, 12:39:20 AM6/4/20
to Girish Moodalbail, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique, Dumitru Ceara
In general, these flows should be per router instead of per router port, since the nat addresses are not attached to any router port. For distributed gateway ports, there will need per-port flows to match is_chassis_resident(gateway-chassis). I think this can be handled by:
- priority X + 20 flows for each distributed gateway port with is_chassis_resident(), reply ARP
- priority X + 10 flows for each distributed gateway port without is_chassis_resident(), drop
- priority X flows for each router (no need to match inport), reply ARP

This way, there are N * (2D + 1) flows per router. N = number of NAT IPs, D = number of distributed gateway ports. This would optimize the above scenario where there is only 1 distributed gateway port but many regular router ports. Thoughts?

Thanks,
Han

Girish Moodalbail

unread,
Jun 4, 2020, 1:23:40 AM6/4/20
to Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique, Dumitru Ceara
Han, I think this will work.

Again, thanks for the quick reply.

Regards,
~Girish

Girish Moodalbail

unread,
Jun 15, 2020, 1:56:32 PM6/15/20
to Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique, Dumitru Ceara
Hello Han,

On Wed, Jun 3, 2020 at 9:39 PM Han Zhou <zho...@gmail.com> wrote:
We went ahead and added support for this topology in ovn-kubernetes project in this commit

Han, was curious to know if the above fix is in your radar? Thanks. 

The number of OpenFlow flows in each of the hypervisors is insanely high and is consuming a lot of memory.

Regards,
~Girish



 

Thanks,
Han

Han Zhou

unread,
Jun 15, 2020, 3:55:51 PM6/15/20
to Girish Moodalbail, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique, Dumitru Ceara
Sorry Girish, I can't promise for now. I will see if I have time in the next couple of weeks, but welcome anyone to volunteer on this if it is urgent.

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STTOrzx-zy48TKpbxx4yxxQ_X5bN05VPqBHA79gpCBQfwg%40mail.gmail.com.

Girish Moodalbail

unread,
Jun 16, 2020, 11:18:49 AM6/16/20
to Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique, Dumitru Ceara
Thanks Han for the update.

Regards,
~Girish 

Dumitru Ceara

unread,
Jun 24, 2020, 11:55:40 AM6/24/20
to Girish Moodalbail, Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique
Hi Girish,

I sent a patch series to implement Han's suggestion:
https://patchwork.ozlabs.org/project/openvswitch/list/?series=185580
https://mail.openvswitch.org/pipermail/ovs-dev/2020-June/372005.html

It would be great if you could give it a run on your setup too.

Thanks,
Dumitru
> 1. When will one see an ARP request for the EA0 from
> any of the LS{0,1,2}'s logical switch port.
> 2. If it is needed at all, can't we just remove the
> `inport` thing altogether since the flow is
> configured for every port of logical router port
> except for the distributed gateway port rtol-LS. For
> this port, we could add an higher priority rule with
> action set to `next`.
> 3. Say, we don't need east-west NAT connectivity. Is
> <mailto:ovn-kubernete...@googlegroups.com>.
> <https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STTOrzx-zy48TKpbxx4yxxQ_X5bN05VPqBHA79gpCBQfwg%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>

Girish Moodalbail

unread,
Jun 24, 2020, 12:01:50 PM6/24/20
to Dumitru Ceara, Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique
Hello Dumitru,

Thank you. We will give it a try and let you know.

Regards,
~Girish

Girish Moodalbail

unread,
Jun 25, 2020, 3:35:00 PM6/25/20
to Dumitru Ceara, Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique
Hello Dumitru, Han,

So, we applied this patchset and gave it a spin on our large scale cluster and saw a significant reduction in the number of logical flows in lr_in_ip_input table. Before this patch there were around 1.6M flows in lr_in_ip_input table. However, after the patch we see about 26K flows. So that is significant reduction in number of logical flows.

In lr_in_ip_input, I see
  • priority 92 flows matching ARP requests for dnat_and_snat IPs on distributed gateway port with is_chassis_resident() and corresponding ARP reply
  • priority 91 flows matching ARP requests for dnat_and_snat IPs on distributed gateway port with !is_chassis_resident() and corresponding drop
  • priority 90 flow matching ARP request for dnat_and_snat IPs and corresponding ARP replies
So far so good. 

However, not directly related to this patch per-se but directly related to the behaviour of ARP and dnat_and_snat IP, on the OVN chassis we are seeing a significant number of OpenFlow flows in table 27 (around 2.3M OpenFlow flows). This table gets populated from logical flows in table=19 (ls_in_l2_lkup) of logical switch.

The two logical flows in l2_in_l2_lkup that are contributing to huge number of OpenFlow flows are: (for the  entire logical flow entry, please see: https://gist.github.com/girishmg/57b3005030d421c59b30e6c36cfc9c18)

Priority=75 flow 
=============
This flow looks like below (where 169.254.0.0/29 is dnat_and_snat subnet and 192.168.0.1 is the logical_switch's gateway IP)

table=19(ls_in_l2_lkup      ), priority=75   , match=(flags[1] == 0 && arp.op == 1 && arp.tpa == { 169.254.3.107, 169.254.1.85, 192.168.0.1, 169.254.10.155, 169.254.1.6}), action=(outport = "stor-sdn-test1"; output;)

What this flow says is that any ARP request packet from the switch heading towards the default gateway or any of those 1-to-1 nat send it out through the port towards  the ovn_cluster_router’s ingress pipeline. Question though is why any Pod on the logical switch would send an ARP for an IP that is not in its subnet. A packet from a Pod towards a non-subnet IP should ARP only for the default gateway IP.

Priority=80 Flow
=============
This flow looks like below

table=19(ls_in_l2_lkup      ), priority=80   , match=(eth.src == { 0a:58:c0:a8:00:01, 6a:93:f4:55:aa:a7, ae:92:2d:33:24:ea, ba:0a:d3:7d:bc:e8, b2:2f:40:4d:d9:2b} && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood"; output;)

The question again for this flow is why will there be a self-originated arp requests for the dnat_and_snat IPs from inside of the node's logical switch. I can see how this is a possibility on the switch that has `localnet port` on it and to which the distributed router connects to through a gateway port. 

Regards,
~Girish

On Wed, Jun 24, 2020 at 8:55 AM Dumitru Ceara <dce...@redhat.com> wrote:

Dumitru Ceara

unread,
Jun 25, 2020, 3:48:13 PM6/25/20
to Girish Moodalbail, Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique
On 6/25/20 9:34 PM, Girish Moodalbail wrote:
> Hello Dumitru, Han,
>
> So, we applied this patchset and gave it a spin on our large scale
> cluster and saw a significant reduction in the number of logical flows
> in lr_in_ip_input table. Before this patch there were around 1.6M flows
> in lr_in_ip_input table. However, after the patch we see about 26K
> flows. So that is significant reduction in number of logical flows.
>
> In lr_in_ip_input, I see
>
> * priority 92 flows matching ARP requests for dnat_and_snat IPs on
> distributed gateway port with is_chassis_resident() and
> corresponding ARP reply
> * priority 91 flows matching ARP requests for dnat_and_snat IPs on
> distributed gateway port with !is_chassis_resident() and
> corresponding drop
> * priority 90 flow matching ARP request for dnat_and_snat IPs and
> corresponding ARP replies
>
> So far so good.

Hi Girish,

Great, thanks for testing out the series and confirming that it's
working ok.

>
> However, not directly related to this patch per-se but directly related
> to the behaviour of ARP and dnat_and_snat IP, on the OVN chassis we are
> seeing a significant number of OpenFlow flows in table 27 (around 2.3M
> OpenFlow flows). This table gets populated from logical flows in
> table=19 (ls_in_l2_lkup) of logical switch.
>
> The two logical flows in l2_in_l2_lkup that are contributing to huge
> number of OpenFlow flows are: (for the  entire logical flow entry,
> please
> see: https://gist.github.com/girishmg/57b3005030d421c59b30e6c36cfc9c18)
>
> Priority=75 flow 
> =============
> This flow looks like below (where 169.254.0.0/29 <http://169.254.0.0/29>
> is dnat_and_snat subnet and 192.168.0.1 is the logical_switch's gateway IP)
>
> table=19(ls_in_l2_lkup      ), priority=75   , match=(flags[1] == 0 &&
> arp.op == 1 && arp.tpa == { 169.254.3.107, 169.254.1.85, 192.168.0.1,
> 169.254.10.155, 169.254.1.6}), action=(outport = "stor-sdn-test1"; output;)
>
> What this flow says is that any ARP request packet from the switch
> heading towards the default gateway or any of those 1-to-1 nat send it
> out through the port towards  the ovn_cluster_router’s ingress pipeline.
> Question though is why any Pod on the logical switch would send an ARP
> for an IP that is not in its subnet. A packet from a Pod towards a
> non-subnet IP should ARP only for the default gateway IP.
>

This is a bug. I'll start working on a fix send a patch for it soon.

> Priority=80 Flow
> =============
> This flow looks like below
>
> table=19(ls_in_l2_lkup      ), priority=80   , match=(eth.src == {
> 0a:58:c0:a8:00:01, 6a:93:f4:55:aa:a7, ae:92:2d:33:24:ea,
> ba:0a:d3:7d:bc:e8, b2:2f:40:4d:d9:2b} && (arp.op == 1 || nd_ns)),
> action=(outport = "_MC_flood"; output;)
>
> The question again for this flow is why will there be a self-originated
> arp requests for the dnat_and_snat IPs from inside of the node's logical
> switch. I can see how this is a possibility on the switch that has
> `localnet port` on it and to which the distributed router connects to
> through a gateway port. 
>

This is also a bug, similar to the one above, we should only deal with
external_mac's that might be used on this port. I'll fix it too soon.

Thanks,
Dumitru

> Regards,
> ~Girish
>
> On Wed, Jun 24, 2020 at 8:55 AM Dumitru Ceara <dce...@redhat.com
> <mailto:dce...@redhat.com>> wrote:
>
> Hi Girish,
>
> I sent a patch series to implement Han's suggestion:
> https://patchwork.ozlabs.org/project/openvswitch/list/?series=185580
> https://mail.openvswitch.org/pipermail/ovs-dev/2020-June/372005.html
>
> It would be great if you could give it a run on your setup too.
>
> Thanks,
> Dumitru
>
> On 6/16/20 5:18 PM, Girish Moodalbail wrote:
> > Thanks Han for the update.
> >
> > Regards,
> > ~Girish 
> >
> > On Mon, Jun 15, 2020 at 12:55 PM Han Zhou <zho...@gmail.com
> <mailto:zho...@gmail.com>
> > <mailto:zho...@gmail.com <mailto:zho...@gmail.com>>> wrote:
> >
> >     Sorry Girish, I can't promise for now. I will see if I have
> time in
> >     the next couple of weeks, but welcome anyone to volunteer on
> this if
> >     it is urgent.
> >
> >     On Mon, Jun 15, 2020 at 10:56 AM Girish Moodalbail
> >     <gmood...@gmail.com <mailto:gmood...@gmail.com>
> <mailto:gmood...@gmail.com <mailto:gmood...@gmail.com>>> wrote:
> >
> >         Hello Han,
> >
> >         On Wed, Jun 3, 2020 at 9:39 PM Han Zhou <zho...@gmail.com
> <mailto:zho...@gmail.com>
> >         <mailto:zho...@gmail.com <mailto:zho...@gmail.com>>> wrote:
> >
> >
> >
> >             On Wed, Jun 3, 2020 at 7:16 PM Girish Moodalbail
> >             <gmood...@gmail.com <mailto:gmood...@gmail.com>
> <mailto:ovn-kubernetes%2Bunsu...@googlegroups.com>
> >         <mailto:ovn-kubernete...@googlegroups.com
> <mailto:ovn-kubernetes%2Bunsu...@googlegroups.com>>.

kevin parker

unread,
Jun 26, 2020, 3:09:17 AM6/26/20
to ovn-kubernetes
Hello Girish,
Thanks for sharing the details on the number of flows.In a non k8s environment with DNS recursive traffic we saw a similar pattern where the flows were extremely high, around 700k.Do you mind sharing the memory usage and the components thats consuming huge memory is it ovs-vswitchd.

Could you also share kubelet "system-reserved" value.

Dumitru Ceara

unread,
Jul 8, 2020, 8:23:31 AM7/8/20
to Girish Moodalbail, Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique
Hi Girish,

I just sent a patch that should fix these two new issues you reported
above. Do you mind giving it a try when you get the chance?

https://patchwork.ozlabs.org/project/openvswitch/patch/1594210824-11382-1-g...@redhat.com/

Thanks,
Dumitru

Girish Moodalbail

unread,
Jul 8, 2020, 9:37:19 AM7/8/20
to Dumitru Ceara, Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique
Absolutely. Thank you so much.

Regards,
~Girish
 

Girish Moodalbail

unread,
Jul 13, 2020, 8:09:54 PM7/13/20
to Dumitru Ceara, Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique
Hello Han/Dumitru,

We tried the patch on our cluster, and we do not see the explosion of OpenFlow rules in the Integration Bridge on each of the OVN Chassis nodes.

Also, on the logical switch with localnet port attached, as expected, we still have a rule to Flood the Gratuitous ARPs sent for all SNAT and DNAT extenal IP addresses.

This looks great. Thank you both.

Regards,
~Girish

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/7121360b-9f69-52be-70a4-7cb7e2f95eff%40redhat.com.

Dumitru Ceara

unread,
Jul 14, 2020, 3:46:40 AM7/14/20
to Girish Moodalbail, Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Numan Siddique
On 7/14/20 2:09 AM, Girish Moodalbail wrote:
> Hello Han/Dumitru,
>
> We tried the patch on our cluster, and we do not see the explosion of
> OpenFlow rules in the Integration Bridge on each of the OVN Chassis nodes.
>
> Also, on the logical switch with localnet port attached, as expected, we
> still have a rule to Flood the Gratuitous ARPs sent for all SNAT and
> DNAT extenal IP addresses.
>
> This looks great. Thank you both.
>
> Regards,
> ~Girish
>

Hi Girish,

Thanks for the confirmation!

Regards,
Dumitru
> <http://169.254.0.0/29> <http://169.254.0.0/29>
> <mailto:ovn-kubernetes%2Bunsu...@googlegroups.com>.

winson wang

unread,
Jul 14, 2020, 2:13:25 PM7/14/20
to kevin parker, ovn-kubernetes
Hi Kevin,


On 6/26/20 12:09 AM, kevin parker wrote:
> External email: Use caution opening links or attachments
>
>
> Hello Girish,
> Thanks for sharing the details on the number of flows.In a non k8s environment with DNS recursive traffic we saw a similar pattern where the flows were extremely high, around 700k.Do you mind sharing the memory usage and the components thats consuming huge memory is it ovs-vswitchd.

With issues we found,  on our 640 node k8s scale test cluster, the open
flows on each work node br-int was around 2.5M.

components that consuming huge memory are ovs-vswitchd and ovn-conroller.

some old logs:

11:29:03 AM   UID       PID  minflt/s  majflt/s     VSZ     RSS %MEM 
Command
11:29:04 AM     0     13353      0.00      0.00 2342548 2303388 28.79 
ovs-vswitchd
11:29:04 AM     0      7354      0.00      0.00 2603304 1867748 23.35 
ovn-controller
11:29:04 AM     0     13299      0.00      0.00   74376   11016 0.14 
ovsdb-server
11:29:04 AM     0     12125      0.00      0.00 1026056  116632 1.46 
ovnkube

With all the patches applied and feature enabled in ovn-k8s.

now our scale cluster size is  1006.

Open flows is around 230K.

11:11:03 AM   UID       PID  minflt/s  majflt/s     VSZ     RSS %MEM 
Command
11:11:04 AM     0     25147      0.00      0.00 1707668  234592 2.93 
ovs-vswitchd
11:11:04 AM     0     21866      0.00      0.00  819496  492084 6.15 
ovn-controller


> Could you also share kubelet "system-reserved" value.

it's scale test environment,  I did not specify the "system-rerserved"
for kubelet service.

Regards,

Winson

> --
> You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/69e2110e-318c-49bb-9f37-e46a7a30ecb0o%40googlegroups.com.

kevin parrikar

unread,
Sep 23, 2020, 10:29:20 AM9/23/20
to winson wang, ovn-kubernetes
Thanks Winson for sharing the details
Reply all
Reply to author
Forward
0 new messages