[ovs-discuss][OVN] flow explosion in lr_in_arp_resolve table

128 views
Skip to first unread message

Girish Moodalbail

unread,
May 1, 2020, 11:20:29 AM5/1/20
to dis...@openvswitch.org, ovn-kub...@googlegroups.com
Hello all,

Say, the logical topology is as defined below. We have a logical router connected to 1000 gateway routers through a join switch. This is a 1000 hypervisor OVN k8s cluster where-in each gateway router is bound to their respective hypervisor.

+-----------+ +-----------+        +-----------+ +-----------+
|  Gateway  | |  Gateway  | .....  |  Gateway  | |  Gateway  |
| Router-1  | | Router-2  |        |Router-999 | |Router-1000|
+-----+-----+ +-----+-----+        +-----+-----+ +---+-------+
 100.64.0.2    100.64.0.3                |        100.64.x.y  
      |             |                    |           |        
+-----+-------------+--------------------+-----------+------+
|                    join logical_switch                    |
|                      (100.64.0.0/16)                      |
+----------------------------+------------------------------+
                             |                                
                             |                                
                        100.64.0.1                            
            +----------------+---------------+                
            |     logical router             |                
            +--------------------------------+    

If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline of Gateway Router-1, then you will see that there will be 2000 logical flow entries (1000 for IPv4 and 1000 for IPv6) where-in each of the entry resolves the NextHop IP address (corresponding to the 1000 routers) to destination ethernet address. For example, on Gateway Router-1
(outport == rtoj-gr1 && reg0 == 100.64.0.3), action=(eth.dst=m3; next;)
(outport == rtoj-gr1 && reg0 == 100.64.0.4), action=(eth.dst=m4; next;)

Each router will have 2000 entries, and in total we will have 2000 * 1000 = 2M entries. That is lot of flows for OVN SB to digest.

In the topology above, the only intended path is North-South between each gateway router and the logical router. There is no east-west traffic between the gateway routers

We addressed this issue by creating 1000 join logical switches and each join logical_switch connects one gateway router to the logical router. However, this creates lots of logical resources and other scale issues in OVN. Also, there are other places in the OVN kubernetes logical topology that we could optimize by creating just one logical switch instead of 1000s of logical switch (for example: instead of a separate external logical switch with localnet port that connects gateway router to physical network, we could just have one for the whole logical topology).

Is there an another way to solve the above problem with just keeping the single join logical switch?

Regards,
~Girish            

Girish Moodalbail

unread,
May 1, 2020, 12:37:20 PM5/1/20
to ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com

Dan Winship

unread,
May 1, 2020, 5:02:32 PM5/1/20
to Girish Moodalbail, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com
On 5/1/20 12:37 PM, Girish Moodalbail wrote:
> If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline
> of Gateway Router-1, then you will see that there will be 2000 logical
> flow entries...

> In the topology above, the only intended path is North-South between
> each gateway router and the logical router. There is no east-west
> traffic between the gateway routers

> Is there an another way to solve the above problem with just keeping the
> single join logical switch?

Two thoughts:

1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
just lets ARP requests pass through normally, and lets ARP replies pass
through normally as long as they are correct (ie, it doesn't let
spoofing through). This means fewer flows but more traffic. Maybe that's
the right tradeoff?

2. In most places in ovn-kubernetes, our MAC addresses are
programmatically related to the corresponding IP addresses, and in
places where that's not currently true, we could try to make it true,
and then perhaps the thousands of rules could just be replaced by a
single rule?

-- Dan

Han Zhou

unread,
May 8, 2020, 2:24:04 AM5/8/20
to Girish Moodalbail, ovn-kub...@googlegroups.com, ovs-discuss, Han Zhou
(Add the MLs back)

On Thu, May 7, 2020 at 4:01 PM Girish Moodalbail <gmood...@gmail.com> wrote:
Hello Han,

Sorry, I was monitoring the ovn-kubernetes google group and didn't see your emails till now.
 

On the other hand, why wouldn't splitting the join logical switch to 1000 LSes solve the problem? I understand that there will be 1000 more datapaths, and 1000 more LRPs, but these are all O(n), which is much more efficient than the O(n^2) exploding. What's the other scale issues created by this?

Splitting a single join logical switch into 1000 different logical switch is how I have resolved the problem now. However, with this design I see following issues.
(1) Complexity
   where one logical switch should have sufficed, we now need to create 1000 logical switches just to workaround the O(n^2) logical flows
(2) IPAM management
  - before I had one IP subnet 100.64.0.0/16 for the single logical switch and depended on OVN IPAM to allocate IPs off of that subnet
  - now I need to first do subnet management (break a /16 to /29 CIDR) in OVN K8s and then assign each subnet to each of the join logical switch
(3) each of this join logical switch is a distributed switch. The flows related to each one of them will be present in each hypervisor. This will increase the number of OpenFlow flows  However, from OVN K8s point of view this logical switch is essentially pinned to an hypervisor and its role is to connect the hypervisor's l3gateway to the distributed router. 

We are trying to simplify the OVN logical topology for OVN K8s so that the number of logical flows (and therefore the number of OpenFlow flows) are reduced and that reduces the pressure on ovn-northd, OVN SB DB, and finally ovn-controller processes.

Every node in OVN K8s cluster adds 4 resources. So, in a 1000 node k8s-cluster we will have 4000 + 1 (distributed router). This ends up creating around 250K OpenFlow rules in each of the hypervisior. This number is to just support the initial logical topology. I am not accounting for any flows that will be generated for k8s network polices, services, and so on.
 

In addition, Girish, for the external LS, I am not sure why can't it be shared, if all the nodes are connected to a single L2 network. (If they are connected to separate L2 networks, different external LSes should be created, at least according to current OVN model).

Yes, the plan was to share the same external LS with all of the L3 gateway routers since they are all on the same broadcast domain. However, we will end up with the same 2M logical flows since a single external LS connects all the L3 gateway routers on the same broadcast domain.

In short, for a 1000-node K8s cluster, if we reduce the logical flow explosion, then we can reduce the number of logical resources in OVN K8s topology by 1998  (1000 Join LS will become 1 and 1000 external LS will become 1).


Ok, so now we are not satisfied with even O(n), and instead we want to make it O(1) for some of the resources.
I think the major problem is the per-node gateway routers. It seems not really necessary in theory. Ideally the topology can be simplified with the concept of distributed gateway ports, on a single logical router (the join router), and then we can remove all the join LSes and gateway routers, something like below:

    +------------------------------------------+                
    |        external logical switch           |                
    +-+-------------+--------------------+-----+   
      |             |                    |             
+-----+-----+ +-----------+        +-----+-----------+
| dgp1@node1| | dgp2@node2|   ...  |dgp1000@node1000 |
+-----+-----+ +-----+-----+        +-----+-----------+
      |             |                    |             
    +-+-------------+--------------------+-----+                
    |             logical router               |                
    +------------------------------------------+   

(dgp = distributed gateway port)

This way, you only need one router, and also one external logical switch, and there won't be the O(n^2) flow exploding problem for ARP resolving because you have 1 LR only. The number of logical routers and switches become O(1). The number of router ports are still O(n), but it is also halved.

In reality, there are some problems of this solution that need to be addressed.

Firstly, it would require some change in OVN because currently OVN has a limitation that each LR can only have one gateway router port. However, it doesn't seem to be anything fundamental that would prevent us from removing that restriction to support multiple distributed gateway ports on a single LR. I'd like to hear from more OVN folks in case there is some reason we shouldn't do this.

The other thing that I am not so sure is about connecting the logical router to the external logical switch through multiple ports. This means we will have multiple ports of the logical router on the same subnet, which is something we usually don't do traditionally. However, I think maybe this will work with OVN static route with src routing and output_port specified so that the LR know which port (and chassis) to send the traffic out, provided that there is only one nexthop, which is the default external GW. If multiple nexthops need to be supported, this won't work (and we probably will have to look at the solution that avoids the static neighbour table population).

Thanks,
Han

Girish M G (GmG)

unread,
May 8, 2020, 3:03:10 AM5/8/20
to Han Zhou, Girish Moodalbail, ovn-kub...@googlegroups.com, ovs-discuss, Han Zhou
Hello Han,

I did consider distributed gateway port. However, there are two issues with it

1. In order to support K8s NodePort services we need to create a North-South LB and L3 gateway is a perfect solution for that. AFAIK,
   DGP doesn't support it
2. Datapath performance would be bad with DGP. We want the packet meant for the host or the Internet to exit out of the hypervisor on which the pod exists. The L3 gateway router provides us with this functionality. With dgp and with OVN supporting only one instance of it, packets unnecessarily gets forwarded over tunnel to dgp chassis for SNATing and then gets forwarded back over tunnel to the host to just exit out locally.

Also, I would like to clarify the topology of the external logical switch and l3gateway. The current topology is like this:

Topology (A)

                       10.10.10.0/24                        
   ------+-----------------+--------------------+------    
         |                 |                    |          
      localnet          localnet             localnet      
   +-----+-----+      +----+------+       +-----+-----+    
   | external  |      | external  |       | external  |    
   |    LS1    |      |    LS1    |       |    LS1    |    
   +-----+-----+      +----+------+       +-----+-----+    
         |                 |                    |          
     10.10.10.2        10.10.10.3           10.10.10.4      
        SNAT              SNAT                 SNAT        
   +-----+-----+     +-----+-----+        +-----+-----+    
   | l3gateway |     | l3gateway |        | l3gateway |    
   |   node1   |     |   node2   |        |   node3   |    
   +-----------+     +-----------+        +-----------+     


and I would like to move to a topology like this and is very similar to physical networking where all tenant's VRFs SNAT to common L2 in the DC.

Topology (B)
                       10.10.10.0/24                        
   -------------------------+--------------------------    
                            |                              
                         localnet                          
                      +-----+-----+                        
                      | external  |                        
         +------------+    LS1    +-------------+          
         |            +----+------+             |          
         |                 |                    |          
     10.10.10.2        10.10.10.3           10.10.10.4      
        SNAT              SNAT                 SNAT        
   +-----+-----+     +-----+-----+        +-----------+    
   | l3gateway |     | l3gateway |        | l3gateway |    
   |   node1   |     |   node2   |        |   node3   |    
   +-----------+     +-----------+        +-----------+   
 

I cannot do this because of the 2M logical flows that gets created since we now have connected 1000 l3 gateway routers through a single logical switch.

Note: Topology (A) might be still relevant in certain DCs where they don't stretch L2 across the Rack and have pure L3 in the core. So, everything upstream from the TOR towards core switches is L3 and everything downstream from TOR to the nodes will be L2. Topology (B) above is just an optimization for end-users who have a single stretched VLAN.

Regards,
~Girish
















--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDC%3Dp4fmsQPY38eezAqENG65ftXk6CAxKn%3DsF1X%3Dp92gw0A%40mail.gmail.com.

Girish Moodalbail

unread,
May 8, 2020, 3:17:32 AM5/8/20
to Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Han Zhou
On Thu, May 7, 2020 at 11:24 PM Han Zhou <zho...@gmail.com> wrote:

Tim Rozet

unread,
May 8, 2020, 9:41:14 AM5/8/20
to Girish Moodalbail, Han Zhou, ovn-kub...@googlegroups.com, ovs-discuss, Han Zhou
Girish, Han,
From my understanding the GR (per node) <----> DR link is local subnet and you don't want the overhead of many switch objects in OVN, but you also dont want a all the GRs connecting to a single switch to stop large L2 domain. Isn't the simple solution to allow connecting routers to each other without an intermediary switch?

Tim Rozet
Red Hat CTO Networking Team


--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.

Han Zhou

unread,
May 8, 2020, 2:03:07 PM5/8/20
to Tim Rozet, Girish Moodalbail, ovn-kub...@googlegroups.com, ovs-discuss, Han Zhou
On Fri, May 8, 2020 at 6:41 AM Tim Rozet <tro...@redhat.com> wrote:
Girish, Han,
From my understanding the GR (per node) <----> DR link is local subnet and you don't want the overhead of many switch objects in OVN, but you also dont want a all the GRs connecting to a single switch to stop large L2 domain. Isn't the simple solution to allow connecting routers to each other without an intermediary switch?
 
Tim Rozet
Red Hat CTO Networking Team


Hi Tim,

Thanks for the suggestion. This should be an improvement, but it doesn't completely solve the problem mentioned by Girish.
- Subnet management for the large number of transit subnet is still needed.
- For the external logical switch, this doesn't help.
It is still O(n) regarding number of datapaths, same as the approach of spliting the join LS, but it is more optimal, because for each of the direct connections between the LR and GRs, the cost of <patch_port - LS - patch_port> is avoided. I think it is worth to try.

Hi Girish, for the DGP solution, please see my comments below:
In fact DGP supports LB (at least from code https://github.com/ovn-org/ovn/blob/master/northd/ovn-northd.c#L9318), but the ovn-nb manpage may need an update.
 
2. Datapath performance would be bad with DGP. We want the packet meant for the host or the Internet to exit out of the hypervisor on which the pod exists. The L3 gateway router provides us with this functionality. With dgp and with OVN supporting only one instance of it, packets unnecessarily gets forwarded over tunnel to dgp chassis for SNATing and then gets forwarded back over tunnel to the host to just exit out locally.

This is related to the changes needed for DGP (the first point I mentioned in previous email). In the diagram I draw, there will be 1000 DGPs, each reside on a chassis, just to make sure north-south traffic can be forwarded on the local chassis without going through a central node, just like how it works today in ovn-k8s. However, maybe this is not a small change, because today the NAT and LB processing on such LRs (LRs with DGP) are all based on the assumption that there is only one DGP. For example, the NB schema would also need to be changed so that the NAT/LB rules for a router can specify DGP to determine the central processing location for those rules.

So, to summarize, if we can make multi-DGP work, it would be the best solution for the ovn-k8s scenario. If we can't (either because of design problem, or because it is too big effort for the gains), maybe configurably avoiding the static neighbour flows is a good way to go. Both options requires changes in OVN. Without changes in OVN, a further optimization based on your current workaround can be done is what Tim has suggested: to replace the large number of small join LSes (and LRPs and patch ports on both sides) by same number of directly connected LRPs.

Thanks,
Han

Lorenzo Bianconi

unread,
May 8, 2020, 4:13:13 PM5/8/20
to Numan Siddique, Han Zhou, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss
> On Wed, May 6, 2020 at 11:41 PM Han Zhou <hz...@ovn.org> wrote:
>
> >
> >
> > On Wed, May 6, 2020 at 12:49 AM Numan Siddique <num...@ovn.org> wrote:
> > >

[...]

> > > I forgot to mention, Lorenzo have similar ideas for moving the arp
> > resolve lflows for NAT entries to mac_binding rows.
> > >
> >
> > I am hesitate to the approach of moving to mac_binding as solution to this
> > particular problem, because:
> > 1. Although cost of each mac_binding entry may be much lower than a
> > logical flow entry, it would still be O(n^2), since LRP is part of the key
> > in the table.
> >
>
> Agree. I realize it now.

Hi Han and Numan,

what about moving to mac_binding table just entries related to NAT where we
configured the external mac address since this info is known in advance. I can
share a PoC I developed few weeks ago.

Regards,
Lorenzo

>
> Thanks
> Numan
>
>
> > 2. It is better to separate the static and dynamic part clearly. Moving to
> > mac_binding will lose this clarity in data, and also the ownership of the
> > data as well (now mac_binding entries are added only by ovn-controllers).
> > Although I am not in favor of solving the problem with this approach
> > (because of 1)), maybe it makes sense to reduce number of logical flows as
> > a general improvement by moving all neighbour information to mac_binding
> > for scalability. If we do so, I would suggest to figure out a way to keep
> > the data clarity between static and dynamic part.
> >
> > For this particular problem, we just don't want the static part populated
> > because most of them are not needed except one per LRP. However, even
> > before considering optionally disabling the static part, I wanted to
> > understand firstly why separating the join LS would not solve the problem.
> >
> > >>
> > >>
> > >> Thanks
> > >> Numan
> > >>
> > >>>
> > >>> > 2. In most places in ovn-kubernetes, our MAC addresses are
> > >>> > programmatically related to the corresponding IP addresses, and in
> > >>> > places where that's not currently true, we could try to make it true,
> > >>> > and then perhaps the thousands of rules could just be replaced by a
> > >>> > single rule?
> > >>> >
> > >>> This may be a good idea, but I am not sure how to implement in OVN to
> > make it generic, since most OVN users can't make such assumption.
> > >>>
> > >>> On the other hand, why wouldn't splitting the join logical switch to
> > 1000 LSes solve the problem? I understand that there will be 1000 more
> > datapaths, and 1000 more LRPs, but these are all O(n), which is much more
> > efficient than the O(n^2) exploding. What's the other scale issues created
> > by this?
> > >>>
> > >>> In addition, Girish, for the external LS, I am not sure why can't it
> > be shared, if all the nodes are connected to a single L2 network. (If they
> > are connected to separate L2 networks, different external LSes should be
> > created, at least according to current OVN model).
> > >>>
> > >>> Thanks,
> > >>> Han
> > >>> _______________________________________________
> > >>> discuss mailing list
> > >>> dis...@openvswitch.org
> > >>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> > _______________________________________________
> > discuss mailing list
> > dis...@openvswitch.org
> > https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
> >

> _______________________________________________
> discuss mailing list
> dis...@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

signature.asc

Tim Rozet

unread,
May 9, 2020, 9:14:37 AM5/9/20
to Lorenzo Bianconi, Numan Siddique, Han Zhou, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss
So we can get rid of the join logical switch. This might be a dumb question, but why do we need an external switch? In the local gateway mode:

pod------logical switch----DR---join switch (to remove) --- GR 169.x.x.2---external switch---169.x.x.1 Linux host

There's no reason in the above to have an external switch that I can see.

Perhaps in the shared gateway mode it is necessary if all of the nodes externally attach to the same L2 network.

Tim Rozet
Red Hat CTO Networking Team

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.

Girish Moodalbail

unread,
May 9, 2020, 8:01:32 PM5/9/20
to Han Zhou, Tim Rozet, ovn-kub...@googlegroups.com, ovs-discuss, Han Zhou
Hello Han, Tim

Please see in-line:

 
Hello Han,

I did consider distributed gateway port. However, there are two issues with it

1. In order to support K8s NodePort services we need to create a North-South LB and L3 gateway is a perfect solution for that. AFAIK,
   DGP doesn't support it

In fact DGP supports LB (at least from code https://github.com/ovn-org/ovn/blob/master/northd/ovn-northd.c#L9318), but the ovn-nb manpage may need an update.

I see
 
 
2. Datapath performance would be bad with DGP. We want the packet meant for the host or the Internet to exit out of the hypervisor on which the pod exists. The L3 gateway router provides us with this functionality. With dgp and with OVN supporting only one instance of it, packets unnecessarily gets forwarded over tunnel to dgp chassis for SNATing and then gets forwarded back over tunnel to the host to just exit out locally.

This is related to the changes needed for DGP (the first point I mentioned in previous email). In the diagram I draw, there will be 1000 DGPs, each reside on a chassis, just to make sure north-south traffic can be forwarded on the local chassis without going through a central node, just like how it works today in ovn-k8s. However, maybe this is not a small change, because today the NAT and LB processing on such LRs (LRs with DGP) are all based on the assumption that there is only one DGP. For example, the NB schema would also need to be changed so that the NAT/LB rules for a router can specify DGP to determine the central processing location for those rules.

Correct
 

So, to summarize, if we can make multi-DGP work, it would be the best solution for the ovn-k8s scenario. If we can't (either because of design problem, or because it is too big effort for the gains), maybe configurably avoiding the static neighbour flows is a good way to go. Both options requires changes in OVN.

Han, optimizing the neighbor cache from the current O(n^2) to something scalable will be ideal for short-term. I am hoping that the changes to OVN will not be as complicated as multi-DGP work and other changes to OVN proposed on this email thread.

 
Without changes in OVN, a further optimization based on your current workaround can be done is what Tim has suggested: to replace the large number of small join LSes (and LRPs and patch ports on both sides) by same number of directly connected LRPs.

Han and Tim,

OVN supports only peering two distributed routers without a logical switch, however it doesn't support connecting a distributed router and an l3 gateway router directly as peers. I remember very clearly this being mentioned in the ovn-architecture man page.

---------8<--------------8<---------------------
       The distributed router and the
       gateway router are  connected  by  another  logical  switch,  sometimes
       referred  to  as a ``join’’ logical switch. (OVN logical routers may be
       connected to one another directly, without an intervening  switch,  but
       the  OVN  implementation only supports gateway logical routers that are
       connected to logical switches. Using a join logical switch also reduces
       the  number  of  IP addresses needed on the distributed router.)
---------8<--------------8<---------------------

Before splitting the OVN join logical switch into several small logical switches, I did try directly connecting the LR to each of the node-specific LR using a point-to-point link but it didn't work. Since this was corroborated by the man page, I didn't debug the topology and moved on to splitting the `join` logical switch.

Regards,
~Girish 


Han Zhou

unread,
May 11, 2020, 1:27:40 AM5/11/20
to Girish Moodalbail, Tim Rozet, ovn-kub...@googlegroups.com, ovs-discuss, Han Zhou
You are right. So this *improvement* would also need change in OVN as well, and the benefit seems less obvious than the other two options.

>
> Regards,
> ~Girish
>
>>>>
> --
> You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.

Tim Rozet

unread,
May 13, 2020, 1:44:00 PM5/13/20
to Han Zhou, Girish Moodalbail, ovn-kub...@googlegroups.com, ovs-discuss, Han Zhou

Tim Rozet
Red Hat CTO Networking Team

Han Zhou

unread,
May 16, 2020, 3:36:22 AM5/16/20
to Han Zhou, Dan Winship, Girish Moodalbail, ovs-discuss, ovn-kub...@googlegroups.com


On Tue, May 5, 2020 at 11:57 AM Han Zhou <hz...@ovn.org> wrote:

>
>
>
> On Fri, May 1, 2020 at 2:14 PM Dan Winship <danwi...@redhat.com> wrote:
> >
> > On 5/1/20 12:37 PM, Girish Moodalbail wrote:
> > > If we now look at table=12 (lr_in_arp_resolve) in the ingress pipeline
> > > of Gateway Router-1, then you will see that there will be 2000 logical
> > > flow entries...
> >
> > > In the topology above, the only intended path is North-South between
> > > each gateway router and the logical router. There is no east-west
> > > traffic between the gateway routers
> >
> > > Is there an another way to solve the above problem with just keeping the
> > > single join logical switch?
> >
> > Two thoughts:
> >
> > 1. In openshift-sdn, the bridge doesn't try to handle ARP itself. It
> > just lets ARP requests pass through normally, and lets ARP replies pass
> > through normally as long as they are correct (ie, it doesn't let
> > spoofing through). This means fewer flows but more traffic. Maybe that's
> > the right tradeoff?
> >
> The 2M entries here is not for ARP responder, but more equivalent to the neighbour table (or ARP cache), on each LR. The ARP responder resides in the LS (join logical switch), which is O(n) instead of O(n^2), so it is not a problem here.
>
> However, a similar idea may works here to avoid the O(n^2) scale issue. For the neighbour table, actually OVN has two parts, one is statically build, which is the 2M entires mentioned in this case, and the other is the dynamic ARP resolve - the mac_binding table, which is dynamically populated by handling ARP messages. To solve the problem here, it is possible to change OVN to support configuring a LR to avoid static neighbour table, and relies only on dynamic ARP resolving. In this case, all the gateway routers can be configured as not using static ARP resolving, and eventually there will be only 2 entries (one for IPv4 and one for IPv6) for each gateway router in mac_binding table for the north-south traffic to the join router. (of source there will be still same amount of mac_bindings in each router for the external traffic on the other side of the gateway routers).
>
> This change seems straightforward, but I am not sure if there is any corner cases.

Hi Girish,

For this use case, just set options:dynamic_neigh_routes=true for all the Gateway Routers. Could you try it in your scale environment and see if it solves the problem?

Thanks,
Han

>
> > 2. In most places in ovn-kubernetes, our MAC addresses are
> > programmatically related to the corresponding IP addresses, and in
> > places where that's not currently true, we could try to make it true,
> > and then perhaps the thousands of rules could just be replaced by a
> > single rule?
> >

Girish Moodalbail

unread,
May 16, 2020, 1:25:29 PM5/16/20
to Han Zhou, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com
Thanks Han for the patch. Will give it a try and let you know.

Regards,
~Girish 
 
>
> Thanks,
> Han

Girish Moodalbail

unread,
May 16, 2020, 3:13:12 PM5/16/20
to Han Zhou, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com
Hello Han,

Can you please explain how the dynamic resolution of the IP-to-MAC will work with this new option set? 

Say the packet is being forwarded from router2 towards the distributed router? So, nexthop (reg0) is set to IP1 and we need to find the MAC address M1 to set eth.dst to.

+----------------+        +----------------+
|   l3gateway    |        |   l3gateway    |
|    router2     |        |    router3     |
+-------------+--+        +-+--------------+
            IP2,M2         IP3,M3          
              |             |                            
           +--+-------------+---+          
           |    join switch     |          
           +---------+----------+          
                     |                      
                  IP1,M1                    
             +-------+--------+            
             |  distributed   |            
             |     router     |            
             +----------------+      

The MAC M1 will not obviously in the MAC_binding table. On the hypervisor where the packet originated, the router2's port and the distributed router's port are locally present. So, does this result in a PACKET_IN to the ovn-controller and the resolution happens there?

How about the resolution of IP3-to-M3 happen on gateway router2? Will there be an ARP request packet that will be broadcasted on the join switch for this case?

Regards,
~Girish       

Han Zhou

unread,
May 17, 2020, 2:17:17 AM5/17/20
to Girish Moodalbail, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com
On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail <gmood...@gmail.com> wrote:
Hello Han,

Can you please explain how the dynamic resolution of the IP-to-MAC will work with this new option set? 

Say the packet is being forwarded from router2 towards the distributed router? So, nexthop (reg0) is set to IP1 and we need to find the MAC address M1 to set eth.dst to.

+----------------+        +----------------+
|   l3gateway    |        |   l3gateway    |
|    router2     |        |    router3     |
+-------------+--+        +-+--------------+
            IP2,M2         IP3,M3          
              |             |                            
           +--+-------------+---+          
           |    join switch     |          
           +---------+----------+          
                     |                      
                  IP1,M1                    
             +-------+--------+            
             |  distributed   |            
             |     router     |            
             +----------------+      

The MAC M1 will not obviously in the MAC_binding table. On the hypervisor where the packet originated, the router2's port and the distributed router's port are locally present. So, does this result in a PACKET_IN to the ovn-controller and the resolution happens there?

Yes there will be a PACKET_IN, and then:
1. ovn-controller will generate the ARP request for IP1, and send PACKET_OUT to OVS.
2. The ARP request will be delivered to the distributed router pipeline only, because of a special handling of ARP in OVN for IPs of router ports, although it is a broadcast. (It would have been broadcasted to all GRs without that special handling)
3. The distributed router pipeline should learn the IP-MAC binding of IP2-M2 (through a PACKET_IN to ovn-controller), and at the same time send ARP reply to the router2 in the distributed router pipeline.
4. Router2 pipeline will handle the ARP response and learn the IP-MAC binding of IP1-M1 (through a PACKET_IN to ovn-controller).
 

How about the resolution of IP3-to-M3 happen on gateway router2? Will there be an ARP request packet that will be broadcasted on the join switch for this case?

I think in the use case of ovn-k8s, as you described before, this should not happen. However, if this does happen, it is similar to above steps, except that in step 2) and 3) the ARP request and response will be sent between the chassises through tunnel. If this happens between all pairs of GRs, then there will be again O(n^2) MAC_Binding entries.

I haven't tested the GR scenario yet, so I can't guarantee it works as expected. Please let me know if you see any problems. I will submit formal patch with more test cases if it is confirmed in your environment.

Thanks,
Han

Girish Moodalbail

unread,
May 17, 2020, 12:52:08 PM5/17/20
to Han Zhou, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com
Thanks Han for the explanation. Yes, there is no east-west traffic between the GRs (I was just curious to know). So, if the ARP request/response between GR and DR is confined to the same chassis, then there shouldn't be O(n^2) explosion per-your explanation.

Will get back to you on how the test goes in the next few days.

Regards,
Girish

Girish Moodalbail

unread,
May 20, 2020, 2:09:20 AM5/20/20
to Han Zhou, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com
Hello Han,

Please see in-line:

On Sat, May 16, 2020 at 11:17 PM Han Zhou <zho...@gmail.com> wrote:


On Sat, May 16, 2020 at 12:13 PM Girish Moodalbail <gmood...@gmail.com> wrote:
Hello Han,

Can you please explain how the dynamic resolution of the IP-to-MAC will work with this new option set? 

Say the packet is being forwarded from router2 towards the distributed router? So, nexthop (reg0) is set to IP1 and we need to find the MAC address M1 to set eth.dst to.

+----------------+        +----------------+
|   l3gateway    |        |   l3gateway    |
|    router2     |        |    router3     |
+-------------+--+        +-+--------------+
            IP2,M2         IP3,M3          
              |             |                            
           +--+-------------+---+          
           |    join switch     |          
           +---------+----------+          
                     |                      
                  IP1,M1                    
             +-------+--------+            
             |  distributed   |            
             |     router     |            
             +----------------+      

The MAC M1 will not obviously in the MAC_binding table. On the hypervisor where the packet originated, the router2's port and the distributed router's port are locally present. So, does this result in a PACKET_IN to the ovn-controller and the resolution happens there?

Yes there will be a PACKET_IN, and then:
1. ovn-controller will generate the ARP request for IP1, and send PACKET_OUT to OVS.
2. The ARP request will be delivered to the distributed router pipeline only, because of a special handling of ARP in OVN for IPs of router ports, although it is a broadcast. (It would have been broadcasted to all GRs without that special handling)
3. The distributed router pipeline should learn the IP-MAC binding of IP2-M2 (through a PACKET_IN to ovn-controller), and at the same time send ARP reply to the router2 in the distributed router pipeline.
4. Router2 pipeline will handle the ARP response and learn the IP-MAC binding of IP1-M1 (through a PACKET_IN to ovn-controller).

Unfortunately, the ARP request (who as IP1) from router2 is broadcasted out to all of the chassis through Geneve Tunnel. The other gateway routers learn the Source mac of 'M2'. Now, each of the gateway router has an entry for (IP2, M2) in the MAC binding table on their respective rtoj-<blah> router port. So, the MAC_Binding table will now have N X N entries, where N is the number of gateway routers.

Per your explanation above, the ARP request should not have broadcasted right? Note that the direction of  ARP request is from Gateway Router to Distributed Router. 

Regards,
~Girish

Venugopal Iyer

unread,
May 21, 2020, 1:33:51 PM5/21/20
to Girish Moodalbail, Han Zhou, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com
Han,

just a quick question below..

________________________________________
From: ovn-kub...@googlegroups.com <ovn-kub...@googlegroups.com> on behalf of Girish Moodalbail <gmood...@gmail.com>
Sent: Tuesday, May 19, 2020 11:09 PM
To: Han Zhou
Cc: Han Zhou; Dan Winship; ovs-discuss; ovn-kub...@googlegroups.com
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

Hello Han,

Please see in-line:

On Sat, May 16, 2020 at 11:17 PM Han Zhou <zho...@gmail.com<mailto:zho...@gmail.com>> wrote:


<vi> probably obvious and I am missing it, but..
<vi> I see the lflow to direct ARP request to the router port, instead of bcast. However,
<vi> we also add flows to bcast self-originated (unsolicitated ?) arp requests (we should
<vi> not see this for router IPs, I suppose). But, given we just match on the source
<vi> MAC address of the packet for such packets, does it differ from the ARP
<vi> request generated for Router IP?

thanks,

-venu

Note that the direction of ARP request is from Gateway Router to Distributed Router.

Regards,
~Girish


How about the resolution of IP3-to-M3 happen on gateway router2? Will there be an ARP request packet that will be broadcasted on the join switch for this case?

I think in the use case of ovn-k8s, as you described before, this should not happen. However, if this does happen, it is similar to above steps, except that in step 2) and 3) the ARP request and response will be sent between the chassises through tunnel. If this happens between all pairs of GRs, then there will be again O(n^2) MAC_Binding entries.

I haven't tested the GR scenario yet, so I can't guarantee it works as expected. Please let me know if you see any problems. I will submit formal patch with more test cases if it is confirmed in your environment.

Thanks,
Han


Regards,
~Girish

On Sat, May 16, 2020 at 10:25 AM Girish Moodalbail <gmood...@gmail.com<mailto:gmood...@gmail.com>> wrote:


On Sat, May 16, 2020 at 12:36 AM Han Zhou <zho...@gmail.com<mailto:zho...@gmail.com>> wrote:


On Tue, May 5, 2020 at 11:57 AM Han Zhou <hz...@ovn.org<mailto:hz...@ovn.org>> wrote:

Hi Girish,

Thanks,
Han

Regards,
~Girish

>
> Thanks,
> Han

--


You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com<mailto:ovn-kubernete...@googlegroups.com>.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STTq4WSwvwHbws5e0yozT7OM9RYcpWwaA2v49k83JDmEqA%40mail.gmail.com<https://groups.google.com/d/msgid/ovn-kubernetes/CAAF2STTq4WSwvwHbws5e0yozT7OM9RYcpWwaA2v49k83JDmEqA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Han Zhou

unread,
May 21, 2020, 5:01:01 PM5/21/20
to Venugopal Iyer, Dumitru Ceara, Girish Moodalbail, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com
Good catch! That seems to be the reason why it is broadcasted. I thought the feature was only allowing GARP to be broadcasted, but it is actually allowing (G)ARP including regular ARP generated by the LRs. It can be an easy fix to: commit 32f5ebb062 ("ovn-northd: Limit ARP/ND broadcast domain whenever possible."), but I am not sure if there are other concerns of doing that. @Dumitru Ceara to comment if we can restrict it to be GARP only.

On the other hand, in this use case, if there are any ARP from the distributed router to any of the GRs, then all the GRs should have learned the MAC-bindings of the IP1-M1, and they won't send ARP for IP1 any more, thus would not result in N x N MAC-bindings, right? In the real use case, it may depend on which direction of traffic comes first. If it is always from external to k8s workloads first, then yes it will end up with N x N mac-bindings finally.

Tim Rozet

unread,
May 21, 2020, 5:35:25 PM5/21/20
to Han Zhou, Venugopal Iyer, Dumitru Ceara, Girish Moodalbail, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
I think that if you directly connect GR to DR you don't need to learn any ARP with packet_in and you can preprogram the static entries. Each GR will have 1 enty for the DR, while the DR will have N number of entries for N nodes.

The real issue with ARP learning comes from the GR-----External. You have to learn these, and from my conversation with Girish it seems like every GR is adding an entry on every ARP request it sees. This means 1 GR sends ARP request to external L2 network and every GR sees the ARP request and adds an entry. I think the behavior should be:

GRs only add ARP entries when:
  1. An ARP Response is sent to it
  2. The GR receives a GARP broadcast, and already has an entry in his cache for that IP (Girish mentioned this is similar to linux arp_accept behavior)
In addition, as Michael Cambria pointed out in our weekly meeting, these ARP cache entries should have expiry timers on them. If they are permanently learned, you will end up with a growing ARP table over time, and end up in the same place. We can probably just program the GR ARP flows with an idle_timeout and have the flow removed. What do you think?

Should I file a bugzilla outlining the above so we can have proper tracking?

Thanks,

Tim Rozet
Red Hat CTO Networking Team

To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCnZ0ZJeC0L%3DXXf8JQ0k1TqJoo0MkHzj6%3DkmEv1qHPxaZA%40mail.gmail.com.

Venugopal Iyer

unread,
May 21, 2020, 5:58:57 PM5/21/20
to Han Zhou, Dumitru Ceara, Girish Moodalbail, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com
Hi, Han:

________________________________________
From: ovn-kub...@googlegroups.com <ovn-kub...@googlegroups.com> on behalf of Han Zhou <zho...@gmail.com>
Sent: Thursday, May 21, 2020 2:00 PM
To: Venugopal Iyer; Dumitru Ceara
Cc: Girish Moodalbail; Han Zhou; Dan Winship; ovs-discuss; ovn-kub...@googlegroups.com


Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

On Thu, May 21, 2020 at 10:33 AM Venugopal Iyer <venug...@nvidia.com<mailto:venug...@nvidia.com>> wrote:
Han,

just a quick question below..

________________________________________
From: ovn-kub...@googlegroups.com<mailto:ovn-kub...@googlegroups.com> <ovn-kub...@googlegroups.com<mailto:ovn-kub...@googlegroups.com>> on behalf of Girish Moodalbail <gmood...@gmail.com<mailto:gmood...@gmail.com>>


Sent: Tuesday, May 19, 2020 11:09 PM
To: Han Zhou

Cc: Han Zhou; Dan Winship; ovs-discuss; ovn-kub...@googlegroups.com<mailto:ovn-kub...@googlegroups.com>


Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

Hello Han,

Please see in-line:

On Sat, May 16, 2020 at 11:17 PM Han Zhou <zho...@gmail.com<mailto:zho...@gmail.com><mailto:zho...@gmail.com<mailto:zho...@gmail.com>>> wrote:

Good catch! That seems to be the reason why it is broadcasted. I thought the feature was only allowing GARP to be broadcasted, but it is actually allowing (G)ARP including regular ARP generated by the LRs. It can be an easy fix to: commit 32f5ebb062 ("ovn-northd: Limit ARP/ND broadcast domain whenever possible."), but I am not sure if there are other concerns of doing that. @Dumitru Ceara<mailto:dce...@redhat.com> to comment if we can restrict it to be GARP only.

On the other hand, in this use case, if there are any ARP from the distributed router to any of the GRs, then all the GRs should have learned the MAC-bindings of the IP1-M1, and they won't send ARP for IP1 any more, thus would not result in N x N MAC-bindings, right? In the real use case, it may depend on which direction of traffic comes first. If it is always from external to k8s workloads first, then yes it will end up with N x N mac-bindings finally.

<vi> that's right. However, I am not sure why the MAC bindings are learnt from the
<vi> ARP requests unconditionally - I thought you update the bindings, if you have
<vi> it in the table, but don't add it unless you need to. Linux has "arp_accept" that
<vi> allows you to add if needed, but by default it doesn't.

"
arp_accept - BOOLEAN
Define behavior for gratuitous ARP frames who's IP is not
already present in the ARP table:
0 - don't create new entries in the ARP table
1 - create new entries in the ARP table

"

<vi>Shouldn't we have a similar knob to learn via ARP request? which should be
<vi>"false" by default?


thanks,

-venu

Note that the direction of ARP request is from Gateway Router to Distributed Router.

Regards,
~Girish


How about the resolution of IP3-to-M3 happen on gateway router2? Will there be an ARP request packet that will be broadcasted on the join switch for this case?

I think in the use case of ovn-k8s, as you described before, this should not happen. However, if this does happen, it is similar to above steps, except that in step 2) and 3) the ARP request and response will be sent between the chassises through tunnel. If this happens between all pairs of GRs, then there will be again O(n^2) MAC_Binding entries.

I haven't tested the GR scenario yet, so I can't guarantee it works as expected. Please let me know if you see any problems. I will submit formal patch with more test cases if it is confirmed in your environment.

Thanks,
Han


Regards,
~Girish

On Sat, May 16, 2020 at 10:25 AM Girish Moodalbail <gmood...@gmail.com<mailto:gmood...@gmail.com><mailto:gmood...@gmail.com<mailto:gmood...@gmail.com>>> wrote:


On Sat, May 16, 2020 at 12:36 AM Han Zhou <zho...@gmail.com<mailto:zho...@gmail.com><mailto:zho...@gmail.com<mailto:zho...@gmail.com>>> wrote:


On Tue, May 5, 2020 at 11:57 AM Han Zhou <hz...@ovn.org<mailto:hz...@ovn.org><mailto:hz...@ovn.org<mailto:hz...@ovn.org>>> wrote:

Hi Girish,

Thanks,
Han

Regards,
~Girish

>
> Thanks,
> Han

To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com<mailto:ovn-kubernetes%2Bunsu...@googlegroups.com><mailto:ovn-kubernete...@googlegroups.com<mailto:ovn-kubernetes%2Bunsu...@googlegroups.com>>.

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com<mailto:ovn-kubernete...@googlegroups.com>.

To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCnZ0ZJeC0L%3DXXf8JQ0k1TqJoo0MkHzj6%3DkmEv1qHPxaZA%40mail.gmail.com<https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCnZ0ZJeC0L%3DXXf8JQ0k1TqJoo0MkHzj6%3DkmEv1qHPxaZA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Han Zhou

unread,
May 21, 2020, 7:43:07 PM5/21/20
to Tim Rozet, Venugopal Iyer, Dumitru Ceara, Girish Moodalbail, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
On Thu, May 21, 2020 at 2:35 PM Tim Rozet <tro...@redhat.com> wrote:
I think that if you directly connect GR to DR you don't need to learn any ARP with packet_in and you can preprogram the static entries. Each GR will have 1 enty for the DR, while the DR will have N number of entries for N nodes.

Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N ports on the DR and also requires a lot of small subnets, which is not desirable. And since changes are needed anyway in OVN to support that, we moved forward with the current approach of avoiding the static ARP flows to solve the problem instead of directly connecting GRs to DR.


The real issue with ARP learning comes from the GR-----External. You have to learn these, and from my conversation with Girish it seems like every GR is adding an entry on every ARP request it sees. This means 1 GR sends ARP request to external L2 network and every GR sees the ARP request and adds an entry. I think the behavior should be:

GRs only add ARP entries when:
  1. An ARP Response is sent to it
  2. The GR receives a GARP broadcast, and already has an entry in his cache for that IP (Girish mentioned this is similar to linux arp_accept behavior)
For 2), it is expensive to do in OVN because OpenFlow doesn't support a match condition of "field1 == field2", which is required to check if the incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to support something similar like linux arp_accept configuration but slightly different. In OVN we can configure it to alllow/disable learning from all ARP requests to IPs not belonging to the router, including GARPs. Would that solve the problem here? (@Venugopal Iyer  brought up the same thing about "arp_accept". I hope this reply addresses that as well)

In addition, as Michael Cambria pointed out in our weekly meeting, these ARP cache entries should have expiry timers on them. If they are permanently learned, you will end up with a growing ARP table over time, and end up in the same place. We can probably just program the GR ARP flows with an idle_timeout and have the flow removed. What do you think?

This has been discussed before. It is also mentioned in the TODO.rst. However, it is not taken care because there is no good solution found yet. It can be done but will be expensive and the gains do not worth the costs. Accepting ARP requests partially reduces the needs of ARP expiration. It is true that it could still be a problem in some scenarios but so far we didn't heard any use case that has hard dependency on this.
 
Should I file a bugzilla outlining the above so we can have proper tracking?

I think bugzilla is out of the control of OVN community, so please feel free to file or not file ;)

Thanks,
Han

Venugopal Iyer

unread,
May 21, 2020, 8:45:12 PM5/21/20
to Han Zhou, Tim Rozet, Dumitru Ceara, Girish Moodalbail, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
Hi, Han:

________________________________________

Sent: Thursday, May 21, 2020 4:42 PM
To: Tim Rozet
Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan Winship; ovs-discuss; ovn-kub...@googlegroups.com; Michael Cambria


Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

On Thu, May 21, 2020 at 2:35 PM Tim Rozet <tro...@redhat.com<mailto:tro...@redhat.com>> wrote:
I think that if you directly connect GR to DR you don't need to learn any ARP with packet_in and you can preprogram the static entries. Each GR will have 1 enty for the DR, while the DR will have N number of entries for N nodes.

Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N ports on the DR and also requires a lot of small subnets, which is not desirable. And since changes are needed anyway in OVN to support that, we moved forward with the current approach of avoiding the static ARP flows to solve the problem instead of directly connecting GRs to DR.


The real issue with ARP learning comes from the GR-----External. You have to learn these, and from my conversation with Girish it seems like every GR is adding an entry on every ARP request it sees. This means 1 GR sends ARP request to external L2 network and every GR sees the ARP request and adds an entry. I think the behavior should be:

GRs only add ARP entries when:

1. An ARP Response is sent to it
2. The GR receives a GARP broadcast, and already has an entry in his cache for that IP (Girish mentioned this is similar to linux arp_accept behavior)

For 2), it is expensive to do in OVN because OpenFlow doesn't support a match condition of "field1 == field2", which is required to check if the incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to support something similar like linux arp_accept configuration but slightly different. In OVN we can configure it to alllow/disable learning from all ARP requests to IPs not belonging to the router, including GARPs. Would that solve the problem here? (@Venugopal Iyer<mailto:venug...@nvidia.com> brought up the same thing about "arp_accept". I hope this reply addresses that as well)

<vi> I can't think of any side effects to this, so seems fine to me to do so. Believe linux behaves that way w.r.t. ARP request
<vi> anyway (assuming I am reading it right).

https://elixir.bootlin.com/linux/v5.7-rc6/source/net/ipv4/arp.c (L874)


thanks,

-venu

In addition, as Michael Cambria pointed out in our weekly meeting, these ARP cache entries should have expiry timers on them. If they are permanently learned, you will end up with a growing ARP table over time, and end up in the same place. We can probably just program the GR ARP flows with an idle_timeout and have the flow removed. What do you think?

This has been discussed before. It is also mentioned in the TODO.rst. However, it is not taken care because there is no good solution found yet. It can be done but will be expensive and the gains do not worth the costs. Accepting ARP requests partially reduces the needs of ARP expiration. It is true that it could still be a problem in some scenarios but so far we didn't heard any use case that has hard dependency on this.

Should I file a bugzilla outlining the above so we can have proper tracking?

I think bugzilla is out of the control of OVN community, so please feel free to file or not file ;)

Thanks,
Han


Thanks,

Tim Rozet
Red Hat CTO Networking Team


On Thu, May 21, 2020 at 5:01 PM Han Zhou <zho...@gmail.com<mailto:zho...@gmail.com>> wrote:


On Thu, May 21, 2020 at 10:33 AM Venugopal Iyer <venug...@nvidia.com<mailto:venug...@nvidia.com>> wrote:
Han,

just a quick question below..

________________________________________
From: ovn-kub...@googlegroups.com<mailto:ovn-kub...@googlegroups.com> <ovn-kub...@googlegroups.com<mailto:ovn-kub...@googlegroups.com>> on behalf of Girish Moodalbail <gmood...@gmail.com<mailto:gmood...@gmail.com>>


Sent: Tuesday, May 19, 2020 11:09 PM
To: Han Zhou

Cc: Han Zhou; Dan Winship; ovs-discuss; ovn-kub...@googlegroups.com<mailto:ovn-kub...@googlegroups.com>


Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

Hello Han,

Please see in-line:

On Sat, May 16, 2020 at 11:17 PM Han Zhou <zho...@gmail.com<mailto:zho...@gmail.com><mailto:zho...@gmail.com<mailto:zho...@gmail.com>>> wrote:

Good catch! That seems to be the reason why it is broadcasted. I thought the feature was only allowing GARP to be broadcasted, but it is actually allowing (G)ARP including regular ARP generated by the LRs. It can be an easy fix to: commit 32f5ebb062 ("ovn-northd: Limit ARP/ND broadcast domain whenever possible."), but I am not sure if there are other concerns of doing that. @Dumitru Ceara<mailto:dce...@redhat.com> to comment if we can restrict it to be GARP only.

On the other hand, in this use case, if there are any ARP from the distributed router to any of the GRs, then all the GRs should have learned the MAC-bindings of the IP1-M1, and they won't send ARP for IP1 any more, thus would not result in N x N MAC-bindings, right? In the real use case, it may depend on which direction of traffic comes first. If it is always from external to k8s workloads first, then yes it will end up with N x N mac-bindings finally.

thanks,

-venu

Note that the direction of ARP request is from Gateway Router to Distributed Router.

Regards,
~Girish


How about the resolution of IP3-to-M3 happen on gateway router2? Will there be an ARP request packet that will be broadcasted on the join switch for this case?

I think in the use case of ovn-k8s, as you described before, this should not happen. However, if this does happen, it is similar to above steps, except that in step 2) and 3) the ARP request and response will be sent between the chassises through tunnel. If this happens between all pairs of GRs, then there will be again O(n^2) MAC_Binding entries.

I haven't tested the GR scenario yet, so I can't guarantee it works as expected. Please let me know if you see any problems. I will submit formal patch with more test cases if it is confirmed in your environment.

Thanks,
Han


Regards,
~Girish

On Sat, May 16, 2020 at 10:25 AM Girish Moodalbail <gmood...@gmail.com<mailto:gmood...@gmail.com><mailto:gmood...@gmail.com<mailto:gmood...@gmail.com>>> wrote:


On Sat, May 16, 2020 at 12:36 AM Han Zhou <zho...@gmail.com<mailto:zho...@gmail.com><mailto:zho...@gmail.com<mailto:zho...@gmail.com>>> wrote:


On Tue, May 5, 2020 at 11:57 AM Han Zhou <hz...@ovn.org<mailto:hz...@ovn.org><mailto:hz...@ovn.org<mailto:hz...@ovn.org>>> wrote:

Hi Girish,

Thanks,
Han

Regards,
~Girish

>
> Thanks,
> Han

To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com<mailto:ovn-kubernetes%2Bunsu...@googlegroups.com><mailto:ovn-kubernete...@googlegroups.com<mailto:ovn-kubernetes%2Bunsu...@googlegroups.com>>.

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com<mailto:ovn-kubernete...@googlegroups.com>.

To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCnZ0ZJeC0L%3DXXf8JQ0k1TqJoo0MkHzj6%3DkmEv1qHPxaZA%40mail.gmail.com<https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCnZ0ZJeC0L%3DXXf8JQ0k1TqJoo0MkHzj6%3DkmEv1qHPxaZA%40mail.gmail.com?utm_medium=email&utm_source=footer>.

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com<mailto:ovn-kubernete...@googlegroups.com>.

To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCmDL84qU_aciBz_OgNwj8RQhiz%3DyCwzrnc6ZVqb80QyPQ%40mail.gmail.com<https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCmDL84qU_aciBz_OgNwj8RQhiz%3DyCwzrnc6ZVqb80QyPQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Tim Rozet

unread,
May 21, 2020, 9:58:09 PM5/21/20
to Venugopal Iyer, Han Zhou, Dumitru Ceara, Girish Moodalbail, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
On Thu, May 21, 2020 at 8:45 PM Venugopal Iyer <venug...@nvidia.com> wrote:
Hi, Han:

________________________________________
From: ovn-kub...@googlegroups.com <ovn-kub...@googlegroups.com> on behalf of Han Zhou <zho...@gmail.com>
Sent: Thursday, May 21, 2020 4:42 PM
To: Tim Rozet
Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan Winship; ovs-discuss; ovn-kub...@googlegroups.com; Michael Cambria
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments



On Thu, May 21, 2020 at 2:35 PM Tim Rozet <tro...@redhat.com<mailto:tro...@redhat.com>> wrote:
I think that if you directly connect GR to DR you don't need to learn any ARP with packet_in and you can preprogram the static entries. Each GR will have 1 enty for the DR, while the DR will have N number of entries for N nodes.

Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N ports on the DR and also requires a lot of small subnets, which is not desirable. And since changes are needed anyway in OVN to support that, we moved forward with the current approach of avoiding the static ARP flows to solve the problem instead of directly connecting GRs to DR.

Why is that not desirable? They are all private subnets with /30 (if using ipv4). If IPv6, it's even less of a concern from an addressing perspective.

The real issue with ARP learning comes from the GR-----External. You have to learn these, and from my conversation with Girish it seems like every GR is adding an entry on every ARP request it sees. This means 1 GR sends ARP request to external L2 network and every GR sees the ARP request and adds an entry. I think the behavior should be:

GRs only add ARP entries when:

  1.  An ARP Response is sent to it
  2.  The GR receives a GARP broadcast, and already has an entry in his cache for that IP (Girish mentioned this is similar to linux arp_accept behavior)

For 2), it is expensive to do in OVN because OpenFlow doesn't support a match condition of "field1 == field2", which is required to check if the incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to support something similar like linux arp_accept configuration but slightly different. In OVN we can configure it to alllow/disable learning from all ARP requests to IPs not belonging to the router, including GARPs. Would that solve the problem here? (@Venugopal Iyer<mailto:venug...@nvidia.com>  brought up the same thing about "arp_accept". I hope this reply addresses that as well)

I think the issue there is if you have an external device, which is using a VIP and it fails over, it will usually send GARP to inform of the mac change. In this case if you ignore GARP, what happens? You wont send another ARP because OVN programs the arp entry forever and doesn't expire it right? So you won't learn the new mac and keep sending packets to a dead mac?

<vi> I can't think of any side effects to this, so seems fine to me to do so. Believe linux behaves that way w.r.t. ARP request
<vi> anyway (assuming I am reading it right).

https://elixir.bootlin.com/linux/v5.7-rc6/source/net/ipv4/arp.c (L874)


thanks,

-venu

In addition, as Michael Cambria pointed out in our weekly meeting, these ARP cache entries should have expiry timers on them. If they are permanently learned, you will end up with a growing ARP table over time, and end up in the same place. We can probably just program the GR ARP flows with an idle_timeout and have the flow removed. What do you think?

This has been discussed before. It is also mentioned in the TODO.rst. However, it is not taken care because there is no good solution found yet. It can be done but will be expensive and the gains do not worth the costs. Accepting ARP requests partially reduces the needs of ARP expiration. It is true that it could still be a problem in some scenarios but so far we didn't heard any use case that has hard dependency on this.

Should I file a bugzilla outlining the above so we can have proper tracking?

I think bugzilla is out of the control of OVN community, so please feel free to file or not file ;)
 
Sorry folks from OVN had told me you use bugzilla to track OVN bugs, and not JIRA or Github. What bug tracking system do you use if not BZ? 

Girish Moodalbail

unread,
May 21, 2020, 10:12:53 PM5/21/20
to Tim Rozet, Venugopal Iyer, Han Zhou, Dumitru Ceara, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
On Thu, May 21, 2020 at 6:58 PM Tim Rozet <tro...@redhat.com> wrote:
On Thu, May 21, 2020 at 8:45 PM Venugopal Iyer <venug...@nvidia.com> wrote:
Hi, Han:

________________________________________
From: ovn-kub...@googlegroups.com <ovn-kub...@googlegroups.com> on behalf of Han Zhou <zho...@gmail.com>
Sent: Thursday, May 21, 2020 4:42 PM
To: Tim Rozet
Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan Winship; ovs-discuss; ovn-kub...@googlegroups.com; Michael Cambria
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments



On Thu, May 21, 2020 at 2:35 PM Tim Rozet <tro...@redhat.com<mailto:tro...@redhat.com>> wrote:
I think that if you directly connect GR to DR you don't need to learn any ARP with packet_in and you can preprogram the static entries. Each GR will have 1 enty for the DR, while the DR will have N number of entries for N nodes.

Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N ports on the DR and also requires a lot of small subnets, which is not desirable. And since changes are needed anyway in OVN to support that, we moved forward with the current approach of avoiding the static ARP flows to solve the problem instead of directly connecting GRs to DR.

Why is that not desirable? They are all private subnets with /30 (if using ipv4). If IPv6, it's even less of a concern from an addressing perspective.

It is not just about the subnet management but also the additional logical flows that created between two ways of connecting DR and GR. 

Say, we have a fix that efficiently allows one to connect 1000s of GR using a single logical switch, then would you rather use that instead of 1000 patch cables connecting a GR to DR? It is not only the issue of Subnet Management for those 1000 point-to-point connections but also those 1000 patch ports are local to each of the chassis, so we need to understand in such a topology how many addition logical flows gets created in the SB and how many OpenFlow flows gets created on each of the 1000 chassis for those 1000 patch cables.
 

The real issue with ARP learning comes from the GR-----External. You have to learn these, and from my conversation with Girish it seems like every GR is adding an entry on every ARP request it sees. This means 1 GR sends ARP request to external L2 network and every GR sees the ARP request and adds an entry. I think the behavior should be:

GRs only add ARP entries when:

  1.  An ARP Response is sent to it
  2.  The GR receives a GARP broadcast, and already has an entry in his cache for that IP (Girish mentioned this is similar to linux arp_accept behavior)

For 2), it is expensive to do in OVN because OpenFlow doesn't support a match condition of "field1 == field2", which is required to check if the incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to support something similar like linux arp_accept configuration but slightly different. In OVN we can configure it to alllow/disable learning from all ARP requests to IPs not belonging to the router, including GARPs. Would that solve the problem here? (@Venugopal Iyer<mailto:venug...@nvidia.com>  brought up the same thing about "arp_accept". I hope this reply addresses that as well)

I think the issue there is if you have an external device, which is using a VIP and it fails over, it will usually send GARP to inform of the mac change. In this case if you ignore GARP, what happens? You wont send another ARP because OVN programs the arp entry forever and doesn't expire it right? So you won't learn the new mac and keep sending packets to a dead mac?

I think we will have to support GARP otherwise VIPs will not work like Tim mentions. If we do learn from GARP and as long as the GARP itself is not originated by any of the 1000s GRs, then we should be fine.

Regards,
~Girish
 

Han Zhou

unread,
May 21, 2020, 10:43:58 PM5/21/20
to Girish Moodalbail, Tim Rozet, Venugopal Iyer, Dumitru Ceara, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
Right, I didn't thought this through. I thought it is just a configurable option, but it seems we will always need to support GARP, so the option becomes useless.
However, there is no easy way to achieve: "do learn from GARP and as long as the GARP itself is not originated by any of the 1000s GRs", because OVN doesn't have the knowledge of the use case. The requirement is like: don't learn neighbours from ARP requests if the ARP's src belongs to OVN routers. Firstly this requirement is hard to understand by users not from the particular ovn-k8s setup. Secondly to implement this, it requires O(n^2) flows already, just to bypass the OVN owned router IPs, which is useless to the original problem. We will have to figure out a clean way.

For the internal join-switch this is easier. I think allowing broadcasting from LRs only the GARP request and ARP request to unknown IPs (all others will be unicasted) will solve the problem. But for the external logical switch, I have no idea. Can it be handled from the operator perspective, by initiating a ping from external to the GR, so that GR learns the external GW IP-MAC binding, before sending broadcast to all neighbours?
 
Regards,
~Girish
 

Venugopal Iyer

unread,
May 22, 2020, 11:39:46 AM5/22/20
to Han Zhou, Girish Moodalbail, Tim Rozet, Dumitru Ceara, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
A couple of comments below:

________________________________________
From: ovn-kub...@googlegroups.com <ovn-kub...@googlegroups.com> on behalf of Han Zhou <zho...@gmail.com>

Sent: Thursday, May 21, 2020 7:43 PM
To: Girish Moodalbail
Cc: Tim Rozet; Venugopal Iyer; Dumitru Ceara; Han Zhou; Dan Winship; ovs-discuss; ovn-kub...@googlegroups.com; Michael Cambria


Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

On Thu, May 21, 2020 at 7:12 PM Girish Moodalbail <gmood...@gmail.com<mailto:gmood...@gmail.com>> wrote:


On Thu, May 21, 2020 at 6:58 PM Tim Rozet <tro...@redhat.com<mailto:tro...@redhat.com>> wrote:


On Thu, May 21, 2020 at 8:45 PM Venugopal Iyer <venug...@nvidia.com<mailto:venug...@nvidia.com>> wrote:
Hi, Han:

________________________________________
From: ovn-kub...@googlegroups.com<mailto:ovn-kub...@googlegroups.com> <ovn-kub...@googlegroups.com<mailto:ovn-kub...@googlegroups.com>> on behalf of Han Zhou <zho...@gmail.com<mailto:zho...@gmail.com>>


Sent: Thursday, May 21, 2020 4:42 PM
To: Tim Rozet

Cc: Venugopal Iyer; Dumitru Ceara; Girish Moodalbail; Han Zhou; Dan Winship; ovs-discuss; ovn-kub...@googlegroups.com<mailto:ovn-kub...@googlegroups.com>; Michael Cambria


Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments

On Thu, May 21, 2020 at 2:35 PM Tim Rozet <tro...@redhat.com<mailto:tro...@redhat.com><mailto:tro...@redhat.com<mailto:tro...@redhat.com>>> wrote:
I think that if you directly connect GR to DR you don't need to learn any ARP with packet_in and you can preprogram the static entries. Each GR will have 1 enty for the DR, while the DR will have N number of entries for N nodes.

Hi Tim, as mentioned by Girish, directly connecting GRs to DR requires N ports on the DR and also requires a lot of small subnets, which is not desirable. And since changes are needed anyway in OVN to support that, we moved forward with the current approach of avoiding the static ARP flows to solve the problem instead of directly connecting GRs to DR.

Why is that not desirable? They are all private subnets with /30 (if using ipv4). If IPv6, it's even less of a concern from an addressing perspective.

It is not just about the subnet management but also the additional logical flows that created between two ways of connecting DR and GR.

Say, we have a fix that efficiently allows one to connect 1000s of GR using a single logical switch, then would you rather use that instead of 1000 patch cables connecting a GR to DR? It is not only the issue of Subnet Management for those 1000 point-to-point connections but also those 1000 patch ports are local to each of the chassis, so we need to understand in such a topology how many addition logical flows gets created in the SB and how many OpenFlow flows gets created on each of the 1000 chassis for those 1000 patch cables.


The real issue with ARP learning comes from the GR-----External. You have to learn these, and from my conversation with Girish it seems like every GR is adding an entry on every ARP request it sees. This means 1 GR sends ARP request to external L2 network and every GR sees the ARP request and adds an entry. I think the behavior should be:

GRs only add ARP entries when:

1. An ARP Response is sent to it
2. The GR receives a GARP broadcast, and already has an entry in his cache for that IP (Girish mentioned this is similar to linux arp_accept behavior)

For 2), it is expensive to do in OVN because OpenFlow doesn't support a match condition of "field1 == field2", which is required to check if the incoming ARP request is a GARP, i.e. SPA == TPA. However, it is ok to support something similar like linux arp_accept configuration but slightly different. In OVN we can configure it to alllow/disable learning from all ARP requests to IPs not belonging to the router, including GARPs. Would that solve the problem here? (@Venugopal Iyer<mailto:venug...@nvidia.com<mailto:venug...@nvidia.com>> brought up the same thing about "arp_accept". I hope this reply addresses that as well)

I think the issue there is if you have an external device, which is using a VIP and it fails over, it will usually send GARP to inform of the mac change. In this case if you ignore GARP, what happens? You wont send another ARP because OVN programs the arp entry forever and doesn't expire it right? So you won't learn the new mac and keep sending packets to a dead mac?

I think we will have to support GARP otherwise VIPs will not work like Tim mentions. If we do learn from GARP and as long as the GARP itself is not originated by any of the 1000s GRs, then we should be fine.

Right, I didn't thought this through. I thought it is just a configurable option, but it seems we will always need to support GARP, so the option becomes useless.
However, there is no easy way to achieve: "do learn from GARP and as long as the GARP itself is not originated by any of the 1000s GRs", because OVN doesn't have the knowledge of the use case. The requirement is like: don't learn neighbours from ARP requests if the ARP's src belongs to OVN routers. Firstly this requirement is hard to understand by users not from the particular ovn-k8s setup. Secondly to implement this, it requires O(n^2) flows already, just to bypass the OVN owned router IPs, which is useless to the original problem. We will have to figure out a clean way.


<vi> I suppose the use of GARP as a reply v/s response is not very clear; [1], Section 3 seems to offer a concise summary of this. If the application sends GARP as
<vi> a reply we are covered, but the question is if the GARP is a request (which is allowed) then what our response should be. Tim is right, we can't ignore
<vi> the request (more so, since aging is not supported currently), however "arp_accept" ignores the request for creating a new cache entry, not updating
<vi> an existing one (see last para below)

[2]


arp_accept - BOOLEAN
Define behavior for gratuitous ARP frames who's IP is not
already present in the ARP table:
0 - don't create new entries in the ARP table
1 - create new entries in the ARP table

Both replies and requests type gratuitous arp will trigger the
ARP table to be updated, if this setting is on.

If the ARP table already contains the IP address of the
gratuitous arp frame, the arp table will be updated regardless
if this setting is on or off.

<vi> if we lookup and get a hit, we should still process the GARP; only if we don't have a hit, we should ignore (instead of
<vi> creating an entry). BTW, do we update today? if I understand the use of reg9[2] / REGBIT_LOOKUP_NEIGHBOR_RESULT (assuming lookup_arp
<vi> returns 1 if entry exists), I am not sure it does? maybe I missed it ..

thanks,

-venu

[1]https://www.ietf.org/rfc/rfc5227.txt


For the internal join-switch this is easier. I think allowing broadcasting from LRs only the GARP request and ARP request to unknown IPs (all others will be unicasted) will solve the problem. But for the external logical switch, I have no idea. Can it be handled from the operator perspective, by initiating a ping from external to the GR, so that GR learns the external GW IP-MAC binding, before sending broadcast to all neighbours?

Regards,
~Girish


--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com<mailto:ovn-kubernete...@googlegroups.com>.

To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCmKJ4JpZ-HfKhmb18LU3HmqAiAvUmFGnRrPcDF5M7u0yw%40mail.gmail.com<https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCmKJ4JpZ-HfKhmb18LU3HmqAiAvUmFGnRrPcDF5M7u0yw%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Han Zhou

unread,
May 22, 2020, 4:51:12 PM5/22/20
to Venugopal Iyer, Girish Moodalbail, Tim Rozet, Dumitru Ceara, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
On Fri, May 22, 2020 at 8:39 AM Venugopal Iyer <venug...@nvidia.com> wrote:
A couple of comments below:




<vi> I suppose the use of GARP as a reply v/s response is not very clear; [1], Section 3 seems to offer a concise summary of this. If the application sends GARP as
<vi> a reply we are covered, but the question is if the GARP is a request (which is allowed) then what our response should be. Tim is right, we can't ignore
<vi> the request (more so, since aging is not supported currently), however "arp_accept" ignores the request for creating a new cache entry, not updating
<vi> an existing one (see last para below)

[2]
arp_accept - BOOLEAN
        Define behavior for gratuitous ARP frames who's IP is not
        already present in the ARP table:
        0 - don't create new entries in the ARP table
        1 - create new entries in the ARP table

        Both replies and requests type gratuitous arp will trigger the
        ARP table to be updated, if this setting is on.

        If the ARP table already contains the IP address of the
        gratuitous arp frame, the arp table will be updated regardless
        if this setting is on or off.

<vi> if we lookup and get a hit, we should still process the GARP; only if we don't  have a hit, we should ignore (instead of
<vi> creating an entry). BTW, do we update today? if I understand the use of reg9[2] / REGBIT_LOOKUP_NEIGHBOR_RESULT (assuming lookup_arp
<vi> returns 1 if entry exists), I am not sure it does? maybe I missed it ..

thanks,

-venu

[1]https://www.ietf.org/rfc/rfc5227.txt


(Not sure why the indent format of your reply is not correct at least on my client - it mixes all previous replies together so one cannot tell which part was from whom, so I truncated all of them.)

Thanks Venu. I think this would work: we can add an option similar but different from arp_accept (because it is not easy to OVN to tell if it is GARP on the ingress pipeline). The option can be named like: learn_from_arp_request.
When ARP request is received, always check if an old entry existed for the SPA. If existed and MAC is different, then update the mac-binding entry. If the entry doesn't exist, check the option setting:
"true" - add a new entry.
"false" - if the TPA is on the router, add a new entry (it means the remote wants to communicate with this node, so it makes sense to learn the remote as well). Otherwise, ignore it and no new entry added.

Do you think this works?
Regarding your question on lookup_arp(), today it looks up for the same IP-MAC binding, just avoid unnecessary updating if the pair already existed and not changed.

Thanks,
Han

Venugopal Iyer

unread,
May 22, 2020, 6:14:39 PM5/22/20
to Han Zhou, Girish Moodalbail, Tim Rozet, Dumitru Ceara, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria

Sorry, Han, for messing up the indents, looks like my outlook browser client is either set

correctly, or doesn’t work well.

 

Let me try from the app and see if it is any better..

 

From: ovn-kub...@googlegroups.com <ovn-kub...@googlegroups.com> On Behalf Of Han Zhou
Sent: Friday, May 22, 2020 1:51 PM
To: Venugopal Iyer <venug...@nvidia.com>
Cc: Girish Moodalbail <gmood...@gmail.com>; Tim Rozet <tro...@redhat.com>; Dumitru Ceara <dce...@redhat.com>; Han Zhou <hz...@ovn.org>; Dan Winship <danwi...@redhat.com>; ovs-discuss <ovs-d...@openvswitch.org>; ovn-kub...@googlegroups.com; Michael Cambria <mcam...@redhat.com>
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

 

External email: Use caution opening links or attachments

 

 

 

On Fri, May 22, 2020 at 8:39 AM Venugopal Iyer <venug...@nvidia.com> wrote:

[vi> ] yes, I believe that should work.

 

Do you think this works?

Regarding your question on lookup_arp(), today it looks up for the same IP-MAC binding, just avoid unnecessary updating if the pair already existed and not changed.

thanks,

 

-venu

 

Thanks,

Han

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.

To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCk%3DeqsrifyfSuBcLFUNdbtFOESdeqg-M%2BZch%2BiQNiJTiA%40mail.gmail.com.

Girish Moodalbail

unread,
May 22, 2020, 6:56:25 PM5/22/20
to Han Zhou, Venugopal Iyer, Tim Rozet, Dumitru Ceara, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
I think this should work as well.

For the single join switch connected to 1000 GRs, it should work as well (assuming your other fix for dynamic learning is present as well). However, in this case,  even with this option set we will still be sending the ARP broadcast out from Node1 to each of the other 999 Nodes. After the packets have travelled through the tunnel, we are going to drop the packet on the target hypervisor, if `learn_from_arp_request=true'. As I understand, we are waiting for reply from @Dumitru Ceara to understand why such a flow is required, correct?

Regards,
~Girish

Dumitru Ceara

unread,
May 25, 2020, 6:55:08 AM5/25/20
to Girish Moodalbail, Han Zhou, Venugopal Iyer, Tim Rozet, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
> from @Dumitru Ceara <mailto:dce...@redhat.com> to understand why such a
> flow is required, correct?
>

As Han pointed out, commit 32f5ebb062 ("ovn-northd: Limit ARP/ND
broadcast domain whenever possible.") added logical flows in the LS
S_SWITCH_IN_L2_LKUP stage to explicitly flood ARP/ND requests originated
from router owned IP interfaces. This was done for a couple of reasons:

1. ARP requests for destinations/next-hops outside OVN need to be
flooded in the broadcast domain anyway and would otherwise match the
lowest priority rule in S_SWITCH_IN_L2_LKUP that would flood them
nevertheless.

2. OVN sends periodic GARP requests for router owned IPs (i.e., NAT
addresses and logical_router_port addresses) to update external
switch/router FDB/ARP caches in scenarios like VM migration:
6bfbb4c24187 ("ovn: Send GARP on localnet."). These packets should be
flooded in the broadcast domain too.

I think we have a few options:

1. Change OVN behavior and use GARP replies instead of GARP requests.
The effect should be (almost [1]) the same from the external devices
perspective but the advantage is that we can completely remove the
logical flows that match on self originated ARP packets. This is quite
easy to achieve and I have a patch ready for it if we decide to go this way.

2. Make the flows that match on self originated ARP traffic more
explicit and restrict them to GARP requests. For example, for a logical
router port with addresses MAC, IP1, IP2 and NAT entries with
external_mac MAC-E and external IP IP-E:

Right now we have a flow:
if "eth.src == {MAC, MAC-E} && (arp_req || nd_ns)" then "flood"

We could instead create:
if "eth.src == MAC && arp.tpa == {IP1, IP2} && arp_req" then "flood"
if "eth.src == MAC-E && arp.tpa == {IP-E} && arp_req" then "flood"

I would prefer option 1 above but I'd like to hear more opinions about
disadvantages of using GARP replies instead of GARP requests for OVN
owned IP addresses.

Option 2 is also relatively straightforward to implement but will
generate a few more logical flows, still O(N) though, with N="number of
logical routers connected to the logical switch".

Thanks,
Dumitru

[1] https://tools.ietf.org/html/rfc5227#page-15


> Regards,
> ~Girish

Venugopal Iyer

unread,
May 26, 2020, 12:23:10 PM5/26/20
to Dumitru Ceara, Girish Moodalbail, Han Zhou, Tim Rozet, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
Hi, Dumitru:

-----Original Message-----
From: Dumitru Ceara <dce...@redhat.com>
Sent: Monday, May 25, 2020 3:55 AM
To: Girish Moodalbail <gmood...@gmail.com>; Han Zhou <zho...@gmail.com>
Cc: Venugopal Iyer <venug...@nvidia.com>; Tim Rozet <tro...@redhat.com>; Han Zhou <hz...@ovn.org>; Dan Winship <danwi...@redhat.com>; ovs-discuss <ovs-d...@openvswitch.org>; ovn-kub...@googlegroups.com; Michael Cambria <mcam...@redhat.com>
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

External email: Use caution opening links or attachments


[vi> ] I'd prefer 1. too, unless we need to think about external devices, if at all, that don't support unsolicited replies .. in that case, we'll need to wait till their cache times out..

Thanks,

-venu

Girish Moodalbail

unread,
May 26, 2020, 3:09:03 PM5/26/20
to Dumitru Ceara, Han Zhou, Venugopal Iyer, Tim Rozet, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
Hello Dumitru,

There are several things that are being discussed on this thread. Let me see if I can tease them out for clarity.

1. All the router IPs are known to OVN (the join switch case)
2. Some IPs are known and some are not known (the external logical switch that connects to physical network case).

Let us look at each of the case above:

1. Join Switch Case

+----------------+        +----------------+
|   l3gateway    |        |   l3gateway    |
|    router2     |        |    router3     |
+-------------+--+        +-+--------------+
            IP2,M2         IP3,M3          
              |             |                            
           +--+-------------+---+          
           |    join switch     |          
           +---------+----------+          
                     |                      
                  IP1,M1                    
             +-------+--------+            
             |  distributed   |            
             |     router     |            
             +----------------+      


Say, GR router2 wants to send the packet out to DR and that we don't have static mappings of MAC to IP in lr_in_arp_resolve table on GR router2 (with Han's patch of dynamic_neigh_routes=true for all the Gateway Routers). With this in mind, when an ARP request is sent out by router2's hypervisor the packet should be directly sent to the distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit ARP/ND broadcast domain whenever possible) should have allowed only unicast. However, in ls_in_l2_lkup table we have

  table=19(ls_in_l2_lkup      ), priority=80   , match=(eth.src == { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood"; output;)
  table=19(ls_in_l2_lkup      ), priority=75   , match=(flags[1] == 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport = "jtor-router2"; output;)

As you can see, `priority=80` rule will always be hit and sent out to all the GRs. The `priority=75` rule is never hit. So, we will see ARP packets on the GENEVE tunnel. So, we need to change `priority=80` to match GARP request packets. That way, for the known OVN IPs case we don't do broadcast.

2. External Logical Switch Case

                       10.10.10.0/24                        
   -------------------------+--------------------------    
                            |                              
                         localnet                          
                      +-----+-----+                        
                      | external  |                        
         +------------+    LS1    +-------------+          
         |            +-----+-----+             |          
         |                  |                   |          
     10.10.10.2         10.10.10.3          10.10.10.4      
        SNAT               SNAT                SNAT        
   +-----+-----+      +-----+-----+       +-----------+    
   | l3gateway |      | l3gateway |       | l3gateway |    
   |   node1   |      |   node2   |       |   node3   |    
   +-----------+      +-----------+       +-----------+   

In this case, we have some of the IPs in OVN and some in the physical network. If we fix (1) above, all the ARP requests for the OVN's router IPs will be unicast. However, all the ARP requests to external IPs, say 10.10.10.1 on the "physical router", will be broadcast. Now, we will see these ARP broadcasts on all the L3 gateway routers. With 'learn_from_arp_request=false' [a], then the MAC_Binding table will not explode for both ARP and GARP requests.

So, I don't think GARP requests and replies is the issue here? Furthermore, learning from the GARP replies are blocked on certain routers. For example:  https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html  says "By default, updating the ARP cache on GARP replies is disabled on the router.". So, our NAT addresses mapping will not be learnt.

Regards,
~Girish


[a] - From Han's mail, the meaning of learn_from_arp_request=false --> if the TPA is on the router, add a new entry (it means the
>     remote wants to communicate with this node, so it makes sense to
>     learn the remote as well). Otherwise, ignore it and no new entry added.


--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/eea5ee59-fb14-e11d-40c1-b33c72ffb470%40redhat.com.

Han Zhou

unread,
May 26, 2020, 3:42:54 PM5/26/20
to Girish Moodalbail, Dumitru Ceara, Venugopal Iyer, Tim Rozet, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
Hi Girish,

Thanks for the summary. I agree with you that GARP request v.s. reply is irrelavent to the problem here.
Please see my comment inline below.

On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail <gmood...@gmail.com> wrote:
>
> Hello Dumitru,
>
> There are several things that are being discussed on this thread. Let me see if I can tease them out for clarity.
>
> 1. All the router IPs are known to OVN (the join switch case)
> 2. Some IPs are known and some are not known (the external logical switch that connects to physical network case).
>
> Let us look at each of the case above:
>
> 1. Join Switch Case
>
> +----------------+        +----------------+
> |   l3gateway    |        |   l3gateway    |
> |    router2     |        |    router3     |
> +-------------+--+        +-+--------------+
>             IP2,M2         IP3,M3          
>               |             |                            
>            +--+-------------+---+          
>            |    join switch     |          
>            +---------+----------+          
>                      |                      
>                   IP1,M1                    
>              +-------+--------+            
>              |  distributed   |            
>              |     router     |            
>              +----------------+      
>
>
> Say, GR router2 wants to send the packet out to DR and that we don't have static mappings of MAC to IP in lr_in_arp_resolve table on GR router2 (with Han's patch of dynamic_neigh_routes=true for all the Gateway Routers). With this in mind, when an ARP request is sent out by router2's hypervisor the packet should be directly sent to the distributed router alone. Your commit 32f5ebb0622 (ovn-northd: Limit ARP/ND broadcast domain whenever possible) should have allowed only unicast. However, in ls_in_l2_lkup table we have
>
>   table=19(ls_in_l2_lkup      ), priority=80   , match=(eth.src == { M2 } && (arp.op == 1 || nd_ns)), action=(outport = "_MC_flood"; output;)
>   table=19(ls_in_l2_lkup      ), priority=75   , match=(flags[1] == 0 && arp.op == 1 && arp.tpa == { IP1}), action=(outport = "jtor-router2"; output;)
>
> As you can see, `priority=80` rule will always be hit and sent out to all the GRs. The `priority=75` rule is never hit. So, we will see ARP packets on the GENEVE tunnel. So, we need to change `priority=80` to match GARP request packets. That way, for the known OVN IPs case we don't do broadcast.

Since the solution to case 2) below (i.e. learn_from_arp_request=false) solves the problem of case 1), too, I think we don't need this change just for case 1). As @Dumitru Ceara  mentioned, there is some cost because it adds extra flows. It would be significant amount of flows if there are a lot of snat_and_dnat IPs. What do you think?

Girish Moodalbail

unread,
May 26, 2020, 4:07:15 PM5/26/20
to Han Zhou, Dumitru Ceara, Venugopal Iyer, Tim Rozet, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
Han, yes it will work. However, my only concern is that we would send all these ARP requests via tunnel to each of 1000 hypervisors and these hypervisors will just drop them on the floor. when they see learn_from_arp_request=false. 

Han, Dumitru, 

Why can't we swap the priorities of the above two flows so that the ARP request for NexHop IP known to OVN will be always sent via `unicast`?

Regards,
~Girish

Han Zhou

unread,
May 26, 2020, 5:52:04 PM5/26/20
to Girish Moodalbail, Dumitru Ceara, Venugopal Iyer, Tim Rozet, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
I think maybe it is not a problem since it happens only once on the Join switch. Once the MAC is learned, it won't broadcast again. It may be more of a problem on the external LS if periodical GARP is required there. However, I'd suggest to have some test and see if it is really a problem, before trying to solve it.

>
> Han, Dumitru,
>
> Why can't we swap the priorities of the above two flows so that the ARP request for NexHop IP known to OVN will be always sent via `unicast`?

If swapped, even GARP won't get broadcasted. Maybe that's not the desired behavior.

>
> Regards,
> ~Girish
>
>>
>> >
>> > 2. External Logical Switch Case
>> >
>> >                        10.10.10.0/24                        
>> >    -------------------------+--------------------------    
>> >                             |                              
>> >                          localnet                          
>> >                       +-----+-----+                        
>> >                       | external  |                        
>> >          +------------+    LS1    +-------------+          
>> >          |            +-----+-----+             |          
>> >          |                  |                   |          
>> >      10.10.10.2         10.10.10.3          10.10.10.4      
>> >         SNAT               SNAT                SNAT        
>> >    +-----+-----+      +-----+-----+       +-----------+    
>> >    | l3gateway |      | l3gateway |       | l3gateway |    
>> >    |   node1   |      |   node2   |       |   node3   |    
>> >    +-----------+      +-----------+       +-----------+  
>> >
>> > In this case, we have some of the IPs in OVN and some in the physical network. If we fix (1) above, all the ARP requests for the OVN's router IPs will be unicast. However, all the ARP requests to external IPs, say 10.10.10.1 on the "physical router", will be broadcast. Now, we will see these ARP broadcasts on all the L3 gateway routers. With 'learn_from_arp_request=false' [a], then the MAC_Binding table will not explode for both ARP and GARP requests.
>> >
>> > So, I don't think GARP requests and replies is the issue here? Furthermore, learning from the GARP replies are blocked on certain routers. For example:  https://www.juniper.net/documentation/en_US/junose15.1/topics/concept/ip-gratuitous-arps-transmission-overview.html  says "By default, updating the ARP cache on GARP replies is disabled on the router.". So, our NAT addresses mapping will not be learnt.
>> >
>> > Regards,
>> > ~Girish
>> >
>> >
>> > [a] - From Han's mail, the meaning of learn_from_arp_request=false --> if the TPA is on the router, add a new entry (it means the
>> > >     remote wants to communicate with this node, so it makes sense to
>> > >     learn the remote as well). Otherwise, ignore it and no new entry added.
>> >
>> >
>> >
>
> --
> You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.

Dumitru Ceara

unread,
May 27, 2020, 4:10:31 AM5/27/20
to Han Zhou, Girish Moodalbail, Venugopal Iyer, Tim Rozet, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
Hi Girish, Han,

On 5/26/20 11:51 PM, Han Zhou wrote:
>
>
> On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail <gmood...@gmail.com
> <mailto:gmood...@gmail.com>> wrote:
>>
>>
>>
>> On Tue, May 26, 2020 at 12:42 PM Han Zhou <zho...@gmail.com
> <mailto:zho...@gmail.com>> wrote:
>>>
>>> Hi Girish,
>>>
>>> Thanks for the summary. I agree with you that GARP request v.s. reply
> is irrelavent to the problem here.

Well, actually I think GARP request vs reply is relevant (at least for
case 1 below) because if OVN would be generating GARP replies we
wouldn't need the priority 80 flow to determine if an ARP request packet
is actually an OVN self originated GARP that needs to be flooded in the
L2 broadcast domain.

On the other hand, router3 would be learning mac_binding IP2,M2 from the
GARP reply originated by router2 and vice versa so we'd have to restrict
flooding of GARP replies to non-patch ports.
I think the following might be a solution, although with the cost of
adding as many flows as dnat_and_snat IPs are configured:

- priority 80: explicitly determine if an ARP request is a self
originated GARP for configured IP addresses and dnat_and_snat IPs (by
matching on all eth.src and arp.tpa pairs) and if so flood on all
non-patch ports.
- priority 75: if arp.tpa is owned by an OVN logical router port,
"unicast" it only on the patch port towards the router.
- priority 1: flood any broadcast packet.

Together with the learn_from_arp_request=false knob this would cover
both case 1 (join switch) and case 2 (external switch).

Wdyt?

>>
>>
>> Han, yes it will work. However, my only concern is that we would send
> all these ARP requests via tunnel to each of 1000 hypervisors and these
> hypervisors will just drop them on the floor. when they see
> learn_from_arp_request=false.
>
> I think maybe it is not a problem since it happens only once on the Join
> switch. Once the MAC is learned, it won't broadcast again. It may be
> more of a problem on the external LS if periodical GARP is required
> there. However, I'd suggest to have some test and see if it is really a
> problem, before trying to solve it.
>
>>
>> Han, Dumitru,
>>
>> Why can't we swap the priorities of the above two flows so that the
> ARP request for NexHop IP known to OVN will be always sent via `unicast`?
>
> If swapped, even GARP won't get broadcasted. Maybe that's not the
> desired behavior.
>

This is definitely not desired as we'd be hitting the prio 75 flow that
would send the self originated GARP request (IPx) packet back towards
the router port that owns IPx.

>>
>> Regards,
>> ~Girish
>>
>>>
>>> >
>>> > 2. External Logical Switch Case
>>> >
>>> >                        10.10.10.0/24 <http://10.10.10.0/24>        
Just as a side note, the above doesn't mean Juniper boxes don't support
learning from GARP replies, just that they'd need extra configuration. I
don't necessarily think that's a bad thing if properly documented in OVN
that we would be generating GARP replies.

Regards,
Dumitru

>>> >
>>> > Regards,
>>> > ~Girish
>>> >
>>> >
>>> > [a] - From Han's mail, the meaning of learn_from_arp_request=false
> --> if the TPA is on the router, add a new entry (it means the
>>> > >     remote wants to communicate with this node, so it makes sense to
>>> > >     learn the remote as well). Otherwise, ignore it and no new
> entry added.
>>> >
>>> >
>>> >
>>
>> --
>> You received this message because you are subscribed to the Google
> Groups "ovn-kubernetes" group.
>> To unsubscribe from this group and stop receiving emails from it, send
> an email to ovn-kubernete...@googlegroups.com
> <mailto:ovn-kubernetes%2Bunsu...@googlegroups.com>.

Han Zhou

unread,
May 28, 2020, 2:35:08 AM5/28/20
to Dumitru Ceara, Girish Moodalbail, Venugopal Iyer, Tim Rozet, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria


On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara <dce...@redhat.com> wrote:
>
> Hi Girish, Han,
>
> On 5/26/20 11:51 PM, Han Zhou wrote:
> >
> >
> > On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail <gmood...@gmail.com
> > <mailto:gmood...@gmail.com>> wrote:
> >>
> >>
> >>
> >> On Tue, May 26, 2020 at 12:42 PM Han Zhou <zho...@gmail.com
> > <mailto:zho...@gmail.com>> wrote:
> >>>
> >>> Hi Girish,
> >>>
> >>> Thanks for the summary. I agree with you that GARP request v.s. reply
> > is irrelavent to the problem here.
>
> Well, actually I think GARP request vs reply is relevant (at least for
> case 1 below) because if OVN would be generating GARP replies we
> wouldn't need the priority 80 flow to determine if an ARP request packet
> is actually an OVN self originated GARP that needs to be flooded in the
> L2 broadcast domain.
>
> On the other hand, router3 would be learning mac_binding IP2,M2 from the
> GARP reply originated by router2 and vice versa so we'd have to restrict
> flooding of GARP replies to non-patch ports.
>

Hi Dumitru, the point was that, on the external LS, the GRs will have to send ARP requests to resolve unknown IPs (at least for the external GW), and it has to be broadcasted, which will cause all the GRs learn all MACs of other GRs. This is regardless of the GARP behavior. You are right that if we only consider the Join switch then the GARP request v.s. reply does make a difference. However, GARP request/reply may be really needed only on the external LS.
Would the "learn_from_arp_request=false knob" cover both cases? If yes, we don't need to add more flows of priority 80, or more accurately: whether to update the priority-80 flows is not directly related to the current problem.

Dumitru Ceara

unread,
May 28, 2020, 4:50:40 AM5/28/20
to Han Zhou, Girish Moodalbail, Venugopal Iyer, Tim Rozet, Han Zhou, Dan Winship, ovs-discuss, ovn-kub...@googlegroups.com, Michael Cambria
On 5/28/20 8:34 AM, Han Zhou wrote:
>
>
> On Wed, May 27, 2020 at 1:10 AM Dumitru Ceara <dce...@redhat.com
> <mailto:dce...@redhat.com>> wrote:
>>
>> Hi Girish, Han,
>>
>> On 5/26/20 11:51 PM, Han Zhou wrote:
>> >
>> >
>> > On Tue, May 26, 2020 at 1:07 PM Girish Moodalbail
> <gmood...@gmail.com <mailto:gmood...@gmail.com>
>> > <mailto:gmood...@gmail.com <mailto:gmood...@gmail.com>>> wrote:
>> >>
>> >>
>> >>
>> >> On Tue, May 26, 2020 at 12:42 PM Han Zhou <zho...@gmail.com
> <mailto:zho...@gmail.com>
>> > <mailto:zho...@gmail.com <mailto:zho...@gmail.com>>> wrote:
>> >>>
>> >>> Hi Girish,
>> >>>
>> >>> Thanks for the summary. I agree with you that GARP request v.s. reply
>> > is irrelavent to the problem here.
>>
>> Well, actually I think GARP request vs reply is relevant (at least for
>> case 1 below) because if OVN would be generating GARP replies we
>> wouldn't need the priority 80 flow to determine if an ARP request packet
>> is actually an OVN self originated GARP that needs to be flooded in the
>> L2 broadcast domain.
>>
>> On the other hand, router3 would be learning mac_binding IP2,M2 from the
>> GARP reply originated by router2 and vice versa so we'd have to restrict
>> flooding of GARP replies to non-patch ports.
>>
>
> Hi Dumitru, the point was that, on the external LS, the GRs will have to
> send ARP requests to resolve unknown IPs (at least for the external GW),
> and it has to be broadcasted, which will cause all the GRs learn all
> MACs of other GRs. This is regardless of the GARP behavior. You are
> right that if we only consider the Join switch then the GARP request
> v.s. reply does make a difference. However, GARP request/reply may be
> really needed only on the external LS.
>

Ok, but do you see an easy way to determine if we need to add the
logical flows that flood self originated GARP packets on a given logical
switch? Right now we add them on all switches.

>> >>> Please see my comment inline below.
>> >>>
>> >>> On Tue, May 26, 2020 at 12:09 PM Girish Moodalbail
>> > <gmood...@gmail.com <mailto:gmood...@gmail.com>
Yes, it would, except for the fact that the ARP requests would still be
flooded to all routers (and ignored at the destination). Which is afaiu
what Girish was worried about. In order to address that part too I'm
afraid we have to update the priority-80 flows.

Regards,
Dumitru
>> > <mailto:ovn-kubernetes%2Bunsu...@googlegroups.com
> <mailto:ovn-kubernetes%252Buns...@googlegroups.com>>.
> --
> You received this message because you are subscribed to the Google
> Groups "ovn-kubernetes" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to ovn-kubernete...@googlegroups.com
> <mailto:ovn-kubernete...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com
> <https://groups.google.com/d/msgid/ovn-kubernetes/CADtzDCkHGft30Vx_Yx3fiCeki4NM4YwCvNJaU2S2mGv4buLwgg%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Dumitru Ceara

unread,
May 28, 2020, 7:26:53 AM5/28/20
to Daniel Alvarez Sanchez, Han Zhou, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria, Tim Rozet, Venugopal Iyer
On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
> Hi all
>
> Sorry for top posting. I want to thank you all for the discussion and
> give also some feedback from OpenStack perspective which is affected
> by the problem described here.
>
> In OpenStack, it's kind of common to have a shared external network
> (logical switch with a localnet port) across many tenants. Each tenant
> user may create their own router where their instances will be
> connected to access the external network.
>
> In such scenario, we are hitting the issue described here. In
> particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
> 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
> connected to the public LS. This is creating a huge problem in terms
> of performance and tons of events due to the MAC_Binding entries
> generated as a consequence of the GARPs sent for the floating IPs.
>

Just as an addition to this, GARPs wouldn't be the only reason why all
routers would learn the MAC_Binding. Even if we wouldn't be sending
GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
the outside, the router will generate an ARP request for the next hop
using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
connected to the public LS and will trigger them to learn the
FIP-IP:FIP-MAC binding.

> Thanks,
> Daniel
>> _______________________________________________
>> discuss mailing list
>> dis...@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-discuss
>

Tim Rozet

unread,
May 28, 2020, 10:33:25 AM5/28/20
to Dumitru Ceara, Daniel Alvarez Sanchez, Han Zhou, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria, Venugopal Iyer
On Thu, May 28, 2020 at 7:26 AM Dumitru Ceara <dce...@redhat.com> wrote:
On 5/28/20 12:48 PM, Daniel Alvarez Sanchez wrote:
> Hi all
>
> Sorry for top posting. I want to thank you all for the discussion and
> give also some feedback from OpenStack perspective which is affected
> by the problem described here.
>
> In OpenStack, it's kind of common to have a shared external network
> (logical switch with a localnet port) across many tenants. Each tenant
> user may create their own router where their instances will be
> connected to access the external network.
>
> In such scenario, we are hitting the issue described here. In
> particular in our tests we exercise 3K VIFs (with 1 FIP) each spanning
> 300 LS; each LS connected to a LR (ie. 300 LRs) and that router
> connected to the public LS. This is creating a huge problem in terms
> of performance and tons of events due to the MAC_Binding entries
> generated as a consequence of the GARPs sent for the floating IPs.
>

Just as an addition to this, GARPs wouldn't be the only reason why all
routers would learn the MAC_Binding. Even if we wouldn't be sending
GARPs for the FIPs, when a VM that's behind a FIP would send traffic to
the outside, the router will generate an ARP request for the next hop
using the FIP-IP and FIP-MAC. This will be broadcasted to all routers
connected to the public LS and will trigger them to learn the
FIP-IP:FIP-MAC binding.

Yeah we shouldn't be learning on regular ARP requests.

Girish Moodalbail

unread,
Jun 3, 2020, 6:32:19 PM6/3/20
to Tim Rozet, Dumitru Ceara, Daniel Alvarez Sanchez, Han Zhou, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria, Venugopal Iyer
Hello all,

To kind of proceed with the proposed fixes, with minimal impact, is the following a reasonable approach?
  1. Add an option, namely dynamic_neigh_routes={true|false}, for a gateway router. With this option enabled, the nextHop IP's MAC will be learned through a ARP request on the physical network. The ARP request will be flooded on the L2 broadcast domain (for both join switch and external switch).

  2. Add an option, namely learn_from_arp_request={true|false}, for a gateway router. The option is interpreted as below:\
    "true" - learn the MAC/IP binding and add a new MAC_Binding entry (default behavior)
    "false" - if there is a MAC_binding for that IP and the MAC is different, then update that MAC/IP binding. The external entity might be trying to advertise the new MAC for that IP. (If we don't do this, then we will never learn External VIP to MAC changes)

    (Irrespective of, learn_from_arp_request is true or false, always do this -- if the TPA is on the router, add a new entry (it means the remote wants to communicate with this node, so it makes sense to learn the remote as well))

For now, I think it is fine for ARP packets to be broadcasted on the tunnel for the `join` switch case. If it becomes a problem, then we can start looking around changing the logical flows.

Thanks everyone for the lively discussion.

Regards,
~Girish

To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/CADO7ZnoBqbOvo-2jjTOKPA3otgA_4LYqiao2k718guFdW8kTAg%40mail.gmail.com.

Han Zhou

unread,
Jun 3, 2020, 7:27:02 PM6/3/20
to Girish Moodalbail, Tim Rozet, Dumitru Ceara, Daniel Alvarez Sanchez, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria, Venugopal Iyer
Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry that I forgot to update here.

On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <gmood...@gmail.com> wrote:
>
> Hello all,
>
> To kind of proceed with the proposed fixes, with minimal impact, is the following a reasonable approach?
>
> Add an option, namely dynamic_neigh_routes={true|false}, for a gateway router. With this option enabled, the nextHop IP's MAC will be learned through a ARP request on the physical network. The ARP request will be flooded on the L2 broadcast domain (for both join switch and external switch).
>

I am working on the formal patch.

> Add an option, namely learn_from_arp_request={true|false}, for a gateway router. The option is interpreted as below:\
> "true" - learn the MAC/IP binding and add a new MAC_Binding entry (default behavior)
> "false" - if there is a MAC_binding for that IP and the MAC is different, then update that MAC/IP binding. The external entity might be trying to advertise the new MAC for that IP. (If we don't do this, then we will never learn External VIP to MAC changes)
>
> (Irrespective of, learn_from_arp_request is true or false, always do this -- if the TPA is on the router, add a new entry (it means the remote wants to communicate with this node, so it makes sense to learn the remote as well))
>

I am working on this as well, but delayed a little. I hope to have something this week.

Girish Moodalbail

unread,
Jun 3, 2020, 8:04:10 PM6/3/20
to Han Zhou, Tim Rozet, Dumitru Ceara, Daniel Alvarez Sanchez, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria, Venugopal Iyer
No worries, thanks for the update Han.

Once you have the patch, we can test your changes on our cluster and provide you an update.

Regards,
~Girish

Venugopal Iyer

unread,
Jun 9, 2020, 12:06:24 PM6/9/20
to Han Zhou, Girish Moodalbail, Tim Rozet, Dumitru Ceara, Daniel Alvarez Sanchez, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria

Sorry for the delay, Han, a quick question below:

 

Sent: Wednesday, June 3, 2020 4:27 PM
To: Girish Moodalbail <gmood...@gmail.com>
Cc: Tim Rozet <tro...@redhat.com>; Dumitru Ceara <dce...@redhat.com>; Daniel Alvarez Sanchez <dalv...@redhat.com>; Dan Winship <danwi...@redhat.com>; ovn-kub...@googlegroups.com; ovs-discuss <ovs-d...@openvswitch.org>; Michael Cambria <mcam...@redhat.com>; Venugopal Iyer <venug...@nvidia.com>
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

 

External email: Use caution opening links or attachments

 

Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry that I forgot to update here.


On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <gmood...@gmail.com> wrote:
>
> Hello all,
>
> To kind of proceed with the proposed fixes, with minimal impact, is the following a reasonable approach?
>
> Add an option, namely dynamic_neigh_routes={true|false}, for a gateway router. With this option enabled, the nextHop IP's MAC will be learned through a ARP request on the physical network. The ARP request will be flooded on the L2 broadcast domain (for both join switch and external switch).

> 

 

I am working on the formal patch.

 

> Add an option, namely learn_from_arp_request={true|false}, for a gateway router. The option is interpreted as below:\
> "true" - learn the MAC/IP binding and add a new MAC_Binding entry (default behavior)
> "false" - if there is a MAC_binding for that IP and the MAC is different, then update that MAC/IP binding. The external entity might be trying to advertise the new MAC for that IP. (If we don't do this, then we will never learn External VIP to MAC changes)
>
> (Irrespective of, learn_from_arp_request is true or false, always do this -- if the TPA is on the router, add a new entry (it means the remote wants to communicate with this node, so it makes sense to learn the remote as well))

> 

 

I am working on this as well, but delayed a little. I hope to have something this week.

[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp (unsolicited ARP request or reply) instead of learn_from_arp_request? This is just to protect from potential rogue usage of  GARP reply flooding the MAC bindings.?

 

Thanks,

 

-venu

Han Zhou

unread,
Jun 9, 2020, 1:05:02 PM6/9/20
to Venugopal Iyer, Girish Moodalbail, Tim Rozet, Dumitru Ceara, Daniel Alvarez Sanchez, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria
On Tue, Jun 9, 2020 at 9:06 AM Venugopal Iyer <venug...@nvidia.com> wrote:

Sorry for the delay, Han, a quick question below:

 

From: ovn-kub...@googlegroups.com <ovn-kub...@googlegroups.com> On Behalf Of Han Zhou
Sent: Wednesday, June 3, 2020 4:27 PM
To: Girish Moodalbail <gmood...@gmail.com>
Cc: Tim Rozet <tro...@redhat.com>; Dumitru Ceara <dce...@redhat.com>; Daniel Alvarez Sanchez <dalv...@redhat.com>; Dan Winship <danwi...@redhat.com>; ovn-kub...@googlegroups.com; ovs-discuss <ovs-d...@openvswitch.org>; Michael Cambria <mcam...@redhat.com>; Venugopal Iyer <venug...@nvidia.com>
Subject: Re: [ovs-discuss] [OVN] flow explosion in lr_in_arp_resolve table

 

External email: Use caution opening links or attachments

 

Hi Girish, yes, that's what we concluded in last OVN meeting, but sorry that I forgot to update here.


On Wed, Jun 3, 2020 at 3:32 PM Girish Moodalbail <gmood...@gmail.com> wrote:
>
> Hello all,
>
> To kind of proceed with the proposed fixes, with minimal impact, is the following a reasonable approach?
>
> Add an option, namely dynamic_neigh_routes={true|false}, for a gateway router. With this option enabled, the nextHop IP's MAC will be learned through a ARP request on the physical network. The ARP request will be flooded on the L2 broadcast domain (for both join switch and external switch).

> 

 

I am working on the formal patch.

 

> Add an option, namely learn_from_arp_request={true|false}, for a gateway router. The option is interpreted as below:\
> "true" - learn the MAC/IP binding and add a new MAC_Binding entry (default behavior)
> "false" - if there is a MAC_binding for that IP and the MAC is different, then update that MAC/IP binding. The external entity might be trying to advertise the new MAC for that IP. (If we don't do this, then we will never learn External VIP to MAC changes)
>
> (Irrespective of, learn_from_arp_request is true or false, always do this -- if the TPA is on the router, add a new entry (it means the remote wants to communicate with this node, so it makes sense to learn the remote as well))

> 

 

I am working on this as well, but delayed a little. I hope to have something this week.

[vi> ] Just wanted to check if this should be learn_From_unsolicit_arp (unsolicited ARP request or reply) instead of learn_from_arp_request? This is just to protect from potential rogue usage of  GARP reply flooding the MAC bindings.?

 


Hi Venu, as discussed earlier in this thread it is hard to check if it is GARP in OVN from the router ingress pipeline. The proposal here cares about ARP request only. It seems the best option so far.

Han Zhou

unread,
Jun 10, 2020, 3:03:52 PM6/10/20
to Venugopal Iyer, Girish Moodalbail, Tim Rozet, Dumitru Ceara, Daniel Alvarez Sanchez, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria
Hi Girish, Venu,

I sent a RFC patch series for the solution discussed. Could you give it a try when you get the chance?

Thanks,
Han

Han Zhou

unread,
Jun 10, 2020, 3:04:35 PM6/10/20
to Venugopal Iyer, Girish Moodalbail, Tim Rozet, Dumitru Ceara, Daniel Alvarez Sanchez, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria
On Wed, Jun 10, 2020 at 12:03 PM Han Zhou <zho...@gmail.com> wrote:
Hi Girish, Venu,

I sent a RFC patch series for the solution discussed. Could you give it a try when you get the chance?

Girish Moodalbail

unread,
Jun 10, 2020, 3:14:32 PM6/10/20
to Han Zhou, Venugopal Iyer, Tim Rozet, Dumitru Ceara, Daniel Alvarez Sanchez, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria
Thanks Han. We will give this a try on our cluster and get back to you soon.

Regards,
~Girish

Girish Moodalbail

unread,
Jul 13, 2020, 9:37:21 PM7/13/20
to Han Zhou, Venugopal Iyer, Tim Rozet, Dumitru Ceara, Daniel Alvarez Sanchez, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria
Hello Han,

On the #openvswitch IRC channel I had provided an update on your patch working great on our test setup. That update was for the L3 Gateway Router option called learn_from_arp_request="true|false". With that option in place, the number of entries in the MAC binding table has significantly reduced.

However, I had not provided an update on the single join switch tests. Sincere apologies for the delay. We just got that code to work last week, and we have an update. This is for the option called dynamic_neigh_routers="true|false" on the L3 Gateway Router. It works as expected.  With that option in place, for all of the L3 Gateway Routers I see just 3 entries as expected:

  table=12(lr_in_arp_resolve  ), priority=500  , match=(ip4.mcast || ip6.mcast), action=(next;)
  table=12(lr_in_arp_resolve  ), priority=0    , match=(ip4), action=(get_arp(outport, reg0); next;)
  table=12(lr_in_arp_resolve  ), priority=0    , match=(ip6), action=(get_nd(outport, xxreg0); next;)

Before, on a 1000 node cluster with 1000 Gateway Routers we would see 1000 entries per Gateway Router and therefore a total of 1M entries in the cluster. Now, that is not the case.

Thank you!

Regards,
~Girish

Tim Rozet

unread,
Jul 14, 2020, 10:33:45 AM7/14/20
to Girish Moodalbail, Han Zhou, Venugopal Iyer, Dumitru Ceara, Daniel Alvarez Sanchez, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria
Thanks for the update Girish. Are you planning on submitting an ovn-k8s patch to enable these?

Tim Rozet
Red Hat CTO Networking Team

Girish Moodalbail

unread,
Jul 14, 2020, 11:32:02 AM7/14/20
to Tim Rozet, Han Zhou, Venugopal Iyer, Dumitru Ceara, Daniel Alvarez Sanchez, Dan Winship, ovn-kub...@googlegroups.com, ovs-discuss, Michael Cambria
Yes, we are going to submit the patch to enable those options on L3 Gateway Routers to ovn-k8s repo.  I am going to wait until these changes make it to OVN repo and then submit since I don't know if these options will be renamed and such.

Regards,
~Girish

Han Zhou

unread,
Jul 14, 2020, 12:20:55 PM7/14/20
to Girish Moodalbail, Dan Winship, Daniel Alvarez Sanchez, Dumitru Ceara, Michael Cambria, Tim Rozet, Venugopal Iyer, ovn-kub...@googlegroups.com, ovs-discuss
Thanks Girish for the update! I will submit formal patches for ovn.

Regards,
Han
Reply all
Reply to author
Forward
0 new messages