Intermittent issues MetalLB vs Cisco ACI

541 views
Skip to first unread message

iohenkies -

unread,
Oct 20, 2020, 5:45:54 AM10/20/20
to metallb-users
Hi all,

You might remember me as the idiot from a few weeks ago that forgot to open up TCP port 7946 after updating from MetalLB 0.8 to 0.9 but it seems there are other things in play here as well :) We're still experiencing intermittent problems (packet loss on one of the MetalLB provisioned IPs). This is probably not a MetalLB issue but I would like to have some more eyes on this one, if I can.

We've got 3 RKE Kubernetes clusters with K8s 1.18.9 and MetalLB 0.9.3 running in Layer 2 mode and installed by manifest. Behind MetalLB is Nginx ingress 0.34.1 routing traffic to our applications. The only thing that is not default about the MetalLB configuration is that MetalLB pods are bound to 3 specific nodes with affinity and tolerations.

      nodeSelector:
        node-role.domain.com/core: "true"
      tolerations:
        - key: node-role.kubernetes.io/controlplane
          operator: Exists
          effect: NoSchedule
        - key: node-role.kubernetes.io/etcd
          operator: Exists
          effect: NoExecute

All nodes are VMs on a large VMWare cluster. They have one NIC attached with one MAC and one IPv4 address. Each cluster has its own floating IP. I've got a simple test app curling a random pod on a random node every second via one of the MetalLB provisioned IPs. When the problems occur about 10% of packages are lost. Restarting controller or speakers do not solve the issues. And this is also a problem: 3 weeks have gone by that the problems did not occur. Last week I had 3 days of packet loss on 1 of the clusters, with the same symptoms. Now it is stable again.

Although not necessary (all nodes and deployments are perfectly healthy), at the time of the problems, I often see in the MetalLB logs that the floating IP is being switched from one node to another. 'Why' is the big question for me. At the time of the initial problems, about a month ago, it seemed to be resolved by opening up the port as mentioned before and this:
https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/4-x/L3-configuration/Cisco-APIC-Layer-3-Networking-Configuration-Guide-411/Cisco-APIC-Layer-3-Networking-Configuration-Guide-411_chapter_010010.html#id_94860

Network guys also saw the frequent switching on their end and after the 'fix' from the above link this stopped. As did the problems for 3 weeks. When they resurfaced last week, without anything changing cluster or network side, we scheduled a call with Cisco enterprise support. The backend is all Cisco ACI. And this is probably the most important question I have for you: are there NO known problems with Cisco ACI and MetalLB or K8s CNIs in general that you know of? Since at the time of the call with Cisco the problems did not manifest, troubleshooting was hard. The only thing they could tell me at the time is that in several point in the logs, they can see that two MAC addresses are associated with my floating IP at the same time. Why, they cannot tell. They call it a general network flow / layer 2 problem.

And this is where I'm left hanging. Cisco ACI also uses VXLAN overlay networking. Does this interfere with Kubernetes overlay? We're using RKE with canal (their default). The people at Rancher don't seem to think so. The in-cluster IP range is 10.43.0.0/16 which in theory overlaps with the LAN that is on 10.0.0.0/8 with all different kinds of masks. 10.43.x.x isn't in use in the LAN though and I can't imagine it being a problem for traffic flowing from outside to inside the cluster, but maybe I'm wrong and maybe the ACI overlay does stuff with the packages we don't expect...

I'm open to any suggestions, opinions, remarks, etc. and thank you very much in advance.

Rodrigo Campos

unread,
Oct 20, 2020, 7:24:56 AM10/20/20
to iohenkies -, metallb-users
On Tue, Oct 20, 2020 at 11:45 AM iohenkies - <iohe...@gmail.com> wrote:
>
> Hi all,
>
> You might remember me as the idiot from a few weeks ago that forgot to open up TCP port 7946 after updating from MetalLB 0.8 to 0.9 but it seems there are other things in play here as well :) We're still experiencing intermittent problems (packet loss on one of the MetalLB provisioned IPs). This is probably not a

Haha, no problem. It happens to all of us :)

>
> MetalLB issue but I would like to have some more eyes on this one, if I can.
>
> We've got 3 RKE Kubernetes clusters with K8s 1.18.9 and MetalLB 0.9.3 running in Layer 2 mode and installed by manifest. Behind MetalLB is Nginx ingress 0.34.1 routing traffic to our applications. The only thing that is not default about the MetalLB configuration is that MetalLB pods are bound to 3 specific nodes with affinity and tolerations.

The pods exposed by the service type load balancer, are running on
these nodes as well?

If they are not, they won't be advertised. But if they are running
there by chance (sometimes they run there, sometimes they don't), it
might be advertised from time to time or things like that.

> All nodes are VMs on a large VMWare cluster. They have one NIC attached with one MAC and one IPv4 address. Each cluster has its own floating IP. I've got a simple test app curling a random pod on a random node every second via one of the MetalLB provisioned IPs. When the problems occur about 10% of packages are lost. Restarting controller or speakers do not solve the issues. And this is also a problem: 3 weeks have gone by that the problems did not occur. Last week I had 3 days of packet loss on 1 of the clusters, with the same symptoms. Now it is stable again.

Do you have metallb speakers and controller logs for when this happened?

> Although not necessary (all nodes and deployments are perfectly healthy), at the time of the problems, I often see in the MetalLB logs that the floating IP is being switched from one node to another.

This is on MetalLB logs? Can you share this (not only this part, but
the whole logs while this was happening and a little before/after)

> Network guys also saw the frequent switching on their end and after the 'fix' from the above link this stopped. As did the problems for 3 weeks. When they

Is it switching between kubernetes nodes that have MetalLB running? Or
are other nodes observed in the switching?

Is it possible that traffic between these 3 nodes is blocked or not
working when this happens?

> resurfaced last week, without anything changing cluster or network side, we scheduled a call with Cisco enterprise support. The backend is all Cisco ACI. And this is probably the most important question I have for you: are there NO known problems with Cisco ACI and MetalLB or K8s CNIs in general that you know of?

See the issues for each CNI listed here:
https://metallb.universe.tf/configuration/. Most of the configurations
are unaffected, but edge cases if your setup is not straightforward
are documented there.

> And this is where I'm left hanging. Cisco ACI also uses VXLAN overlay networking. Does this interfere with Kubernetes overlay? We're using RKE with canal (their default). The people at Rancher don't seem to think so. The in-cluster IP range is 10.43.0.0/16 which in theory overlaps with the LAN that is on 10.0.0.0/8 with all different kinds of masks. 10.43.x.x isn't in use in the LAN though and I can't imagine it being a problem for traffic flowing from outside to inside the cluster, but maybe I'm wrong and maybe the ACI overlay does stuff with the packages we don't expect...

I didn't follow 10.0.00/8 is used by what? Overlaps are usually not
good, I'd verify if metalLB is answering to ARP requests on that IP or
other nodes too when the "flip" happens (i.e. what are the MAC
addresses that this switches to? All metalLB nodes only?)

Rodrigo Campos

unread,
Oct 20, 2020, 7:34:10 AM10/20/20
to iohenkies -, metallb-users
On Tue, Oct 20, 2020 at 11:45 AM iohenkies - <iohe...@gmail.com> wrote:
>
> Hi all,
>
> This is probably not a MetalLB issue but I would like to have some more eyes on this one, if I can.

MetalLB 0.9.4 was released a few days ago with some fixes for Layer2.
Please use 0.9.4 and try with and without the new fast dead node
algorithm (see instructions here:
https://metallb.universe.tf/release-notes/#version-0-9-2).

Rodrigo Campos

unread,
Oct 20, 2020, 9:17:17 AM10/20/20
to iohenkies -, metallb-users
Sorry for so many mails. I forgot to mention that metalLB doesn't
officially support running on a subset of nodes, although it is likely
that using the new dead node detection algorithm on layer 2 it will
just work.

When you try to disable it, make sure to run metalLB on all nodes.

--
Rodrigo Campos
---
Kinvolk GmbH | Adalbertstr.6a, 10999 Berlin | tel: +491755589364
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000

iohenkies -

unread,
Oct 20, 2020, 9:57:21 AM10/20/20
to metallb-users
Hi Rodrigo,

Thanks and let me answer one by one.

1. The pods exposed by the service type load balancer, are running on
these nodes as well?
- Let's take the cluster that had problems last week. This is a simple cluster with 6 nodes. The first 3 nodes are our 'core' nodes with control plane, etcd and a couple of extra 'system' related services like metallb and ingress. All other workloads and developers applications are on the other nodes. I believe this is a perfectly valid configuration?

2. Do you have metallb speakers and controller logs for when this happened?
- I do have fresh logs of the speakers and controller startups, these are attached. I do not have fresh logs of the time of the errors of last week. I have attached errors of a month ago that show the problem. When I looked in the logs last week, I saw the same behavior, but somehow I cannot find this logs now :( The problem does not manifest at this time but when it does, I will for sure restart speakers again and record fresh logs.

3. This is on MetalLB logs?
- This is in the MetalLB logs indeed (as attached as 'older_errrors.txt'). It was also in the logs of the Cisco ACI.

4. Is it switching between kubernetes nodes that have MetalLB runnin?
- Yes, only kubernetes nodes with the speaker pods on it, the 'core' nodes as mentioned above

5. Is it possible that traffic between these 3 nodes is blocked or not
working when this happens?
- In normal circumstances absolutely not. These nodes are on the same subnet and there if no firewalls in between. We even had them pinned on the same VMWare esx host, to no avail

6. See the issues for each CNI listed here:
- Yes I know the page and Canel should be good. My main concern here was Cisco ACI. It uses VXLAN network overlay as well and a colleague said that at least in the past this was a problem with other network overlays: ACI not being able to differentiate his own overlay network from the external (in our case Kubernetes) overlay network. All loose assumptions though

7. I didn't follow 10.0.00/8 is used by what? Overlaps are usually not
good, I'd verify if metalLB is answering to ARP requests on that IP or
other nodes too when the "flip" happens (i.e. what are the MAC
addresses that this switches to? All metalLB nodes only?)
- 10.0.0.0/8 is in use by the LAN, but not the entire range of course. It is divided in many small subnets with many different masks. FWIW 10.43.x.x is not (yet) in use on the LAN. I'm mentioning this because the in-cluster IP range of Canal is 10.43.0.0/16 by default and my colleague and I were brainstorming if this could not be a problem. In theroy the 2 overlap. My colleague thinks it can cause problems because of that. I think it cannot (at least not for traffic going towards the cluster) because external machine don't send to in-cluster IP addresses (the 10.43.x.x range) but to the for them known floating IP. Hope this clears it up a little bit

8. New version!
- I will try it ASAP

iohenkies -

unread,
Oct 20, 2020, 9:59:38 AM10/20/20
to metallb-users
No problem at all, I'm very glad you are taking the time. Are you saying to remove the taints/tolerations in such a way that speaker pods run on all nodes?

iohenkies -

unread,
Oct 20, 2020, 10:07:46 AM10/20/20
to metallb-users
Ugh. And the attachments.

On Tuesday, October 20, 2020 at 1:24:56 PM UTC+2 rod...@kinvolk.io wrote:
speaker3.txt
speaker1.txt
controller.txt
older_errors.txt
speaker2.txt

iohenkies -

unread,
Oct 22, 2020, 3:39:01 AM10/22/20
to metallb-users, Rodrigo Campos
Hi all. At this very moment my DEV and PRD cluster are unusable due to the problems.

- DEV is still running 0.9.3, METALLB_ML_BIND_ADDR ON and now the speaker pods on all nodes instead of the smaller subset (as proposed by Rodrigo)
- PRD is now running version 0.9.4 and I've tried with METALLB_ML_BIND_ADDR ON and OFF

We're trying to establish a troubleshooting session with Cisco ASAP but I already would like to let you know.

I've attached two MetalLB logs from the command k -n metallb-system logs --selector app=metallb --follow. One is midway when I just discovered the problems and the second is right after restarting the controller and speakers. Only things that strikes me as strange is the large amount of:

{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:55:47.123678678Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:56:04.121236745Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:56:18.217862317Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:56:18.218531381Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:56:45.977524594Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:57:25.917597383Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:57:52.422352786Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:57:52.42283054Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:58:44.127818346Z"}

Difference with the situation at the time of the previous problems about a month ago, is that the responseMAC was switching between nodes. Now it responds every time with the same MAC.

--
You received this message because you are subscribed to the Google Groups "metallb-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metallb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/05d130c9-3da7-4e79-bafc-2109bc89446fn%40googlegroups.com.
1-log-midway.txt
2-log-start.txt

Rodrigo Campos

unread,
Oct 22, 2020, 10:48:07 AM10/22/20
to iohenkies -, metallb-users
On Thu, Oct 22, 2020 at 9:38 AM iohenkies - <iohe...@gmail.com> wrote:
>
> Hi all. At this very moment my DEV and PRD cluster are unusable due to the problems.
>
> - DEV is still running 0.9.3, METALLB_ML_BIND_ADDR ON and now the speaker pods on all nodes instead of the smaller subset (as proposed by Rodrigo)
> - PRD is now running version 0.9.4 and I've tried with METALLB_ML_BIND_ADDR ON and OFF
>
> We're trying to establish a troubleshooting session with Cisco ASAP but I already would like to let you know.

Thanks for sharing. Will look the logs in more detail in the next few days.

>
> I've attached two MetalLB logs from the command k -n metallb-system logs --selector app=metallb --follow. One is midway when I just discovered the problems and the second is right after restarting the controller and speakers. Only things that strikes me as strange is the large amount of:
>
> {"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:55:47.123678678Z"}
>
> Difference with the situation at the time of the previous problems about a month ago, is that the responseMAC was switching between nodes. Now it responds every time with the same MAC.

Ok, if the MAC is not switching, what is the problem you see? Have you
tried these steps:
https://metallb.universe.tf/configuration/troubleshooting/ ?

Also, I'd try to connect using the NodePort (if you open that port on
the nodes). If that still has issues, then MetalLB for sure is out of
the picture (as it might be now, as you don't see the flipping MAC
address)


Thanks,
Rodrigo

iohenkies -

unread,
Oct 23, 2020, 8:02:24 AM10/23/20
to metallb-users
Hi Rodrigo and thanks again. Yesterday was an awful 12 hour workday with no progress and in the end the problems automagically solved during the night. Needless to say it is only a matter of time before they resurface. It would be awesome if you could take a look at the logs, to see if something weird sticks out.

Although I'm pretty convinced this is a network problem of some sort - layer 2, (G)ARP learning, Cisco ACI stuff - there are now 2 Cisco enterprise guys saying the same thing: Cisco ACI reports over and over again that our floating IP is in use by 2 MAC addresses. Nothing the Cisco hardware does to cause this, this is how it is advertised TO the Cisco ACI FROM the network/cluster. Yesterday this problem was with 2 clusters and they noticed the same thing on both.

Yesterday this was the situation:
- TST running 0.9.3 but stable, no problems
- DEV running 0.9.3, METALLB_ML_BIND_ADDR ON and speaker pods on all nodes instead of the smaller subset
- PRD now running version 0.9.4 and I've tried with METALLB_ML_BIND_ADDR ON and OFF

Today it is running stable, all upgraded to 0.9.4 and METALLB_BL_BIND_ADDR status ON. But again: 0 confidence this will remain stable.

To respond to your questions:
- I've used the troubleshooting steps before and will do again when problems occur
- As far as I can see I don't have any node port services? I have MetalLB and behind it Nginx Ingress on port 80 and 443

Etienne Champetier

unread,
Oct 23, 2020, 8:22:40 AM10/23/20
to iohenkies -, metallb-users
Hello iohenkies,


Le ven. 23 oct. 2020 à 08:02, iohenkies - <iohe...@gmail.com> a écrit :
Hi Rodrigo and thanks again. Yesterday was an awful 12 hour workday with no progress and in the end the problems automagically solved during the night. Needless to say it is only a matter of time before they resurface. It would be awesome if you could take a look at the logs, to see if something weird sticks out.

Although I'm pretty convinced this is a network problem of some sort - layer 2, (G)ARP learning, Cisco ACI stuff - there are now 2 Cisco enterprise guys saying the same thing: Cisco ACI reports over and over again that our floating IP is in use by 2 MAC addresses. Nothing the Cisco hardware does to cause this, this is how it is advertised TO the Cisco ACI FROM the network/cluster. Yesterday this problem was with 2 clusters and they noticed the same thing on both.

Do you have the 2 MACs ? Have you identified which node they are on ?
Do you have the time of the changes ? Look at the logs 10s before/after for each node.

tcpdump ALL arp traffic from all nodes with rotating captures, so you will know exactly what happens during issues

Do you have the cpu/memory limits configured ? (yes by default in the manifests)
Remove them.

Best
Etienne

--
You received this message because you are subscribed to the Google Groups "metallb-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metallb-user...@googlegroups.com.

Follow us on LinkedIn : https://www.linkedin.com/company/anevia/
Follow us on Twitter : https://twitter.com/aneviaiptv


CONFIDENTIALITY NOTICE: The information in this e-mail message contains confidential information intended only for the use of the addressee named above. Unauthorized review, dissemination, distribution, copying or other use of this e-mail message, including all attachments, is strictly prohibited and may be unlawful. If you have received this e-mail message in error, please notify us immediately by telephone at +33 1 41983240 or by return e-mail and destroy this message and all copies thereof, including all attachments

iohenkies -

unread,
Oct 23, 2020, 8:35:19 AM10/23/20
to metallb-users
Hi Etienne,

The MACs are easily recognizable and from the K8s nodes. The changes, or at least how Cisco describes it the double MAC address for the floating IP, is constantly happening at the time of the problems. Yesterday the internal (non-Cisco) network guy literally said "Now there is one MAC for the IP. Oh now there are two." And some time later "Hmm there is one again. Is it working now? Oh, now there are two MACs again.". So it seems from the ACI perspective it is not a constant condition, but it is happening constantly. It is 'flapping' as they call it.

- The logs at the time was already attached and will attach it again "...I've attached two MetalLB logs from the command k -n metallb-system logs --selector app=metallb --follow. One is midway when I just discovered the problems and the second is right after restarting the controller and speakers. Only things that strikes me as strange is the large amount of..."
- Could you please explain how to this, especially the 'rotating captures'? "... tcpdump ALL arp traffic from all nodes with rotating captures ..."
- The manifests are indeed default except for the nodeselector and toleration. I will remove the limits!

iohenkies -

unread,
Oct 23, 2020, 8:36:02 AM10/23/20
to metallb-users
Attachments.
1-log-midway.txt
2-log-start.txt

Etienne Champetier

unread,
Oct 23, 2020, 10:04:53 AM10/23/20
to iohenkies -, metallb-users
Le ven. 23 oct. 2020 à 08:35, iohenkies - <iohe...@gmail.com> a écrit :
Hi Etienne,

The MACs are easily recognizable and from the K8s nodes. The changes, or at least how Cisco describes it the double MAC address for the floating IP, is constantly happening at the time of the problems. Yesterday the internal (non-Cisco) network guy literally said "Now there is one MAC for the IP. Oh now there are two." And some time later "Hmm there is one again. Is it working now? Oh, now there are two MACs again.". So it seems from the ACI perspective it is not a constant condition, but it is happening constantly. It is 'flapping' as they call it.

- The logs at the time was already attached and will attach it again "...I've attached two MetalLB logs from the command k -n metallb-system logs --selector app=metallb --follow. One is midway when I just discovered the problems and the second is right after restarting the controller and speakers. Only things that strikes me as strange is the large amount of..."

the missing memberlist "spam" in the log is a bad sign, means that the nodes stop talking to each other and all think they are master
0.9.4 rejoin periodically so it'll recover automatically
the root of the issue might be the limits, making memberlist fail (haven't look at the logs yet)
 
- Could you please explain how to this, especially the 'rotating captures'? "... tcpdump ALL arp traffic from all nodes with rotating captures ..."

something like
tcpdump -i INTF -s1500 -p arp -G 3600 -w /tmp/arp-nodeX-%Y-%m-%d_%H.%M.pcap

I'm using '-p' to not put the card in promiscuous mode, we might need to do 2 captures



--

Etienne Champetier

Operations Engineer

Skype id: etiennechampetier

Web: www.anevia.com


iohenkies -

unread,
Oct 23, 2020, 10:37:05 AM10/23/20
to metallb-users
Hi Etienne,

In the mails I couldn't discover the 'memberlist spam' you are referring to. I only pulled large amounts of this from the logs:
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T07:35:15.209663756Z"}

Looking at the logs themselves, this might be what you mean (but I'm a bit confused because you said you didn't look at the logs yet):
{"caller":"net.go:210","component":"Memberlist","msg":"[DEBUG] memberlist: Stream connection from=10.11.16.61:46000","ts":"2020-10-22T07:35:01.016510916Z"}
{"caller":"net.go:785","component":"Memberlist","msg":"[DEBUG] memberlist: Initiating push/pull sync with: 10.11.16.63:7946","ts":"2020-10-22T07:35:01.008023999Z"}
{"caller":"net.go:210","component":"Memberlist","msg":"[DEBUG] memberlist: Stream connection from=10.11.16.63:54488","ts":"2020-10-22T07:35:03.282288846Z"}
{"caller":"net.go:785","component":"Memberlist","msg":"[DEBUG] memberlist: Initiating push/pull sync with: 10.11.16.62:7946","ts":"2020-10-22T07:35:03.282019153Z"}
{"caller":"net.go:210","component":"Memberlist","msg":"[DEBUG] memberlist: Stream connection from=10.11.16.62:52862","ts":"2020-10-22T07:35:13.579998673Z"}
{"caller":"net.go:785","component":"Memberlist","msg":"[DEBUG] memberlist: Initiating push/pull sync with: 10.11.16.63:7946","ts":"2020-10-22T07:35:13.579964703Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T07:35:15.209663756Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T07:35:28.013029166Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T07:35:28.569118671Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T07:35:30.168856668Z"}
{"caller":"net.go:210","component":"Memberlist","msg":"[DEBUG] memberlist: Stream connection from=10.11.16.61:60168","ts":"2020-10-22T07:35:31.018281454Z"}
{"caller":"net.go:785","component":"Memberlist","msg":"[DEBUG] memberlist: Initiating push/pull sync with: 10.11.16.62:7946","ts":"2020-10-22T07:35:31.009558109Z"}
{"caller":"net.go:785","component":"Memberlist","msg":"[DEBUG] memberlist: Initiating push/pull sync with: 10.11.16.62:7946","ts":"2020-10-22T07:35:33.28398487Z"}

But in any case I understand this is a big problem and would substantiate the view from the Cisco engineers:
"... means that the nodes stop talking to each other and all think they are master ..."

Now I'll remove the limits from the clusters as well. Not to jump to conclusions but when this is the issue, I'll have no clue what is causing this and how our clusters are different than all these clusters that have MetalLB running successfully with limits (although I don't care much disabling these).

iohenkies -

unread,
Oct 23, 2020, 10:46:56 AM10/23/20
to metallb-users
Hi Etienne,

I guess we're "lucky": as soon as I updated the MetalLB manifests on DEV it started giving errors again. I have 3 short captures from 3 nodes attached. I will make more captures.

On Friday, October 23, 2020 at 4:04:53 PM UTC+2 echam...@anevia.com wrote:
arp-node2-2020-10-23_16.40.pcap
arp-node1-2020-10-23_16.40.pcap
arp-node3-2020-10-23_16.40.pcap

iohenkies -

unread,
Oct 23, 2020, 11:12:10 AM10/23/20
to metallb-users
Some more packet captures. Since the problems are now constant, I assume this should be enough.

DEV is the one giving errors.

PRD is OK, but wanted to attach anyway because maybe you notice an important difference or some kind of anomaly when it is in a healthy state

On Friday, October 23, 2020 at 4:04:53 PM UTC+2 echam...@anevia.com wrote:
dev-kube-03-2020-10-23_16.59.pcap
prd-kube-03-2020-10-23_17.01.pcap
dev-kube-02-2020-10-23_16.59.pcap
prd-kube-01-2020-10-23_17.01.pcap
dev-kube-02-2020-10-23_16.55.pcap
dev-kube-01-2020-10-23_16.55.pcap
dev-kube-01-2020-10-23_16.59.pcap
prd-kube-02-2020-10-23_17.01.pcap
dev-kube-03-2020-10-23_16.55.pcap

Etienne Champetier

unread,
Oct 23, 2020, 1:55:23 PM10/23/20
to iohenkies -, metallb-users
Le ven. 23 oct. 2020 à 10:46, iohenkies - <iohe...@gmail.com> a écrit :
Hi Etienne,

I guess we're "lucky": as soon as I updated the MetalLB manifests on DEV it started giving errors again. I have 3 short captures from 3 nodes attached. I will make more captures.

When you restart speakers the VIP will move right away and move back
As soon as everything is stable again, the last node considered master will send GARP for 5sec
So you will see GARP from multiple nodes during rollout, but at the end with 5 GARP in 5 sec this should be enough for the switch to update

mergecap -w merged-arp.pcapng arp-node*
tshark -Y "arp.dst.proto_ipv4 == 10.11.112.74 or arp.src.proto_ipv4 == 10.11.112.74" -r merged-arp.pcapng

There is only 1 MAC for 10.11.112.74, it's 00:50:56:ab:81:a1

 


--

Etienne Champetier

Operations Engineer

Skype id: etiennechampetier

Web: www.anevia.com


Etienne Champetier

unread,
Oct 23, 2020, 2:02:36 PM10/23/20
to iohenkies -, metallb-users
Le ven. 23 oct. 2020 à 11:12, iohenkies - <iohe...@gmail.com> a écrit :
Some more packet captures. Since the problems are now constant, I assume this should be enough.

DEV is the one giving errors.

mergecap -w merged.pcapng dev-kube*
tshark -Y "arp.dst.proto_ipv4 == 10.11.112.74 or arp.src.proto_ipv4 == 10.11.112.74" -r merged.pcapng

There is only 1 MAC !!

I think you need to do a capture in the network at the same time as on the nodes, to see if something creates responses other than the nodes (VMware ?? Cisco ??)
do the captures without the '-p'



--

Etienne Champetier

Operations Engineer

Skype id: etiennechampetier

Web: www.anevia.com


iohenkies

unread,
Oct 23, 2020, 4:49:39 PM10/23/20
to Etienne Champetier, metallb-users
Hi Etienne,

Thank you very much for your time and effort. This weekend I’ll try to find and follow a crash course on network captures and analysis and figure out if this all answers some questions. I’ll report back for sure.

Kind regards,
iohenkies

On 23 Oct 2020, at 20:02, Etienne Champetier <echam...@anevia.com> wrote:



Etienne Champetier

unread,
Oct 23, 2020, 5:57:52 PM10/23/20
to iohenkies, metallb-users
Hi again,

Le ven. 23 oct. 2020 à 16:49, iohenkies <iohe...@gmail.com> a écrit :
Hi Etienne,

Thank you very much for your time and effort. This weekend I’ll try to find and follow a crash course on network captures and analysis and figure out if this all answers some questions. I’ll report back for sure.

tshark is just command line wireshark, so you can use the same filter (-Y) in the GUI.

You said there was problems during the 2x5min captures (dev-kube*.pcap), but there is 1MAC only, so there is some magic in this network.
Tell your networking & virtualization people to launch 1h capture of all arp traffic on the virtualization host / on the gateway if possible / somewhere, at the same time also capture on the 3 nodes, send the captures to them, if they have 1 MAC in the nodes captures and 2 MACs in another capture, then they need to look somewhere else.

Also for the networking team / Cisco, MetalLB does GARP (gratuitous ARP), more precisely 1 GARP request and 1 GARP response, every 1.1s for 5s when the node become master for this IP.
Maybe GARP request confuses "the network", or too much changes trigger some anti flapping and in the end "the network" stay in a weird state.

=> pcap or it didn't happen !

iohenkies -

unread,
Nov 5, 2020, 3:58:47 PM11/5/20
to metallb-users
Hi all, me again. Hope I can steal some of your time again.

Unfortunately problems are not solved and it is getting very precarious now. Let me try to summarize:
* Etienne helped me with some pcap info that showed that there is always only one MAC address associated with the floating IP
* Last Friday I had another troubleshooting call with a network expert. Long story short, he says Cisco people are wrong and he sees the network is stable, converged and indeed only 1 MAC with 1 floating IP
* This made me look to the clusters again. I thought I was on to something when I discovered I could reproduce and solve the issue:
- I'm able to reproduce the problem when I have more than 1 Nginx ingress controller and I change something random in the MetalLB config (forcing redeployment of pods)
- I'm able to solve the problem by scaling the Nginx ingress controller to 0 and then back to 2 or 3
- I could not reproduce this problem when I have only 1 Nginx ingress controller
* These last two points made me think it must have something to do with ingress, right?
* I've contacted the Nginx ingress developers and we ran some tests. Sorry they say, it really is not a Nginx ingress problem, because:
- By scaling Nginx to 0 and back to 2 or 3 you are simply resetting the state and forcing MetalLB to reconfigure
=> Important to know: I've now setup a dedicated cluster with just one MetalLB speaker pinned to one node and one Nginx ingress pinned to the same node
- After a random time, my test app just stops responding. Times out in the browser. But since we now have one pod for each service, troubleshooting is hopefully easier
- So now I cannot reach my app via the floating IP, BUT I CAN directly curl the ingress nodeport with a curl direct-ip-of-a-node:31775 -H 'Host: testapp.domain.com'

Nginx people say it therefore is most likely a MetalLB issue. But we were here again of course. I'm hoping very very much this info helps and you still want to help.

Etienne Champetier

unread,
Nov 5, 2020, 4:46:53 PM11/5/20
to iohenkies -, metallb-users
Hello,

Le jeu. 5 nov. 2020 à 15:58, iohenkies - <iohe...@gmail.com> a écrit :
Hi all, me again. Hope I can steal some of your time again.

Unfortunately problems are not solved and it is getting very precarious now. Let me try to summarize:
* Etienne helped me with some pcap info that showed that there is always only one MAC address associated with the floating IP
* Last Friday I had another troubleshooting call with a network expert. Long story short, he says Cisco people are wrong and he sees the network is stable, converged and indeed only 1 MAC with 1 floating IP
* This made me look to the clusters again. I thought I was on to something when I discovered I could reproduce and solve the issue:
- I'm able to reproduce the problem when I have more than 1 Nginx ingress controller and I change something random in the MetalLB config (forcing redeployment of pods)
- I'm able to solve the problem by scaling the Nginx ingress controller to 0 and then back to 2 or 3
- I could not reproduce this problem when I have only 1 Nginx ingress controller
* These last two points made me think it must have something to do with ingress, right?
* I've contacted the Nginx ingress developers and we ran some tests. Sorry they say, it really is not a Nginx ingress problem, because:
- By scaling Nginx to 0 and back to 2 or 3 you are simply resetting the state and forcing MetalLB to reconfigure
=> Important to know: I've now setup a dedicated cluster with just one MetalLB speaker pinned to one node and one Nginx ingress pinned to the same node
- After a random time, my test app just stops responding. Times out in the browser. But since we now have one pod for each service, troubleshooting is hopefully easier
- So now I cannot reach my app via the floating IP, BUT I CAN directly curl the ingress nodeport with a curl direct-ip-of-a-node:31775 -H 'Host: testapp.domain.com'

Nginx people say it therefore is most likely a MetalLB issue. But we were here again of course. I'm hoping very very much this info helps and you still want to help.

In between MetalLB and Nginx, you have kubeproxy, with a lot of iptables rules, maybe some IPVS, so many things that can go wrong ;)
Are you using kubeproxy iptables or IPVS ? If IPVS look at https://github.com/metallb/metallb/issues/153#issuecomment-518651132
Are you using externalTrafficPolicy local or cluster ? Can you try to switch between the 2 ?
What CNI / OS are you using ?

Can you reproduce on a single host cluster ? then same config on bare metal or a VM on your laptop (just to remove the whole VMware/Cisco part from the picture)
Make sure to have exactly the same kernel version if you start such tests.

 


--

Etienne Champetier

Operations Engineer

Skype id: etiennechampetier

Web: www.anevia.com


iohenkies -

unread,
Nov 6, 2020, 7:37:23 AM11/6/20
to metallb-users
Hi Etienne,

- I'm assuming kubeproxy iptables. I can't find how to see the active config but at least I see --proxy-mod=ipvs is not being set
- I've tried Cluster and Local traffic policy. Both seem to have the same results
- We're running RKE clusters and the default there is Canal. We're running that
- OS is CentOS 7.7 and kernel is 3.10.0-1062.9.1.el7.x86_64

I'll try and organize a single node and local VM.

Something strange I just noticed on the dedicated test cluster where I have the single ingress controller / single speaker setup. This stopped functioning altogether again and I've ran my

curl direct-ip-of-a-node:31775 -H 'Host: testapp.domain.com'

again. This does not work on one node! It gives a

curl: (7) Failed connect to 10.11.80.121:31775; Connection timed out

It only works from the node itself, but not from anywhere else. I'm not absolutely sure this is related, but it is sure a strange issue on its own.

So from the 6 nodes, this curl to the nodeport does not work on the first node. This first node is together with node 2 and 3 an etcd and Control Plane node with no other (workload related) pods on them. So also no MetalLB and Ingress.

The nodes are 100% identical (configured with Ansible), port 31775 is open and listening. On the network / firewalls all nodeports are allowed. I've double checked this at the network folks. Also my colleague did a short packet capture showing the request does arrive at the node. It goes wrong on the node. Again nothing in the logs (messages, dmesg, kubeproxy, ingress, metallb e.g.). I'm stumped. This is a dedicated test cluster with only RKE, MetalLB and Nginx and default settings.

Rodrigo Campos

unread,
Dec 8, 2020, 12:43:52 PM12/8/20
to iohenkies -, metallb-users
Sorry for the late reply. If you can't connect to the host-ip:nodePort then that is a bigger issue and you need that working reliably before metallb takes action. That is most likely related to your CNI or kube-proxy.

Did you find your network problem? Let us know if after fixing that MetalLB has any issues :)

Francesco Trebbi

unread,
Oct 26, 2022, 9:07:19 AM10/26/22
to metallb-users
Does anybody know if the root cause has been found? 
I do have exactly the same issue with MetalLB 0.9.5 + vmware + Cisco physical switch.
Reply all
Reply to author
Forward
0 new messages