Frequent switching of responseMAC with MetalLB version 0.9.3

iohenkies -

unread,

Sep 4, 2020, 2:58:24 AM9/4/20

to metallb-users

Hi all, I just posted this on Slack as well but being it not so active recently, I took a look on the website and found this mailing list. Since this is a pretty big issue for us, I'm trying my luck here as well ;)

So these last couple of days we've troubleshooted some intermittent connectivity issues in our 3 identical Kubernetes clusters (version 1.18.3, RKE) and it seems we're finally at the source of it.

We're running on prem and are using MetalLB in Layer 2 mode.Recently we've switched from the deprecated helm chart (kubernetes-charts.storage.googleapis.com) with MetalLB version 0.8.1 to the maintained helm chart (charts.bitnami.com) with MetalLB version 0.9.3. I've got a simple test application curling a pod every second (https://hub.docker.com/r/monachus/rancher-demo) and while preparing a demo I noticed that about 1% to 2% of the requests failed all of a sudden. Very long story short: this only happens while using MetalLB 0.9.3 and not with MetalLB 0.8.1.

Around every failed package we can find the below in the logs so I guess the reason of the failed packages is clear but why this is happening we cannot find any solid reason for it. Also important to emphasize: a cluster with version 0.8.1 is rock solid for days and that exact same cluster but now testing with version 0.9.3 gives us these problems.

{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.59385607Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.594421005Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.593812138Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.594421678Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.887565678Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.888564243Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.88753557Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.888830084Z"}

Any help is greatly appreciated. If this is just a bug that will be solved in version 0.9.4 we're of course also fine with it. We just like some confirmation/explanation of this behavior.

Johannes Liebermann

unread,

Sep 4, 2020, 2:12:41 PM9/4/20

to iohenkies -, metallb-users

Hi there,

>Around every failed package we can find the below in the logs so I guess the reason of the failed packages is clear but why this is happening we cannot find any solid reason for it.

It's not clear to me what the reason for the failures is from the logs you've pasted. All that I seem to be seeing is that the speaker is responding to ARP requests (which is its normal behavior assuming it's the active speaker).

If you suspect there is a bug in MetalLB, I encourage you to open an issue at https://github.com/metallb/metallb/issues and include the steps for reproducing the problem.

--
You received this message because you are subscribed to the Google Groups "metallb-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metallb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/ffa48e9e-685c-40ea-a092-67fa7e136653n%40googlegroups.com.

--

Johannes Liebermann

Kinvolk GmbH | Adalbertstr. 6a, 10999 Berlin | tel: +491755589364
Geschäftsführer/Directors: Alban Crequy, Chris Kühl, Iago López Galeiras
Registergericht/Court of registration: Amtsgericht Charlottenburg
Registernummer/Registration number: HRB 171414 B
Ust-ID-Nummer/VAT ID number: DE302207000

Rodrigo Campos

unread,

Sep 4, 2020, 3:43:04 PM9/4/20

to iohenkies -, metallb-users

On Fri, Sep 4, 2020 at 8:58 AM iohenkies - <iohe...@gmail.com> wrote:

> Around every failed package we can find the below in the logs so I guess the reason of the failed packages is clear but why this is happening we cannot find any solid reason for it. Also important to emphasize: a cluster with version 0.8.1 is rock solid for days and that exact same cluster but now testing with version 0.9.3 gives us these problems.
>
> {"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.59385607Z"}
> {"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.594421005Z"}
> {"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.593812138Z"}
> {"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.594421678Z"}
> {"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.887565678Z"}
> {"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.888564243Z"}
> {"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.88753557Z"}
> {"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.888830084Z"}

Can you please share some information about your network setup? I
guess you have more than one interface, that is why the mac address
changes?

Also, you 1% traffic loss, but do you have any clue where in your
network that is being dropped? Is the traffic reaching the kubernetes
worker announcing the IP for the service?

iohenkies -

unread,

Sep 7, 2020, 3:12:27 AM9/7/20

to Rodrigo Campos, metallb-users

Hi guys, thanks for the feedback. Let me try and clarify a couple of things.

We're in a pretty large network. I don't have insight on all the details, I'm just the Linux guy. There is a big VMWare environment with THz of compute power and TBs of memory and storage capacity. With a Rancher management cluster and 3 Kubernetes clusters we've got a piece of this VMWare platform. These clusters are semi in production, meaning about 25% is utilized and nothing is under pressure, resource-wise. Exactly a week ago I discovered a lot of errors in my simple test application (https://hub.docker.com/r/monachus/rancher-demo) that I'm absolutely sure I did not have before; the app curls a pod every second and should give a 200. A week ago I discovered 1% to 2% gave an error. Long story short:

- these last 7 days I did thousands of these tests on all 3 clusters

- running MetalLB 0.8.1 this results in 100% of 200s (see attached for one run)

- running MetalLB 0.9.3 this results in about 2% of errors

So 3 clusters with MetalLB 0.8.1: rock solid, not one error. The exact same 3 clusters with MetalLB 0.9.3: the described problems.

At the time of the errors in the test app, the earlier sent lines pop up in the logs. What I make of the log, and of course I could be wrong, is that the floating IP (10.11.112.74) is bound to my node with MAC address 00:50:56:ab:0a:2c, then switches to another node with MAC address 00:50:56:ab:ae:f1, and back and forth once more. All in a second or 5 max. But at this time there is no reason to switch. All pods, workloads, nodes are stable, alive and kicking. And it only does so with version 0.9.3. All nodes have a single NIC and MAC address.

As a side note: before I could pinpoint this to MetalLB I also talked to the network guys, the firewall guys, the VMWare guys: they all assured me there were no issues or changes going on affecting these 3 clusters. The 3 clusters also reside all 3 in a different IP subnet, so such an issue or change would be network wide. In short: I don't believe we should look for the cause somewhere else in the environment.

I hope I was able to clarify. I will try and test myself with other versions than 0.8.1 and 0.9.3 but time is an issue.

2020-09-07 08_33_59-Rancher Demo.png

Johannes Liebermann

unread,

Sep 7, 2020, 7:46:12 AM9/7/20

to iohenkies -, Rodrigo Campos, metallb-users

Thanks for elaborating. I understand you're running a complex setup, however we have to focus on MetalLB...

If there is a bug in MetalLB which causes this behavior under stable network conditions, we should be able to reproduce it. Can you share some steps which we can take to see the bug? Without reproducing the problem, it is not clear to me how we can help.

--

You received this message because you are subscribed to the Google Groups "metallb-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metallb-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/CAHAts2iXg7U%2Btb3Mh2GjPUhdgq9jwNrnZnxG4YpcXOVQFDPGwg%40mail.gmail.com.

iohenkies -

unread,

Sep 7, 2020, 10:19:35 AM9/7/20

to Johannes Liebermann, Rodrigo Campos, metallb-users

I understand, I just wanted to make things clear that this really is MetalLB version related.

When you have an existing Kubernetes cluster running, I think steps to reproduce are easy, but we have only Rancher managed clusters so this might differ.

- RKE cluster, Kubernetes version 1.18.3

- MetalLB version 0.9.3

- Behind this a Nginx ingress 0.34.1 as type Load Balancer, receives IP from MetalLB pool

- Test application https://hub.docker.com/r/monachus/rancher-demo with for instance 6 pods

- Configure an Ingress object to point to the test app and terminate TLS

- Go to your app from a browser and start watching cows appear

- Leave it running for a few hours (sometimes it takes a while before errors appear)

- Follow the logs with `k -n metallb-system logs --selector app.kubernetes.io/component=speaker --follow` at the same time

This is about it, I don't know if you are expecting more, please let me know if you do. For me with version 0.9.3 about 2% of the requests results in an error. With 0.8.1 there are 0% errors.

Maybe not clear before, but I also have the exact same results with other apps. But I threw this test app in for convenience. For instance while using linkerd.io and leaving the dashboard open, which also polls every second, I'm getting a lot of errors too (maybe even more than with the test app). It is 100% operational while using MetalLB 0.8.1.

iohenkies -

unread,

Sep 7, 2020, 10:23:00 AM9/7/20

to Johannes Liebermann, Rodrigo Campos, metallb-users

Ah and one addition. I've read the release notes and I believe the major difference between 0.8 and 0.9 is the METALLB_ML_BIND_ADDR setting:

https://metallb.universe.tf/release-notes/

As a test I've deployed 0.9.3 again and disabled that specific setting as per the instructions and now will do the same tests again. Can report back within 24 hours.

iohenkies -

unread,

Sep 8, 2020, 12:30:39 PM9/8/20

to Johannes Liebermann, Rodrigo Campos, metallb-users

Hi. Today I've again ran thousands of tests at the same time to 3 different clusters:

1. A cluster with MetalLB 0.8.1
2. A cluster with MetalLB 0.9.3 and METALLB_ML_BIND_ADDR OFF (turned off manually as per the instructions)
3. A cluster with MetalLB 0.9.3 and METALLB_ML_BIND_ADDR ON (the default)

The first two: 100% success rate.

The third: see attached.

2020-09-08 18_25_08-Rancher Demo.png

Russell Bryant

unread,

Sep 8, 2020, 5:40:30 PM9/8/20

to iohenkies -, Johannes Liebermann, Rodrigo Campos, metallb-users

Interesting. When you have memberlist turned on, do you see memberlist events in our speaker logs? Are you seeing any unexpected NodeJoin / NodeLeave events?

--

Russell Bryant

To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/CAHAts2h-PBaYYKfMufeava1rCGUUhYdD-BEk7JjhsU%3D%3DMVKkLg%40mail.gmail.com.

iohenkies -

unread,

Sep 9, 2020, 3:29:43 AM9/9/20

to Russell Bryant, Johannes Liebermann, Rodrigo Campos, metallb-users

Hi. I'm pretty sure these types are the only types of events in the logs:

{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.59385607Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.594421005Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.593812138Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.594421678Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.887565678Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.888564243Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.88753557Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:26.888830084Z"}

But I will check again while monitoring again today. This is what I concluded before about this behavior (being no networking pro or MetalLB expert):

What I make of the log, and of course I could be wrong, is that the floating IP (10.11.112.74) is bound to my node with MAC address 00:50:56:ab:0a:2c, then switches to another node with MAC address 00:50:56:ab:ae:f1, and back and forth once more. All in a second or 5 max. But at this time there is no reason to switch. All pods, workloads, nodes are stable, alive and kicking. And it only does so with version 0.9.3 (as it seems now, with METALLB_ML_BIND_ADDR ON). All nodes have a single NIC and MAC address.

What also seems to be the case, is that this can go without an error for a couple of hours, but when the first error hits, errors can pile up fast, erroring every 5th request or so. But this I'm not completely sure of and might also be because of sample size.

Rodrigo Campos

unread,

Sep 9, 2020, 10:09:05 AM9/9/20

to iohenkies -, Russell Bryant, Johannes Liebermann, metallb-users

Thanks a lot for the detailed explanation!

I think I have an idea what the bug can be. If once you hit the issue,
you restart all the speaker pods (one by one), does that fix it by any
chance?

If that fixes the problem, I might ask you to try a development image
of metallb (PR not merged yet) if you have the time :)

--
Rodrigo Campos
---
Kinvolk GmbH | Adalbertstr.6a, 10999 Berlin | tel: +491755589364

Etienne Champetier

unread,

Sep 9, 2020, 10:52:10 AM9/9/20

to iohenkies -, Rodrigo Campos, Russell Bryant, Johannes Liebermann, metallb-users

Hi All,

Le mer. 9 sept. 2020 à 10:09, Rodrigo Campos <rod...@kinvolk.io> a écrit :

Thanks a lot for the detailed explanation!

I think I have an idea what the bug can be. If once you hit the issue,
you restart all the speaker pods (one by one), does that fix it by any
chance?

Just restarting 1 speaker should be enough I think, and you should see MemberList logs

To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/CACaBj2aP61EmTMty0yMd6S39YBhgWs7dGbHJ%2B4GQG-rMHY5t8Q%40mail.gmail.com.

--

Etienne Champetier

Operations Engineer

Skype id: etiennechampetier

Web: www.anevia.com

CONFIDENTIALITY NOTICE: The information in this e-mail message is legally privileged and contains confidential information intended only for the use of the individual or entity named above. Unauthorized review, dissemination, distribution, copying or other use of this e-mail message, including all attachments, is strictly prohibited and may be unlawful. If you have received this e-mail message in error, please notify us immediately by telephone at +33 1 41983240 or by return e-mail and destroy this message and all copies thereof, including all attachments.

Rodrigo Campos

unread,

Sep 9, 2020, 10:54:25 AM9/9/20

to metallb-users

On Wednesday, September 9, 2020 at 4:52:10 PM UTC+2 echam...@anevia.com wrote:

Hi All,

Le mer. 9 sept. 2020 à 10:09, Rodrigo Campos <rod...@kinvolk.io> a écrit :
Thanks a lot for the detailed explanation!

I think I have an idea what the bug can be. If once you hit the issue,
you restart all the speaker pods (one by one), does that fix it by any
chance?

Just restarting 1 speaker should be enough I think, and you should see MemberList logs

Right! Just restarting any one speaker pod should do the trick, if my guess is correct. Please update if you can test that :)

Johannes Liebermann

unread,

Sep 10, 2020, 7:48:09 AM9/10/20

to Rodrigo Campos, metallb-users

Thanks for the additional detailed info. It's clear to me that you're only experiencing the problem with v0.9.3. It's also likely that this functionality is related to a memberlist-related code path. What I'm still missing is steps to reproduce the problem. If you could share a list of steps on how to reproduce the bug, it would make a fix arrive faster. Example:

1. Deploy MetalLB v0.9.3 using manifest X.

2. Apply ConfigMap Y.

3. Do Z.

Once you send that we can open a bug and work on a fix.

--

You received this message because you are subscribed to the Google Groups "metallb-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metallb-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/0bf5e7db-6dfe-4663-b9b5-2ad31c1b65b5n%40googlegroups.com.

iohenkies -

unread,

Sep 10, 2020, 1:54:09 PM9/10/20

to Johannes Liebermann, Rodrigo Campos, metallb-users

Hi all.

First thanks for the restart-speaker-pod-trick. I did not have the chance anymore to run tests but will do so as soon as possible, restart one of the speaker pods and report back with results and logs.

Second I tried to do a reproduction of steps in my email of Mon, Sep 7, 4:19 PM. Adding to this info I've installed MetalLB in the following way:

$ helm repo add bitnami https://charts.bitnami.com/bitnami
$ kubectl create ns metallb-system
$ helm fetch --untar bitnami/metallb
$ vim metallb/values.yaml
$ helm install metallb -n metallb-system -f metallb/values.yaml bitnami/metallb

The values.yml is attached. Behind MetalLB is the Nginx ingress with version 0.34.1 as type Load Balancer, that receives an IP from the MetalLB pool. The floating IP has an A record and multiple applications have CNAMEs to that A record.

Do you need more info? If you'd want I can also give installation details of the ingress but I think it's pretty default as well.

To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/CACgQN7r1taS5oGx-VLg%2B6nKSUUQO4mC%3DR9u-9rEqABE9gLCS3Q%40mail.gmail.com.

values.yml

iohenkies -

unread,

Sep 11, 2020, 3:34:52 AM9/11/20

to Johannes Liebermann, Rodrigo Campos, metallb-users

And here the logs. The first part is while one or more errors occur, the second part is during and right after a speaker restart. This does not solve the error. Restarting all speakers doesn't solve the error either.

error_and_speaker_restart.txt

Johannes Liebermann

unread,

Sep 11, 2020, 2:56:43 PM9/11/20

to iohenkies -, Rodrigo Campos, metallb-users

Looks like you aren't using the official manifests and container images. Does the problem occur when you deploy MetalLB using the official instructions (https://metallb.universe.tf/installation/)? If so, am I supposed to encounter the problem you're having by simply deploying MetalLB with the default config (unlikely as this would probably affect all layer 2 users)?

iohenkies -

unread,

Sep 13, 2020, 10:11:29 AM9/13/20

to Johannes Liebermann, Rodrigo Campos, metallb-users

I'm a little confused as to why I thought helm was an official supported install method, but I did some work today.

On 3 clusters I removed everything from MetalLB and installed with kubectl and the official manifests. I did add in this official manifest:

nodeSelector:
node-role.domain.com/core: "true"
tolerations:
- key: node-role.kubernetes.io/controlplane
operator: Exists
effect: NoSchedule
- key: node-role.kubernetes.io/etcd
operator: Exists
effect: NoExecute

To the controller deployment and speaker daemonset. This to land the pods on any of my 'master' nodes. I assume and hope this is not a problem.

Of course I've also added the unique configmap per cluster.

Hope all tests will be 100% successful in the coming days.

iohenkies -

unread,

Sep 16, 2020, 8:44:53 AM9/16/20

to Johannes Liebermann, Rodrigo Campos, metallb-users

Hi all. First of all: thank you all for the time, in case I did not make this clear yet I am very grateful for your time.

Second, with the official manifests, settings and earlier mentioned nodeSelector and tolerations I was very sad to discover that the problems remained.

As said in my email Sep 7, 2020, 9:12 AM, we're in a pretty large network on a big VMWare cluster, of which I don't know all the ins and outs, and before bothering you all, I talked to the network guys, firewall guys and VMWare guys at our company. All was OK, they said. Now with the correct MetalLB manifests and default settings in place, but the error persisting, I had to bother them again.

I talked to one of the network guys and ran some more tests. In his switch port overview/monitoring tool he acknowledged what I already saw in the MetalLB logging: that the floating IP switches MAC address constantly. At that geographical area, where the hardware is located for the VMWare cluster, we have two physical locations. He discovered that the MAC 'flapping', as he calls it, only seems to happen when one of the Kubernetes master VMs (with MetalLB on it as well) is not at the same location as the other two VMs. With vMotion these VMs can and will be moved regularly. The cluster is set up in a way that vMotion should be able to do its job without any interruption. Other VMs and even the Kubernetes VMs with etcd for instance don't seem to be bothered with it.

It also actually is a much bigger problem than I anticipated earlier: more users than I thought are using the clusters and when I run my tests with Firefox instead of Chrome I can have an error rate of up to 50%. So clusters are pretty unusable.

The plan is to pin the Kubernetes VMs at one location and keep them together and see what it does. I'll keep you posted.

iohenkies -

unread,

Sep 16, 2020, 9:19:19 AM9/16/20

to Russell Bryant, Johannes Liebermann, Rodrigo Campos, metallb-users

Hi Russell, it is indeed a combined log:
k -n metallb-system logs --selector app=metallb --follow

On Wed, Sep 16, 2020 at 3:15 PM Russell Bryant <rbr...@redhat.com> wrote:

Can you clarify one thing, when you see the messages like this:

{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.594421005Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.593812138Z"}

Are those from the log of a single speaker, or did you combine the logs across multiple speaker instances? Based on the interface name being the same, I'm guessing this was combined logs of 2 different speakers, but I wanted to make sure.

--
Russell Bryant

To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/CAHAts2h8MuyL6NL%3Dp5JrMbv%2BSSGqoXZuYKLAhAOTdLkU3BU5bA%40mail.gmail.com.

iohenkies -

unread,

Sep 17, 2020, 3:18:05 AM9/17/20

to Russell Bryant, Johannes Liebermann, Rodrigo Campos, metallb-users

We've not only placed the VMs with MetalLB on it on the same physical location, but also on the same host. Problems persist. I'll be talking with the network guys once more and maybe remove the nodeSelector and tolerations (since this is the only thing at this moment that is not default).

Rodrigo Campos

unread,

Sep 17, 2020, 5:07:57 AM9/17/20

to iohenkies -, Johannes Liebermann, metallb-users

On Wed, Sep 16, 2020 at 2:44 PM iohenkies - <iohe...@gmail.com> wrote:
>
> Hi all. First of all: thank you all for the time, in case I did not make this clear yet I am very grateful for your time.

Thank you too! :)

Were you able to try what we mentioned here:
https://groups.google.com/g/metallb-users/c/HAO0k7cCbDk/m/DCoVXzmqAwAJ
and see if, when you hit the problem and do that, it is solved for a
while? That would help us a lot to understand the root cause of the
issue and fix it :)

Russell Bryant

unread,

Sep 17, 2020, 5:08:22 AM9/17/20

to iohenkies -, Johannes Liebermann, Rodrigo Campos, metallb-users

Can you clarify one thing, when you see the messages like this:

{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:0a:2c","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.594421005Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.112.74","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:ae:f1","senderIP":"10.11.112.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-09-03T16:54:22.593812138Z"}

Are those from the log of a single speaker, or did you combine the logs across multiple speaker instances? Based on the interface name being the same, I'm guessing this was combined logs of 2 different speakers, but I wanted to make sure.

--

Russell Bryant

To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/CAHAts2h8MuyL6NL%3Dp5JrMbv%2BSSGqoXZuYKLAhAOTdLkU3BU5bA%40mail.gmail.com.

iohenkies -

unread,

Sep 17, 2020, 5:39:49 AM9/17/20

to Rodrigo Campos, Johannes Liebermann, metallb-users

Yes I did do that. Restarting one of all of the speaker pods does not solve it for a while. As said in my last email I've also removed the nodeselectors and tolerations, so the manifests and settings are 100% default. It does not solve the problems. At this time I cannot get in contact with the network guys, they don't understand how awful this problem is :(

iohenkies -

unread,

Sep 17, 2020, 7:59:30 AM9/17/20

to Rodrigo Campos, Johannes Liebermann, metallb-users

The network people say this has to be configured on the switches in order to make this work in this network:

https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/4-x/L3-configuration/Cisco-APIC-Layer-3-Networking-Configuration-Guide-411/Cisco-APIC-Layer-3-Networking-Configuration-Guide-411_chapter_010010.html#id_94860

Any opinions?

iohenkies -

unread,

Sep 17, 2020, 11:14:51 AM9/17/20

to Etienne Champetier, Rodrigo Campos, Johannes Liebermann, metallb-users

Argh I didn't want to only send it in private. I'll paste the mail below. I'll have to check and see if there is some blocking going on and mayme ask the firewall people :|. Which ports should be open at least?

Previous mail:

Thanks for your feedback Etienne. I also thought the proposed solution from the Cisco document would just replace MetalLB. The network guy disagreed.

Removing CPU limits doesn't solve it.

Hereby the logs and the tcpdumps. I did this:

- Started pcap on nodes

- Redeployed MetalLB controller and speakers

- Launched test application

- Let it all run for a minute or 30

- There was about 15% errors during the capture

Amazingly enough in the speaker logs I didn't notice the MAC switching, but can see at least a couple in the packet capture.

On Thu, Sep 17, 2020 at 5:09 PM Etienne Champetier <echam...@anevia.com> wrote:

Hi Iohenkies,

Le jeu. 17 sept. 2020 à 08:44, Etienne Champetier <echam...@anevia.com> a écrit :
Hello Iohenkies,

Le jeu. 17 sept. 2020 à 07:59, iohenkies - <iohe...@gmail.com> a écrit :
The network people say this has to be configured on the switches in order to make this work in this network:
https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/4-x/L3-configuration/Cisco-APIC-Layer-3-Networking-Configuration-Guide-411/Cisco-APIC-Layer-3-Networking-Configuration-Guide-411_chapter_010010.html#id_94860

Any opinions?

This would be to just replace MetalLB from a quick read, not to help MetalLB work

As we are out of simple ideas, I think we need:
1) full logs since speakers & controller start
2) tcpdump -i ethX -p arp -w arpnodeX.pcap on all nodes

Looking at what you provided in private
{"caller":"main.go:202","component":"MemberList","msg":"memberlist.go:245: [DEBUG] memberlist: Failed to join 10.11.112.103: dial tcp 10.11.112.103:7946: connect: no route to host","ts":"2020-09-17T13:33:00.934375645Z"}
{"caller":"main.go:202","component":"MemberList","msg":"net.go:785: [DEBUG] memberlist: Initiating push/pull sync with: 10.11.112.101:7946","ts":"2020-09-17T13:33:00.93463846Z"}
{"caller":"main.go:202","component":"MemberList","msg":"net.go:210: [DEBUG] memberlist: Stream connection from=10.11.112.101:50654","ts":"2020-09-17T13:33:00.934717441Z"}
{"caller":"main.go:202","component":"MemberList","msg":"memberlist.go:245: [DEBUG] memberlist: Failed to join 10.11.112.102: dial tcp 10.11.112.102:7946: connect: no route to host","ts":"2020-09-17T13:33:00.935523095Z"}
{"caller":"main.go:163","error ?":null,"msg":"Memberlist join","nb joigned":1,"op":"startup","ts":"2020-09-17T13:33:00.935552288Z"}
Any firewall blocking MemberlIst traffic ?

Have you tried to remove CPU limits on MetalLB components ?

On Thu, Sep 17, 2020 at 11:39 AM iohenkies - <iohe...@gmail.com> wrote:
Yes I did do that. Restarting one of all of the speaker pods does not solve it for a while. As said in my last email I've also removed the nodeselectors and tolerations, so the manifests and settings are 100% default. It does not solve the problems. At this time I cannot get in contact with the network guys, they don't understand how awful this problem is :(

On Thu, Sep 17, 2020 at 11:07 AM Rodrigo Campos <rod...@kinvolk.io> wrote:
On Wed, Sep 16, 2020 at 2:44 PM iohenkies - <iohe...@gmail.com> wrote:
>
> Hi all. First of all: thank you all for the time, in case I did not make this clear yet I am very grateful for your time.

Thank you too! :)

Were you able to try what we mentioned here:
https://groups.google.com/g/metallb-users/c/HAO0k7cCbDk/m/DCoVXzmqAwAJ
and see if, when you hit the problem and do that, it is solved for a
while? That would help us a lot to understand the root cause of the
issue and fix it :)

--

You received this message because you are subscribed to the Google Groups "metallb-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metallb-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/CAHAts2izD%2Bn1CiaQgKgBnuafNiNbw5ASTFLQrNo9qr72JO%2BNdQ%40mail.gmail.com.

--
Etienne Champetier
Operations Engineer
Skype id: etiennechampetier
Web: www.anevia.com
CONFIDENTIALITY NOTICE: The information in this e-mail message is legally privileged and contains confidential information intended only for the use of the individual or entity named above. Unauthorized review, dissemination, distribution, copying or other use of this e-mail message, including all attachments, is strictly prohibited and may be unlawful. If you have received this e-mail message in error, please notify us immediately by telephone at +33 1 41983240 or by return e-mail and destroy this message and all copies thereof, including all attachments.

Follow us on LinkedIn : https://www.linkedin.com/company/anevia/
Follow us on Twitter : https://twitter.com/aneviaiptv

CONFIDENTIALITY NOTICE: The information in this e-mail message contains confidential information intended only for the use of the addressee named above. Unauthorized review, dissemination, distribution, copying or other use of this e-mail message, including all attachments, is strictly prohibited and may be unlawful. If you have received this e-mail message in error, please notify us immediately by telephone at +33 1 41983240 or by return e-mail and destroy this message and all copies thereof, including all attachments

controller.txt

arpnode-06.pcap

arpnode-05.pcap

arpnode-04.pcap

speaker-04.txt

speaker-05.txt

speaker-06.txt

Etienne Champetier

unread,

Sep 17, 2020, 12:55:50 PM9/17/20

to iohenkies -, Rodrigo Campos, Johannes Liebermann, metallb-users

Hello Iohenkies,

Le jeu. 17 sept. 2020 à 07:59, iohenkies - <iohe...@gmail.com> a écrit :

The network people say this has to be configured on the switches in order to make this work in this network:
https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/4-x/L3-configuration/Cisco-APIC-Layer-3-Networking-Configuration-Guide-411/Cisco-APIC-Layer-3-Networking-Configuration-Guide-411_chapter_010010.html#id_94860

Any opinions?

This would be to just replace MetalLB from a quick read, not to help MetalLB work

As we are out of simple ideas, I think we need:

1) full logs since speakers & controller start

2) tcpdump -i ethX -p arp -w arpnodeX.pcap on all nodes

Have you tried to remove CPU limits on MetalLB components ?

On Thu, Sep 17, 2020 at 11:39 AM iohenkies - <iohe...@gmail.com> wrote:
Yes I did do that. Restarting one of all of the speaker pods does not solve it for a while. As said in my last email I've also removed the nodeselectors and tolerations, so the manifests and settings are 100% default. It does not solve the problems. At this time I cannot get in contact with the network guys, they don't understand how awful this problem is :(

On Thu, Sep 17, 2020 at 11:07 AM Rodrigo Campos <rod...@kinvolk.io> wrote:
On Wed, Sep 16, 2020 at 2:44 PM iohenkies - <iohe...@gmail.com> wrote:
>
> Hi all. First of all: thank you all for the time, in case I did not make this clear yet I am very grateful for your time.

Thank you too! :)

Were you able to try what we mentioned here:
https://groups.google.com/g/metallb-users/c/HAO0k7cCbDk/m/DCoVXzmqAwAJ
and see if, when you hit the problem and do that, it is solved for a
while? That would help us a lot to understand the root cause of the
issue and fix it :)

--

You received this message because you are subscribed to the Google Groups "metallb-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metallb-user...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/CAHAts2izD%2Bn1CiaQgKgBnuafNiNbw5ASTFLQrNo9qr72JO%2BNdQ%40mail.gmail.com.

--

Etienne Champetier

Operations Engineer

Skype id: etiennechampetier

Web: www.anevia.com

CONFIDENTIALITY NOTICE: The information in this e-mail message is legally privileged and contains confidential information intended only for the use of the individual or entity named above. Unauthorized review, dissemination, distribution, copying or other use of this e-mail message, including all attachments, is strictly prohibited and may be unlawful. If you have received this e-mail message in error, please notify us immediately by telephone at +33 1 41983240 or by return e-mail and destroy this message and all copies thereof, including all attachments.

Follow us on LinkedIn : https://www.linkedin.com/company/anevia/
Follow us on Twitter : https://twitter.com/aneviaiptv

CONFIDENTIALITY NOTICE: The information in this e-mail message contains confidential information intended only for the use of the addressee named above. Unauthorized review, dissemination, distribution, copying or other use of this e-mail message, including all attachments, is strictly prohibited and may be unlawful. If you have received this e-mail message in error, please notify us immediately by telephone at +33 1 41983240 or by return e-mail and destroy this message and all copies thereof, including all attachments

Etienne Champetier

unread,

Sep 17, 2020, 12:55:50 PM9/17/20

to iohenkies -, Rodrigo Campos, Johannes Liebermann, metallb-users

Le jeu. 17 sept. 2020 à 11:14, iohenkies - <iohe...@gmail.com> a écrit :

Argh I didn't want to only send it in private. I'll paste the mail below. I'll have to check and see if there is some blocking going on and mayme ask the firewall people :|. Which ports should be open at least?

No problem

On each nodes tcp&udp port 7946, this is controlled by METALLB_ML_BIND_PORT

https://github.com/metallb/metallb/blob/main/manifests/metallb.yaml#L304

Etienne Champetier

unread,

Sep 17, 2020, 12:55:51 PM9/17/20

to iohenkies -, Rodrigo Campos, Johannes Liebermann, metallb-users

Hi Iohenkies,

Le jeu. 17 sept. 2020 à 08:44, Etienne Champetier <echam...@anevia.com> a écrit :

Hello Iohenkies,

Le jeu. 17 sept. 2020 à 07:59, iohenkies - <iohe...@gmail.com> a écrit :
The network people say this has to be configured on the switches in order to make this work in this network:
https://www.cisco.com/c/en/us/td/docs/switches/datacenter/aci/apic/sw/4-x/L3-configuration/Cisco-APIC-Layer-3-Networking-Configuration-Guide-411/Cisco-APIC-Layer-3-Networking-Configuration-Guide-411_chapter_010010.html#id_94860

Any opinions?

This would be to just replace MetalLB from a quick read, not to help MetalLB work

As we are out of simple ideas, I think we need:
1) full logs since speakers & controller start
2) tcpdump -i ethX -p arp -w arpnodeX.pcap on all nodes

Looking at what you provided in private

{"caller":"main.go:202","component":"MemberList","msg":"memberlist.go:245: [DEBUG] memberlist: Failed to join 10.11.112.103: dial tcp 10.11.112.103:7946: connect: no route to host","ts":"2020-09-17T13:33:00.934375645Z"}
{"caller":"main.go:202","component":"MemberList","msg":"net.go:785: [DEBUG] memberlist: Initiating push/pull sync with: 10.11.112.101:7946","ts":"2020-09-17T13:33:00.93463846Z"}
{"caller":"main.go:202","component":"MemberList","msg":"net.go:210: [DEBUG] memberlist: Stream connection from=10.11.112.101:50654","ts":"2020-09-17T13:33:00.934717441Z"}
{"caller":"main.go:202","component":"MemberList","msg":"memberlist.go:245: [DEBUG] memberlist: Failed to join 10.11.112.102: dial tcp 10.11.112.102:7946: connect: no route to host","ts":"2020-09-17T13:33:00.935523095Z"}
{"caller":"main.go:163","error ?":null,"msg":"Memberlist join","nb joigned":1,"op":"startup","ts":"2020-09-17T13:33:00.935552288Z"}

Any firewall blocking MemberlIst traffic ?

iohenkies -

unread,

Sep 18, 2020, 9:45:53 AM9/18/20

to Etienne Champetier, Rodrigo Campos, Johannes Liebermann, metallb-users

O M G... This seems to solve the bulk of the problems. I feel stupid that I have overlooked this all this time. My apologies for probably wasting your time here.

I went ad hoc from the 'stable' helm repo to the 'bitnami' one, jumped from 0.8 to 0.9 and missed the fact that I needed an open port now. The METALLB_ML_BIND_PORT was on my radar for a while, see my mail from Tue, Sep 8, 6:30 PM for instance, but I do not know for sure anymore why I diverted from that path.

What is still strange though, is that the test app can go for hours on end now, on 3 clusters, thousands of packages sent successfully, but as soon as one error occurs, a lot will follow. Is this an entirely other/new problem you think?

Francesco Trebbi

unread,

Oct 26, 2022, 9:06:59 AM10/26/22

to metallb-users

Does anybody know if the root cause has been found?

I do have exactly the same issue with MetalLB 0.9.5 + vmware + Cisco physical switch.

Reply all

Reply to author

Forward