METALLB_ML_BIND_ADDR
ON and now the speaker pods on all nodes instead of the smaller subset (as proposed by Rodrigo)
METALLB_ML_BIND_ADDR ON and OFF
We're trying to establish a troubleshooting session with Cisco ASAP but I already would like to let you know.
I've attached two MetalLB logs from the command k -n metallb-system logs --selector app=metallb --follow. One is midway when I just discovered the problems and the second is right after restarting the controller and speakers. Only things that strikes me as strange is the large amount of:
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:55:47.123678678Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:56:04.121236745Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:56:18.217862317Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:56:18.218531381Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:56:45.977524594Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:57:25.917597383Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:57:52.422352786Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:57:52.42283054Z"}
{"caller":"arp.go:102","interface":"ens192","ip":"10.11.16.64","msg":"got ARP request for service IP, sending response","responseMAC":"00:50:56:ab:16:16","senderIP":"10.11.16.1","senderMAC":"00:22:bd:f8:19:ff","ts":"2020-10-22T06:58:44.127818346Z"}
Difference with the situation at the time of the previous problems about a month ago, is that the responseMAC was switching between nodes. Now it responds every time with the same MAC.
--
You received this message because you are subscribed to the Google Groups "metallb-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metallb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/05d130c9-3da7-4e79-bafc-2109bc89446fn%40googlegroups.com.
Hi Rodrigo and thanks again. Yesterday was an awful 12 hour workday with no progress and in the end the problems automagically solved during the night. Needless to say it is only a matter of time before they resurface. It would be awesome if you could take a look at the logs, to see if something weird sticks out.Although I'm pretty convinced this is a network problem of some sort - layer 2, (G)ARP learning, Cisco ACI stuff - there are now 2 Cisco enterprise guys saying the same thing: Cisco ACI reports over and over again that our floating IP is in use by 2 MAC addresses. Nothing the Cisco hardware does to cause this, this is how it is advertised TO the Cisco ACI FROM the network/cluster. Yesterday this problem was with 2 clusters and they noticed the same thing on both.
--
You received this message because you are subscribed to the Google Groups "metallb-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metallb-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/ca971348-b89b-4b05-8e76-0b8fa3d1fd4bn%40googlegroups.com.
Hi Etienne,The MACs are easily recognizable and from the K8s nodes. The changes, or at least how Cisco describes it the double MAC address for the floating IP, is constantly happening at the time of the problems. Yesterday the internal (non-Cisco) network guy literally said "Now there is one MAC for the IP. Oh now there are two." And some time later "Hmm there is one again. Is it working now? Oh, now there are two MACs again.". So it seems from the ACI perspective it is not a constant condition, but it is happening constantly. It is 'flapping' as they call it.- The logs at the time was already attached and will attach it again "...I've attached two MetalLB logs from the command k -n metallb-system logs --selector app=metallb --follow. One is midway when I just discovered the problems and the second is right after restarting the controller and speakers. Only things that strikes me as strange is the large amount of..."
- Could you please explain how to this, especially the 'rotating captures'? "... tcpdump ALL arp traffic from all nodes with rotating captures ..."
To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/a826a468-8715-4e68-89cd-4a3f72feb3cbn%40googlegroups.com.
Etienne Champetier
Operations Engineer
Skype id: etiennechampetier
Web: www.anevia.com
Hi Etienne,
I guess we're "lucky": as soon as I updated the MetalLB manifests on DEV it started giving errors again. I have 3 short captures from 3 nodes attached. I will make more captures.
To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/3dad508e-9f3b-4958-9c68-5b57de378e5an%40googlegroups.com.
Some more packet captures. Since the problems are now constant, I assume this should be enough.DEV is the one giving errors.
To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/50e262da-bea0-4c3e-87f5-a0c339785d18n%40googlegroups.com.
On 23 Oct 2020, at 20:02, Etienne Champetier <echam...@anevia.com> wrote:
Hi Etienne,
Thank you very much for your time and effort. This weekend I’ll try to find and follow a crash course on network captures and analysis and figure out if this all answers some questions. I’ll report back for sure.
Hi all, me again. Hope I can steal some of your time again.Unfortunately problems are not solved and it is getting very precarious now. Let me try to summarize:* Etienne helped me with some pcap info that showed that there is always only one MAC address associated with the floating IP* Last Friday I had another troubleshooting call with a network expert. Long story short, he says Cisco people are wrong and he sees the network is stable, converged and indeed only 1 MAC with 1 floating IP* This made me look to the clusters again. I thought I was on to something when I discovered I could reproduce and solve the issue:- I'm able to reproduce the problem when I have more than 1 Nginx ingress controller and I change something random in the MetalLB config (forcing redeployment of pods)- I'm able to solve the problem by scaling the Nginx ingress controller to 0 and then back to 2 or 3- I could not reproduce this problem when I have only 1 Nginx ingress controller* These last two points made me think it must have something to do with ingress, right?* I've contacted the Nginx ingress developers and we ran some tests. Sorry they say, it really is not a Nginx ingress problem, because:- By scaling Nginx to 0 and back to 2 or 3 you are simply resetting the state and forcing MetalLB to reconfigure=> Important to know: I've now setup a dedicated cluster with just one MetalLB speaker pinned to one node and one Nginx ingress pinned to the same node- After a random time, my test app just stops responding. Times out in the browser. But since we now have one pod for each service, troubleshooting is hopefully easier- So now I cannot reach my app via the floating IP, BUT I CAN directly curl the ingress nodeport with a curl direct-ip-of-a-node:31775 -H 'Host: testapp.domain.com'Nginx people say it therefore is most likely a MetalLB issue. But we were here again of course. I'm hoping very very much this info helps and you still want to help.
To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/677e2ac4-2d9a-45c8-9373-10787b4ced24n%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metallb-users/45670cd7-cfce-4e3d-83b7-d90b23d44263n%40googlegroups.com.