Hi everybody.
Brian told us to move here this
issue, as here it is more proper place to discuss it.
We have the following issue with blackbox exporter.
We run blackbox-exporter inside docker container. Suddenly, without any changes on working machine or container,
ping probe starts failing for one or more targets, while other targets remain ok.
But when I run manually ping tool inside docker container and on host OS outside the container, both succeed.
When we restart docker container, issue disappears, but occurs after some time again.
We experienced this behavior for two of ours internal IP targets simultaneously (both from the same datacenter) and later for other public targets:
8.8.8.8, 1.1.1.1.
I examined the problem with a tcpdump and it shows only request packets (no reply packets):
tcpdump -i eth0 -nn -s0 -X icmp and host 8.8.8.8
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:28:48.734661 IP 172.17.0.5 > 8.8.8.8: ICMP echo request, id 33313, seq 41979, length 36
0x0000: 4500 0038 f40e 4000 4001 8a90 ac11 0005 E..8..@.@.......
0x0010: 0808 0808 0800 7648 8221 a3fb 5072 6f6d ......vH.!..Prom
0x0020: 6574 6865 7573 2042 6c61 636b 626f 7820 etheus.Blackbox.
0x0030: 4578 706f 7274 6572 Exporter
15:28:48.977456 IP 172.17.0.5 > 8.8.8.8: ICMP echo request, id 33313, seq 41982, length 36
0x0000: 4500 0038 f41d 4000 4001 8a81 ac11 0005 E..8..@.@.......
0x0010: 0808 0808 0800 7645 8221 a3fe 5072 6f6d ......vE.!..Prom
0x0020: 6574 6865 7573 2042 6c61 636b 626f 7820 etheus.Blackbox.
0x0030: 4578 706f 7274 6572 Exporter
This is tcpdump output, when I start ping manually inside the container, along the blackbox-exporter (blackbox-exporter id==33313):
root @ /
[4] 🐳 → tcpdump -i eth0 icmp and host 1.1.1.1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:48:50.599214 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 50421, length 36
14:48:51.384085 IP a382643a1270 > one.one.one.one: ICMP echo request, id 35072, seq 5, length 64
14:48:51.392669 IP one.one.one.one > a382643a1270: ICMP echo reply, id 35072, seq 5, length 64
14:48:51.599289 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 50435, length 36
14:48:52.384292 IP a382643a1270 > one.one.one.one: ICMP echo request, id 35072, seq 6, length 64
14:48:52.393031 IP one.one.one.one > a382643a1270: ICMP echo reply, id 35072, seq 6, length 64
14:48:52.599559 IP a382643a1270 > one.one.one.one: ICMP echo request, id 33313, seq 50449, length 36
14:48:53.384517 IP a382643a1270 > one.one.one.one: ICMP echo request, id 35072, seq 7, length 64
14:48:53.396626 IP one.one.one.one > a382643a1270: ICMP echo reply, id 35072, seq 7, length 64
I also checked if there is any zero-filled ID field in IP header, as it was discussed in a very similar issue here: #360, but it is not our case.
The only correlations which we found in Grafana, are very short outages of connection from the blackbox-exporter machine to
some of ours internal DNS servers (spikes are in the same time as the probes starts failing) monitored with the same blackbox-exporter ...
I would check more deeply, what's going on, but I have no idea where to look now.
Please, don't You have any suggestions what else to check or how to possibly debug it?
Kind regards,
Tomáš Bartek