Hi,
I'm trying to track down an issue in the IP allocation code in swarm mode in 17.03. The symptom is that you get errors like the following in the logs:
dockerd[592]: time="2017-03-03T07:30:08.877627238+01:00" level=error msg="task allocation failure" error="failed during networktask allocation for task pptolnhiauhhben0i1t0sjb58: failed to allocate network IP for task pptolnhiauhhben0i1t0sjb58 network 004pvcgpxy5tm8xrq5y5m5caf: could not find an available IP"
This happens eventually to all containers, even though no containers are actually running. I've been instrumenting the code to see if I can work out what it going on. It appears that there is a discrepancy between the bitmap used to track allocations and the actually allocated IPs. So it thinks that there is nothing available when there actually is.
So, the first thing I could use help with: is there some way I can dump the internal pool state so that I can check this theory? I was thinking of tweaking the "network inspect" code to include the pool information, but it appears that this information is hidden behind the allocator and would need to be passed via protobuf messages. Is this the only way or is there something better?
Secondly, my current theory is that it has to do with the process of recovery after a node has failed. The reason is that the logs contain the following messages when one of the nodes (not the leader) is detected as failed:
(the print of the allocated IP address is something I added).
Mar 7 12:49:09 b3-test dockerd[25717]: time="2017-03-07T12:49:09.752468244+01:00" level=debug msg="miss notification for dest IP, 10.255.0.3"
Mar 7 12:49:09 b3-test dockerd[25717]: time="2017-03-07T12:49:09.788254194+01:00" level=debug msg="75c58c8bd5e11d5d failed to send message to 24c4132b76046066 because it is unreachable [next = 49644, match = 49643, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]"
Mar 7 12:49:09 b3-test dockerd[25717]: time="2017-03-07T12:49:09+01:00" level=info msg="Firewalld running: false"
Mar 7 12:49:10 b3-test dockerd[25717]: time="2017-03-07T12:49:10.447459939+01:00" level=debug msg="memberlist: Failed UDP ping: ctmp-b1-test-bbdb797bca08 (timeout reached)"
Mar 7 12:49:10 b3-test dockerd[25717]: time="2017-03-07T12:49:10.788250415+01:00" level=debug msg="75c58c8bd5e11d5d failed to send message to 24c4132b76046066 because it is unreachable [next = 49644, match = 49643, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]"
Mar 7 12:49:10 b3-test dockerd[25717]: time="2017-03-07T12:49:10.947219122+01:00" level=info msg="memberlist: Suspect ctmp-b1-test-bbdb797bca08 has failed, no acks received"
Mar 7 12:49:11 b3-test dockerd[25717]: time="2017-03-07T12:49:11.447563924+01:00" level=debug msg="memberlist: Failed UDP ping: ctmp-b1-test-bbdb797bca08 (timeout reached)"
Mar 7 12:49:11 b3-test dockerd[25717]: time="2017-03-07T12:49:11.788054638+01:00" level=debug msg="75c58c8bd5e11d5d failed to send message to 24c4132b76046066 because it is unreachable [next = 49644, match = 49643, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]"
Mar 7 12:49:11 b3-test dockerd[25717]: time="2017-03-07T12:49:11.947410183+01:00" level=info msg="memberlist: Suspect ctmp-b1-test-bbdb797bca08 has failed, no acks received"
Mar 7 12:49:12 b3-test dockerd[25717]: time="2017-03-07T12:49:12.788246568+01:00" level=debug msg="75c58c8bd5e11d5d failed to send message to 24c4132b76046066 because it is unreachable [next = 49644, match = 49643, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]"
Mar 7 12:49:12 b3-test dockerd[25717]: time="2017-03-07T12:49:12.924723032+01:00" level=debug msg="memberlist: TCP connection from=172.16.102.102:56382"
Mar 7 12:49:12 b3-test dockerd[25717]: time="2017-03-07T12:49:12.927500266+01:00" level=debug msg="ctmp-b3-test-e605bab6f418: Initiating bulk sync for networks [w6rjj8erkvrasbfmq45cm9wea jj73g9sca5locqxh701z9vfnu cvc08ylrvac5dhruqqes6hqrg 1vhmng05l7u825kayllgb7mmr 004pvcgpxy5tm8xrq5y5m5caf raopgloeigrrtbiftoa9rtclf uqs0xq1sd3yfaai4ospjxbwiq g14009otlta8faghs50yfvan2] with node ctmp-b2-test-9ffb2d84eab3"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.309488333+01:00" level=debug msg="heartbeat expiration"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.354791895+01:00" level=debug msg="node status updated" method="(*Dispatcher).processUpdates" module=dispatcher node.id=gvyh45n4iccw9ip0u7p0k0w61
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.355389640+01:00" level=debug msg="75c58c8bd5e11d5d failed to send message to 24c4132b76046066 because it is unreachable [next = 49644, match = 49643, state = ProgressStateProbe, waiting = true, pendingSnapshot = 0]"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.366949348+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.2.0/24, <nil>, map[])"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.367000925+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.2.0/24, <nil>, map[]) -> 10.0.2.27"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.367022570+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.4.0/24, <nil>, map[])"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.367040371+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.4.0/24, <nil>, map[]) -> 10.0.4.11"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.367056298+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.1.0/24, <nil>, map[])"
followed by the allocation of 35 IP addresses in various networks. These IP addresses don't appear to be assigned to any service or container. I checked this using:
docker ps -aq |xargs docker inspect |grep IPAddress|grep -F '10.0.'|less
docker service ls -q | xargs docker service inspect |grep -w Addr|less
I haven't found yet what causes this allocation.
BTW, this also seems to cause issues with the tracking of mem/cpu reservations. I also see lines like this straight after:
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.411551384+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.1.0/24, <nil>, map[])"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.411584304+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.1.0/24, <nil>, map[]) -> 10.0.1.27"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.411600322+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.6.0/24, <nil>, map[])"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.411616388+01:00" level=debug msg="RequestAddress(GlobalDefault/10.0.6.0/24, <nil>, map[]) -> 10.0.6.9"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.464243279+01:00" level=debug msg="assigning to node oiv4mjh3us46phi0gjnmz5jup" task.id=o2yb2qb3vdgel709hea2yp5zv
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.464296676+01:00" level=info msg="addTask o2yb2qb3vdgel709hea2yp5zv (node oiv4mjh3us46phi0gjnmz5jup) -= (47000000, 1000000) => (2973410752, 1928452000)"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.464322440+01:00" level=debug msg="assigning to node 0viyjq1wphptbllu3guubjqz1" task.id=zkokj14ww5zyg9ml6e205gzau
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.464338601+01:00" level=info msg="addTask zkokj14ww5zyg9ml6e205gzau (node 0viyjq1wphptbllu3guubjqz1) -= (54000000, 40569000) => (1660410752, 1835461000)"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.464359242+01:00" level=debug msg="assigning to node oiv4mjh3us46phi0gjnmz5jup" task.id=7nx9fqzvmvg8ood4chsun9fau
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.464371571+01:00" level=info msg="addTask 7nx9fqzvmvg8ood4chsun9fau (node oiv4mjh3us46phi0gjnmz5jup) -= (897000000, 20447000) => (2076410752, 1908005000)"
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.464399748+01:00" level=debug msg="assigning to node oiv4mjh3us46phi0gjnmz5jup" task.id=zkokj14ww5zyg9ml6e205gzau
Mar 7 12:49:13 b3-test dockerd[25717]: time="2017-03-07T12:49:13.464413581+01:00" level=info msg="addTask zkokj14ww5zyg9ml6e205gzau (node oiv4mjh3us46phi0gjnmz5jup) -= (54000000, 40569000) => (2022410752, 1867436000)"
Note that task zkokj14ww5zyg9ml6e205gzau has been assigned to two nodes and the reservation deducted from both (addTask and removeTask have both been instrumented). The second reservation is later undone, the first is not. This eventually leads to tasks not being able to be started because the memory/cpu reservation is apparently full, when it really isn't.
So my second question is: does anybody have an idea what might trigger the above behaviour on node restart?
Thanks in advance,
--
Martij
n