Minions not able to connect back to master when losing connection

Olivier M

unread,

May 13, 2017, 11:26:00 AM5/13/17

to Salt-users

Hello,

I've had this issue for a long time. In my setup, my master is behind a non reliable internet line that sometimes lose connection.

In the case when my master gets disconnected - Either because of the line or a simple service restart - minions fails to connect back to the master.

I use the following minion config:

default_include: minion.d/*.conf

master: master.xxxx

master_alive_interval: 30

master_tries: -1

ping_interval: 1

tcp_keepalive: True

tcp_keepalive_idle: 60

grains_refresh_every: 5

I've noted that they most of time get back after "grains_refresh_every" delay if the delay is small enough... But sometimes - actually quite often - they don't.

Some minion will end up with this log:

2017-05-13 02:32:26,394 [salt.minion ][ERROR ][2812] ** Master Ping failed. Attempting to restart minion**

Some not... Result is the same: Minion can't be contacted until the agent has been restarted.

Any idea what's going on? Is there a way to test from the minion if the connection to the server is fine to actively restart it?

Version reports:

# salt-minion --versions-report

Salt Version:

Salt: 2016.11.4

Dependency Versions:

cffi: 1.10.0

cherrypy: Not Installed

dateutil: 2.5.3

docker-py: Not Installed

gitdb: Not Installed

gitpython: Not Installed

ioflo: Not Installed

Jinja2: 2.9.6

libgit2: 0.25.1

libnacl: Not Installed

M2Crypto: Not Installed

Mako: Not Installed

msgpack-pure: Not Installed

msgpack-python: 0.4.8

mysql-python: Not Installed

pycparser: 2.14

pycrypto: 2.6.1

pycryptodome: Not Installed

pygit2: 0.25.0

Python: 2.7.13 (default, Dec 22 2016, 09:22:15)

python-gnupg: Not Installed

PyYAML: 3.12

PyZMQ: 16.0.2

RAET: Not Installed

smmap: Not Installed

timelib: Not Installed

Tornado: 4.5.1

ZMQ: 4.1.5

System Versions:

dist:

machine: x86_64

release: 4.9.19-0-virtgrsec

system: Linux

version: Not Installed

# salt --versions-report

Salt Version:

Salt: 2016.11.4

Dependency Versions:

cffi: 1.8.3

cherrypy: Not Installed

dateutil: 2.5.3

docker-py: Not Installed

gitdb: Not Installed

gitpython: Not Installed

ioflo: Not Installed

Jinja2: 2.9.6

libgit2: 0.25.1

libnacl: Not Installed

M2Crypto: Not Installed

Mako: Not Installed

msgpack-pure: Not Installed

msgpack-python: 0.4.8

mysql-python: Not Installed

pycparser: 2.14

pycrypto: 2.6.1

pycryptodome: Not Installed

pygit2: 0.25.0

Python: 2.7.13 (default, Dec 22 2016, 09:22:15)

python-gnupg: Not Installed

PyYAML: 3.12

PyZMQ: 16.0.2

RAET: Not Installed

smmap: Not Installed

timelib: Not Installed

Tornado: 4.4.2

ZMQ: 4.1.5

System Versions:

dist:

machine: x86_64

release: 4.9.24-2-virtgrsec

system: Linux

version: Not Installed

Chris Apsey

unread,

May 14, 2017, 12:14:08 AM5/14/17

to Salt-users

+1 to this - I've noticed something similar as well. When we do rolling upgrades across our edge firewalls, the IPsec tunnels that connect minions to their master fail over (loss of connectivity for no more than a few seconds as the tunnels get rebuilt). Minions don't respond to a test.ping, even though the master destination IP is exactly the same (also, it responds to ping, I can nc to 4505/4506, etc.). If I manually restart a minion, it picks up the master again immediately without issue. When we switch back to the primary firewalls after the upgrade is complete, any minions that were manually restarted have to be manually restarted *again*, while those that weren't restarted after the first failover start responding again to test.ping on their own as soon as the original firewall becomes the default gateway again.

It's not a failure of gratuitous ARP, as other non-salt traffic gets routed as expected. This behavior has always perplexed me.

Olivier M

unread,

May 14, 2017, 5:56:02 AM5/14/17

to Salt-users

Oh how nice to see I'm not alone :)
So I did some more tests, as a simple master restart over internet makes it 100% reproducible.

- Restart of the master

- test.ping on all minion - 7 over internet, 2 on LAN

- 2 on LAN answer present immediately

- on the 7 remote minions

- 3 reply OK

- 4 reply "Minion did not return."

- 1 [No response]

- 3 [Not connected]

- Send an event from one of the failing minions while listening for event from the master

- Events gets through and is received by the master!

- If minion is in [No response] => No change

- If minion is in [Not connected] => Changes to [No response]

- Minions still not responding to test.ping

- After a while [No response] gets back to [Not connected] until you fire an event that gets through as if nothing happened.

- Minions are still not coming back until a minion restart is done

Anything else I could do?

Dmitri Maziuk

unread,

May 14, 2017, 12:49:52 PM5/14/17

to salt-...@googlegroups.com

On 2017-05-14 04:56, Olivier M wrote:
> Oh how nice to see I'm not alone :)

FWIW I get a (very) occasional "minion did not return" from minions on
the LAN, too, and it seemed entirely random so far.

Dima

Nikita Bochenko

unread,

May 15, 2017, 3:18:36 AM5/15/17

to Salt-users

We have a very similar "random" issue that happens couple of times a week. We traced it down to snapshots being taken, which in turn cause master or minion not to be available for couple of seconds/minutes (depending on the load), essentially causing network connections not to work at this time.

Would be helpful if there is a way to tune this. It could also be a bug, though I am not sure at the moment if this is intended.

Raine Curtis

unread,

May 16, 2017, 12:57:38 PM5/16/17

to Salt-users

You might want to consider using Tornado instead of ZeroMQ. It is completely TCP-based and has been able to fix issues such as these.

You can enable this on a minion and by updating the configuration and restarting each service respectively.

from:

transport: zeromq

to

transport: tcp

Olivier M

unread,

May 16, 2017, 3:10:50 PM5/16/17

to Salt-users

Thanks Raine for the idea.

I've modified my master and minion configs to use tcp transport and re-ran my salt-master restart test.

It looks a bit better, but still random ...

On 9 minions I get:

- 2 LAN minions picking up in around 30s

- 5 WAN minions picking up in 2mn

- 2 WAN minions taking up to 20mn to get back

So in the end all the minions get connected back, but I can't make anything of these 20 minutes delay...

Reply all

Reply to author

Forward