Minions not able to connect back to master when losing connection

863 views
Skip to first unread message

Olivier M

unread,
May 13, 2017, 11:26:00 AM5/13/17
to Salt-users
Hello,

I've had this issue for a long time. In my setup, my master is behind a non reliable internet line that sometimes lose connection.
In the case when my master gets disconnected - Either because of the line or a simple service restart - minions fails to connect back to the master.

I use the following minion config:
default_include: minion.d/*.conf
master: master.xxxx
master_alive_interval: 30
master_tries: -1
ping_interval: 1
tcp_keepalive: True
tcp_keepalive_idle: 60
grains_refresh_every: 5

I've noted that they most of time get back after "grains_refresh_every" delay if the delay is small enough... But sometimes - actually quite often - they don't.
Some minion will end up with this log:
2017-05-13 02:32:26,394 [salt.minion                                          ][ERROR   ][2812] ** Master Ping failed. Attempting to restart minion**

Some not... Result is the same: Minion can't be contacted until the agent has been restarted.

Any idea what's going on? Is there a way to test from the minion if the connection to the server is fine to actively restart it?

Version reports:
# salt-minion --versions-report
Salt Version:
           Salt: 2016.11.4
 
Dependency Versions:
           cffi: 1.10.0
       cherrypy: Not Installed
       dateutil: 2.5.3
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.9.6
        libgit2: 0.25.1
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: 0.25.0
         Python: 2.7.13 (default, Dec 22 2016, 09:22:15)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.1
            ZMQ: 4.1.5
 
System Versions:
           dist:   
        machine: x86_64
        release: 4.9.19-0-virtgrsec
         system: Linux
        version: Not Installed
 
# salt --versions-report
Salt Version:
           Salt: 2016.11.4
 
Dependency Versions:
           cffi: 1.8.3
       cherrypy: Not Installed
       dateutil: 2.5.3
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.9.6
        libgit2: 0.25.1
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.4.8
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: 0.25.0
         Python: 2.7.13 (default, Dec 22 2016, 09:22:15)
   python-gnupg: Not Installed
         PyYAML: 3.12
          PyZMQ: 16.0.2
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.2
            ZMQ: 4.1.5
 
System Versions:
           dist:   
        machine: x86_64
        release: 4.9.24-2-virtgrsec
         system: Linux
        version: Not Installed
 

Chris Apsey

unread,
May 14, 2017, 12:14:08 AM5/14/17
to Salt-users
+1 to this - I've noticed something similar as well.  When we do rolling upgrades across our edge firewalls, the IPsec tunnels that connect minions to their master fail over (loss of connectivity for no more than a few seconds as the tunnels get rebuilt).  Minions don't respond to a test.ping, even though the master destination IP is exactly the same (also, it responds to ping, I can nc to 4505/4506, etc.).  If I manually restart a minion, it picks up the master again immediately without issue.  When we switch back to the primary firewalls after the upgrade is complete, any minions that were manually restarted have to be manually restarted *again*, while those that weren't restarted after the first failover start responding again to test.ping on their own as soon as the original firewall becomes the default gateway again.

It's not a failure of gratuitous ARP, as other non-salt traffic gets routed as expected.  This behavior has always perplexed me.

Olivier M

unread,
May 14, 2017, 5:56:02 AM5/14/17
to Salt-users
Oh how nice to see I'm not alone :)
So I did some more tests, as a simple master restart over internet makes it 100% reproducible.

- Restart of the master
- test.ping on all minion - 7 over internet, 2 on LAN
  - 2 on LAN answer present immediately
  - on the 7 remote minions
    - 3 reply OK
    - 4 reply "Minion did not return."
      - 1 [No response]
      - 3 [Not connected]
- Send an event from one of the failing minions while listening for event from the master
  - Events gets through and is received by the master!
  - If minion is in [No response] => No change
  - If minion is in [Not connected] => Changes to [No response]
- Minions still not responding to test.ping
- After a while [No response] gets back to [Not connected] until you fire an event that gets through as if nothing happened.
- Minions are still not coming back until a minion restart is done

Anything else I could do?

Dmitri Maziuk

unread,
May 14, 2017, 12:49:52 PM5/14/17
to salt-...@googlegroups.com
On 2017-05-14 04:56, Olivier M wrote:
> Oh how nice to see I'm not alone :)

FWIW I get a (very) occasional "minion did not return" from minions on
the LAN, too, and it seemed entirely random so far.

Dima

Nikita Bochenko

unread,
May 15, 2017, 3:18:36 AM5/15/17
to Salt-users
We have a very similar "random" issue that happens couple of times a week. We traced it down to snapshots being taken, which in turn cause master or minion not to be available for couple of seconds/minutes (depending on the load), essentially causing network connections not to work at this time.

Would be helpful if there is a way to tune this. It could also be a bug, though I am not sure at the moment if this is intended. 

Raine Curtis

unread,
May 16, 2017, 12:57:38 PM5/16/17
to Salt-users
You might want to consider using Tornado instead of ZeroMQ. It is completely TCP-based and has been able to fix issues such as these.

You can enable this on a minion and  by updating the configuration and restarting each service respectively.

from:

transport: zeromq
to

transport: tcp

Olivier M

unread,
May 16, 2017, 3:10:50 PM5/16/17
to Salt-users
Thanks Raine for the idea.

I've modified my master and minion configs to use tcp transport and re-ran my salt-master restart test.
It looks a bit better, but still random ... 
On 9 minions I get:
  - 2 LAN minions picking up in around 30s
  - 5 WAN minions picking up in 2mn
  - 2 WAN minions taking up to 20mn to get back

So in the end all the minions get connected back, but I can't make anything of these 20 minutes delay...
Reply all
Reply to author
Forward
0 new messages