Salt master/minion communication broken after network reconfigure.

135 views
Skip to first unread message

Danny Smit

unread,
Mar 27, 2018, 9:50:20 AM3/27/18
to Salt-users
Hi all,

I'm trying to provision and configure CentOS based machines using the salt state network.managed. This salt state changes the network configuration from dhcp to a static configuration, meaning that the IP address of the node is changed by it. The state itself works perfectly and an implicit ifdown/ifup is triggered by the network.managed, because of which the new static address is activated on the fly. Also communication between the master machine and minion machine with the new IP address immediately works as expected on a network level (ping and ssh back and forth works immediately).

However communication between the salt master and minion seems to stall. After the network state is done, other state files will be executed that require files or data from the salt master (using states like file.managed). The very first state that does so, stalls for about 4 minutes and finally results in a SaltReqTimeoutError in the salt minion log file. After that the salt run continues and all other state files that use states like file.managed, are able to communicate with the master just fine.

In the meantime, while the minion stalled, the salt master already report "Minion did not return. [No response]". Even when the minion continued and finished processing all of the states (communication with the master seems to be restored), the master still fails to communicate with minion, as the following call keeps failing with the "Minion did not return. [No response]" as reply: 

  salt 'minion.test.local' test.ping

The interesting thing is that a restart of the salt minion restored the communication between master and minion again. After a restart the a test.ping from master to minion is successful again.

I understood it is to be expected that the minion will attempt to reconnect to the master, however it does cause one salt state to be unsuccessful. Additionally the master stays confused. Is that expected behavior? And if so is there a speed up the process in such a way the salt recovers without one salt state failing?

Below is an example of what the minion shows in its log with the very first salt state that runs after the reconfigure of the network. I can post some more logs and info if required.

2018-03-22 17:33:43,876 [salt.state       ][INFO    ][12036] Executing state file.blockreplace for [/etc/hosts]
2018-03-22 17:34:43,919 [salt.transport.zeromq][DEBUG   ][12036] SaltReqTimeoutError, retrying. (1/3)
2018-03-22 17:35:43,977 [salt.transport.zeromq][DEBUG   ][12036] SaltReqTimeoutError, retrying. (2/3)
2018-03-22 17:36:44,036 [salt.transport.zeromq][DEBUG   ][12036] SaltReqTimeoutError, retrying. (3/3)
2018-03-22 17:37:44,094 [salt.state       ][ERROR   ][12036] An exception occurred in this state: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/state.py", line 1837, in call
   **cdata['kwargs'])
 File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1794, in wrapper
   return f(*args, **kwargs)
 File "/usr/lib/python2.7/site-packages/salt/states/file.py", line 4184, in blockreplace
   context=context)
 File "/usr/lib/python2.7/site-packages/salt/states/file.py", line 1059, in _get_template_texts
   **kwargs
 File "/usr/lib/python2.7/site-packages/salt/modules/cp.py", line 301, in get_template
   **kwargs)
 File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 731, in get_template
   sfn = self.cache_file(url, saltenv, cachedir=cachedir)
 File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 189, in cache_file
   return self.get_url(path, '', True, saltenv, cachedir=cachedir)
 File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 495, in get_url
   result = self.get_file(url, dest, makedirs, saltenv, cachedir=cachedir)
 File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 1044, in get_file
   hash_server, stat_server = self.hash_and_stat_file(path, saltenv)
 File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 1285, in hash_and_stat_file
   return self.__hash_and_stat_file(path, saltenv)
 File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 1270, in __hash_and_stat_file
   return self.channel.send(load)
 File "/usr/lib/python2.7/site-packages/salt/utils/async.py", line 75, in wrap
   ret = self._block_future(ret)
 File "/usr/lib/python2.7/site-packages/salt/utils/async.py", line 85, in _block_future
   return future.result()
 File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
   raise_exc_info(self._exc_info)
 File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 876, in run
   yielded = self.gen.throw(*exc_info)
 File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 280, in send
   ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout, raw=raw)
 File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 870, in run
   value = future.result()
 File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
   raise_exc_info(self._exc_info)
 File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 876, in run
   yielded = self.gen.throw(*exc_info)
 File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 248, in _crypted_transfer
   ret = yield _do_transfer()
 File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 870, in run
   value = future.result()
 File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
   raise_exc_info(self._exc_info)
 File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 876, in run
   yielded = self.gen.throw(*exc_info)
 File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 232, in _do_transfer
   tries=tries,
 File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 870, in run
   value = future.result()
 File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
   raise_exc_info(self._exc_info)
 File "<string>", line 3, in raise_exc_info
SaltReqTimeoutError: Message timed out

-- 
Danny

Jeremy McMillan

unread,
Apr 2, 2018, 11:40:51 PM4/2/18
to Salt-users
The salt-minion probably should be restarted after the network.managed state, to ensure neither it nor the master have any stale sockets.


I would do the forked shell version rather than using at(1), which is no longer installed by default on most systems.

This is not going to solve the problem of losing the remaining states after network.managed and the minion restart.

I would attempt to set this up in a separate reactor triggered run, maybe triggering on minion start but targeting via custom grain that indicates whether the network interface is static or DHCP.

Danny Smit

unread,
Apr 5, 2018, 10:12:35 AM4/5/18
to salt-...@googlegroups.com
Thanks for the response, restarting the minion is definitely working
as a short term fix.

I'm not yet familiar with the reactor system, but that looks like a
promising way get around the issue. I assume that it could be used to
run the network configuration part first, then trigger a restart of
the minion and after that run the remaining part of the configuration?

One thing that may complicate things a bit could be that in my case
'the foreman' (https://theforeman.org/) is used in combination with
salt, which currently triggers a highstate autonomously. I'm not sure
yet how well this works with a two-steps approach like this.
> --
> You received this message because you are subscribed to the Google Groups
> "Salt-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to salt-users+...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/salt-users/e8d19592-62f2-432c-bbf0-f75a3bbb1ef8%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages