Hi all,
I'm trying to provision and configure CentOS based machines using the salt state network.managed. This salt state changes the network configuration from dhcp to a static configuration, meaning that the IP address of the node is changed by it. The state itself works perfectly and an implicit ifdown/ifup is triggered by the network.managed, because of which the new static address is activated on the fly. Also communication between the master machine and minion machine with the new IP address immediately works as expected on a network level (ping and ssh back and forth works immediately).
However communication between the salt master and minion seems to stall. After the network state is done, other state files will be executed that require files or data from the salt master (using states like file.managed). The very first state that does so, stalls for about 4 minutes and finally results in a SaltReqTimeoutError in the salt minion log file. After that the salt run continues and all other state files that use states like file.managed, are able to communicate with the master just fine.
In the meantime, while the minion stalled, the salt master already report "Minion did not return. [No response]". Even when the minion continued and finished processing all of the states (communication with the master seems to be restored), the master still fails to communicate with minion, as the following call keeps failing with the "Minion did not return. [No response]" as reply:
salt 'minion.test.local' test.ping
The interesting thing is that a restart of the salt minion restored the communication between master and minion again. After a restart the a test.ping from master to minion is successful again.
I understood it is to be expected that the minion will attempt to reconnect to the master, however it does cause one salt state to be unsuccessful. Additionally the master stays confused. Is that expected behavior? And if so is there a speed up the process in such a way the salt recovers without one salt state failing?
Below is an example of what the minion shows in its log with the very first salt state that runs after the reconfigure of the network. I can post some more logs and info if required.
2018-03-22 17:33:43,876 [salt.state ][INFO ][12036] Executing state file.blockreplace for [/etc/hosts]
2018-03-22 17:34:43,919 [salt.transport.zeromq][DEBUG ][12036] SaltReqTimeoutError, retrying. (1/3)
2018-03-22 17:35:43,977 [salt.transport.zeromq][DEBUG ][12036] SaltReqTimeoutError, retrying. (2/3)
2018-03-22 17:36:44,036 [salt.transport.zeromq][DEBUG ][12036] SaltReqTimeoutError, retrying. (3/3)
2018-03-22 17:37:44,094 [salt.state ][ERROR ][12036] An exception occurred in this state: Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/salt/state.py", line 1837, in call
**cdata['kwargs'])
File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1794, in wrapper
return f(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/salt/states/file.py", line 4184, in blockreplace
context=context)
File "/usr/lib/python2.7/site-packages/salt/states/file.py", line 1059, in _get_template_texts
**kwargs
File "/usr/lib/python2.7/site-packages/salt/modules/cp.py", line 301, in get_template
**kwargs)
File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 731, in get_template
sfn = self.cache_file(url, saltenv, cachedir=cachedir)
File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 189, in cache_file
return self.get_url(path, '', True, saltenv, cachedir=cachedir)
File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 495, in get_url
result = self.get_file(url, dest, makedirs, saltenv, cachedir=cachedir)
File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 1044, in get_file
hash_server, stat_server = self.hash_and_stat_file(path, saltenv)
File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 1285, in hash_and_stat_file
return self.__hash_and_stat_file(path, saltenv)
File "/usr/lib/python2.7/site-packages/salt/fileclient.py", line 1270, in __hash_and_stat_file
return self.channel.send(load)
File "/usr/lib/python2.7/site-packages/salt/utils/async.py", line 75, in wrap
ret = self._block_future(ret)
File "/usr/lib/python2.7/site-packages/salt/utils/async.py", line 85, in _block_future
return future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 876, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 280, in send
ret = yield self._crypted_transfer(load, tries=tries, timeout=timeout, raw=raw)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 870, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 876, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 248, in _crypted_transfer
ret = yield _do_transfer()
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 870, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 876, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/salt/transport/zeromq.py", line 232, in _do_transfer
tries=tries,
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 870, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 214, in result
raise_exc_info(self._exc_info)
File "<string>", line 3, in raise_exc_infoSaltReqTimeoutError: Message timed out
--
Danny