I have a TF 2.3+ training script which uses MultiWorkerMirroredStrategy to distribute. I am using a grid of small hosts.
If I start, let's say 50 workers, I need to set TF_CONFIG to list all the hosts...for each worker. If some don't start, it appears that the system sits (forever?).
I want to be able to get the script to continue to run by either:
* Ignoring the failed hosts and just continue
* Allow me to add new workers while Keras .fit() is running
...preferrably both.
Are their any recovery options? Restarting the run wouldn't work as that's just trial and error. If there isn't anything does this warrant a github issue?