MultiWorkerMirroredStrategy won't start if some workers fail to start. How to recover?

16 views
Skip to first unread message

Robert Lugg

unread,
Oct 15, 2020, 1:42:05 PM10/15/20
to TensorFlow End Users - GETTING STARTED, TUTORIALS & HOW-TO'S
I have a TF 2.3+ training script which uses MultiWorkerMirroredStrategy to distribute.  I am using a grid of small hosts.

If I start, let's say 50 workers, I need to set TF_CONFIG to list all the hosts...for each worker.  If some don't start, it appears that the system sits (forever?).

I want to be able to get the script to continue to run by either:
* Ignoring the failed hosts and just continue
* Allow me to add new workers while Keras .fit() is running
...preferrably both.

Are their any recovery options?  Restarting the run wouldn't work as that's just trial and error.  If there isn't anything does this warrant a github issue?

Lance Norskog

unread,
Oct 16, 2020, 2:45:35 PM10/16/20
to Robert Lugg, TensorFlow End Users - GETTING STARTED, TUTORIALS & HOW-TO'S
There are different projects to run distributed machine learning scripts in a reliable manner with retries, etc. You probably want to use one of them. This is one toolkit that I've read about:


I have not used any of these projects.

Cheers,

Lance

--
You received this message because you are subscribed to the Google Groups "TensorFlow End Users - GETTING STARTED, TUTORIALS & HOW-TO'S" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tensorflow+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tensorflow/30c1810c-2b00-455d-9102-f24740686724n%40googlegroups.com.


--
Lance Norskog
lance....@gmail.com
Redwood City, CA
Reply all
Reply to author
Forward
0 new messages