Extending tf.distribute.Strategy

Skip to first unread message

Jan Betley

Apr 19, 2021, 11:41:47 AM4/19/21
to TensorFlow Developers

We want to extend the TF code base in order to make it possible for the TF developer to train over the Golem Network. The Golem Network is open source, fee free, decentralized computation power market that seems to be very good fit when anyone needs to train big TF model.

What use case we want to achieve?

(1) TF Developer has typical TF train data and model architecture ready. He wants to use hundredths of multicore (and possibly multi GPU) hardware boxes that are offered for renting in the Golem Network
(2) TF code base has been extended with Golem support, probably as an extension to existing distribution strategy.
(3) The TF Developer connects his crypto wallet to the Golem Distribution Strategy, and specifies how many boxes he wants to get from the decentralized market (+ additionally how many CPUs / GPUs per box, etc)
(4) Each of the requested Golem Provider Nodes will load and instantiate the specified docker image. The image contains the TF instance configured to act as a worker.
(5) The Golem Distribution Strategy gets the boxes from the market, and splits the training set to all the rented machines, sends the appropriate part to each of the boxes and starts learning.
(6) Technically speaking the Golem Nodes forms a VPN, so if we want to have synchronous learning with node<>node communication it is possible.
(7) The model training experience is the same as with other distribution strategies.

Now, I would like to know what is your suggestion when doing 2. From where to start? What distribution strategy to possibly extend? What to be aware? Please mind that:
- We would like to start with POC that not necessarily will outperform single box training. If there is any gain from using Golem Network compared to local model training, our first goal is accomplished.
- We do not know anything about TF internals, but we are not afraid : )
- We would like to have some quick win (so fast initial result) to be able to test the market with the solution. Than we will be able to invest more, possibly writing everything again from the scratch.
- The node<>node communication can be slow as the boxes can be on separate continents etc.
- We are not sure if we should go with asynchronous or synchronous strategy. As synchronous learning is fully possible, the current TF synchronous distribution strategies might assume low-ping, high-throughput communication channel

Going to the details, it seems that the best way it would be to use tf.distribute.Strategy.
We considered a modification of either the MultiWorkerMirroredStrategy or the ParameterServerStrategy.

So, we have a lot of questions : ) Starting with:
1. Is this a sensible and workable idea? Maybe there are some serious obstacles we don't see?
2. How hard is this task (from the TF point of view)? In the optimistic scenario we'd only have to write a few functions and put them where original communication is. In the pessimistic scenario we’d have to dig really deep in tf.strategy.
3. Where to start?

Best regards,
Jan Betley


Apr 19, 2021, 1:22:45 PM4/19/21
to Jan Betley, TensorFlow Developers
Hi Jan,

I understand why you believe this to be exciting though... I'm pretty sure this is not going to work for a few reasons.

- transferring 100s of GBs of data all accross the planet is going to be super costly in time and networking 
- transferring your (customers?) data all accross pose HUGE privacy concern
- communication is absolutely critical. Every workers needs to send and receive multiple GBs to update it's weights (ring all reduce or other). You clearly can't do this over internet. Especially that you mentioned it can be slow. Some research has been conducted for a low BW approach but it's research not proven approaches right now.

Sorry but I don't believe it is worth investing your time in this. 


You received this message because you are subscribed to the Google Groups "TensorFlow Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to developers+...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/developers/e2f707fa-f024-4f8a-bbfb-004270c3d89dn%40tensorflow.org.

radoslaw tereszczuk

Apr 20, 2021, 11:18:46 AM4/20/21
to TensorFlow Developers, jonathan...@gmail.com, TensorFlow Developers, jan.b...@golem.network
I’m Jan's colleague working In Golem. Thank you Jonathan. I would say that your answer is spot on and in general you are right, but… there are some border cases when our idea makes a bit more sense than the general case that already has been discussed. We hope to find some use case to golem fit here.

One of base scenario we are exploring is using nonheterogeneous boxes in the after-hours in the university or office. Often there is a lot of powerful hardware being not used. When forming TF learning cluster on golem, we can use geolocation tag to form the cluster from the boxes in a given location. Then the ping times and bandwidth between the boxes will make it more feasible.

Additionally, the training set can be generated locally on the worker nodes by the simulator engine. This is the approach I was using to train cnn for genetic sequencing. There was simply not enough train data we can use. Our friends from SkyEngine also use train data set generation as a base strategy. So there are cases that do not need extensive training set transfer.

Also, there are countries with the very nice network. Also, 5G is coming. In Golem we are thinking ahead in 5-10 year frame, so I think that we should not stop…

We would like to continue with some form of POC here. We can assume that all the boxes are in the same Ethernet network. We know that for serious HPC the sole Ethernet is not enough and one needs something like an Infinityband hardware connector, but we would like to try with a small POC and if the results are better even by 10% than on a single box it is OK.

Having said above, can you give us any advice on the strategy we should base on / implementing strategy from scratch?

Radek Tereszczuk
Reply all
Reply to author
0 new messages