distributed swivel

86 views
Skip to first unread message

hyper...@gmail.com

unread,
Jul 11, 2016, 1:17:53 AM7/11/16
to Swivel Embeddings
Hi

I'm curious if it's possible to alter https://github.com/tensorflow/models/blob/master/swivel/swivel.py to be distributed. I understand that this is meant to be used with a GPU but I'm wondering if it's practical to use a cluster of machines with many vCPUs.

I notice that these 2 locations are calling for the CPU explicitly, presumably this is meant to be on a single machine with a GPU:

Would it be straightforward to simply use the existing code and substitute the tf.Session instance from a distributed cluster, via tf.train.ClusterSpec or is there a better way?

Thanks!

Best,
Isaac




Chris Waterson

unread,
Jul 11, 2016, 3:12:01 PM7/11/16
to hyper...@gmail.com, Swivel Embeddings
On Sun, Jul 10, 2016 at 10:17 PM, <hyper...@gmail.com> wrote:
Hi

I'm curious if it's possible to alter https://github.com/tensorflow/models/blob/master/swivel/swivel.py to be distributed. I understand that this is meant to be used with a GPU but I'm wondering if it's practical to use a cluster of machines with many vCPUs.

Isaac, it's definitely possible -- and intended! -- for Swivel to be used in a distributed manner. Unfortunately, the way timing worked out, we created this version of the model before the distributed version of TF was publicly available.  I have not yet looked into getting this version of the model working in a distributed way, but it's definitely something that ought to be done.
 
I notice that these 2 locations are calling for the CPU explicitly, presumably this is meant to be on a single machine with a GPU:

I don't think the device specification is necessary here (especially for the single-machine case). The important point is to make sure that these particular variables -- the embeddings -- must be shared across workers.  For distributed training, this requires that they be located on the parameter server.
 
Would it be straightforward to simply use the existing code and substitute the tf.Session instance from a distributed cluster, via tf.train.ClusterSpec or is there a better way?

I'm not yet completely up on TF's distributed API, but this seems like the right direction to pursue.  You'd want to make sure that the variables end up "with tf.device('/job:ps/task:%d' % ps_task_id)", and that the matmuls, etc. end up "with tf.device('/job:worker/task:%d' % worker_task_id)".

Let me know if you get a chance to take a crack at it -- we would definitely accept a PR to make this work!  ;)

thanks!
chris


emile...@gmail.com

unread,
Dec 4, 2016, 12:28:58 PM12/4/16
to Swivel Embeddings, hyper...@gmail.com
Would love to see this modified to be distributed! Perhaps will attempt it myself if I get to that point with TF.
Reply all
Reply to author
Forward
0 new messages