Weight exploding when running Swivel with 8-GPUs

52 views
Skip to first unread message

이정규

unread,
Mar 16, 2017, 2:03:34 AM3/16/17
to Swivel Embeddings
Hello
I'm running the Swivel algorithm on my data. However, in single gpu, learning can be done without any problem, but when swivel is executed with 8-gpus, loss and weight become larger during the learning process, and learning can not be done properly.

At first I posted on the issue of `tensorflow/model` respository, but I was guided to come here. This is the issue that I posted on which I described the symptoms in detail.



Chris Waterson

unread,
Mar 20, 2017, 7:41:41 PM3/20/17
to 이정규, Swivel Embeddings
Hello, and sorry for the delayed response!

If I understand correctly, you're running Swivel in the models repo on a single machine with 8 GPUs? If so, that's impressive! There was recently a change to support multiple GPUs, and it's possible that the hogwild parameter updates are causing problems.

I've been running a version of Swivel that makes use of the distributed facilities of Tensorflow (e.g., tf.Supervisor and Supervisor.managed_session). This allows the gradient updates being coordinated through the parameter server, and may ameliorate the problem.

I'll see about getting that version pushed into the repository.

chris

--
You received this message because you are subscribed to the Google Groups "Swivel Embeddings" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swivel-embeddings+unsubscribe@googlegroups.com.
To post to this group, send email to swivel-embeddings@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/swivel-embeddings/a2397971-6446-48f5-84f4-a404fc2af4ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

이정규

unread,
Mar 20, 2017, 10:54:13 PM3/20/17
to Swivel Embeddings

Thank you very much for your reply.

As you said, I'm working on a single machine with 8 gpus.

I also speculated that, at the beginning of the learning, there was a lot of parameter updates with a high learning rate, which would cause problems.

If we reduce the learning_rate, the phenomenon disappears and we are working on it (of course, the convergence rate is slower).

Does "gradient updates being coordinated" mean an alternative least square update?

I am very much looking forward to the repository push you mentioned. 

Thanks

2017년 3월 21일 화요일 오전 8시 41분 41초 UTC+9, Chris Waterson 님의 말:
Hello, and sorry for the delayed response!

If I understand correctly, you're running Swivel in the models repo on a single machine with 8 GPUs? If so, that's impressive! There was recently a change to support multiple GPUs, and it's possible that the hogwild parameter updates are causing problems.

I've been running a version of Swivel that makes use of the distributed facilities of Tensorflow (e.g., tf.Supervisor and Supervisor.managed_session). This allows the gradient updates being coordinated through the parameter server, and may ameliorate the problem.

I'll see about getting that version pushed into the repository.

chris
On Wed, Mar 15, 2017 at 11:03 PM, 이정규 <swea...@gmail.com> wrote:
Hello
I'm running the Swivel algorithm on my data. However, in single gpu, learning can be done without any problem, but when swivel is executed with 8-gpus, loss and weight become larger during the learning process, and learning can not be done properly.

At first I posted on the issue of `tensorflow/model` respository, but I was guided to come here. This is the issue that I posted on which I described the symptoms in detail.



--
You received this message because you are subscribed to the Google Groups "Swivel Embeddings" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swivel-embeddi...@googlegroups.com.
To post to this group, send email to swivel-e...@googlegroups.com.

Chris Waterson

unread,
Mar 20, 2017, 11:02:18 PM3/20/17
to 이정규, Swivel Embeddings
On Mon, Mar 20, 2017 at 7:54 PM, 이정규 <swea...@gmail.com> wrote:

Thank you very much for your reply.

As you said, I'm working on a single machine with 8 gpus.

I also speculated that, at the beginning of the learning, there was a lot of parameter updates with a high learning rate, which would cause problems.

If we reduce the learning_rate, the phenomenon disappears and we are working on it (of course, the convergence rate is slower).

Does "gradient updates being coordinated" mean an alternative least square update?

My understanding of how TF works without a parameter server is that the parameters are stored in shared memory, and so gradient updates (including Adagrad parameters) are updated without any sort of explicit locking.

With a parameter server, a separate process actually maintains the parameters, and so while multiple worker threads work on separate submatrix shards to do matmuls etc., all updates are serialized through the parameter server. Parameter updates may clobber each other, but at least in an orderly way. Also, the Adagrad clamping is computed in the parameter server process, and so is serialized.

I am very much looking forward to the repository push you mentioned. 

Cool, I will follow up to this thread when that happens.

chris
 

Colin Evans

unread,
Mar 21, 2017, 12:04:44 PM3/21/17
to Chris Waterson, 이정규, Swivel Embeddings
FYI you can set use_locking on the optimizer to set synchronized access:




--
You received this message because you are subscribed to the Google Groups "Swivel Embeddings" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swivel-embeddings+unsubscribe@googlegroups.com.
To post to this group, send email to swivel-embeddings@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/swivel-embeddings/CANYn0-rfJSt5G5UymLKmEuL7RZF7y8VJBdws%3DKp5PRXY1ZwE_w%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages