Model Parallelism VS Data Parallelism

Jack DH

unread,

Nov 2, 2017, 11:59:00 AM11/2/17

to Discuss

Hello.

I'm working on parallel cluster with tensorflow,

but is there any reason why we need Data Parallelism of in-graph replication?

If a job is working within in a "in-graph", it means it can be processed with Model Parallelism without any pain.

I understand that Model Parallelism is heavily dependent on RDMA, so it can cause latency issue with NUMA architecture,

but it would be great if someone can share opinion.

Martin Wicke

unread,

Nov 2, 2017, 12:07:44 PM11/2/17

to Jack DH, Discuss

Note that placement is not automatic or automatically optimal, so if you want to do model parallelism, you have to come up with a parallelization scheme that works for your specific model. That's hard (human) work.

On the other hand, data parallelism can be made to work automatically, and work pretty well, assuming the workers available are similar.

Hence we like data parallelism.

Martin

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/0cf67ca9-6360-43d0-a545-b7cb89e5c8c0%40tensorflow.org.

Jack DH

unread,

Nov 2, 2017, 12:47:58 PM11/2/17

to Discuss

Martin, thanks for sharing your great thought!

Here is one more. Can you tell me how you think about multiple gpus on one graph?

By default, tensorflow uses multiple gpus when those are detected for traning,

when GPUdirect P2P will be automatically used as tensorflow kernel is utilizing cuda API, and 2 gpu will used as 1 gpu.

Because of this, I'm trying hard to understand why Data Parallelism is needed.

Would there be any usecase where data parallelism can be preferred to gpudirect on in-graph(single graph)?

2017년 11월 3일 금요일 오전 1시 7분 44초 UTC+9, Martin Wicke 님의 말:

Note that placement is not automatic or automatically optimal, so if you want to do model parallelism, you have to come up with a parallelization scheme that works for your specific model. That's hard (human) work.

On the other hand, data parallelism can be made to work automatically, and work pretty well, assuming the workers available are similar.

Hence we like data parallelism.

Martin

On Thu, Nov 2, 2017 at 8:58 AM, Jack DH <tzn...@gmail.com> wrote:

Hello.

I'm working on parallel cluster with tensorflow,
but is there any reason why we need Data Parallelism of in-graph replication?

If a job is working within in a "in-graph", it means it can be processed with Model Parallelism without any pain.

I understand that Model Parallelism is heavily dependent on RDMA, so it can cause latency issue with NUMA architecture,
but it would be great if someone can share opinion.

--
You received this message because you are subscribed to the Google Groups "Discuss" group.

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

Martin Wicke

unread,

Nov 2, 2017, 12:57:32 PM11/2/17

to Jack DH, Discuss

If you only have one (effective) GPU, you don't need it. But most systems don't look like that, and model parallelism requires thought.

To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.

To post to this group, send email to dis...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/1c33aa81-7c19-472b-ad57-e02476941361%40tensorflow.org.

Jack DH

unread,

Nov 2, 2017, 1:16:48 PM11/2/17

to Discuss

Thanks Martin!

I will test to see how far model parallelism can achieve performance benefit on a system

where gpudirect is not supported or RDMA can not be utilized.

2017년 11월 3일 금요일 오전 1시 57분 32초 UTC+9, Martin Wicke 님의 말:

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/1c33aa81-7c19-472b-ad57-e02476941361%40tensorflow.org.

Reply all

Reply to author

Forward