Proxy Thread Error in NCCL after building from source tensorflow 2.0 gpu

377 views
Skip to first unread message

sayan mandal

unread,
Jul 18, 2019, 1:43:47 AM7/18/19
to Discuss
Hi,
I have built tensorflow 2.0 from source for cuda 10 and cudnn 7. When I use nccl to run multi-host job, I get this error:

p40-gpu-0004:25127:25578 [2] NCCL INFO rank 2 nranks 8
p40-gpu-0004:25127:25578 [3] NCCL INFO rank 3 nranks 8
NCCL version 2.3.5+cudaCUDA_MAJOR.CUDA_MINOR
p40-gpu-0004:25127:25578 [0] NCCL INFO rank 0 nranks 8
p40-gpu-0004:25127:25578 [1] NCCL INFO rank 1 nranks 8
p40-gpu-0004:25127:25592 [1] NCCL INFO comm 0x7f7c082f9b00 rank 1 nranks 8
p40-gpu-0004:25127:25590 [3] NCCL INFO comm 0x7f649b64be50 rank 3 nranks 8
p40-gpu-0004:25127:25589 [2] NCCL INFO comm 0x7f7bf807a0e0 rank 2 nranks 8
p40-gpu-0004:25127:25591 [0] NCCL INFO comm 0x7f7c0c2c9fd0 rank 0 nranks 8
p40-gpu-0004:25127:25592 [1] NCCL INFO CUDA Dev 1, IP Interfaces : eth0(PXB)
p40-gpu-0004:25127:25590 [3] NCCL INFO CUDA Dev 3, IP Interfaces : eth0(PXB)
p40-gpu-0004:25127:25589 [2] NCCL INFO CUDA Dev 2, IP Interfaces : eth0(PXB)
p40-gpu-0004:25127:25591 [0] NCCL INFO CUDA Dev 0, IP Interfaces : eth0(PXB)
p40-gpu-0004:25127:25591 [0] NCCL INFO Using 256 threads
p40-gpu-0004:25127:25591 [0] NCCL INFO Min Comp Cap 6
p40-gpu-0004:25127:25591 [0] NCCL INFO Ring 00 :    0   1   2   3   4   5   6   7
p40-gpu-0004:25127:25591 [0] NCCL INFO Ring 00 : 7 -> 0 via NET/Socket/0
p40-gpu-0004:25127:25591 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
p40-gpu-0004:25127:25592 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
p40-gpu-0004:25127:25589 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via direct shared memory
p40-gpu-0004:25127:25587 [0] NCCL INFO Launch mode Group/CGMD

p40-gpu-0004:25127:25595 [0] bazel-out/k8-opt/bin/external/nccl_archive/transport/net_socket.cu.cc:189 NCCL WARN Message truncated : received 174080 bytes instead of 8192
p40-gpu-0004:25127:25595 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:28 -> 3
p40-gpu-0004:25127:25595 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/transport/net.cu.cc:474 -> 3

p40-gpu-0004:25127:25595 [0] bazel-out/k8-opt/bin/external/nccl_archive/transport.cu.cc:153 NCCL WARN bazel-out/k8-opt/bin/external/nccl_archive/transport.cu.cc:153 -> 3 [Proxy thread error]

p40-gpu-0004:25127:25595 [0] bazel-out/k8-opt/bin/external/nccl_archive/transport/net_socket.cu.cc:189 NCCL WARN Message truncated : received 976055552 bytes instead of 8192
p40-gpu-0004:25127:25595 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/_virtual_includes/include_hdrs/net.h:28 -> 3
p40-gpu-0004:25127:25595 [0] NCCL INFO bazel-out/k8-opt/bin/external/nccl_archive/transport/net.cu.cc:474 -> 3

p40-gpu-0004:25127:25595 [0] bazel-out/k8-opt/bin/external/nccl_archive/transport.cu.cc:153 NCCL WARN bazel-out/k8-opt/bin/external/nccl_archive/transport.cu.cc:153 -> 3 [Proxy thread error]
  

Regards,
Sayan Mandal

Final Year Undergraduate Student,
Computer Science and engineering,
Indian Institute Of Technology Kharagpur,

陶旭

unread,
Feb 28, 2020, 3:24:46 AM2/28/20
to Discuss
Have you solved this problem? I tried it on official docker image and got stuck in the same position.

Sanjoy Das

unread,
Feb 28, 2020, 1:21:24 PM2/28/20
to 陶旭, Ayush Dubey, Discuss
+Ayush Dubey have you seen this before?

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/ad87c156-1fd6-4ebf-b6ed-132367959d8b%40tensorflow.org.

Ayush Dubey

unread,
Feb 28, 2020, 1:27:17 PM2/28/20
to Sanjoy Das, 陶旭, Discuss
No I haven't.  It seems like https://github.com/NVIDIA/nccl/issues/193 is related.

陶旭

unread,
Mar 1, 2020, 1:54:18 AM3/1/20
to Discuss
Thanks for your reply. I looked up the issue#193. Seems that hanging is the expected behavior?
I also added your link to 
To unsubscribe from this group and stop receiving emails from it, send an email to dis...@tensorflow.org.
Reply all
Reply to author
Forward
0 new messages