No communication between compute Nodes

414 views
Skip to first unread message

aaditya chapagain

unread,
Dec 30, 2020, 12:42:29 AM12/30/20
to google-cloud-slurm-discuss
Hi,
I created slurm cluster using https://github.com/SchedMD/slurm-gcp , partitions with volta GPUs.

When i use 1 node to train, it works fine , but when i use 2 or more nodes, looks like communication between nodes hungs and training just stuck at very beginning, but no 
error logs 
here are some NCCL debug ,

slurm-train-compute-0-0:2139:2139 [0] NCCL INFO Bootstrap : Using [0]eth0:10.0.0.34<0>
slurm-train-compute-0-0:2139:2139 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
slurm-train-compute-0-0:2139:2139 [0] NCCL INFO NET/IB : No device found.
slurm-train-compute-0-0:2139:2139 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.34<0>
slurm-train-compute-0-0:2139:2139 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
slurm-train-compute-0-1:2119:2119 [1] NCCL INFO Bootstrap : Using [0]eth0:10.0.0.33<0>
slurm-train-compute-0-1:2119:2119 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
slurm-train-compute-0-1:2119:2119 [1] NCCL INFO NET/IB : No device found.
slurm-train-compute-0-1:2119:2119 [1] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.33<0>
slurm-train-compute-0-1:2119:2119 [1] NCCL INFO Using network Socket
slurm-train-compute-0-0:2139:2250 [0] NCCL INFO Channel 00/02 :    0   1
slurm-train-compute-0-0:2139:2250 [0] NCCL INFO Channel 01/02 :    0   1
slurm-train-compute-0-0:2139:2250 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
slurm-train-compute-0-0:2139:2250 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->1|1->0->-1/-1/-1
slurm-train-compute-0-1:2119:2212 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
slurm-train-compute-0-1:2119:2212 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] 0/-1/-1->1->-1|-1->1->0/-1/-1
slurm-train-compute-0-0:2139:2250 [0] NCCL INFO Channel 00 : 1[50] -> 0[40] [receive] via NET/Socket/0
slurm-train-compute-0-1:2119:2212 [1] NCCL INFO Channel 00 : 0[40] -> 1[50] [receive] via NET/Socket/0
slurm-train-compute-0-0:2139:2250 [0] NCCL INFO Channel 00 : 0[40] -> 1[50] [send] via NET/Socket/0
slurm-train-compute-0-1:2119:2212 [1] NCCL INFO Channel 00 : 1[50] -> 0[40] [send] via NET/Socket/0
slurm-train-compute-0-0:2139:2250 [0] NCCL INFO Channel 01 : 1[50] -> 0[40] [receive] via NET/Socket/0
slurm-train-compute-0-1:2119:2212 [1] NCCL INFO Channel 01 : 0[40] -> 1[50] [receive] via NET/Socket/0
slurm-train-compute-0-1:2119:2212 [1] NCCL INFO Channel 01 : 1[50] -> 0[40] [send] via NET/Socket/0
slurm-train-compute-0-0:2139:2250 [0] NCCL INFO Channel 01 : 0[40] -> 1[50] [send] via NET/Socket/0
slurm-train-compute-0-0:2139:2250 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
slurm-train-compute-0-1:2119:2212 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
slurm-train-compute-0-1:2119:2212 [1] NCCL INFO comm 0x7f2e38000e00 rank 1 nranks 2 cudaDev 1 busId 50 - Init COMPLETE
slurm-train-compute-0-0:2139:2250 [0] NCCL INFO comm 0x7f8130000e00 rank 0 nranks 2 cudaDev 0 busId 40 - Init COMPLETE
slurm-train-compute-0-0:2139:2139 [0] NCCL INFO Launch mode Parallel
<some random model info>
slurm-train-compute-0-0:2139:2386 [1] NCCL INFO Channel 00/02 :    0   1
slurm-train-compute-0-0:2139:2386 [1] NCCL INFO Channel 01/02 :    0   1
slurm-train-compute-0-0:2139:2386 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
slurm-train-compute-0-0:2139:2386 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] -1/-1/-1->0->1|1->0->-1/-1/-1
slurm-train-compute-0-1:2119:2291 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
slurm-train-compute-0-1:2119:2291 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] 0/-1/-1->1->-1|-1->1->0/-1/-1
slurm-train-compute-0-1:2119:2291 [0] NCCL INFO Channel 00 : 0[50] -> 1[40] [receive] via NET/Socket/0
slurm-train-compute-0-0:2139:2386 [1] NCCL INFO Channel 00 : 1[40] -> 0[50] [receive] via NET/Socket/0
slurm-train-compute-0-1:2119:2291 [0] NCCL INFO Channel 00 : 1[40] -> 0[50] [send] via NET/Socket/0
slurm-train-compute-0-0:2139:2386 [1] NCCL INFO Channel 00 : 0[50] -> 1[40] [send] via NET/Socket/0
slurm-train-compute-0-1:2119:2291 [0] NCCL INFO Channel 01 : 0[50] -> 1[40] [receive] via NET/Socket/0
slurm-train-compute-0-0:2139:2386 [1] NCCL INFO Channel 01 : 1[40] -> 0[50] [receive] via NET/Socket/0
slurm-train-compute-0-1:2119:2291 [0] NCCL INFO Channel 01 : 1[40] -> 0[50] [send] via NET/Socket/0
slurm-train-compute-0-0:2139:2386 [1] NCCL INFO Channel 01 : 0[50] -> 1[40] [send] via NET/Socket/0
slurm-train-compute-0-0:2139:2386 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
slurm-train-compute-0-1:2119:2291 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
slurm-train-compute-0-0:2139:2386 [1] NCCL INFO comm 0x7f8060000e00 rank 0 nranks 2 cudaDev 1 busId 50 - Init COMPLETE
slurm-train-compute-0-1:2119:2291 [0] NCCL INFO comm 0x7f2de8000e00 rank 1 nranks 2 cudaDev 0 busId 40 - Init COMPLETE
slurm-train-compute-0-0:2139:2139 [1] NCCL INFO Launch mode Parallel

slurm-train-compute-0-1:2119:2292 [0] transport/net_socket.cc:414 NCCL WARN NET/Socket : message truncated : receiving 1048576 bytes instead of 524288
slurm-train-compute-0-1:2119:2292 [0] NCCL INFO include/net.h:28 -> 3
slurm-train-compute-0-1:2119:2292 [0] NCCL INFO transport/net.cc:357 -> 3
slurm-train-compute-0-1:2119:2292 [0] NCCL INFO proxy.cc:198 -> 3 [Proxy Thread]

Logs in node-0:
[2020-12-30T05:11:27.540] error: Domain socket directory /var/spool/slurmd: No such file or directory
[2020-12-30T05:11:27.552] error: xcpuinfo_hwloc_topo_load: failed (load will be required after read failures).
[2020-12-30T05:11:27.552] error: Node configuration differs from hardware: Procs=16:16(hw) Boards=1:1(hw) SocketsPerBoard=16:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=1:2(hw)
[2020-12-30T05:11:27.552] Message aggregation disabled
[2020-12-30T05:11:27.605] CPU frequency setting not configured for this node
[2020-12-30T05:11:27.627] slurmd version 19.05.8 started
[2020-12-30T05:11:27.642] slurmd started on Wed, 30 Dec 2020 05:11:27 +0000
[2020-12-30T05:11:27.642] CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=60231 TmpDisk=409388 Uptime=22 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2020-12-30T05:11:51.871] task_p_slurmd_batch_request: 25
[2020-12-30T05:11:51.871] task/affinity: job 25 CPU input mask for node: 0x03FF
[2020-12-30T05:11:51.871] task/affinity: job 25 CPU final HW mask for node: 0x1F1F
[2020-12-30T05:11:51.872] _run_prolog: run job script took usec=216
[2020-12-30T05:11:51.872] _run_prolog: prolog with lock for job 25 ran for 0 seconds
[2020-12-30T05:11:51.873] Launching batch job 25 for UID 3521375391
[2020-12-30T05:11:51.937] [25.batch] task/cgroup: /slurm/uid_3521375391/job_25: alloc=57344MB mem.limit=57344MB memsw.limit=57344MB
[2020-12-30T05:11:51.943] [25.batch] task/cgroup: /slurm/uid_3521375391/job_25/step_batch: alloc=57344MB mem.limit=57344MB memsw.limit=57344MB
[2020-12-30T05:11:51.960] [25.batch] task_p_pre_launch: Using sched_affinity for tasks
[2020-12-30T05:11:52.045] launch task 25.0 request from UID:3521375391 GID:3521375391 HOST:10.0.0.34 PORT:53419
[2020-12-30T05:11:52.045] lllp_distribution jobid [25] implicit auto binding: sockets,one_thread, dist 1
[2020-12-30T05:11:52.045] _task_layout_lllp_cyclic 
[2020-12-30T05:11:52.045] _lllp_generate_cpu_bind jobid [25]: mask_cpu,one_thread, 0x1F1F
[2020-12-30T05:11:52.057] [25.0] task/cgroup: /slurm/uid_3521375391/job_25: alloc=57344MB mem.limit=57344MB memsw.limit=57344MB
[2020-12-30T05:11:52.062] [25.0] task/cgroup: /slurm/uid_3521375391/job_25/step_0: alloc=57344MB mem.limit=57344MB memsw.limit=57344MB
[2020-12-30T05:11:52.073] [25.0] task_p_pre_launch: Using sched_affinity for tasks

Logs in node-1:
[2020-12-30T05:11:27.384] error: Domain socket directory /var/spool/slurmd: No such file or directory
[2020-12-30T05:11:27.400] error: xcpuinfo_hwloc_topo_load: failed (load will be required after read failures).
[2020-12-30T05:11:27.400] error: Node configuration differs from hardware: Procs=16:16(hw) Boards=1:1(hw) SocketsPerBoard=16:1(hw) CoresPerSocket=1:8(hw) ThreadsPerCore=1:2(hw)
[2020-12-30T05:11:27.400] Message aggregation disabled
[2020-12-30T05:11:27.437] CPU frequency setting not configured for this node
[2020-12-30T05:11:27.456] slurmd version 19.05.8 started
[2020-12-30T05:11:27.468] slurmd started on Wed, 30 Dec 2020 05:11:27 +0000
[2020-12-30T05:11:27.468] CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=60231 TmpDisk=409388 Uptime=20 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)
[2020-12-30T05:11:52.048] launch task 25.0 request from UID:3521375391 GID:3521375391 HOST:10.0.0.34 PORT:50381
[2020-12-30T05:11:52.048] lllp_distribution jobid [25] implicit auto binding: sockets,one_thread, dist 1
[2020-12-30T05:11:52.048] _task_layout_lllp_cyclic 
[2020-12-30T05:11:52.048] _lllp_generate_cpu_bind jobid [25]: mask_cpu,one_thread, 0x1F1F
[2020-12-30T05:11:52.051] _run_prolog: run job script took usec=2614
[2020-12-30T05:11:52.051] _run_prolog: prolog with lock for job 25 ran for 0 seconds
[2020-12-30T05:11:52.097] [25.0] task/cgroup: /slurm/uid_3521375391/job_25: alloc=57344MB mem.limit=57344MB memsw.limit=57344MB
[2020-12-30T05:11:52.103] [25.0] task/cgroup: /slurm/uid_3521375391/job_25/step_0: alloc=57344MB mem.limit=57344MB memsw.limit=57344MB
[2020-12-30T05:11:52.115] [25.0] task_p_pre_launch: Using sched_affinity for tasks

Any solutions to this would be really helpful. Thank you

Joseph Schoonover

unread,
Jan 3, 2021, 2:10:52 PM1/3/21
to aaditya chapagain, google-cloud-slurm-discuss
Hey Aaditya,
How are you launching the jobs? Can your share your sbatch or srun commands with the flags you are using?

--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/f52e0865-473c-4f05-bb70-6a8519b1ab81n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages