NodeName=ne[01-09] CPUs=32 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
PartitionName=neon-noSMT Nodes=ne[01-09] Default=NO MaxTime=3-00:00:00 DefaultTime=4:00:00 State=UP OverSubscribe=YES
I've read this is the recommended configuration.
srun --network=DEVNAME=mlx5_ib,DEVTYPE=IB
but there is not much documentation on this and I haven't been able to run a job yet.
Is this the way we should be directing srun to run the executable over infiniband?
Thanks in advance,
Anne Hammond
Hello Anne,
We have aCentOS 8.5 clusterslurm 20.11Mellanox ConnectX 6 HDR IB and Mellanox 32 port switch
Our application is not scaling. I discovered the process communications are going over ethernet, not ib. I used the ifconfig count for the eno2 (ethernet) and ib0 (infiniband) interfaces at end of a job, and subtracted the count at the beginning. We are using sbatch andsrun {application}
If I interactively login to a node and use the commandmpiexec -iface ib0 -n 32 -machinefile machinefile {application}
where machinefile contains 32 lines with the ib hostname:ne08-ibne08-ib...ne09-ibne09-ib
the application runs over ib and scales.
/etc/slurm/slurm.conf uses the ethernet interface for administrative communications and allocation:
NodeName=ne[01-09] CPUs=32 Sockets=2 CoresPerSocket=16 ThreadsPerCore=1 State=UNKNOWN
PartitionName=neon-noSMT Nodes=ne[01-09] Default=NO MaxTime=3-00:00:00 DefaultTime=4:00:00 State=UP OverSubscribe=YES
I've read this is the recommended configuration.
I looked for srun parameters that would instruct srun to run over the ib interface when the job is run through the slurm queue.
I found the --network parameter:srun --network=DEVNAME=mlx5_ib,DEVTYPE=IB
What is the output of
srun --mpi=list ?
but there is not much documentation on this and I haven't been able to run a job yet.
Is this the way we should be directing srun to run the executable over infiniband?
Thanks in advance,
Anne Hammond
-- Regards, --Dani_L.