Dear Community,
We are trying to activate sharding.
Our Compute Node are configured with 64 cores, 4 phisical GPU MI250x ( 8 logical ) 4 Numa Domain. 1 Phisical Gpu / 2 logical GPU for each Numa Domain. 1 Logical GPU for each l3 cache domain
gres.conf
AutoDetect=rsmi
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD128 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD129 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD130 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD131 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD132 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD133 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD134 Count=4
NodeName=c6n[3339-3348] Name=shard File=/dev/dri/renderD135 Count=4
If I ask 2 cores with block:cyclic I have the expected result
srun -N1 -n2 -c1 --cpu-bind=cores -m block:cyclic --pty bash
cpuset cgroup is 1,17
But if I add 2 shard in the request I don't expect this result
srun -N1 -n2 -c1 --cpu-bind=cores --gres=shard:2 -m block:cyclic --pty bash
cpuset cgroup is 1-2
ROCR_VISIBILE_DEVICES=0
Is it possibile request 2 sharding in round robin fashion, in order to run a multigpu job on different GPUs?
srun -N1 -n2 -c1 --cpu-bind=cores --gres=shard:2 -m block:cyclic --pty bash
Practically, I would to have this result
cpuset cgroup is 1-17
ROCR_VISIBILE_DEVICES=0,1
Thank you in advance,
Alessandro
--
slurm-users mailing list --
slurm...@lists.schedmd.com
To unsubscribe send an email to
slurm-us...@lists.schedmd.com