Hello,
I have several 1cpu commands to execute (4,000). I created a partition with c2-standard-60cpu and want to run one command on each cpu -- this should all fit in 67 machines (4000jobs/60cpu=67 machines of 60 cpus).
My partition uses preemptible machines and I use the last hpc image ("projects/schedmd-slurm-public/global/images/family/schedmd-slurm-20-11-7-hpc-centos-7") and install slurm with terraform. I set true to the image_hyperthreads in the tfvars file.
My controller is a c2-standard-30 with the same image. My login is a c2-standard-4. I launched my sbatch command from the login machine.
I could not make it work with an array so I created a long batch file that looks like this:
#!/bin/bash
#SBATCH --job-name=pc60m2
#SBATCH --partition=pc60
#SBATCH --ntasks-per-node=60
#SBATCH --ntasks=4000
#SBATCH --output=pc60m2_%j.txt
srun -n1 -N1 --exclusive sh mut2_0.sh &
srun -n1 -N1 --exclusive sh mut2_1.sh &
srun -n1 -N1 --exclusive sh mut2_2.sh &
[4000 lines like srun....]
wait
A test set of 150 jobs (the first 150 srun lines) works perfectly well as it creates 3 machines, and distributed 60 jobs in the 2 first machines and 30 jobs in the 3d machine. I can see that while logging in to each machine.
However, when scaling up to the 4000 jobs (all srun lines), it does not seem to work. First, slurm spins up the correct number of machine (67) and starts to distribute 60 jobs for each machine, however, after few minutes, each machine executes only one 1 job, the scheduler starts to become unstable and return frequently "Socket timed out on send/recv operation" message from squeue command and eventually no job runs on any machine even if when squeue returns the jobs are running. I can scancel my run (state R) but without getting my results as the machines does not run any of my job.
I tried with another partition using c2-standard-4, same image, same configuration. In this case, it spins up 1000 machines but then the exact same behavior happens...
Does anyone have an idea why it does not work? Did I miss a parameter on the Sbatch file?
Also, If there is a solution implying array, I would be very interested in too because even with a test set I could not make array distribute job within all cpu of the machines.
Thanks in advance!
Best,
William