Jobs distribution on multi-core nodes

46 views
Skip to first unread message

Will

unread,
Jul 7, 2021, 5:06:29 PM7/7/21
to google-cloud-slurm-discuss
Hello,

I have several 1cpu commands to execute (4,000). I created a partition with c2-standard-60cpu and want to run one command on each cpu -- this should all fit in 67 machines (4000jobs/60cpu=67 machines of 60 cpus). 

My partition uses preemptible machines and I use the last hpc image ("projects/schedmd-slurm-public/global/images/family/schedmd-slurm-20-11-7-hpc-centos-7") and install slurm with terraform. I set true to the image_hyperthreads in the tfvars file.

My controller is a c2-standard-30 with the same image. My login is a c2-standard-4. I launched my sbatch command from the login machine.

I could not make it work with an array so I created a long batch file that looks like this:

#!/bin/bash
#SBATCH --job-name=pc60m2
#SBATCH --partition=pc60
#SBATCH --ntasks-per-node=60
#SBATCH --ntasks=4000
#SBATCH --output=pc60m2_%j.txt

srun -n1 -N1 --exclusive sh mut2_0.sh &
srun -n1 -N1 --exclusive sh mut2_1.sh &
srun -n1 -N1 --exclusive sh mut2_2.sh &
[4000 lines like srun....]
wait

A test set of 150 jobs (the first 150 srun lines) works perfectly well as it creates 3 machines, and distributed 60 jobs in the 2 first machines and 30 jobs in the 3d machine. I can see that while logging in to each machine.

However, when scaling up to the 4000 jobs (all srun lines), it does not seem to work. First, slurm spins up the correct number of machine (67) and starts to distribute 60 jobs for each machine, however, after few minutes, each machine executes only one 1 job, the scheduler starts to become unstable and return frequently "Socket timed out on send/recv operation" message from squeue command and eventually no job runs on any machine even if when squeue returns the jobs are running. I can scancel my run (state R)  but without getting my results as the machines does not run any of my job.

I tried with another partition using c2-standard-4, same image, same configuration. In this case, it spins up 1000 machines but then the exact same behavior happens...

Does anyone have an idea why it does not work? Did I miss a parameter on the Sbatch file? 
Also, If there is a solution implying array, I would be very interested in too because even with a test set I could not make array distribute job within all cpu of the machines.

Thanks in advance!
Best,
William

Joseph Schoonover

unread,
Jul 7, 2021, 5:15:00 PM7/7/21
to Will, google-cloud-slurm-discuss
Hey Will,
You likely need to remove the -N1 --exclusive flags from your srun calls - those flags indicate each job step needs a full node exclusively.
Additionally, the socket time out errors indicates a network issue. How big is your controller ? Which VM image are you using from SchedMD ?

The content of this email is confidential and intended for the recipient specified in message only. It is strictly forbidden to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.



Dr. Joseph Schoonover

Chief Executive Officer

Senior Research Software Engineer

j...@fluidnumerics.com








--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/b059f642-8b10-47b6-97ee-7f586ea1f99an%40googlegroups.com.

Joseph Schoonover

unread,
Jul 7, 2021, 5:15:39 PM7/7/21
to Will, google-cloud-slurm-discuss
Apologies.. I just saw your controller size..

Joseph Schoonover

unread,
Jul 7, 2021, 5:21:04 PM7/7/21
to Will, google-cloud-slurm-discuss
When doing this with job arrays, you can try something like this :

#!/bin/bash
#SBATCH --job-name=pc60m2
#SBATCH --partition=pc60
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH --array=0-4000
#SBATCH --output=pc60m2_%j.txt

sh mut2_${SLURM_ARRAY_TASK_ID}.sh &
wait

If you need to limit the number of array tasks running simultaneously, you can modify the --array option, e.g.

#SBATCH --array=0-4000%300

will only allow 300 tasks to run simultaneously.

Alex Chekholko

unread,
Jul 7, 2021, 5:24:12 PM7/7/21
to Joseph Schoonover, Will, google-cloud-slurm-discuss
I came across a related issue the other day; the default SLURM parameters baked into this config are pretty low:


alex@wm1-controller:~$ scontrol show config | grep -i Max
MaxArraySize            = 1001
MaxJobCount             = 10000
...

You'll need to bump those up, e.g.

I changed the values in /usr/local/etc/slurm/slurm.conf
and ran
root@wm1-controller:~# systemctl restart slurmctld
...

root@wm1-controller:~# scontrol show config | grep Max
MaxArraySize            = 10001
MaxJobCount             = 50000

In the case of the OP, I would just submit 4k 1cpu batch jobs.

Will

unread,
Jul 7, 2021, 9:27:48 PM7/7/21
to google-cloud-slurm-discuss
@Joseph: Thanks a lot!
"How big is your controller ?"
What type of machine and how big you would recommend the controller for managing this amount of jobs? 
I finally launched the with n1-standard-1cpu 4000 machines of 1 cpu using an array, however, I can't have control on my job now, any command results in 
"scancel: error: Kill job error on job id 24: Unable to contact slurm controller (connect failure)" I can't even cancel my jobs. I tried "sudo service slurmctld restart" on controller, the command works but any s (scancel) command give the same error. Do you think it is due to a weak controller machine (top command on controller does not show the machine has big usage cpu and memory)? 

@Alex: Thanks a lot! I've already bumped the MaxArraySize and MaxJobCount to a large number since it was indeed set as high be default!

Will

unread,
Jul 7, 2021, 9:36:31 PM7/7/21
to google-cloud-slurm-discuss
@Joseph:

I've tried:
#!/bin/bash
#SBATCH --job-name=pc4m2
#SBATCH --partition=pc4
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH --array=0-10%10
#SBATCH --output=pc4m2_%j.txt

sh mut2_${SLURM_ARRAY_TASK_ID}.sh  &
wait

As test I used a partition with c2-standard-4cpu. I would expect 3 machines of 4cpu, 2 machines running 4 jobs, and 1 running 2 jobs.
However, slurm created 10 machines with 4cpu and launched only one 1 job per machine, while the 3 remaining cpu are unused.
Any idea why?
Thanks in advance!


Alex Chekholko

unread,
Jul 7, 2021, 10:08:23 PM7/7/21
to Will, google-cloud-slurm-discuss
Hi Will,

There is another dimension to your jobs and that is the RAM allocated per CPU and since you don't specify it in your job spec, SLURM will use its default, so you will want to look at the DefMemPerCPU parameter or else specify --mem-per-cpu in your job description and you probably want that value to match the cpu/ram ratio of the instance type you are using.

Regards,
Alex

Will

unread,
Jul 8, 2021, 11:05:30 AM7/8/21
to google-cloud-slurm-discuss
Thanks Alex! Yes, this is indeed a dimension that is extremely important and I did not specify it, I will try with that.
For the moment, my jobs are running with the n1-standard-1cpu machine, I switch my controller from a c2-standard-30 to a c2-standard-60 and it seems to handle very well 2000 machines of 1 cpu.

I still can't pass to distribute job on multi cpu machine using:
#!/bin/bash
#SBATCH --job-name=pc60m2
#SBATCH --partition=pc60
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4Gb
#SBATCH --ntasks=1
#SBATCH --array=0-4000
#SBATCH --output=pc60m2_%j.txt

sh mut2_${SLURM_ARRAY_TASK_ID}.sh &
wait

it runs 1 job per machine wasting 59 cpu. Any guess why the array does not distribute job within a multi cpu machine?
Best,
Will
Reply all
Reply to author
Forward
0 new messages