[slurm-users] New slurm configuration - multiple jobs per host

940 views
Skip to first unread message

Jake Jellinek

unread,
May 26, 2022, 3:13:02 PM5/26/22
to slurm...@schedmd.com

Hi

 

I am just building my first Slurm setup and have got everything running – well, almost.

 

I have a two node configuration. All of my setup exists on a single HyperV server and I have divided up the resources to create my VMs

 

One node I will use for heavy duty work; this is called compute001

One node I will use for normal work; this is called compute002

 

My compute node specification in slurm.conf is

NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN

NodeName=compute001 CPUs=32

NodeName=compute002 CPUs=2

 

The partition specification is

PartitionName=DEFAULT State=UP

PartitionName=interactive Nodes=compute002 MaxTime=INFINITE OverSubscribe=FORCE

PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE

 

 

I have added the OverSubscribe=FORCE option as I want more than one job to be able to land on my interactive/simulation queues.

 

All of the nodes and cluster master start up fine and they all talk to each other but no matter what I do, I cannot get my cluster to accept more than one job per node.

 

 

Can you help me determine where I am going wrong?

Thanks a lot

Jake

 

 

The entire slurm.conf is pasted below

# slurm.conf file generated by configurator.html.

ClusterName=pm-slurm

SlurmctldHost=slurm-master

MpiDefault=none

ProctrackType=proctrack/cgroup

ReturnToService=2

SlurmctldPidFile=/var/run/slurmctld.pid

SlurmctldPort=6817

SlurmdPidFile=/var/run/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=slurm

StateSaveLocation=/home/slurm/var/spool/slurmctld

SwitchType=switch/none

TaskPlugin=task/cgroup

#

# TIMERS

InactiveLimit=0

KillWait=30

MinJobAge=300

SlurmctldTimeout=120

SlurmdTimeout=300

Waittime=0

#

# SCHEDULING

SchedulerType=sched/backfill

SelectType=select/cons_tres

SelectTypeParameters=CR_Core_Memory

#

# LOGGING AND ACCOUNTING

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/cgroup

SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurmctld.log

SlurmdDebug=info

SlurmdLogFile=/var/log/slurmd.log

 

# COMPUTE NODES

NodeName=DEFAULT CPUs=1 RealMemory=1000 State=UNKNOWN

NodeName=compute001 CPUs=32

NodeName=compute002 CPUs=2

 

PartitionName=DEFAULT State=UP

PartitionName=interactive Nodes=compute002 MaxTime=INFINITE OverSubscribe=FORCE

PartitionName=simulation Nodes=compute001 MaxTime=30 OverSubscribe=FORCE

 

 

Ole Holm Nielsen

unread,
May 26, 2022, 4:10:02 PM5/26/22
to slurm...@lists.schedmd.com
Hi Jake,

Firstly, which Slurm version and which OS do you use?

Next, try simplifying by removing the oversubscribe configuration. Read
the slurm.conf manual page about oversubscribe, it looks a bit tricky.

The RealMemory=1000 is extremely low and might prevent jobs from
starting! Run "slurmd -C" on the nodes to read appropriate node
parameters for slurm.conf.

I hope this helps.

/Ole

Jake Jellinek

unread,
May 26, 2022, 4:37:25 PM5/26/22
to Ole.H....@fysik.dtu.dk, Slurm User Community List
Hi Ole

I only added the oversubscribe option because without it, it didn’t work - so in fact, it appears not to have made any difference

I though the RealMemory option just said not to offer any jobs to the node that didn’t have AT LEAST that amount of RAM
My large node has more than 64GB RAM (and more will be allocated later) but I have yet to get to a memory issue…still working on cores


jake@compute001:~$ slurmd -C
NodeName=compute001 CPUs=32 Boards=1 SocketsPerBoard=2 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=64359
UpTime=0-06:58:54


Thanks
Jake

> On 26 May 2022, at 21:11, Ole Holm Nielsen <Ole.H....@fysik.dtu.dk> wrote:
>
> Hi Jake,

Lyn Gerner

unread,
Jun 2, 2022, 8:52:12 PM6/2/22
to Slurm User Community List
Jake, my hunch is that your jobs are getting hung up on mem allocation, such that Slurm is assigning all of memory to each job as it runs; you can verify w/scontrol show job. If that's what's happening, try setting a DefMemPerCPU value for your partition(s).

Best of luck,
Lyn

Jake Jellinek

unread,
Jun 3, 2022, 5:40:03 AM6/3/22
to Slurm User Community List

Thanks Lyn – that was exactly the problem.

 

Jake

Reply all
Reply to author
Forward
0 new messages