[slurm-dev] heterogeneus number of processors per node, slurm wont use all processors

1 view
Skip to first unread message

Andrew Petersen

unread,
Jul 18, 2014, 10:29:59 AM7/18/14
to slurm-dev, Jonathan Wormald
Hello

Lets say my heterogeneous cluster has
n001 with 12 cores
n002 with 20 cores
How do I get slurm to run a job on 12 cores of node 1 and 20 cores on node 2?  If I use -N 2 --hint-compute_bound, it will only run n001x12 and n002x12, if the BatchHost=n001 (If the BatchHost is n002, it will run 20 cores on n001, causing oversubscribing). 

I can do it with the low level mpirun -machinefile
command, where machinefile has
n008:20
n001:12
However slurm seems to over-rule this information

Regards
Andrew Petersen

P.S., the output of
 scontrol show config
is:

Configuration data as of 2014-07-17T18:51:39
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = localhost
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = YES
AuthType                = auth/munge
BackupAddr              = (null)
BackupController        = (null)
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2014-06-19T10:53:14
CacheGroups             = 0
CheckpointType          = checkpoint/none
ClusterName             = slurm_cluster
CompleteWait            = 0 sec
ControlAddr             = fission
ControlMachine          = fission
CryptoType              = crypto/munge
DebugFlags              = (null)
DefMemPerNode           = UNLIMITED
DisableRootJobs         = NO
EnforcePartLimits       = NO
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
FastSchedule            = 0
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = gpu
GroupUpdateForce        = 0
GroupUpdateTime         = 600 sec
HASH_VAL                = Different Ours=0x2e2a4b6a Slurmctld=0xd9296c09
HealthCheckInterval     = 0 sec
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 30 sec
JobAcctGatherType       = jobacct_gather/linux
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend           = 0
JobRequeue              = 1
JobSubmitPlugins        = (null)
KillOnBadExit           = 0
KillWait                = 30 sec
Licenses                = (null)
MailProg                = /bin/mail
MaxJobCount             = 10000
MaxJobId                = 4294901760
MaxMemPerNode           = UNLIMITED
MaxStepCount            = 40000
MaxTasksPerNode         = 128
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
NEXT_JOB_ID             = 45294
OverTimeLimit           = 0 min
PluginDir               = /cm/shared/apps/slurm/2.3.4/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PreemptMode             = OFF
PreemptType             = preempt/none
PriorityType            = priority/basic
PrivateData             = none
ProctrackType           = proctrack/pgid
Prolog                  = (null)
PrologSlurmctld         = /cm/local/apps/cmd/scripts/prolog
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvOverRun             = 0 min
ReturnToService         = 2
SallocDefaultCommand    = (null)
SchedulerParameters     = (null)
SchedulerPort           = 7321
SchedulerRootFilter     = 1
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/linear
SelectTypeParameters    = CR_CPU
SlurmUser               = slurm(117)
SlurmctldDebug          = 3
SlurmctldLogFile        = /var/log/slurmctld
SlurmSchedLogFile       = (null)
SlurmctldPort           = 6817
SlurmctldTimeout        = 600 sec
SlurmdDebug             = 3
SlurmdLogFile           = /var/log/slurmd
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /cm/local/apps/slurm/var/spool
SlurmdTimeout           = 600 sec
SlurmdUser              = root(0)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 2.3.4
SrunEpilog              = (null)
SrunProlog              = (null)
StateSaveLocation       = /cm/shared/apps/slurm/var/cm/statesave
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = task/none
TaskPluginParam         = (null type)
TaskProlog              = (null)
TmpFS                   = /tmp
TopologyPlugin          = topology/none
TrackWCKey              = 0
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec

Slurmctld(primary/backup) at fission/(NULL) are UP/DOW


People (2)

Jonathan Wormald


Show details

Christopher Samuel

unread,
Jul 22, 2014, 8:49:12 PM7/22/14
to slurm-dev

On 19/07/14 00:29, Andrew Petersen wrote:

> Lets say my heterogeneous cluster has n001 with 12 cores n002 with
> 20 cores How do I get slurm to run a job on 12 cores of node 1 and
> 20 cores on node 2?

I'm assuming you want a single MPI job using 32 cores across both nodes?

Does --ntasks=32 (and no node specification) not work for that?

cheers,
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Andrew Petersen

unread,
Jul 23, 2014, 11:05:20 AM7/23/14
to slurm-dev
Yes, I want a single 32 core mpi job.  No, that does not work.  To test this, I created a partition (called test) that contains one 12 core node and one 20 core node.
 sbatch -p test --ntasks=32  lmpFission.pka1.sh
The job will run on 12 cores on each node.  If I arbitrarily played around with the node numbers in the "test" partition, so that BatchHost is a 20 core node, then I will have a similar problem, except it will run on 20 cores on each node, causing oversubscription on the 12 core node.
Neither does
sbatch -p test --ntasks=32 --hint=compute_bound --hint=nomultithread -s  lmpFission.pka1.sh
work

Is the behaviour I am seeing unusual?  Will upgrading fix this?  I downloaded and untarred the 14.03.6 bz2 file, but I can see that upgrading is not straight forward in my case, so I don't want to upgrade until I have reasonable assurance that the new version "supposed" to work the way I want.

Christopher Samuel

unread,
Jul 27, 2014, 7:41:12 PM7/27/14
to slurm-dev

On 24/07/14 01:05, Andrew Petersen wrote:

> Yes, I want a single 32 core mpi job. No, that does not work. To test
> this, I created a partition (called test) that contains one 12 core node
> and one 20 core node.

Oh dear. We've only got one cluster that has a mix of CPU core counts
so I'll try and find a time to create a test partition there to check we
see the same.

We're (currently) on 2.6.5, what version are you on?
Reply all
Reply to author
Forward
0 new messages