[slurm-dev] Configuration Issues

3 views
Skip to first unread message

Carl E. Fields

unread,
Mar 30, 2015, 1:49:13 PM3/30/15
to slurm-dev
Hello,

I have installed slurm version version 14.11.4 on a RHEL server with the following specs:


Architecture:          x86_64

CPU op-mode(s):        32-bit, 64-bit

Byte Order:            Little Endian

CPU(s):                2

On-line CPU(s) list:   0,1

Thread(s) per core:    1

Core(s) per socket:    2

Socket(s):             1

NUMA node(s):          1

Vendor ID:             GenuineIntel

CPU family:            6

Model:                 23

Stepping:              6

CPU MHz:               2300.000

BogoMIPS:              4600.00

Hypervisor vendor:     VMware

Virtualization type:   full

L1d cache:             32K

L1i cache:             32K

L2 cache:              256K

L3 cache:              15360K

NUMA node0 CPU(s):     0,1



I wish to designate one core as the controller. And another core as available for job submissions which require 1 core. 

I have configured everything however, I believe I have an error in my slurm.conf file because when I submit a job, it sits in the queue with node reason as below:

             JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)

                20   compute calculat SlurmUse  PENDING       0:00     10:00      1 (Resources)



I believe I am not properly configuring the resources in my file but I am unsure of wherein the issue lies. I hope one can assist me in properly configuring my server. Thank you in advance

My current slurm.conf file:



[SlurmUser@sod264 etc]$ cat slurm.conf 

# slurm.conf file generated by configurator easy.html.

# Put this file on all nodes of your cluster.

# See the slurm.conf man page for more information.

#

ControlMachine=sod264

ControlAddr=129.XXX

#MailProg=/bin/mail 

MpiDefault=none

#MpiParams=ports=#-# 

ProctrackType=proctrack/pgid

ReturnToService=0

SlurmctldPidFile=/var/run/slurmctld.pid

#SlurmctldPort=6817 

SlurmdPidFile=/var/run/slurmd.pid

#SlurmdPort=6818 

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=SlurmUser

SlurmdUser=SlurmUser 

StateSaveLocation=/var/spool/statesave

SwitchType=switch/none

TaskPlugin=task/none

# TIMERS 

#KillWait=30 

#MinJobAge=300 

#SlurmctldTimeout=120 

#SlurmdTimeout=300 

# SCHEDULING 

FastSchedule=1

SchedulerType=sched/backfill

#SchedulerPort=7321 

#SelectType=select/serial

SelectType=select/cons_res

SelectTypeParameters=CR_CORE

# LOGGING AND ACCOUNTING 

AccountingStorageType=accounting_storage/none

ClusterName=MESA-Web

#JobAcctGatherFrequency=30 

JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=3

SlurmctldLogFile=/var/log/slurm/slurmctld.log

SlurmdDebug=3

SlurmdLogFile=/var/log/slurm/slurmd.log

#


# COMPUTE NODES 

NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=128940 TmpDisk=19895  


PartitionName=compute Nodes=sod264 Default=YES STATE=UP




Kind Regards,

Carl



Uwe Sauter

unread,
Mar 30, 2015, 1:54:11 PM3/30/15
to slurm-dev

It would be helpful to see how you submitted the job. And the output from "scontrol show job 20".

Regards,

Uwe

Carl E. Fields

unread,
Mar 30, 2015, 2:01:10 PM3/30/15
to slurm-dev
Dear Uwe,

Thank you for your response,

I have submitted the job by:

$ batch calculate.sh

and the output you requested:

[SlurmUser@sod264 services]$ scontrol show job 20

JobId=20 JobName=calculate.sh

   UserId=SlurmUser(3099) GroupId=SlurmUser(3099)

   Priority=4294901742 Nice=0 Account=slurmuser QOS=(null)

   JobState=PENDING Reason=Resources Dependency=(null)

   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A

   SubmitTime=2015-03-29T13:09:21 EligibleTime=2015-03-29T13:09:21

   StartTime=2015-03-31T10:56:06 EndTime=Unknown

   PreemptTime=None SuspendTime=None SecsPreSuspend=0

   Partition=compute AllocNode:Sid=sod264:792

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=(null)

   NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0

   Features=(null) Gres=(null) Reservation=(null)

   Shared=OK Contiguous=0 Licenses=(null) Network=(null)

   Command=/var/www/virtual/mesa-web.asu.edu/html/services/calculate.sh

   WorkDir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/

   StdErr=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.err

   StdIn=/dev/null

   StdOut=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.out




Thanks,

Carl

Mehdi Denou

unread,
Mar 30, 2015, 2:13:48 PM3/30/15
to slurm-dev
Hi all,

Carl, could you give the output of sinfo ?
-- 
---
Mehdi Denou
International HPC support
+336 45 57 66 56

Carl E. Fields

unread,
Mar 30, 2015, 2:17:52 PM3/30/15
to slurm-dev
Hello,

Output below:

[SlurmUser@sod264 ~]$ sinfo -l

Mon Mar 30 11:13:12 2015

PARTITION AVAIL  TIMELIMIT   JOB_SIZE ROOT    SHARE     GROUPS  NODES       STATE NODELIST

compute*     up   infinite 1-infinite   no       NO        all      1     drained sod264

[SlurmUser@sod264 ~]$ 




Thank you,

Carl

Uwe Sauter

unread,
Mar 30, 2015, 2:20:47 PM3/30/15
to slurm-dev

Hi,

please post your answers to the list so others can help, too.

The reason for the node's IDLE+DRAIN state is given by the output of "scontrol show node":

Reason=Low socket*core*thread count, Low CPUs [SlurmUser@2015-03-11T22:15:12]

Why slurmctld decided to put the node into this state is a bit unclear to me as in slurm.conf there is:

NodeName=sod264 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=128940 TmpDisk=19895

And it correctly detects 2 CPUs (again "scontrol show node"):

CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.00


You might want to give it a try in *not* specifying the sockets/cores/threads in slurm.conf and let slurmd detect the
values on its own.

If someone else has a better idea, its welcome.

Regards,

Uwe


Am 30.03.2015 um 20:03 schrieb Carl E. Fields:
> Output of calculate.sh
>
> [SlurmUser@sod264 services]$ cat calculate.sh
>
> #!/bin/bash
>
> #SBATCH -A SlurmUser
>
> #SBATCH --workdir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/ <http://mesa-web.asu.edu/html/services/Work/>
>
>
> #SBATCH -n 1
>
> #SBATCH -c 1
>
> #SBATCH --time=00:10:00
>
> #SBATCH --mail-type=ALL
>
> #SBATCH --mail-user=c...@asu.edu <mailto:c...@asu.edu>
>
> #SBATCH --error=job.%J.err
>
> #SBATCH --output=job.%J.out
>
>
> #SBATCH --export=MESA_DIR=/home/cefields/mesa
>
> #SBATCH --export=OMP_NUM_THREADS=1
>
> #SBATCH --export=MESASDK_ROOT=/home/cefields/mesasdk
>
> source $MESASDK_ROOT/bin/mesasdk_init.sh
>
>
> srun ./mv.sh
>
> echo 'WAITING 1 MINUTE FOR FILES TO BE MOVED...'
>
> srun ./mk
>
> echo'make...'
>
> srun ./rn
>
> echo 'run...'
>
>
>
>
> State of node:
>
>
> [SlurmUser@sod264 services]$ scontrol show node sod264
>
> NodeName=sod264 Arch=x86_64 CoresPerSocket=2
>
> CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.00 Features=(null)
>
> Gres=(null)
>
> NodeAddr=sod264 NodeHostName=sod264 Version=14.11
>
> OS=Linux RealMemory=128940 AllocMem=0 Sockets=1 Boards=1
>
> State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=19895 Weight=1
>
> BootTime=2015-03-10T12:12:21 SlurmdStartTime=2015-03-29T13:08:30
>
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
> Reason=Low socket*core*thread count, Low CPUs [SlurmUser@2015-03-11T22:15:12]
>
>
> [SlurmUser@sod264 services]$
>
>
>
>
> Thanks,
>
> Carl
>
> On Mon, Mar 30, 2015 at 11:01 AM, Uwe Sauter <uwe.sa...@gmail.com <mailto:uwe.sa...@gmail.com>> wrote:
>
> And what is the state of your node ("sinfo -l" output)? Or "scontrol show node sod264"?
>
>
> Am 30.03.2015 um 19:57 schrieb Carl E. Fields:
> > Dear Uwe,
> >
> > Thank you for your response,
> >
> > I have submitted the job by:
> >
> > $ batch calculate.sh
> >
> > and the output you requested:
> >
> > [SlurmUser@sod264 services]$ scontrol show job 20
> >
> > JobId=20 JobName=calculate.sh
> >
> > UserId=SlurmUser(3099) GroupId=SlurmUser(3099)
> >
> > Priority=4294901742 Nice=0 Account=slurmuser QOS=(null)
> >
> > JobState=PENDING Reason=Resources Dependency=(null)
> >
> > Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
> >
> > RunTime=00:00:00 TimeLimit=00:10:00 TimeMin=N/A
> >
> > SubmitTime=2015-03-29T13:09:21 EligibleTime=2015-03-29T13:09:21
> >
> > StartTime=2015-03-31T10:56:06 EndTime=Unknown
> >
> > PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >
> > Partition=compute AllocNode:Sid=sod264:792
> >
> > ReqNodeList=(null) ExcNodeList=(null)
> >
> > NodeList=(null)
> >
> > NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
> >
> > Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >
> > MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
> >
> > Features=(null) Gres=(null) Reservation=(null)
> >
> > Shared=OK Contiguous=0 Licenses=(null) Network=(null)
> >
> > Command=/var/www/virtual/mesa-web.asu.edu/html/services/calculate.sh <http://mesa-web.asu.edu/html/services/calculate.sh>
> <http://mesa-web.asu.edu/html/services/calculate.sh>
> >
> > WorkDir=/var/www/virtual/mesa-web.asu.edu/html/services/Work/ <http://mesa-web.asu.edu/html/services/Work/>
> <http://mesa-web.asu.edu/html/services/Work/>
> >
> > StdErr=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.err <http://mesa-web.asu.edu/html/services/Work//job.%J.err>
> > <http://mesa-web.asu.edu/html/services/Work//job.%J.err>
> >
> > StdIn=/dev/null
> >
> > StdOut=/var/www/virtual/mesa-web.asu.edu/html/services/Work//job.%J.out <http://mesa-web.asu.edu/html/services/Work//job.%J.out>
> > <http://mesa-web.asu.edu/html/services/Work//job.%J.out>
Reply all
Reply to author
Forward
0 new messages