[slurm-users] Slurm not starting

Elisabetta Falivene

unread,

Jan 15, 2018, 7:14:38 AM1/15/18

to Slurm User Community List

I did an upgrade from wheezy to jessie (automatically with a normal dist-upgrade) on a cluster with 8 nodes (up, running and reachable) and from slurm 2.3.4 to 14.03.9. Overcame some problems booting kernel (thank you vey much to Gennaro Oliva, btw), now the system is running correctly with kernel 3.16.0.4, but slurm isn't starting. I tried restarting services, but it seems it isn't able to do it.

Error messages are not much helping me in guessing what is going on. What should I check to get what is failing?

Thank you

Elisabetta

PS: Here it is some tests I did

Running

sinfo

returns

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST

batch* up infinite 8 unk* node[01-08]

Running

systemctl status slurmctld.service

returns

slurmctld.service - Slurm controller daemon

Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)

Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 CET; 41s ago

Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)

slurmctld[2100]: cons_res: select_p_reconfigure

slurmctld[2100]: cons_res: select_p_node_init

slurmctld[2100]: cons_res: preparing for 1 partitions

slurmctld[2100]: Running as primary controller

slurmctld[2100]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0

slurmctld.service start operation timed out. Terminating.

Terminate signal (SIGINT or SIGTERM) received

slurmctld[2100]: Saving all slurm state

Failed to start Slurm controller daemon.

Unit slurmctld.service entered failed state.

and running

/etc/init.d/slurmd status

returns

slurmd.service - Slurm node daemon

Loaded: loaded (/lib/systemd/system/slurmd.service; enabled)

Active: failed (Result: exit-code) since Mon 2018-01-15 12:44:52 CET; 21min ago

Process: 729 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)

slurmd.service: control process exited, code=exited status=1

systemd[1]: Failed to start Slurm node daemon.

Unit slurmd.service entered failed state.

Gennaro Oliva

unread,

Jan 15, 2018, 8:09:40 AM1/15/18

to Slurm User Community List

Ciao Elisabetta,

On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:
> Error messages are not much helping me in guessing what is going on. What
> should I check to get what is failing?

check slurmctld.log and slurmd.log, you can find them under
/var/log/slurm-llnl

> *PARTITION AVAIL TIMELIMIT NODES STATE NODELIST*
> *batch* up infinite 8 unk* node[01-08]*
>
>
> Running
> *systemctl status slurmctld.service*
>
> returns
>
> *slurmctld.service - Slurm controller daemon*
> * Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)*
> * Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 CET; 41s
> ago*
> * Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)*
>
> * slurmctld[2100]: cons_res: select_p_reconfigure*
> * slurmctld[2100]: cons_res: select_p_node_init*
> * slurmctld[2100]: cons_res: preparing for 1 partitions*
> * slurmctld[2100]: Running as primary controller*
> * slurmctld[2100]:
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0*
> * slurmctld.service start operation timed out. Terminating.*
> *Terminate signal (SIGINT or SIGTERM) received*
> * slurmctld[2100]: Saving all slurm state*
> * Failed to start Slurm controller daemon.*
> * Unit slurmctld.service entered failed state.*

Do you have a backup controller?
Check your slurm.conf under:
/etc/slurm-llnl

Anyway I suggest to update the operating system to stretch and fix your
configuration under a more recent version of slurm.
Best regards
--
Gennaro Oliva

Williams, Jenny Avis

unread,

Jan 15, 2018, 8:17:26 AM1/15/18

to Slurm User Community List

Elisabetta-

Start by focusing on slurmctld. Slurmd not happy without it.
Start it manually in the foreground as in
/usr/sbin/slurmctld -d -vvv

This assumes slurmd,conf is in default location.
Pardon brevity; on my phone
Jenny Williams

Sent from Nine

From: Elisabetta Falivene <e.fal...@ilabroma.com>
Sent: Monday, January 15, 2018 7:14 AM
To: Slurm User Community List
Subject: [slurm-users] Slurm not starting

Elisabetta Falivene

unread,

Jan 15, 2018, 9:30:55 AM1/15/18

to Slurm User Community List

> Anyway I suggest to update the operating system to stretch and fix your
> configuration under a more recent version of slurm.

I think I'll soon arrive to that :)

b

Douglas Jacobsen

unread,

Jan 15, 2018, 9:59:09 AM1/15/18

to Slurm User Community List

The fact that sinfo is responding shows that at least slurmctld is running. Slumd, on the other hand is not. Please also get output of slurmd log or running "slurmd -Dvvv"

Elisabetta Falivene

unread,

Jan 15, 2018, 10:30:50 AM1/15/18

to Slurm User Community List

slurmd -Dvvv says

slurmd: fatal: Unable to determine this slurmd's NodeName

b

John Hearns

unread,

Jan 15, 2018, 10:36:34 AM1/15/18

to Slurm User Community List

That's it. I am calling JohnH's Law:

"Any problem with a batch queueing system is due to hostname resolution"

Carlos Fenoy

unread,

Jan 15, 2018, 10:43:57 AM1/15/18

to Slurm User Community List

Are you trying to start the slurmd in the headnode or a compute node?

Can you provide the slurm.conf file?

Regards,

Carlos

--

--
Carles Fenoy

Elisabetta Falivene

unread,

Jan 15, 2018, 10:50:30 AM1/15/18

to Slurm User Community List

In the headnode. (I'm also noticing, and seems good to tell, for maybe the problem is the same, even ldap is not working as expected giving a message "invalid credential (49)" which is a message given when there are problem of this type. The update to jessie must have touched something that is affecting all my software sanity :D )

Here is the my slurm.conf.

# slurm.conf file generated by configurator.html.

# Put this file on all nodes of your cluster.

# See the slurm.conf man page for more information.

#

ControlMachine=anyone

ControlAddr=master

#BackupController=

#BackupAddr=

#

AuthType=auth/munge

CacheGroups=0

#CheckpointType=checkpoint/none

CryptoType=crypto/munge

#DisableRootJobs=NO

#EnforcePartLimits=NO

#Epilog=

#EpilogSlurmctld=

#FirstJobId=1

#MaxJobId=999999

#GresTypes=

#GroupUpdateForce=0

#GroupUpdateTime=600

#JobCheckpointDir=/var/slurm/checkpoint

#JobCredentialPrivateKey=

#JobCredentialPublicCertificate=

#JobFileAppend=0

#JobRequeue=1

#JobSubmitPlugins=1

#KillOnBadExit=0

#Licenses=foo*4,bar

#MailProg=/bin/mail

#MaxJobCount=5000

#MaxStepCount=40000

#MaxTasksPerNode=128

MpiDefault=openmpi

MpiParams=ports=12000-12999

#PluginDir=

#PlugStackConfig=

#PrivateData=jobs

ProctrackType=proctrack/cgroup

#Prolog=

#PrologSlurmctld=

#PropagatePrioProcess=0

#PropagateResourceLimits=

#PropagateResourceLimitsExcept=

ReturnToService=2

#SallocDefaultCommand=

SlurmctldPidFile=/var/run/slurmctld.pid

SlurmctldPort=6817

SlurmdPidFile=/var/run/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/tmp/slurmd

SlurmUser=slurm

#SlurmdUser=root

#SrunEpilog=

#SrunProlog=

StateSaveLocation=/tmp

SwitchType=switch/none

#TaskEpilog=

TaskPlugin=task/cgroup

#TaskPluginParam=

#TaskProlog=

#TopologyPlugin=topology/tree

#TmpFs=/tmp

#TrackWCKey=no

#TreeWidth=

#UnkillableStepProgram=

#UsePAM=0

#

# TIMERS

#BatchStartTimeout=10

#CompleteWait=0

#EpilogMsgTime=2000

#GetEnvTimeout=2

#HealthCheckInterval=0

#HealthCheckProgram=

InactiveLimit=0

KillWait=60

#MessageTimeout=10

#ResvOverRun=0

MinJobAge=43200

#OverTimeLimit=0

SlurmctldTimeout=600

SlurmdTimeout=600

#UnkillableStepTimeout=60

#VSizeFactor=0

Waittime=0

#

# SCHEDULING

DefMemPerCPU=1000

FastSchedule=1

#MaxMemPerCPU=0

#SchedulerRootFilter=1

#SchedulerTimeSlice=30

SchedulerType=sched/backfill

#SchedulerPort=

SelectType=select/cons_res

SelectTypeParameters=CR_CPU_Memory

#

# JOB PRIORITY

#PriorityType=priority/basic

#PriorityDecayHalfLife=

#PriorityCalcPeriod=

#PriorityFavorSmall=

#PriorityMaxAge=

#PriorityUsageResetPeriod=

#PriorityWeightAge=

#PriorityWeightFairshare=

#PriorityWeightJobSize=

#PriorityWeightPartition=

#PriorityWeightQOS=

#

# LOGGING AND ACCOUNTING

#AccountingStorageEnforce=0

#AccountingStorageHost=

AccountingStorageLoc=/var/log/slurm-llnl/AccountingStorage.log

#AccountingStoragePass=

#AccountingStoragePort=

AccountingStorageType=accounting_storage/filetxt

#AccountingStorageUser=

AccountingStoreJobComment=YES

ClusterName=cluster

#DebugFlags=

#JobCompHost=

JobCompLoc=/var/log/slurm-llnl/JobComp.log

#JobCompPass=

#JobCompPort=

JobCompType=jobcomp/filetxt

#JobCompUser=

JobAcctGatherFrequency=60

JobAcctGatherType=jobacct_gather/linux

SlurmctldDebug=3

#SlurmctldLogFile=

SlurmdDebug=3

#SlurmdLogFile=

#SlurmSchedLogFile=

#SlurmSchedLogLevel=

#

# POWER SAVE SUPPORT FOR IDLE NODES (optional)

#SuspendProgram=

#ResumeProgram=

#SuspendTimeout=

#ResumeTimeout=

#ResumeRate=

#SuspendExcNodes=

#SuspendExcParts=

#SuspendRate=

#SuspendTime=

#

# COMPUTE NODES

NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN

PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP

Douglas Jacobsen

unread,

Jan 15, 2018, 10:51:01 AM1/15/18

to Slurm User Community List

Please check your slurm.conf on the compute nodes, I'm thinking that your compute node isn't appearing in slurm.conf properly.

Carlos Fenoy

unread,

Jan 15, 2018, 10:56:55 AM1/15/18

to Slurm User Community List

Hi,

you can not start the slurmd on the headnode. Try running the same command on the compute nodes and check the output. If there is any issue it should display the reason.

Regards,

Carlos

--

--
Carles Fenoy

Elisabetta Falivene

unread,

Jan 15, 2018, 10:57:35 AM1/15/18

to Slurm User Community List

Googling a bit, the error "slurmd: fatal: Unable to determine this slurmd's NodeName" come up when you try to check slurmd on the master which shouldn't execute slurmd(?). It must be up on the nodes, not on the master.

Elisabetta Falivene

unread,

Jan 15, 2018, 12:23:00 PM1/15/18

to Slurm User Community List

The deeper I go in the problem, the worser it seems... but maybe I'm a step closer to the solution.

I discovered that munge was disabled on the nodes (my fault, Gennaro pointed out the problem before, but I enabled it back only on the master). Btw, it's very strange that the wheezy->jessie upgrade disabled munge on all nodes and master...

Unfortunately, re-enabling munge on the nodes, didn't made slurmd start again.

Maybe filling this setting could give me some info about the problem?

#SlurmdLogFile=

Thank you very much for your help. Is very precious to me.

betta

Ps: some test I made ->

Running on the nodes

slurm -Dvvv

returns

slurmd: debug2: hwloc_topology_init

slurmd: debug2: hwloc_topology_load

slurmd: Considering each NUMA node as a socket

slurmd: debug: CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 ThreadsPerCore:1

slurmd: Node configuration differs from hardware: CPUs=16:16(hw) Boards=1:1(hw) SocketsPerBoard=16:4(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw)

slurmd: topology NONE plugin loaded

slurmd: Gathering cpu frequency information for 16 cpus

slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf

slurmd: debug2: hwloc_topology_init

slurmd: debug2: hwloc_topology_load

slurmd: Considering each NUMA node as a socket

slurmd: debug: CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 ThreadsPerCore:1

slurmd: debug: Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf

slurmd: debug: task/cgroup: now constraining jobs allocated cores

slurmd: task/cgroup: loaded

slurmd: auth plugin for Munge (http://code.google.com/p/munge/) loaded

slurmd: debug: spank: opening plugin stack /etc/slurm-llnl/plugstack.conf

slurmd: Munge cryptographic signature plugin loaded

slurmd: Warning: Core limit is only 0 KB

slurmd: slurmd version 14.03.9 started

slurmd: Job accounting gather LINUX plugin loaded

slurmd: debug: job_container none plugin loaded

slurmd: switch NONE plugin loaded

slurmd: slurmd started on Mon, 15 Jan 2018 18:07:17 +0100

slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=15999 TmpDisk=40189 Uptime=1254

slurmd: AcctGatherEnergy NONE plugin loaded

slurmd: AcctGatherProfile NONE plugin loaded

slurmd: AcctGatherInfiniband NONE plugin loaded

slurmd: AcctGatherFilesystem NONE plugin loaded

slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)

slurmd: debug2: _slurm_connect failed: Connection refused

slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused