[slurm-dev] Worker nodes down, draining?

9 views
Skip to first unread message

Simpson Lachlan

unread,
Dec 17, 2015, 11:21:27 PM12/17/15
to slurm-dev

Not 100% sure what I'm doing wrong, but I don't seem to be able to get SLURM working in a real sense. I'm getting more "SUCCESS"es responses in the testing.

Here are the outputs I'm seeing - I'm seeing similar but inconsistent symptoms: always seems to be a couple of nodes DOWN or DRAINING, but they don't ever come back up? It seems to be a different node every time I reboot or restart the services?


Here are some of the info I'm seeing.

Questions, from the outputs below them:

1. Why am I getting "Cray node selection plugin loaded" - I don't remember setting this, and I'm using a Cray?
2. What do down and drain represent, and what does the addition of a * mean?
3. What does "Real LowMemory" mean and how can I fix this?
4. "error: If munged is up, restart with --num-threads=10" - I see this a lot in each node's slurm-d log, but I can't see anything online re how to fix it and sudo systemctl start munge --num-threads=10 fails with "/bin/systemctl: unrecognised option '--num-threads=10'". Sure enough, I can't see that option anywhere in the init scripts.

Cheers
L.

===================================================
[ec2-user@slurm-head slurm-administration]$ sudo sinfo -a -R -l -v
-----------------------------
dead = false
exact = 0
filtering = true
format = %20E %12U %19H %6t %N
iterate = 0
long = true
no_header = false
node_field = false
node_format = false
nodes = n/a
part_field = false
partition = n/a
responding = false
states = down,drain,error
sort = (null)
summarize = false
verbose = 1
-----------------------------
all_flag = true
alloc_mem_flag = false
avail_flag = false
bg_flag = false
cpus_flag = false
default_time_flag =false
disk_flag = false
features_flag = false
groups_flag = false
gres_flag = false
job_size_flag = false
max_time_flag = false
memory_flag = false
partition_flag = false
priority_flag = false
reason_flag = true
reason_timestamp_flag = true
reason_user_flag = true
reservation_flag = false
root_flag = false
share_flag = false
state_flag = true
weight_flag = false
-----------------------------

Fri Dec 18 03:28:35 2015
sinfo: Cray node selection plugin loaded
REASON USER TIMESTAMP STATE NODELIST
Not responding slurm(1001) 2015-12-18T02:52:28 down* slurm-w1
Low RealMemory slurm(1001) 2015-12-18T03:09:08 drain slurm-test,slurm-w2

===================================================

[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w2
NodeName=slurm-w2 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
Gres=(null)
NodeAddr=slurm-w2 NodeHostName=slurm-w2 Version=15.08
OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31412 Sockets=8 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
BootTime=2015-12-17T23:51:43 SlurmdStartTime=2015-12-17T23:51:55
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2015-12-18T03:09:08]

[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w3
NodeName=slurm-w3 Arch=x86_64 CoresPerSocket=1
CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
Gres=(null)
NodeAddr=slurm-w3 NodeHostName=slurm-w3 Version=15.08
OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31404 Sockets=8 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
BootTime=2015-12-17T23:51:54 SlurmdStartTime=2015-12-17T23:52:04
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s


[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-w1
NodeName=slurm-w1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
Gres=(null)
NodeAddr=slurm-w1 NodeHostName=slurm-w1 Version=15.08
OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31426 Sockets=8 Boards=1
State=DOWN* ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
BootTime=2015-12-17T23:51:37 SlurmdStartTime=2015-12-17T23:51:48
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Not responding [slurm@2015-12-18T02:52:28]

[ec2-user@slurm-head slurm-administration]$ sudo scontrol show node slurm-test
NodeName=slurm-test Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUErr=0 CPUTot=8 CPULoad=0.01 Features=(null)
Gres=(null)
NodeAddr=slurm-test NodeHostName=slurm-test Version=15.08
OS=Linux RealMemory=32014 AllocMem=0 FreeMem=31438 Sockets=8 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
BootTime=2015-12-17T23:53:30 SlurmdStartTime=2015-12-17T23:53:40
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [slurm@2015-12-18T03:09:08]

===================================================

What I'm seeing in the logs:

from slurm-d.log

[2015-12-18T02:37:29.573] error: Unable to register: Protocol authentication error
[2015-12-18T02:37:30.705] error: If munged is up, restart with --num-threads=10
[2015-12-18T02:37:30.705] error: Munge decode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
[2015-12-18T02:37:30.705] error: authentication: Socket communication error



from slurm-ctl.log

[2015-12-18T03:42:28.584] error: Node slurm-w3 has low real_memory size (32013 < 32014)
[2015-12-18T03:42:28.584] error: Node slurm-w2 has low real_memory size (32013 < 32014)
[2015-12-18T03:42:28.585] error: Node slurm-w1 has low real_memory size (32013 < 32014)
[2015-12-18T03:42:28.586] error: Node slurm-test has low real_memory size (32013 < 32014)


This email (including any attachments or links) may contain
confidential and/or legally privileged information and is
intended only to be read or used by the addressee. If you
are not the intended addressee, any use, distribution,
disclosure or copying of this email is strictly
prohibited.
Confidentiality and legal privilege attached to this email
(including any attachments) are not waived or lost by
reason of its mistaken delivery to you.
If you have received this email in error, please delete it
and notify us immediately by telephone or email. Peter
MacCallum Cancer Centre provides no guarantee that this
transmission is free of virus or that it has not been
intercepted or altered and will not be liable for any delay
in its receipt.

Simpson Lachlan

unread,
Dec 17, 2015, 11:21:31 PM12/17/15
to slurm-dev

Just to further clarify: I'm running on Centos 7.

Slurmctld is running as the slurm user (SlurmUser) and each node's slurmd is running as root user.

Slurm.conf is:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.

ControlMachine=slurm-head
ControlAddr=115.146.87.234

AuthType=auth/munge
CacheGroups=0
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmd
SwitchType=switch/none

TaskPlugin=task/none

InactiveLimit=0
KillWait=30
MinJobAge=300


SlurmctldTimeout=120
SlurmdTimeout=300

Waittime=0

FastSchedule=1

SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear

AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster

JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none

SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurm-ctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurm-d.log
# COMPUTE NODES
NodeName=slurm-test,slurm-w1,slurm-w2,slurm-w3 CPUs=8 RealMemory=32014 Sockets=8 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN
PartitionName=debug Nodes=slurm-test,slurm-w1,slurm-w2,slurm-w3 Default=YES MaxTime=INFINITE State=UP

Christopher Samuel

unread,
Dec 20, 2015, 7:11:01 PM12/20/15
to slurm-dev

On 18/12/15 15:21, Simpson Lachlan wrote:

> Just to further clarify: I'm running on Centos 7.

I noticed the error about munge not running - be aware that the package
doesn't enable itself in systemd so you'll need to do:

systemctl enable munge.service
systemctl start munge.service

Best of luck!
Chris
--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Simpson Lachlan

unread,
Dec 20, 2015, 8:05:57 PM12/20/15
to slurm-dev
> -----Original Message-----
> From: Christopher Samuel [mailto:sam...@unimelb.edu.au]
>
> On 18/12/15 15:21, Simpson Lachlan wrote:
>
> > Just to further clarify: I'm running on Centos 7.
>
> I noticed the error about munge not running - be aware that the package doesn't
> enable itself in systemd so you'll need to do:
>
> systemctl enable munge.service
> systemctl start munge.service
>

Thanks Chris, I'll check this now and automate.

Cheers
L.
Reply all
Reply to author
Forward
0 new messages