[slurm-users] Slurm not starting

9,887 views
Skip to first unread message

Elisabetta Falivene

unread,
Jan 15, 2018, 7:14:38 AM1/15/18
to Slurm User Community List
I did an upgrade from wheezy to jessie (automatically with a normal dist-upgrade) on a cluster with 8 nodes (up, running and reachable) and from slurm 2.3.4 to 14.03.9. Overcame some problems booting kernel (thank you vey much to Gennaro Oliva, btw), now the system is running correctly with kernel 3.16.0.4, but slurm isn't starting. I tried restarting services, but it seems it isn't able to do it.

Error messages are not much helping me in guessing what is going on. What should I check to get what is failing?

Thank you 
Elisabetta

PS: Here it is some tests I did

Running  
sinfo

returns

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch*       up   infinite      8   unk* node[01-08]


Running 
systemctl status slurmctld.service

returns 

slurmctld.service - Slurm controller daemon
   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)
   Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 CET; 41s ago
  Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)

 slurmctld[2100]: cons_res: select_p_reconfigure
 slurmctld[2100]: cons_res: select_p_node_init
 slurmctld[2100]: cons_res: preparing for 1 partitions
 slurmctld[2100]: Running as primary controller
 slurmctld[2100]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0
 slurmctld.service start operation timed out. Terminating.
Terminate signal (SIGINT or SIGTERM) received
 slurmctld[2100]: Saving all slurm state
 Failed to start Slurm controller daemon.
 Unit slurmctld.service entered failed state.

and running

/etc/init.d/slurmd status

returns

slurmd.service - Slurm node daemon
   Loaded: loaded (/lib/systemd/system/slurmd.service; enabled)
   Active: failed (Result: exit-code) since Mon 2018-01-15 12:44:52 CET; 21min ago
  Process: 729 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)

slurmd.service: control process exited, code=exited status=1
systemd[1]: Failed to start Slurm node daemon.
Unit slurmd.service entered failed state.


Gennaro Oliva

unread,
Jan 15, 2018, 8:09:40 AM1/15/18
to Slurm User Community List
Ciao Elisabetta,

On Mon, Jan 15, 2018 at 01:13:27PM +0100, Elisabetta Falivene wrote:
> Error messages are not much helping me in guessing what is going on. What
> should I check to get what is failing?

check slurmctld.log and slurmd.log, you can find them under
/var/log/slurm-llnl

> *PARTITION AVAIL TIMELIMIT NODES STATE NODELIST*
> *batch* up infinite 8 unk* node[01-08]*
>
>
> Running
> *systemctl status slurmctld.service*
>
> returns
>
> *slurmctld.service - Slurm controller daemon*
> * Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)*
> * Active: failed (Result: timeout) since Mon 2018-01-15 13:03:39 CET; 41s
> ago*
> * Process: 2098 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)*
>
> * slurmctld[2100]: cons_res: select_p_reconfigure*
> * slurmctld[2100]: cons_res: select_p_node_init*
> * slurmctld[2100]: cons_res: preparing for 1 partitions*
> * slurmctld[2100]: Running as primary controller*
> * slurmctld[2100]:
> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0*
> * slurmctld.service start operation timed out. Terminating.*
> *Terminate signal (SIGINT or SIGTERM) received*
> * slurmctld[2100]: Saving all slurm state*
> * Failed to start Slurm controller daemon.*
> * Unit slurmctld.service entered failed state.*

Do you have a backup controller?
Check your slurm.conf under:
/etc/slurm-llnl

Anyway I suggest to update the operating system to stretch and fix your
configuration under a more recent version of slurm.
Best regards
--
Gennaro Oliva

Williams, Jenny Avis

unread,
Jan 15, 2018, 8:17:26 AM1/15/18
to Slurm User Community List
Elisabetta-

Start by focusing on slurmctld. Slurmd not happy without it.
Start it manually in the foreground as in
/usr/sbin/slurmctld -d -vvv

This assumes slurmd,conf is in default location.
Pardon brevity; on my phone
Jenny Williams


Sent from Nine

From: Elisabetta Falivene <e.fal...@ilabroma.com>
Sent: Monday, January 15, 2018 7:14 AM
To: Slurm User Community List
Subject: [slurm-users] Slurm not starting

Elisabetta Falivene

unread,
Jan 15, 2018, 9:30:55 AM1/15/18
to Slurm User Community List
> Anyway I suggest to update the operating system to stretch and fix your
> configuration under a more recent version of slurm.

I think I'll soon arrive to that :)
b

Douglas Jacobsen

unread,
Jan 15, 2018, 9:59:09 AM1/15/18
to Slurm User Community List
The fact that sinfo is responding shows that at least slurmctld is running.  Slumd, on the other hand is not.  Please also get output of slurmd log or running "slurmd -Dvvv"

Elisabetta Falivene

unread,
Jan 15, 2018, 10:30:50 AM1/15/18
to Slurm User Community List
slurmd -Dvvv says

slurmd: fatal: Unable to determine this slurmd's NodeName

b

John Hearns

unread,
Jan 15, 2018, 10:36:34 AM1/15/18
to Slurm User Community List
That's it. I am calling JohnH's Law:
"Any problem with a batch queueing system is due to hostname resolution"

Carlos Fenoy

unread,
Jan 15, 2018, 10:43:57 AM1/15/18
to Slurm User Community List
Are you trying to start the slurmd in the headnode or a compute node?

Can you provide the slurm.conf file?

Regards,
Carlos
--
--
Carles Fenoy

Elisabetta Falivene

unread,
Jan 15, 2018, 10:50:30 AM1/15/18
to Slurm User Community List
In the headnode. (I'm also noticing, and seems good to tell, for maybe the problem is the same, even ldap is not working as expected giving a message "invalid credential (49)" which is a message given when there are problem of this type. The update to jessie must have touched something that is affecting all my software sanity :D )

Here is the my slurm.conf.

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=anyone
ControlAddr=master
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=openmpi
MpiParams=ports=12000-12999
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/cgroup
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=2
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/cgroup
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=60
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=43200
#OverTimeLimit=0
SlurmctldTimeout=600
SlurmdTimeout=600
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
DefMemPerCPU=1000
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
#SchedulerPort=
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
AccountingStorageLoc=/var/log/slurm-llnl/AccountingStorage.log
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/filetxt
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
JobCompLoc=/var/log/slurm-llnl/JobComp.log
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/filetxt
#JobCompUser=
JobAcctGatherFrequency=60
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN
PartitionName=batch Nodes=node[01-08] Default=YES MaxTime=INFINITE State=UP

Douglas Jacobsen

unread,
Jan 15, 2018, 10:51:01 AM1/15/18
to Slurm User Community List
Please check your slurm.conf on the compute nodes, I'm thinking that your compute node isn't appearing in slurm.conf properly.

Carlos Fenoy

unread,
Jan 15, 2018, 10:56:55 AM1/15/18
to Slurm User Community List
Hi,

you can not start the slurmd on the headnode. Try running the same command on the compute nodes and check the output. If there is any issue it should display the reason.

Regards,
Carlos
--
--
Carles Fenoy

Elisabetta Falivene

unread,
Jan 15, 2018, 10:57:35 AM1/15/18
to Slurm User Community List
Googling a bit, the error "slurmd: fatal: Unable to determine this slurmd's NodeName" come up when you try to check slurmd on the master which shouldn't execute slurmd(?). It must be up on the nodes, not on the master. 

Elisabetta Falivene

unread,
Jan 15, 2018, 12:23:00 PM1/15/18
to Slurm User Community List
The deeper I go in the problem, the worser it seems... but maybe I'm a step closer to the solution.

I discovered that munge was disabled on the nodes (my fault, Gennaro pointed out the problem before, but I enabled it back only on the master). Btw, it's very strange that the wheezy->jessie upgrade disabled munge on all nodes and master...

Unfortunately, re-enabling munge on the nodes, didn't made slurmd start again.

Maybe filling this setting could give me some info about the problem? 
#SlurmdLogFile=

Thank you very much for your help. Is very precious to me.
betta

Ps: some test I made ->

Running on the nodes

slurm -Dvvv

returns

slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: Considering each NUMA node as a socket
slurmd: debug:  CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 ThreadsPerCore:1
slurmd: Node configuration differs from hardware: CPUs=16:16(hw) Boards=1:1(hw) SocketsPerBoard=16:4(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:1(hw)
slurmd: topology NONE plugin loaded
slurmd: Gathering cpu frequency information for 16 cpus
slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug2: hwloc_topology_init
slurmd: debug2: hwloc_topology_load
slurmd: Considering each NUMA node as a socket
slurmd: debug:  CPUs:16 Boards:1 Sockets:4 CoresPerSocket:4 ThreadsPerCore:1
slurmd: debug:  Reading cgroup.conf file /etc/slurm-llnl/cgroup.conf
slurmd: debug:  task/cgroup: now constraining jobs allocated cores
slurmd: task/cgroup: loaded
slurmd: auth plugin for Munge (http://code.google.com/p/munge/) loaded
slurmd: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
slurmd: Munge cryptographic signature plugin loaded
slurmd: Warning: Core limit is only 0 KB
slurmd: slurmd version 14.03.9 started
slurmd: Job accounting gather LINUX plugin loaded
slurmd: debug:  job_container none plugin loaded
slurmd: switch NONE plugin loaded
slurmd: slurmd started on Mon, 15 Jan 2018 18:07:17 +0100
slurmd: CPUs=16 Boards=1 Sockets=16 Cores=1 Threads=1 Memory=15999 TmpDisk=40189 Uptime=1254
slurmd: AcctGatherEnergy NONE plugin loaded
slurmd: AcctGatherProfile NONE plugin loaded
slurmd: AcctGatherInfiniband NONE plugin loaded
slurmd: AcctGatherFilesystem NONE plugin loaded
slurmd: debug2: No acct_gather.conf file (/etc/slurm-llnl/acct_gather.conf)
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
^Cslurmd: got shutdown request
slurmd: waiting on 1 active threads
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
©©slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
^C^C^C^Cslurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused
slurmd: debug:  Failed to contact primary controller: Connection refused
slurmd: error: Unable to register: Unable to contact slurm controller (connect failure)
slurmd: debug:  Unable to register with slurm controller, retrying
slurmd: all threads complete
slurmd: Consumable Resources (CR) Node Selection plugin shutting down ...
slurmd: Munge cryptographic signature plugin unloaded
slurmd: Slurmd shutdown completing

which maybe it is not so bad as it seems for it may only point out that slurm is not up on the master, isn't?

On the master running

service slurmctld restart

returns

Job for slurmctld.service failed. See 'systemctl status slurmctld.service' and 'journalctl -xn' for details.

and 

service slurmctld status

returns

slurmctld.service - Slurm controller daemon
   Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled)
   Active: failed (Result: timeout) since Mon 2018-01-15 18:11:20 CET; 44s ago
  Process: 2223 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)

 slurmctld[2225]: cons_res: select_p_reconfigure
 slurmctld[2225]: cons_res: select_p_node_init
 slurmctld[2225]: cons_res: preparing for 1 partitions
 slurmctld[2225]: Running as primary controller
 slurmctld[2225]: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=4,partition_job_depth=0
 systemd[1]: slurmctld.service start operation timed out. Terminating.
 slurmctld[2225]: Terminate signal (SIGINT or SIGTERM) received
 slurmctld[2225]: Saving all slurm state
 systemd[1]: Failed to start Slurm controller daemon.
 systemd[1]: Unit slurmctld.service entered failed state.

and 
journalctl -xn

returns no visible error

-- Logs begin at Mon 2018-01-15 18:04:38 CET, end at Mon 2018-01-15 18:17:33 CET. --
Jan 15 18:16:23 anyone.phys.uniroma1.it slurmctld[2286]: Saving all slurm state
Jan 15 18:16:23 anyone.phys.uniroma1.it systemd[1]: Failed to start Slurm controller daemon.
-- Subject: Unit slurmctld.service has failed
-- Defined-By: systemd
-- 
-- Unit slurmctld.service has failed.
-- 
-- The result is failed.
 systemd[1]: Unit slurmctld.service entered failed state.
 CRON[2312]: pam_unix(cron:session): session opened for user root by (uid=0)
CRON[2313]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
CRON[2312]: pam_unix(cron:session): session closed for user root
dhcpd[1538]: DHCPREQUEST for 192.168.1.101 from c8:60:00:32:c6:c4 via eth1
 dhcpd[1538]: DHCPACK on 192.168.1.101 to c8:60:00:32:c6:c4 via eth1
dhcpd[1538]: DHCPREQUEST for 192.168.1.102 from bc:ae:c5:12:97:75 via eth1
dhcpd[1538]: DHCPACK on 192.168.1.102 to bc:ae:c5:12:97:75 via eth1



Carlos Fenoy

unread,
Jan 15, 2018, 12:32:50 PM1/15/18
to Slurm User Community List

It seems like the pidfile in systemd and slurm.conf are different. Check if they are the same and if not adjust the slurm.conf pid files. That should prevent systemd from killing slurm.

Christopher Samuel

unread,
Jan 15, 2018, 5:02:34 PM1/15/18
to slurm...@lists.schedmd.com
On 16/01/18 04:22, Elisabetta Falivene wrote:

> slurmd: debug2: _slurm_connect failed: Connection refused
> slurmd: debug2: Error connecting slurm stream socket at
> 192.168.1.1:6817: Connection refused

This sounds like the compute node cannot connect back to
slurmctld on the management node, you should check that the
IP address it is using is correct and that there are no firewall
rules on the management node preventing connections to port 6817.

Good luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Elisabetta Falivene

unread,
Jan 16, 2018, 7:21:04 AM1/16/18
to Slurm User Community List
slurmd: debug2: _slurm_connect failed: Connection refused
slurmd: debug2: Error connecting slurm stream socket at 192.168.1.1:6817: Connection refused

This sounds like the compute node cannot connect back to
slurmctld on the management node, you should check that the
IP address it is using is correct and that there are no firewall
rules on the management node preventing connections to port 6817.


Yes, exactly. It not so strange for slurmctld in not running on the master :) so I think I must resolve the issue on the master and then worry about the nodes.

Elisabetta Falivene

unread,
Jan 16, 2018, 7:26:01 AM1/16/18
to Slurm User Community List

It seems like the pidfile in systemd and slurm.conf are different. Check if they are the same and if not adjust the slurm.conf pid files. That should prevent systemd from killing slurm.

Emh, sorry, how I can do this? 

Elisabetta Falivene

unread,
Jan 16, 2018, 10:33:57 AM1/16/18
to Slurm User Community List
Here is the solution and another (minor) problem!

Investigating in the direction of the pid problem I found that in the setting there was a 
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid

but the pid was searched in /var/run/slurm-llnl so I changed in the slurm.conf of the master AND the nodes
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid

being again able to launch slurmctld on the master and slurmd on the nodes.

At this point the nodes were all set to drain automatically giving an error like

error: Node node01 has low real_memory size (15999 < 16000)

so it was necessary to change in the slurm.conf (master and nodes) 
NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN
to
NodeName=node[01-08] CPUs=16 RealMemory=15999 State=UNKNOWN

Now, slurm works and the nodes are running. There is only one minor problem

error: Node node04 has low real_memory size (7984 < 15999)
error: Node node02 has low real_memory size (3944 < 15999)

Two nodes are still put to drain state. The nodes suffered a physical damage to some rams and I had to physically remove them, so slurm think it is not a good idea to use them. 
It is possibile to make slurm use the node anyway? I know I can scontrol update NodeName=node04 State=RESUME and put back the node to idle state, but as the machine is rebooted or the service restarted it would be set to drain again.

Thank you for your help!
b

Gennaro Oliva

unread,
Jan 16, 2018, 1:35:42 PM1/16/18
to Slurm User Community List
Ciao Elisabetta,

On Tue, Jan 16, 2018 at 04:32:47PM +0100, Elisabetta Falivene wrote:
> being again able to launch slurmctld on the master and slurmd on the nodes.

great!

> *NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN*
> to
> *NodeName=node[01-08] CPUs=16 RealMemory=15999 State=UNKNOWN*
>
> Now, slurm works and the nodes are running. There is only one minor problem
>
> *error: Node node04 has low real_memory size (7984 < 15999)*
> *error: Node node02 has low real_memory size (3944 < 15999)*
>
> Two nodes are still put to drain state. The nodes suffered a physical
> damage to some rams and I had to physically remove them, so slurm think it
> is not a good idea to use them.
> It is possibile to make slurm use the node anyway?

I think you can specify their properties on separate lines:

NodeName=node[01,03,05-08] CPUs=16 RealMemory=15999 State=UNKNOWN*
NodeName=node02 CPUs=16 RealMemory=3944 State=UNKNOWN*
NodeName=node04 CPUs=16 RealMemory=7984 State=UNKNOWN*

Elisabetta Falivene

unread,
Jan 17, 2018, 8:24:31 AM1/17/18
to Slurm User Community List
Ciao Gennaro!
 
> *NodeName=node[01-08] CPUs=16 RealMemory=16000 State=UNKNOWN*
> to
> *NodeName=node[01-08] CPUs=16 RealMemory=15999 State=UNKNOWN*
>
> Now, slurm works and the nodes are running. There is only one minor problem
>
> *error: Node node04 has low real_memory size (7984 < 15999)*
> *error: Node node02 has low real_memory size (3944 < 15999)*
>
> Two nodes are still put to drain state. The nodes suffered a physical
> damage to some rams and I had to physically remove them, so slurm think it
> is not a good idea to use them.
> It is possibile to make slurm use the node anyway?

I think you can specify their properties on separate lines:

NodeName=node[01,03,05-08] CPUs=16 RealMemory=15999 State=UNKNOWN*
NodeName=node02 CPUs=16 RealMemory=3944 State=UNKNOWN*
NodeName=node04 CPUs=16 RealMemory=7984 State=UNKNOWN*


It was possible indeed! Only it required to type "UNKNOWN" instead of "UNKNOWN*"
Problem fully solved!
Thank you very much!
Elisabetta 
Reply all
Reply to author
Forward
0 new messages