I have configured up a cluster of 70 nodes with SLURM 2.1.16 and have
enabled power-saving. It seems to be working well in initial tests
although some of the nodes have gotten into a DOWN state. I tried
rebooting those nodes manually to get them back into the cluster but I
think SLURM is getting confused as to whether they are on or off.
I tried to reset all nodes to state RESUME to see if that fixes things.
SLURM subsequently tried to power them down but they were down already
so that gave an error. sinfo lists the status as follows,
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 70 idle~ cn[01-70]
priority up infinite 70 idle~ cn[01-70]
I've submitted a job to see if that will cause SLURM to power them back
up but so far I'm only seeing messages like the following
14:44:56 manager1 slurmctld[17540]: error: Nodes cn[01-70] not responding
What is the best way to tell SLURM that these machines are suspended and
ready to be resumed or will it eventually recognise that from their
current state?
Should I modify by suspend script to always return success even if it
can't actually complete the shutdown (as would be the case if the nodes
are already powered down)?
Thanks,
-stephen
--
Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie
Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway)
In the sinfo output, the state suffix of "~" indicates that SLURM
considers the nodes to already be powered down (see man sinfo
for details).
The exit codes from the suspend/resume scripts are logged by
slurmctld if non-zero. If the suspend script otherwise fails, there
is little slurm can do about it. In the case of the resume script
failing, slurm will periodicaly re-run it and avoid spawning any
programs on the node until the slurmd daemon on that node
responds.
________________________________________
From: owner-s...@lists.llnl.gov [owner-s...@lists.llnl.gov] On Behalf Of stephen mulcahy [smul...@atlanticlinux.ie]
Sent: Wednesday, December 08, 2010 6:55 AM
To: slur...@lists.llnl.gov
Subject: [slurm-dev] How to reset node states to idle~
Thanks for your response.
On a related note - what is the recommended way of resetting the state
of SLURM after doing the following on the master/controller node.
/etc/init.d/slurm-2.1.16 stop
/etc/init.d/slurm-2.1.16 startclean
where the compute nodes have already been powered down/suspended.
Will SLURM eventually figure out they are suspended and start them up
when required or do I need to manually start them? Or can I somehow
instruct SLURM that the nodes are in a powered down state?
Thanks,
-stephen
You can use the scontrol command to change node states if desired.
Note that setting a node state to power_down changes the state in
the internal tables, but does not execute the power down script.
scontrol update nodename="cn[01-70]" state=resume
scontrol update nodename="cn[01-70]" state=power_down
________________________________________
From: stephen mulcahy [smul...@atlanticlinux.ie]
Sent: Wednesday, December 08, 2010 8:47 AM
To: slur...@lists.llnl.gov
Cc: Jette, Moe
Subject: Re: [slurm-dev] How to reset node states to idle~