[slurm-dev] scontrol-slurmctld interactions

Lawrence Stewart

unread,

Aug 11, 2008, 3:02:53 PM8/11/08

to slur...@lists.llnl.gov, Larry Stewart

We're trying to find a way to know when a cluster is really ready to
accept jobs.

When we reboot the entire cluster, we

1) scontrol update NodeName=<all nodes> State=Down
2) Halt all the nodes
3) scontrol update NodeName=<all nodes> State=Resume
4) Boot the nodes
5) Wait for sinfo to report everyone is idle

... but this doesn't work. the scontrol update State=Resume
seems to remember that it had talked to some nodes recently and
puts them right into Idle, rather than Idle* where I would expect.

My intuition is that I want a way for scontrol to tell slurmctld that
it should put nodes in down* until it hears from them (which will
be when /etc/init.d/slurmd runs on each node.)

... or a way to tell slurmctld to query all the nodes <right now>
instead of waiting for the polling interval.

--
-Larry / Sector IX

jet...@llnl.gov

unread,

Aug 11, 2008, 4:09:08 PM8/11/08

to slur...@lists.llnl.gov

This should do the trick for you. When you resume the node, it
sets the state to IDLE* (idle and not responding, so not usable).
It also causes the node to get pinged immediately. Once it responds,
the not responding flag gets cleared.

I'm assuming that the process below does not include a restart of
the slurmctld daemon. Is that correct?
If so and you have ReturnToService=1 (or yes) then you can probably
just boot the nodes and skip steps 1-3.

larry.patch

Lawrence Stewart

unread,

Aug 11, 2008, 4:25:47 PM8/11/08

to slur...@lists.llnl.gov, Larry Stewart

jet...@llnl.gov wrote:


Index: src/slurmctld/slurmctld.h
===================================================================
--- src/slurmctld/slurmctld.h	(revision 14732)
+++ src/slurmctld/slurmctld.h	(working copy)
@@ -237,6 +237,7 @@
 extern bitstr_t *idle_node_bitmap;	/* bitmap of idle nodes */
 extern bitstr_t *share_node_bitmap;	/* bitmap of sharable nodes */
 extern bitstr_t *up_node_bitmap;	/* bitmap of up nodes, not DOWN */
+extern bool ping_nodes_now;		/* if set, ping nodes immediately */
 
 /*****************************************************************************\
  *  PARTITION parameters and data structures
Index: src/slurmctld/controller.c
===================================================================
--- src/slurmctld/controller.c	(revision 14732)
+++ src/slurmctld/controller.c	(working copy)
@@ -143,6 +143,7 @@
 char *slurmctld_cluster_name = NULL; /* name of cluster */
 void *acct_db_conn = NULL;
 int accounting_enforce = 0;
+bool ping_nodes_now = false;
 
 /* Local variables */
 static int	daemonize = DEFAULT_DAEMONIZE;
@@ -1097,12 +1098,13 @@
 				unlock_slurmctld(node_write_lock);
 			}
 		}
-
-		if (difftime(now, last_ping_node_time) >= ping_interval) {
+		if ((difftime(now, last_ping_node_time) >= ping_interval) ||
+		    ping_nodes_now) {
 			static bool msg_sent = false;
 			if (is_ping_done()) {
 				msg_sent = false;
 				last_ping_node_time = now;
+				ping_nodes_now = false;
 				lock_slurmctld(node_write_lock);
 				ping_nodes();
 				unlock_slurmctld(node_write_lock);
Index: src/slurmctld/ping_nodes.c
===================================================================
--- src/slurmctld/ping_nodes.c	(revision 14732)
+++ src/slurmctld/ping_nodes.c	(working copy)
@@ -217,7 +217,8 @@
 			continue;
 		}
 
-		if (node_ptr->last_response >= still_live_time)
+		if ((!no_resp_flag) && 
+		    (node_ptr->last_response >= still_live_time))
 			continue;
 
 		/* Do not keep pinging down nodes since this can induce
Index: src/slurmctld/node_mgr.c
===================================================================
--- src/slurmctld/node_mgr.c	(revision 14732)
+++ src/slurmctld/node_mgr.c	(working copy)
@@ -1067,9 +1067,13 @@
 				node_ptr->node_state &= (~NODE_STATE_DRAIN);
 				node_ptr->node_state &= (~NODE_STATE_FAIL);
 				base_state &= NODE_STATE_BASE;
-				if (base_state == NODE_STATE_DOWN)
+				if (base_state == NODE_STATE_DOWN) {
 					state_val = NODE_STATE_IDLE;
-				else
+					node_ptr->node_state |= 
+							NODE_STATE_NO_RESPOND;
+					node_ptr->last_response = now;
+					ping_nodes_now = true;
+				} else
 					state_val = base_state;
 			}
 			if (state_val == NODE_STATE_DOWN) {

Thanks. In the meantime, I have evolved the following strategy:

scontrol nodename=<all> state=down
halt nodes
/etc/init.d/slurmctld stop
rm /tmp/node_state
/etc/init.d/slurmctld start
boot nodes

which seems to set them all to "unknown" until they check in

Opinion? I'd rather not have unneccessary source divergence.

jet...@llnl.gov

unread,

Aug 11, 2008, 4:31:05 PM8/11/08

to slur...@lists.llnl.gov

jet...@llnl.gov wrote:

Thanks. In the meantime, I have evolved the following strategy:

scontrol nodename=<all> state=down
halt nodes
/etc/init.d/slurmctld stop
rm /tmp/node_state
/etc/init.d/slurmctld start
boot nodes

which seems to set them all to "unknown" until they check in

Opinion? I'd rather not have unneccessary source divergence.

-- -Larry / Sector IX

Your strategy will result in all node state being lost (e.g. drained

nodes, reasons, etc), which is not ideal. I'll get my patch in to

slurm v1.3.7 and would recommend that you upgrade when that is

available (probably this week).

--

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Morris "Moe" Jette       jet...@llnl.gov                 925-423-4856
Integrated Computational Resource Management Group   fax 925-423-6961
Livermore Computing            Lawrence Livermore National Laboratory
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
"The problem with the world is that we draw the circle of our family
too small." - Mother Teresa
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Lawrence Stewart

unread,

Aug 11, 2008, 5:01:24 PM8/11/08

to slur...@lists.llnl.gov, Larry Stewart

jet...@llnl.gov wrote:

jet...@llnl.gov wrote:

Thanks. In the meantime, I have evolved the following strategy:

scontrol nodename=<all> state=down
halt nodes
/etc/init.d/slurmctld stop
rm /tmp/node_state
/etc/init.d/slurmctld start
boot nodes

which seems to set them all to "unknown" until they check in

Opinion? I'd rather not have unneccessary source divergence.

-- -Larry / Sector IX

Your strategy will result in all node state being lost (e.g. drained

nodes, reasons, etc), which is not ideal. I'll get my patch in to

slurm v1.3.7 and would recommend that you upgrade when that is

available (probably this week).
-- 

Got it. We may live with that, since we reconstruct the drain data and
reasons during the boot process anyway, to support booting with a
single command and without some nodes. At least we're up to 1.3.5 :-)

Lawrence Stewart

unread,

Aug 11, 2008, 5:07:09 PM8/11/08

to Lawrence Stewart, slur...@lists.llnl.gov

Wow. Cancelling some running jobs on a different partition make this immediate
problem vanish.

Probably the moral is that I shouldn't muck around with the files in /tmp, and
instead use /etc/init.d/slurmctld startclean

Reply all

Reply to author

Forward