[slurm-dev] Preemption and Power Saving

2 views
Skip to first unread message

stephen mulcahy

unread,
Dec 9, 2010, 11:27:12 AM12/9/10
to slur...@lists.llnl.gov
Hi,

I have installed SLURM 2.1.16 and enabled both Preemption and Power
Saving (both of which are really interesting features for us - thanks to
you folks for developging them). Each one seems to be working well as
per our tests. However when I combine in the following scenario I run
into a problem.

Start Job 1 (using 65 of our 70 compute nodes).

Start Job 2 in a higher priority partition (using 40 of our 70 compute
nodes).

Job 2 successfully pre-empts Job 1 and runs to completion. While Job 2
is running, 25 of the nodes running for Job 1 are found to be idle and
are powered down. This will affect our suspended job.

Is this the expected behaviour of SLURM? Is there any way for me to stop
the power management shutting down nodes with suspended jobs?

My Pre-emption config:

# SCHEDULING
#DefMemPerCPU=0
#EnablePreemption=no
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
SelectTypeParameters=CR_Memory
DefMemPerNode=0
PreemptMode=SUSPEND,GANG
PreemptType=preempt/partition_prio


My power-saving config:

# POWER SAVE SUPPORT FOR IDLE NODES (optional)
SuspendProgram=/shared/slurm/power/slurmSuspend.sh
ResumeProgram=/shared/slurm/power/slurmResume.sh
SuspendTimeout=180
ResumeTimeout=300
ResumeRate=100
#SuspendExcNodes=
#SuspendExcParts=
SuspendRate=100
SuspendTime=300


Thanks,

-stephen

--
Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie
Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway)

Jette, Moe

unread,
Dec 10, 2010, 11:26:43 AM12/10/10
to slur...@lists.llnl.gov
I am not planning to fix this in SLURM version 2.1, but the attached patch
will be applied to version 2.2. It adds a count of suspended jobs to the
node's data structure and avoids powering down any node with suspended
jobs.
________________________________________
From: owner-s...@lists.llnl.gov [owner-s...@lists.llnl.gov] On Behalf Of stephen mulcahy [smul...@atlanticlinux.ie]
Sent: Thursday, December 09, 2010 8:27 AM
To: slur...@lists.llnl.gov
Subject: [slurm-dev] Preemption and Power Saving
suspend_power.patch

stephen mulcahy

unread,
Dec 10, 2010, 12:20:34 PM12/10/10
to slur...@lists.llnl.gov, Jette, Moe
Hi Moe,

That sounds good - we'll roll the cluster to v2.2 when it is released.

Unfortunately the cluster is back in production now so I won't be able
to test this patch for some time. I'll let you know when I get a chance
to test this how it works.

Thanks,

-stephen

stephen mulcahy

unread,
Mar 1, 2011, 10:34:04 AM3/1/11
to slur...@lists.llnl.gov, Jette, Moe
Hi Moe,

Just to confirm - we have upgraded the cluster from SLURM 2.1.16 to
SLURM 2.2.1 and the problem we saw with pre-empted nodes being
power-saved is no longer occuring so this bug is indeed fixed.

Many thanks,

-stephen

Reply all
Reply to author
Forward
0 new messages