For some nodes with more RAM, we have configured the feature "fat",
e.g. like this:
NodeName=q[9-24] RealMemory=72000 Feature=fat,mem72GB,ibsw1 Weight=3
They are occupied by jobs today, so when I do e.g. a
srun -t 5 grep -i memtotal /proc/meminfo
I will be waiting and waiting and waiting. As expected!
But if the SLURM server (/usr/sbin/slurmctld and its partner
/usr/sbin/slurmdbd)
is restarted with a
service slurm restart
command, the waiting is over and the job is run on a "thin" node. Too bad!!!
It looks like a SLURM bug. Or should I look for errors in our configuration
files?
We run version 2.1.0 and
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
Best regards,
-- Lennart Karlsson
UPPMAX, Uppsala University
http://*www.*uppmax.uu.se
http://**www.**uppmax.uu.se
Thank you, Aaron!
Sorry about my wrong explanation. My command
actually was:
srun -C fat -t 5 grep -i memtotal /proc/meminfo
as you say.
Here are all the NodeName definitions:
NodeName=q[1-8] RealMemory=24000 Feature=thin,mem24GB,ibsw1 Weight=1
NodeName=q[9-24] RealMemory=72000 Feature=fat,mem72GB,ibsw1 Weight=3
NodeName=q[25-32] RealMemory=48000 Feature=fat,mem48GB,ibsw1 Weight=2
NodeName=q[33-40] RealMemory=48000 Feature=fat,mem48GB,ibsw2 Weight=2
NodeName=q[41-64] RealMemory=24000 Feature=thin,mem24GB,ibsw2 Weight=1
NodeName=q[65-96] RealMemory=24000 Feature=thin,mem24GB,ibsw3 Weight=1
NodeName=q[97-108] RealMemory=24000 Feature=thin,mem24GB,ibsw4 Weight=1
NodeName=q[109-140] RealMemory=24000 Feature=thin,mem24GB,ibsw5 Weight=1
NodeName=q[141-172] RealMemory=24000 Feature=thin,mem24GB,ibsw6 Weight=1
NodeName=q[173-204] RealMemory=24000 Feature=thin,mem24GB,ibsw7 Weight=1
NodeName=q[205-216] RealMemory=24000 Feature=thin,mem24GB,ibsw8 Weight=1
NodeName=q[217-232] RealMemory=24000 Feature=thin,mem24GB,ibsw4 Weight=1
NodeName=q[233-252] RealMemory=24000 Feature=thin,mem24GB,ibsw8 Weight=1
NodeName=q[253-284] RealMemory=24000 Feature=thin,mem24GB,ibsw9 Weight=1
NodeName=q[285-316] RealMemory=24000 Feature=thin,mem24GB,ibsw10 Weight=1
NodeName=q[317-348] RealMemory=24000 Feature=thin,mem24GB,ibsw11 Weight=1
And the job was started on node q210, when I restarted the SLURM server.
Best regards,
-- Lennart Karlsson
UPPMAX, Uppsala University
==============================================================
> On Wed, Jan 27, 2010 at 9:16 AM, Lennart Karlsson <Lennart....@it.uu.se
>
>> wrote:
>>
>
>
>> Hi,
>>
>> For some nodes with more RAM, we have configured the feature "fat",
>> e.g. like this:
>> NodeName=q[9-24] RealMemory=72000 Feature=fat,mem72GB,ibsw1 Weight=3
>>
>> They are occupied by jobs today, so when I do e.g. a
>> srun -t 5 grep -i memtotal /proc/meminfo
>> I will be waiting and waiting and waiting. As expected!
>>
>> But if the SLURM server (/usr/sbin/slurmctld and its partner
>> /usr/sbin/slurmdbd)
>> is restarted with a
>> service slurm restart
>> command, the waiting is over and the job is run on a "thin" node. Too
>> bad!!!
>>
>> It looks like a SLURM bug. Or should I look for errors in our configuration
>> files?
>>
>> We run version 2.1.0 and
>>
>> FastSchedule=1
>> SchedulerType=sched/backfill
>> SelectType=select/cons_res
>>
>> Best regards,
>> -- Lennart Karlsson
>> UPPMAX, Uppsala University
>> http://***www.***uppmax.uu.se
Moe
What may be a bug though is - if you update features on multiple nodes
where at least one of them does not have a feature defined in the config
file followed by 'scontrol reconfigure' then none of the nodes are reset -
even though some of them do have feature values in the config file.
from scontrol man page..
Features=<features>
Identify feature(s) to be associated with the specified node.
Any previously defined feature(s) will
be overwritten with the new value. NOTE: Features assigned
via scontrol do not survive the restart of
the slurmctld nor will they survive scontrol reconfigure if
Features are defined in slurm.conf. Update
slurm.conf with any changes meant to be persistent.
Doug.Parisek@bull
.com
Sent by: To
owner-slurm-dev@l slur...@lists.llnl.gov
ists.llnl.gov cc
Subject
01/27/2010 03:09 [slurm-dev] Updated node features
PM not removed by scontrol reconfigure
Please respond to
slurm-dev@lists.l
lnl.gov
I am able to replicate what you see.
I'm thinking that what we really want to do is preserve a node's
features as previously configured (either from reading slurm.conf
or as updated by scontrol). We should probably only use the value
in slurm.conf if slurmctld is cold-started or there was no previous
feature value but a value is set in slurm.conf. The change required
for this behavior is fairly small.
Comments?
Moe
--
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Morris "Moe" Jette jet...@llnl.gov 925-423-4856
Integrated Computational Resource Management Group fax 925-423-6961
Livermore Computing Lawrence Livermore National Laboratory
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
"The problem with the world is that we draw the circle of our family
too small." - Mother Teresa
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++