[slurm-users] "Low RealMem" after upgrade

819 views
Skip to first unread message

Diego Zuccato

unread,
Oct 1, 2021, 4:23:37 AM10/1/21
to Slurm User Community List
Hello all.

I just upgraded to Debian 11 that brings Slurm 21.08 and the newer nodes
upgraded w/o too many issues (just minor config changes, one being
RealMemory value in slurm.conf, since for some reason it seems the new
slurmd detects about 12MB less memory than before).

But the older nodes are still marked IDLE+DRAIN:
-8<--
NodeName=str957-bl0-01 Arch=x86_64 CoresPerSocket=6
CPUAlloc=0 CPUTot=24 CPULoad=0.39
AvailableFeatures=ib,blade,intel,avx
ActiveFeatures=ib,blade,intel,avx
Gres=(null)
NodeAddr=str957-bl0-01 NodeHostName=str957-bl0-01 Version=20.11.4
OS=Linux 5.10.0-8-amd64 #1 SMP Debian 5.10.46-5 (2021-09-23)
RealMemory=64000 AllocMem=0 FreeMem=63518 Sockets=2 Boards=1
MemSpecLimit=2048
State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=0 Weight=2 Owner=N/A
MCS_label=N/A
Partitions=b1
BootTime=2021-10-01T09:35:42 SlurmdStartTime=2021-10-01T09:36:15
CfgTRES=cpu=24,mem=62.50G,billing=182
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [root@2021-10-01T08:08:18]
Comment=(null)
-8<--
I already reduced RealMemory line in slurm.conf and restarted both
slurmctld and slurmd (in case "scontrol reconfigure" was not enough...
not really clear from the docs).

The relevant lines in slurm.conf are:
-8<--
NodeName=DEFAULT Sockets=2 ThreadsPerCore=2
State=UNKNOWN MemSpecLimit=2048
NodeName=str957-bl0-0[1-2] CoresPerSocket=6
RealMemory=64000 Weight=2 Feature=ib,blade,intel,avx
-8<--

And the node says:
-8<--
root@str957-bl0-01:~# slurmd -C
NodeName=str957-bl0-01 CPUs=24 Boards=1 SocketsPerBoard=2
CoresPerSocket=6 ThreadsPerCore=2 RealMemory=64378
UpTime=0-00:37:17
-8<--

I also tried lowering RealMemory setting to 60000, in case MemSpecLimit
interfered, but the result remains the same.

Any ideas?

TIA!

--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786

Brian Andrus

unread,
Oct 1, 2021, 1:48:02 PM10/1/21
to slurm...@lists.schedmd.com
Not unusual. You should set your amount of memory a bit below what
slurmd reports.

Different kernel modules that get upgraded may use a little more memory,
causing just this situation. There are other causes as well, but by
providing the kernel/system some wiggle room, you prevent any issues.

Also helps with OOM killer situations.

Brian Andrus

Paul Brunk

unread,
Oct 1, 2021, 3:33:14 PM10/1/21
to Slurm User Community List
Hi:

If you mean "why are the nodes still Drained, now that I fixed the
slurm.conf and restarted (never mind whether the RealMem parameter is
correct)?", try 'scontrol update nodename=str957-bl0-0[1-2] State=RESUME'.

--
Paul Brunk, system administrator
Georgia Advanced Computing Resource Center
Enterprise IT Svcs, the University of Georgia

-----Original Message-----
From: slurm-users <slurm-use...@lists.schedmd.com> On Behalf Of Diego Zuccato
Sent: Friday, October 1, 2021 04:23
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: [slurm-users] "Low RealMem" after upgrade

[EXTERNAL SENDER - PROCEED CAUTIOUSLY]

Diego Zuccato

unread,
Oct 5, 2021, 2:07:26 AM10/5/21
to Slurm User Community List, Paul Brunk
Hi.

I already tried multiple times, both RESUME and IDLE, and it didn't
work: it just returned to "IDLE+DRAIN" with 'Reason="low realmem"'. :(
I just tried again (after an unplanned shutdown of the frontend) and it
worked with IDLE (RESUME gives "Invalid node state specified").
SLURM 20.11.4.

Tks.
Diego

Il 01/10/2021 21:32, Paul Brunk ha scritto:
> Hi:
>
> If you mean "why are the nodes still Drained, now that I fixed the
> slurm.conf and restarted (never mind whether the RealMem parameter is
> correct)?", try 'scontrol update nodename=str957-bl0-0[1-2] State=RESUME'.
>

--

Ole Holm Nielsen

unread,
Oct 5, 2021, 3:22:44 AM10/5/21
to slurm...@lists.schedmd.com
On 10/5/21 8:05 AM, Diego Zuccato wrote:
> I already tried multiple times, both RESUME and IDLE, and it didn't work:
> it just returned to "IDLE+DRAIN" with 'Reason="low realmem"'. :(
> I just tried again (after an unplanned shutdown of the frontend) and it

What is a "frontend"? Do you mean the slurmctld server?

> worked with IDLE (RESUME gives "Invalid node state specified").

So "scontrol update node=... state=idle" gives the node a correct idle
state, whereas "state=resume" doesn't? Did you restart the slurmd on the
compute nodes?

> SLURM 20.11.4.

You wrote that you use Slurm 21.08 from Debian 11. How did 20.11 get into
the picture? The slurmdbd and slurmctld servers must have versions >=
that of slurmd, see some links in
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

> Il 01/10/2021 21:32, Paul Brunk ha scritto:
>> If you mean "why are the nodes still Drained, now that I fixed the
>> slurm.conf and restarted (never mind whether the RealMem parameter is
>> correct)?", try 'scontrol update nodename=str957-bl0-0[1-2] State=RESUME'.

/Ole


Diego Zuccato

unread,
Oct 5, 2021, 4:07:11 AM10/5/21
to Slurm User Community List, Ole Holm Nielsen
Il 05/10/2021 09:22, Ole Holm Nielsen ha scritto:

> What is a "frontend"?  Do you mean the slurmctld server?
Yes, sorry. "Frontend" is how we call the node(s) used by users to
submit jobs, where slurmctld and slurmdbd run. We'll probably move
slurmdbd and slurmctld to a dedicated VM in a future upgrade (mainly, I
have to be sure it doesn't need IB or access to the gluster fs that's
only available over IB).
Does sbatch give slurmctld just a path to the job script or the whole
script?

>> worked with IDLE (RESUME gives "Invalid node state specified").
> So "scontrol update node=... state=idle" gives the node a correct idle
> state, whereas "state=resume" doesn't?  Did you restart the slurmd on
> the compute nodes?
Yes. Complete node reboots, actually. Multiple times. When desperate,
try rebooting.

>> SLURM 20.11.4.
> You wrote that you use Slurm 21.08 from Debian 11.  How did 20.11 get
> into the picture?
Good question. I copy-pasted 21.08 from a node after the upgrade, but
now all nodes say 20.11.4 . Really confused :-? Just to add to the
confusion, packages.debian.org gives 20.11.7+really20.11.4-2 as
slurmctld version for bullseye. No mention of 21.08 anywhere, not even
in sid (20.11.8). ARGH! Did I dream it? And if so, how could I c&p it????

>  The slurmdbd and slurmctld servers must have versions
> >= that of slurmd, see some links in
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
Yup. That's why I upgraded the whole cluster at once.

Tks for the help.
Reply all
Reply to author
Forward
0 new messages