[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

5,027 views
Skip to first unread message

Chin,David

unread,
Mar 15, 2021, 1:53:15 PM3/15/21
to Slurm-Users List
Hi, all:

I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete. 

Here's sacct output:

               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize                        AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- ---------- -------------------------------- --------
               83387 ProdEmisI+      foob        def         node001   03:34:26 OUT_OF_ME+    0:125      128Gn                               billing=16,cpu=16,node=1
         83387.batch      batch                              node001   03:34:26 OUT_OF_ME+    0:125      128Gn   1617705K   7880672K              cpu=16,mem=0,node=1
        83387.extern     extern                              node001   03:34:26  COMPLETED      0:0      128Gn       460K    153196K         billing=16,cpu=16,node=1

Thanks in advance,
    Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-s...@drexel.edu
github:prehensilecode


Drexel Internal Data

Paul Edmon

unread,
Mar 15, 2021, 2:03:13 PM3/15/21
to slurm...@lists.schedmd.com

One should keep in mind that sacct results for memory usage are not accurate for Out Of Memory (OoM) jobs.  This is due to the fact that the job is typically terminated prior to next sacct polling period, and also terminated prior to it reaching full memory allocation.  Thus I wouldn't trust any of the results with regards to memory usage if the job is terminated by OoM.  sacct just can't pick up a sudden memory spike like that and even if it did  it would not correctly record the peak memory because the job was terminated prior to that point.


-Paul Edmon-

Renfro, Michael

unread,
Mar 15, 2021, 2:05:06 PM3/15/21
to Slurm User Community List
Just a starting guess, but are you certain the MATLAB script didn’t try to allocate enormous amounts of memory for variables? That’d be about 16e9 floating point values, if I did the units correctly.

On Mar 15, 2021, at 12:53 PM, Chin,David <dw...@drexel.edu> wrote:



External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.


Chin,David

unread,
Mar 15, 2021, 2:16:05 PM3/15/21
to Slurm User Community List
Here's seff output, if it makes any difference. In any case, the exact same job was run by the user on their laptop with 16 GB RAM with no problem.

Job ID: 83387
Cluster: picotte
User/Group: foob/foob
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 06:50:30
CPU Efficiency: 11.96% of 2-09:10:56 core-walltime
Job Wall-clock time: 03:34:26
Memory Utilized: 1.54 GB
Memory Efficiency: 1.21% of 128.00 GB


--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-s...@drexel.edu
github:prehensilecode


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Paul Edmon <ped...@cfa.harvard.edu>
Sent: Monday, March 15, 2021 14:02
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
 

External.


Drexel Internal Data

Chin,David

unread,
Mar 15, 2021, 2:23:54 PM3/15/21
to Slurm User Community List
Hi Michael:

I looked at the Matlab script: it's loading an xlsx file which is 2.9 kB.

There are some "static" arrays allocated with ones() or zeros(), but those use small subsets (< 10 columns) of the loaded data, and outputs are arrays of 6x10. Certainly there are not 16e9 rows in the original file.

Saved output .mat file is only 1.8kB.

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-s...@drexel.edu
github:prehensilecode



From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Renfro, Michael <Ren...@tntech.edu>
Sent: Monday, March 15, 2021 14:04
To: Slurm User Community List <slurm...@lists.schedmd.com>

Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
 

External.

Just a starting guess, but are you certain the MATLAB script didn’t try to allocate enormous amounts of memory for variables? That’d be about 16e9 floating point values, if I did the units correctly.



Drexel Internal Data

Chin,David

unread,
Mar 15, 2021, 2:49:12 PM3/15/21
to Slurm User Community List
One possible datapoint: on the node where the job ran, there were two slurmstepd processes running, both at 100%CPU even after the job had ended.


--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-s...@drexel.edu
github:prehensilecode


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Chin,David <dw...@drexel.edu>
Sent: Monday, March 15, 2021 13:52
To: Slurm-Users List <slurm...@lists.schedmd.com>
Subject: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
 

External.

Hi, all:

Drexel Internal Data


Drexel Internal Data

Chad DeWitt

unread,
Mar 15, 2021, 3:09:39 PM3/15/21
to Slurm User Community List
Hi Dave,

Hope you're doing well.

(...very possible you have already done these things...)

Maybe the logs on the compute node (system and slurmd.log) would yield more info? 

Rolling dice, it may also be worth a look for runaway processes or jobs on that compute node as well as confirm the node is healthy... (No hardware issues, etc.)

Cheers,
Chad

------------------------------------------------------------

Chad DeWitt, CISSP | University Research Computing

UNC Charlotte | Office of OneIT

ccde...@uncc.edu https://oneit.uncc.edu

------------------------------------------------------------




On Mon, Mar 15, 2021 at 2:50 PM Chin,David <dw...@drexel.edu> wrote:
[Caution: Email from External Sender. Do not click or open links or attachments unless you know this sender.]

Sean Crosby

unread,
Mar 15, 2021, 3:22:59 PM3/15/21
to Slurm User Community List
What are your Slurm settings - what's the values of

ProctrackType
JobAcctGatherType
JobAcctGatherParams

and what's the contents of cgroup.conf? Also, what version of Slurm are you using?

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Tue, 16 Mar 2021 at 04:52, Chin,David <dw...@drexel.edu> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts


Chin,David

unread,
Mar 15, 2021, 3:34:38 PM3/15/21
to Slurm User Community List
Hi, Sean:

Slurm version 20.02.6 (via Bright Cluster Manager)

  ProctrackType=proctrack/cgroup
  JobAcctGatherType=jobacct_gather/linux
  JobAcctGatherParams=UsePss,NoShared


I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this job appeared to have left two slurmstepd zombie processes running at 100%CPU each, and changed to:

  ProctrackType=proctrack/cgroup
  JobAcctGatherType=jobacct_gather/cgroup
  JobAcctGatherParams=UsePss,NoShared,NoOverMemoryKill

Have asked the user to re-run the job, but that has not happened, yet.

cgroup.conf:

  CgroupMountpoint="/sys/fs/cgroup"
  CgroupAutomount=yes
  TaskAffinity=yes
  ConstrainCores=yes
  ConstrainRAMSpace=yes
  ConstrainSwapSpace=no
  ConstrainDevices=yes
  ConstrainKmemSpace=yes
  AllowedRamSpace=100.00
  AllowedSwapSpace=0.00
  MinKmemSpace=200
  MaxKmemPercent=100.00
  MemorySwappiness=100
  MaxRAMPercent=100.00
  MaxSwapPercent=100.00
  MinRAMSpace=200


Cheers,
    Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-s...@drexel.edu
github:prehensilecode



From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Sean Crosby <scr...@unimelb.edu.au>
Sent: Monday, March 15, 2021 15:22

To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
 

External.

What are your Slurm settings - what's the values of

ProctrackType
JobAcctGatherType
JobAcctGatherParams

and what's the contents of cgroup.conf? Also, what version of Slurm are you using?

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


Drexel Internal Data

Sean Crosby

unread,
Mar 16, 2021, 6:04:28 AM3/16/21
to Slurm User Community List
Hi David,


On Tue, 16 Mar 2021 at 06:34, Chin,David <dw...@drexel.edu> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts


Hi, Sean:

Slurm version 20.02.6 (via Bright Cluster Manager)

  ProctrackType=proctrack/cgroup
  JobAcctGatherType=jobacct_gather/linux
  JobAcctGatherParams=UsePss,NoShared


I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this job appeared to have left two slurmstepd zombie processes running at 100%CPU each, and changed to:

  ProctrackType=proctrack/cgroup
  JobAcctGatherType=jobacct_gather/cgroup
  JobAcctGatherParams=UsePss,NoShared,NoOverMemoryKill

You definitely want the NoOverMemoryKill option for JobAcctGatherParams. This allows cgroups to kill the job, instead of Slurm accounting.
 


Have asked the user to re-run the job, but that has not happened, yet.

cgroup.conf:

  CgroupMountpoint="/sys/fs/cgroup"
  CgroupAutomount=yes
  TaskAffinity=yes
  ConstrainCores=yes
  ConstrainRAMSpace=yes
  ConstrainSwapSpace=no
  ConstrainDevices=yes
  ConstrainKmemSpace=yes
  AllowedRamSpace=100.00
  AllowedSwapSpace=0.00
  MinKmemSpace=200
  MaxKmemPercent=100.00
  MemorySwappiness=100
  MaxRAMPercent=100.00
  MaxSwapPercent=100.00
  MinRAMSpace=200

This looks good too. Our site does not restrict kmem space, but at least now you'll see why cgroups kills the job (on the compute node, cgroup will show the memory used at the time of the job kill), so you can see if it is kmem related.

Sean
Reply all
Reply to author
Forward
0 new messages