[slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Chin,David

unread,

Mar 15, 2021, 1:53:15 PM3/15/21

to Slurm-Users List

Hi, all:

I'm trying to understand why a job exited with an error condition. I think it was actually terminated by Slurm: job was a Matlab script, and its output was incomplete. 

Here's sacct output:

               JobID    JobName      User  Partition        NodeList    Elapsed      State ExitCode     ReqMem     MaxRSS  MaxVMSize                        AllocTRES AllocGRE
-------------------- ---------- --------- ---------- --------------- ---------- ---------- -------- ---------- ---------- ---------- -------------------------------- --------
               83387 ProdEmisI+      foob        def         node001   03:34:26 OUT_OF_ME+    0:125      128Gn                               billing=16,cpu=16,node=1
         83387.batch      batch                              node001   03:34:26 OUT_OF_ME+    0:125      128Gn   1617705K   7880672K              cpu=16,mem=0,node=1

        83387.extern     extern                              node001   03:34:26  COMPLETED      0:0      128Gn       460K    153196K         billing=16,cpu=16,node=1

Thanks in advance,

    Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-s...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode

Drexel Internal Data

Paul Edmon

unread,

Mar 15, 2021, 2:03:13 PM3/15/21

to slurm...@lists.schedmd.com

One should keep in mind that sacct results for memory usage are not accurate for Out Of Memory (OoM) jobs. This is due to the fact that the job is typically terminated prior to next sacct polling period, and also terminated prior to it reaching full memory allocation. Thus I wouldn't trust any of the results with regards to memory usage if the job is terminated by OoM. sacct just can't pick up a sudden memory spike like that and even if it did it would not correctly record the peak memory because the job was terminated prior to that point.

-Paul Edmon-

Renfro, Michael

unread,

Mar 15, 2021, 2:05:06 PM3/15/21

to Slurm User Community List

Just a starting guess, but are you certain the MATLAB script didn’t try to allocate enormous amounts of memory for variables? That’d be about 16e9 floating point values, if I did the units correctly.

On Mar 15, 2021, at 12:53 PM, Chin,David <dw...@drexel.edu> wrote:

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

Chin,David

unread,

Mar 15, 2021, 2:16:05 PM3/15/21

to Slurm User Community List

Here's seff output, if it makes any difference. In any case, the exact same job was run by the user on their laptop with 16 GB RAM with no problem.

Job ID: 83387
Cluster: picotte
User/Group: foob/foob
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 06:50:30
CPU Efficiency: 11.96% of 2-09:10:56 core-walltime
Job Wall-clock time: 03:34:26
Memory Utilized: 1.54 GB

Memory Efficiency: 1.21% of 128.00 GB

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-s...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Paul Edmon <ped...@cfa.harvard.edu>
Sent: Monday, March 15, 2021 14:02
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

External.

Drexel Internal Data

Chin,David

unread,

Mar 15, 2021, 2:23:54 PM3/15/21

to Slurm User Community List

Hi Michael:

I looked at the Matlab script: it's loading an xlsx file which is 2.9 kB.

There are some "static" arrays allocated with ones() or zeros(), but those use small subsets (< 10 columns) of the loaded data, and outputs are arrays of 6x10. Certainly there are not 16e9 rows in the original file.

Saved output .mat file is only 1.8kB.

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-s...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Renfro, Michael <Ren...@tntech.edu>
Sent: Monday, March 15, 2021 14:04
To: Slurm User Community List <slurm...@lists.schedmd.com>

Subject: Re: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

External.

Just a starting guess, but are you certain the MATLAB script didn’t try to allocate enormous amounts of memory for variables? That’d be about 16e9 floating point values, if I did the units correctly.

Drexel Internal Data

Chin,David

unread,

Mar 15, 2021, 2:49:12 PM3/15/21

to Slurm User Community List

One possible datapoint: on the node where the job ran, there were two slurmstepd processes running, both at 100%CPU even after the job had ended.

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-s...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Chin,David <dw...@drexel.edu>
Sent: Monday, March 15, 2021 13:52
To: Slurm-Users List <slurm...@lists.schedmd.com>
Subject: [slurm-users] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

External.

Hi, all:

Drexel Internal Data

Chad DeWitt

unread,

Mar 15, 2021, 3:09:39 PM3/15/21

to Slurm User Community List

Hi Dave,

Hope you're doing well.

(...very possible you have already done these things...)

Maybe the logs on the compute node (system and slurmd.log) would yield more info?

Rolling dice, it may also be worth a look for runaway processes or jobs on that compute node as well as confirm the node is healthy... (No hardware issues, etc.)

Cheers,

Chad

------------------------------------------------------------

Chad DeWitt, CISSP | University Research Computing

UNC Charlotte | Office of OneIT

ccde...@uncc.edu | https://oneit.uncc.edu

------------------------------------------------------------

On Mon, Mar 15, 2021 at 2:50 PM Chin,David <dw...@drexel.edu> wrote:

[Caution: Email from External Sender. Do not click or open links or attachments unless you know this sender.]

Sean Crosby

unread,

Mar 15, 2021, 3:22:59 PM3/15/21

to Slurm User Community List

What are your Slurm settings - what's the values of

ProctrackType
JobAcctGatherType
JobAcctGatherParams

and what's the contents of cgroup.conf? Also, what version of Slurm are you using?

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

On Tue, 16 Mar 2021 at 04:52, Chin,David <dw...@drexel.edu> wrote:

UoM notice: External email. Be cautious of links, attachments, or impersonation attempts

Chin,David

unread,

Mar 15, 2021, 3:34:38 PM3/15/21

to Slurm User Community List

Hi, Sean:

Slurm version 20.02.6 (via Bright Cluster Manager)

ProctrackType=proctrack/cgroup

  JobAcctGatherType=jobacct_gather/linux

  JobAcctGatherParams=UsePss,NoShared

I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because
 this job appeared to have left two slurmstepd zombie processes running at 100%CPU each, and changed to:

  ProctrackType=proctrack/cgroup  JobAcctGatherType=jobacct_gather/cgroup

  JobAcctGatherParams=UsePss,NoShared,NoOverMemoryKill

Have asked the user to re-run the job, but that has not happened, yet.

cgroup.conf:

  CgroupMountpoint="/sys/fs/cgroup"
  CgroupAutomount=yes
  TaskAffinity=yes
  ConstrainCores=yes
  ConstrainRAMSpace=yes
  ConstrainSwapSpace=no
  ConstrainDevices=yes
  ConstrainKmemSpace=yes
  AllowedRamSpace=100.00
  AllowedSwapSpace=0.00
  MinKmemSpace=200
  MaxKmemPercent=100.00
  MemorySwappiness=100
  MaxRAMPercent=100.00
  MaxSwapPercent=100.00

  MinRAMSpace=200

Cheers,

    Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dw...@drexel.edu                     215.571.4335 (o)
For URCF support: urcf-s...@drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Sean Crosby <scr...@unimelb.edu.au>
Sent: Monday, March 15, 2021 15:22

To: Slurm User Community List <slurm...@lists.schedmd.com>

Subject: Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

External.

What are your Slurm settings - what's the values of

ProctrackType
JobAcctGatherType
JobAcctGatherParams

and what's the contents of cgroup.conf? Also, what version of Slurm are you using?

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

Drexel Internal Data

Sean Crosby

unread,

Mar 16, 2021, 6:04:28 AM3/16/21

to Slurm User Community List

Hi David,

On Tue, 16 Mar 2021 at 06:34, Chin,David <dw...@drexel.edu> wrote:

UoM notice: External email. Be cautious of links, attachments, or impersonation attempts

Hi, Sean:

Slurm version 20.02.6 (via Bright Cluster Manager)

ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/linux

JobAcctGatherParams=UsePss,NoShared

I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this job appeared to have left two slurmstepd zombie processes running at 100%CPU each, and changed to:

ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/cgroup

JobAcctGatherParams=UsePss,NoShared,NoOverMemoryKill

You definitely want the NoOverMemoryKill option for JobAcctGatherParams. This allows cgroups to kill the job, instead of Slurm accounting.

Have asked the user to re-run the job, but that has not happened, yet.

cgroup.conf:

CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes

TaskAffinity=yes

ConstrainCores=yes

ConstrainRAMSpace=yes

ConstrainSwapSpace=no

ConstrainDevices=yes

ConstrainKmemSpace=yes

AllowedRamSpace=100.00

AllowedSwapSpace=0.00

MinKmemSpace=200

MaxKmemPercent=100.00

MemorySwappiness=100

MaxRAMPercent=100.00

MaxSwapPercent=100.00
MinRAMSpace=200

This looks good too. Our site does not restrict kmem space, but at least now you'll see why cgroups kills the job (on the compute node, cgroup will show the memory used at the time of the job kill), so you can see if it is kmem related.

Sean

Reply all

Reply to author

Forward