[slurm-users] Jobs can grow in RAM usage surpassing MaxMemPerNode

Cristóbal Navarro

unread,

Jan 11, 2023, 8:48:42 PM1/11/23

to Slurm User Community List

Hi Slurm community,

Recently we found a small problem triggered by one of our jobs. We have a MaxMemPerNode=532000 setting in our compute node in slurm.conf file, however we found out that a job that started with mem=65536, and after hours of execution it was able to grow its memory usage during execution up to ~650GB. We expected that MaxMemPerNode would stop any job exceeding the limit of 532000, did we miss something in the slurm.conf file? We were trying to avoid going into setting QOS for each group of users.

any help is welcome

Here is the node definition in the conf file

## Nodes list
## use native GPUs
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8 Feature=gpu

And here is the full slurm.conf file

# node health check
HealthCheckProgram=/usr/sbin/nhc
HealthCheckInterval=300

## Timeouts
SlurmctldTimeout=600
SlurmdTimeout=600

GresTypes=gpu
AccountingStorageTRES=gres/gpu
DebugFlags=CPU_Bind,gres

## We don't want a node to go back in pool without sys admin acknowledgement
ReturnToService=0

## Basic scheduling
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
SchedulerType=sched/backfill

## Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
AccountingStorageHost=10.10.0.1
AccountingStorageEnforce=limits

JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux

TaskPlugin=task/cgroup
ProctrackType=proctrack/cgroup

## scripts
Epilog=/etc/slurm/epilog
Prolog=/etc/slurm/prolog
PrologFlags=Alloc

## MPI
MpiDefault=pmi2

## Nodes list
## use native GPUs
NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=1024000 MemSpecLimit=65556 State=UNKNOWN Gres=gpu:A100:8 Feature=gpu

## Partitions list
PartitionName=gpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=65556 DefCpuPerGPU=8 DefMemPerGPU=65556 MaxMemPerNode=532000 MaxTime=3-12:00:00 State=UP Nodes=nodeGPU01 Default=YES
PartitionName=cpu OverSubscribe=No MaxCPUsPerNode=64 DefMemPerNode=16384 MaxMemPerNode=420000 MaxTime=3-12:00:00 State=UP Nodes=nodeGPU01

--

Cristóbal A. Navarro

Rodrigo Santibáñez

unread,

Jan 12, 2023, 12:23:17 AM1/12/23

to Slurm User Community List

Hi Cristóbal,

I would guess you need to set up a cgroup.conf file

###
# Slurm cgroup support configuration file
###
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedRAMSpace=100
AllowedSwapSpace=0
MaxRAMPercent=100
MaxSwapPercent=0
#ConstrainDevices=yes
MemorySwappiness=0
TaskAffinity=no
CgroupAutomount=yes
ConstrainCores=yes
#

Best,

Rodrigo

Daniel Letai

unread,

Jan 13, 2023, 1:47:39 AM1/13/23

to slurm...@lists.schedmd.com

Hello Cristóbal,

I think you might have a slight misunderstanding of how Slurm works, which can cause this difference in expectation.

The MaxMemPerNode is there to allow the scheduler to plan job placement according to resources. It does not enforce limitations during job execution, only placement with the assumption that the job will not use more than the resources it requested.

One option to limit the job during execution is through cgroups, another might be using JobAcctGatherParams/OverMemoryKill but I would suspect cgroups would indeed be the better option for your use case, and see from the slurm.conf man page:

Kill processes that are being detected to use more memory than requested by steps every time accounting information is gathered by the JobAcctGather plugin. This parameter should be used with caution because a job exceeding its memory allocation may affect other processes and/or machine health.

NOTE: If available, it is recommended to limit memory by enabling task/cgroup as a TaskPlugin and making use of ConstrainRAMSpace=yes in the cgroup.conf instead of using this JobAcctGather mechanism for memory enforcement. Using JobAcctGather is polling based and there is a delay before a job is killed, which could lead to system Out of Memory events.

NOTE: When using OverMemoryKill, if the combined memory used by all the processes in a step exceeds the memory limit, the entire step will be killed/cancelled by the JobAcctGather plugin. This differs from the behavior when using ConstrainRAMSpace, where processes in the step will be killed, but the step will be left active, possibly with other processes left running.

-- 
Regards,

Daniel Letai
+972 (0)505 870 456

Cristóbal Navarro

unread,

Jan 13, 2023, 8:43:58 AM1/13/23

to da...@letai.org.il, slurm...@lists.schedmd.com

Many thanks Rodrigo and Daniel,

Indeed I misunderstood that part of Slurm, so thanks for clarifying this aspect now it makes a lot of sense.

Regarding the approach, I went with the cgroup.conf approach as suggested by both.

I will start doing some synthetic tests to make sure the job gets killed once it surpasses memory.

many thanks again

--

Cristóbal A. Navarro

Reply all

Reply to author

Forward