[slurm-users] sbatch: error: memory allocation failure

275 views
Skip to first unread message

Yap, Mike

unread,
Jun 7, 2021, 7:46:37 PM6/7/21
to slurm...@lists.schedmd.com

Hi All

 

Can another advise the possibilities of me encountering the error message as below when submitting a job ?

sbatch: error: memory allocation failure

The same script use work perfectly fine until I include  #SBATCH --nodelist=(compute[015-046])  (once removed it work as it should)

 

The issues

  1. For the current setup, I have specific resources available for each compute node
    1. (NodeName=compute[007-014] Procs=36 CoresPerSocket=18 RealMemory=384000 ThreadsPerCore=1 Boards=1 SocketsPerBoard=2) – newer model
    2. (NodeName=compute[001-006] Procs=16 CoresPerSocket=18 RealMemory=128000 ThreadsPerCore=1 Boards=1 SocketsPerBoard=2)
  2. I have same resources sharing between multiple queue (working fine)
  3. When running on parallel job, the exact same job run when assigned to the same node category (ie exclusively on 1a or 1b)
  4. When running the exact same jobs but assigned between 1a and 1b, the job will run on 1b node but no activities on 1a

 

Any suggestion

 

Thanks

Mike

Prentice Bisbal

unread,
Jun 17, 2021, 3:45:28 PM6/17/21
to slurm...@lists.schedmd.com

Mike,

You don't include your entire sbatch script, so it's really hard to say what's going wrong when we only have a single line to work with. Based on what you have told us, I'm guessing you are specifying a memory requirement per node greater than 128000. When you specify a nodelist, Slurm will assign your job to all of those nodes, not a subset that matches the other job specifications (--mem or --mem-per-cpu, or --tasks, etc.):

-w, --nodelist=<node name list>
Request a specific list of hosts. The job will contain all of these hosts and possibly additional hosts as needed to satisfy resource requirements.

Prentice 
Reply all
Reply to author
Forward
0 new messages