[slurm-users] Job submitted to multiple partitions not running when any partition is full

9 views
Skip to first unread message

Paul Raines via slurm-users

unread,
Jul 9, 2024, 9:26:32 AM (7 days ago) Jul 9
to slurm-users

I have a job 465072 submitted to multiple partitions (rtx6000,rtx8000,pubgpu)

JOBID PARTITION PENDING PRIORITY TRES_ALLOC|REASON
4650727 rtx6000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority
4650727 rtx8000 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority
4650727 pubgpu 47970 0.00367972 cpu=5,mem=400G,node=1,gpu=1|Priority
4646926 rtx6000 487048 0.00121987 cpu=10,mem=32G,node=1,gpu=1|Priority,Resources
4650186 rtx8000 56979 0.00000000 cpu=4,mem=10G,node=1,gpu=1|Priority,Resources

We see the two partitions rtx6000 and rtx8000 are full and two other
jobs are at the top of the queue waiting to run on those. But partition
pubgpu is NOT full and you can see here a node leo with resources to
run the 4650727 job

HOST PARTITION CORES MEMORY GPUS
leo pubgpu 48/ 64 12288/1030994 0/ 1
leo pubcpu 48/ 64 12288/1030994 0/ 1

The node leo is NOT part of the rtx6000 or rtx8000 partitions and
there are no other pending jobs waiting on either the pubgpu or
pubcpu partition that leo is part of

So why is 4650727 not running on the pubgpu partition?

---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA



The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline <https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.


--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Timony, Mick via slurm-users

unread,
Jul 9, 2024, 2:41:04 PM (6 days ago) Jul 9
to slurm-users, Raines, Paul E.
Hi Paul,

There could be multiple reasons why the job isn't running, from the user's QOS to your cluster hitting MaxJobCount. This page might help:

https://slurm.schedmd.com/high_throughput.html

The output of the following command might help:

scontrol show job 465072​ 

Regards
-- 
Mick Timony
Senior DevOps Engineer
Harvard Medical School
--


From: Paul Raines via slurm-users <slurm...@lists.schedmd.com>
Sent: Tuesday, July 9, 2024 9:24 AM
To: slurm-users <slurm...@lists.schedmd.com>
Subject: [slurm-users] Job submitted to multiple partitions not running when any partition is full
 

Paul Raines via slurm-users

unread,
Jul 9, 2024, 3:11:11 PM (6 days ago) Jul 9
to Timony, Mick, slurm-users

Thanks. I traced it to a MaxMemPerCPU=16384 setting on the pubgpu
partition.

-- Paul Raines (http://help.nmr.mgh.harvard.edu)

On Tue, 9 Jul 2024 2:39pm, Timony, Mick wrote:

> External Email - Use Caution

Reply all
Reply to author
Forward
0 new messages