[slurm-users] slurm jobs are pending but resources are available

1,748 views
Skip to first unread message

Marius.C...@sony.com

unread,
Apr 16, 2018, 6:36:21 AM4/16/18
to slurm...@lists.schedmd.com
Hi,

I'm having some trouble with resource allocation in the sense that according to how I understood
the documentation and applied that to the config file I am expecting some behavior that does not happen.

Here is the relevant excerpt from the config file:

60 SchedulerType=sched/backfill
61 SchedulerParameters=bf_continue,bf_interval=45,bf_resolution=90,max_array_tasks=1000
62 #SchedulerAuth=
63 #SchedulerPort=
64 #SchedulerRootFilter=
65 SelectType=select/cons_res
66 SelectTypeParameters=CR_CPU_Memory
67 FastSchedule=1
...
102 NodeName=cn_burebista Sockets=2 CoresPerSocket=14 ThreadsPerCore=2 RealMemory=256000 State=UNKNOWN
103 PartitionName=main_compute Nodes=cn_burebista Shared=YES Default=YES MaxTime=76:00:00 State=UP

According to the above I have the backfill scheduler enabled with CPUs and Memory configured as
resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would expect that he backfill
scheduler attempts to allocate the resources in order to fill as much of the cores as possible if there
are multiple processes asking for more resources than available. In my case I have the following queue:

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2361 main_comp training mcetatea PD 0:00 1 (Resources)
2356 main_comp skrf_ori jhanca R 58:41 1 cn_burebista
2357 main_comp skrf_ori jhanca R 44:13 1 cn_burebista

Jobs 2356 and 2357 are asking for 16 CPUs each, job 2361 is asking for 20 CPUs, meaning in total 52 CPUs
As seen from above job 2361(which is started by a different user) is marked as pending due to lack of resources although there are plenty of CPUs and memory available. "scontrol show nodes cn_burebista" gives me the following:

NodeName=cn_burebista Arch=x86_64 CoresPerSocket=14
CPUAlloc=32 CPUErr=0 CPUTot=56 CPULoad=21.65
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=cn_burebista NodeHostName=cn_burebista Version=16.05
OS=Linux RealMemory=256000 AllocMem=64000 FreeMem=178166 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
BootTime=2018-03-09T12:04:52 SlurmdStartTime=2018-03-20T10:35:50
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

I'm going through the documentation again and again but I cannot figure out what am I doing wrong ...
Why do I have the above situation? What should I change to my config to make this work?

scontrol show -dd job <jobid> shows me the following:

JobId=2361 JobName=training_carlib
UserId=mcetateanu(1000) GroupId=mcetateanu(1001) MCS_label=N/A
Priority=4294901726 Nice=0 Account=(null) QOS=(null)
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=3-04:00:00 TimeMin=N/A
SubmitTime=2018-03-27T10:30:38 EligibleTime=2018-03-27T10:30:38
StartTime=2018-03-28T10:27:36 EndTime=2018-03-31T14:27:36 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=main_compute AllocNode:Sid=zalmoxis:23690
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=cn_burebista
NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=20 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/train_classifier.sh
WorkDir=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier
StdErr=/home/mcetateanu/workspace/CarLib/src/_outputs/linux-xeon_e5v4-icc17.0/bin/classifier/training_job_2383.out
StdIn=/dev/null
StdOut=/home/mcetateanu/workspace/CarLib/src/_out

I also changed my config to specify exactly the numver of CPUs and to not let slurm compute the CPUs
from Sockets, CoresPerSocket, and ThreadsPerCore. The 2 tasks that I am trying to run have the following
output from "scontrol show -dd job <jobid>" but the one asking for 20 CPUs is still pending due to lack of resources:

NumNodes=1 NumCPUs=16 NumTasks=1 CPUs/Task=16 ReqB:S:C:T=0:0:*:* TRES=cpu=16,mem=32000M,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* Nodes=cn_burebista CPU_IDs=0-15 Mem=32000 MinCPUsNode=16 MinMemoryCPU=2000M MinTmpDiskNode=0

NumNodes=1 NumCPUs=20 NumTasks=1 CPUs/Task=20 ReqB:S:C:T=0:0:*:* TRES=cpu=20,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*

Thank you

-------------------------------------------------------------------------------------------
Marius Cetateanu
Senior Embedded Software Engineer
Engineering Department 1, Driver & Embedded
Sony Depthsensing Solutions
Tel: +32 (0)28992171
email: Marius.C...@sony.com

Sony Depthsensing Solutions
11 Boulevard de la Plaine, 1050 Brussels, Belgium

**********************************************************************
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the sender. This
footnote also confirms that this email message has been checked for all
known viruses.
Sony DepthSensing Solutions SA/NV
Registered Office: 11 Boulevard de la Plaine, 1050 Brussels, Belgium
Registered number: RPM/RPR Brussels 0811 784 189
**********************************************************************

________________________________________
From: slurm-users [slurm-use...@lists.schedmd.com] on behalf of slurm-use...@lists.schedmd.com [slurm-use...@lists.schedmd.com]
Sent: Sunday, April 15, 2018 9:02 PM
To: slurm...@lists.schedmd.com
Subject: slurm-users Digest, Vol 6, Issue 21

Send slurm-users mailing list submissions to
slurm...@lists.schedmd.com

To subscribe or unsubscribe via the World Wide Web, visit
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.schedmd.com%2Fcgi-bin%2Fmailman%2Flistinfo%2Fslurm-users&data=01%7C01%7Cmcetateanu%40softkinetic.com%7C531c46b911e643cc3bad08d5a303860b%7C918620360842404b8ecc17f785a95cfe%7C0&sdata=0J4phgqhMDHOFVqXITNuNY62BWyprqriA75AvslDMG8%3D&reserved=0
or, via email, send a message with subject or body 'help' to
slurm-use...@lists.schedmd.com

You can reach the person managing the list at
slurm-us...@lists.schedmd.com

When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."


Today's Topics:

1. Re: ulimit in sbatch script (Mahmood Naderan)
2. Re: ulimit in sbatch script (Bill Barth)
3. Re: ulimit in sbatch script (Mahmood Naderan)
4. Re: ulimit in sbatch script (Mahmood Naderan)
5. Re: ulimit in sbatch script (Bill Barth)


----------------------------------------------------------------------

Message: 1
Date: Sun, 15 Apr 2018 22:56:01 +0430
From: Mahmood Naderan <mahmo...@gmail.com>
To: Ole.H....@fysik.dtu.dk, Slurm User Community List
<slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID:
<CADa2P2XsyW0tBVGjuBi_yRpD...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

I actually have disabled the swap partition (!) since the system goes
really bad and based on my experience I have to enter the room and
reset the affected machine (!). Otherwise I have to wait for long
times to see it get back to normal.

When I ssh to the node with root user, the ulimit -a says unlimited
virtual memory. So, it seems that the root have unlimited value while
users have limited value.

Regards,
Mahmood


On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen
<Ole.H....@fysik.dtu.dk> wrote:
> Hi Mahmood,
>
> It seems your compute node is configured with this limit:
>
> virtual memory (kbytes, -v) 72089600
>
> So when the batch job tries to set a higher limit (ulimit -v 82089600) than
> permitted by the system (72089600), this must surely get rejected, as you
> have discovered!
>
> You may want to reconfigure your compute nodes' limits, for example by
> setting the virtual memory limit to "unlimited" in your configuration. If
> the nodes has a very small RAM memory + swap space size, you might encounter
> Out Of Memory errors...
>
> /Ole

------------------------------

Message: 2
Date: Sun, 15 Apr 2018 18:31:08 +0000
From: Bill Barth <bba...@tacc.utexas.edu>
To: Slurm User Community List <slurm...@lists.schedmd.com>,
"Ole.H....@fysik.dtu.dk" <Ole.H....@fysik.dtu.dk>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID: <6218364A-07C8-4A75...@tacc.utexas.edu>
Content-Type: text/plain; charset="utf-8"

Are you using pam_limits.so in any of your /etc/pam.d/ configuration files? That would be enforcing /etc/security/limits.conf for all users which are usually unlimited for root. Root’s almost always allowed to do stuff bad enough to crash the machine or run it out of resources. If the /etc/pam.d/sshd file has pam_limits.so in it, that’s probably where the unlimited setting for root is coming from.

Best,
Bill.

--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu | Phone: (512) 232-7069
Office: ROC 1.435 | Fax: (512) 475-9445

On 4/15/18, 1:26 PM, "slurm-users on behalf of Mahmood Naderan" <slurm-use...@lists.schedmd.com on behalf of mahmo...@gmail.com> wrote:

I actually have disabled the swap partition (!) since the system goes
really bad and based on my experience I have to enter the room and
reset the affected machine (!). Otherwise I have to wait for long
times to see it get back to normal.

When I ssh to the node with root user, the ulimit -a says unlimited
virtual memory. So, it seems that the root have unlimited value while
users have limited value.

Regards,
Mahmood


On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen
<Ole.H....@fysik.dtu.dk> wrote:
> Hi Mahmood,
>
> It seems your compute node is configured with this limit:
>
> virtual memory (kbytes, -v) 72089600
>
> So when the batch job tries to set a higher limit (ulimit -v 82089600) than
> permitted by the system (72089600), this must surely get rejected, as you
> have discovered!
>
> You may want to reconfigure your compute nodes' limits, for example by
> setting the virtual memory limit to "unlimited" in your configuration. If
> the nodes has a very small RAM memory + swap space size, you might encounter
> Out Of Memory errors...
>
> /Ole


------------------------------

Message: 3
Date: Sun, 15 Apr 2018 23:01:32 +0430
From: Mahmood Naderan <mahmo...@gmail.com>
To: Ole.H....@fysik.dtu.dk, Slurm User Community List
<slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID:
<CADa2P2U-9Pxm0oPT-DkmjzBDa66uk2z=tr-69X=p5WOa...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

BTW, the memory size of the node is 64GB.
Regards,
Mahmood


On Sun, Apr 15, 2018 at 10:56 PM, Mahmood Naderan <mahmo...@gmail.com> wrote:
> I actually have disabled the swap partition (!) since the system goes
> really bad and based on my experience I have to enter the room and
> reset the affected machine (!). Otherwise I have to wait for long
> times to see it get back to normal.
>
> When I ssh to the node with root user, the ulimit -a says unlimited
> virtual memory. So, it seems that the root have unlimited value while
> users have limited value.
>
> Regards,
> Mahmood
>
>
>
>
> On Sun, Apr 15, 2018 at 10:26 PM, Ole Holm Nielsen
> <Ole.H....@fysik.dtu.dk> wrote:
>> Hi Mahmood,
>>
>> It seems your compute node is configured with this limit:
>>
>> virtual memory (kbytes, -v) 72089600
>>
>> So when the batch job tries to set a higher limit (ulimit -v 82089600) than
>> permitted by the system (72089600), this must surely get rejected, as you
>> have discovered!
>>
>> You may want to reconfigure your compute nodes' limits, for example by
>> setting the virtual memory limit to "unlimited" in your configuration. If
>> the nodes has a very small RAM memory + swap space size, you might encounter
>> Out Of Memory errors...
>>
>> /Ole

------------------------------

Message: 4
Date: Sun, 15 Apr 2018 23:11:20 +0430
From: Mahmood Naderan <mahmo...@gmail.com>
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID:
<CADa2P2XTFSztdtW2_drBtXkK...@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"

Excuse me... I think the problem is not pam.d.
How do you interpret the following output?


[hamid@rocks7 case1_source2]$ sbatch slurm_script.sh
Submitted batch job 53
[hamid@rocks7 case1_source2]$ tail -f hvacSteadyFoam.log
max memory size (kbytes, -m) 65536000
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) 72089600
file locks (-x) unlimited
^C
[hamid@rocks7 case1_source2]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
53 CLUSTER hvacStea hamid R 0:27 1 compute-0-3
[hamid@rocks7 case1_source2]$ ssh compute-0-3
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Last login: Sun Apr 15 23:03:29 2018 from rocks7.local
Rocks Compute Node
Rocks 7.0 (Manzanita)
Profile built 19:21 11-Apr-2018

Kickstarted 19:37 11-Apr-2018
[hamid@compute-0-3 ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256712
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
[hamid@compute-0-3 ~]$

As you can see, the log file where I put "ulimit -a" before the main
command says limited virtual memory. However, when I login to the
node, it says unlimited!

Regards,
Mahmood


On Sun, Apr 15, 2018 at 11:01 PM, Bill Barth <bba...@tacc.utexas.edu> wrote:
> Are you using pam_limits.so in any of your /etc/pam.d/ configuration files? That would be enforcing /etc/security/limits.conf for all users which are usually unlimited for root. Root’s almost always allowed to do stuff bad enough to crash the machine or run it out of resources. If the /etc/pam.d/sshd file has pam_limits.so in it, that’s probably where the unlimited setting for root is coming from.
>
> Best,
> Bill.

------------------------------

Message: 5
Date: Sun, 15 Apr 2018 19:02:48 +0000
From: Bill Barth <bba...@tacc.utexas.edu>
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] ulimit in sbatch script
Message-ID: <9A10D099-77FD-4892...@tacc.utexas.edu>
Content-Type: text/plain; charset="utf-8"

Mahmood, sorry to presume. I meant to address the root user and your ssh to the node in your example.

At our site, we use UsePAM=1 in our slurm.conf, and our /etc/pam.d/slurm and slurm.pam files both contain pam_limits.so, so it could be that way for you, too. I.e. Slurm could be setting the limits for jobscripts for your users, but for root SSHes, where that’s being set by PAM through another config file. Also, root’s limits are potentially differently set by PAM (in /etc/security/limits.conf) or the kernel at boot time.

Finally, users should be careful using ulimit in their job scripts b/c that can only change the limits for that shell script process and not across nodes. That jobscript appears to only apply to one node, but if they want different limits for jobs that span nodes, they may need to use other features of SLURM to get them across all the nodes their job wants (cgroups, perhaps?).

Best,
Bill.

--
Bill Barth, Ph.D., Director, HPC
bba...@tacc.utexas.edu | Phone: (512) 232-7069
Office: ROC 1.435 | Fax: (512) 475-9445

On 4/15/18, 1:41 PM, "slurm-users on behalf of Mahmood Naderan" <slurm-use...@lists.schedmd.com on behalf of mahmo...@gmail.com> wrote:

Excuse me... I think the problem is not pam.d.
How do you interpret the following output?


[hamid@rocks7 case1_source2]$ sbatch slurm_script.sh
Submitted batch job 53
[hamid@rocks7 case1_source2]$ tail -f hvacSteadyFoam.log
max memory size (kbytes, -m) 65536000
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) 72089600
file locks (-x) unlimited
^C
[hamid@rocks7 case1_source2]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
53 CLUSTER hvacStea hamid R 0:27 1 compute-0-3
[hamid@rocks7 case1_source2]$ ssh compute-0-3
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Last login: Sun Apr 15 23:03:29 2018 from rocks7.local
Rocks Compute Node
Rocks 7.0 (Manzanita)
Profile built 19:21 11-Apr-2018

Kickstarted 19:37 11-Apr-2018
[hamid@compute-0-3 ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 256712
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
[hamid@compute-0-3 ~]$

As you can see, the log file where I put "ulimit -a" before the main
command says limited virtual memory. However, when I login to the
node, it says unlimited!

Regards,
Mahmood


On Sun, Apr 15, 2018 at 11:01 PM, Bill Barth <bba...@tacc.utexas.edu> wrote:
> Are you using pam_limits.so in any of your /etc/pam.d/ configuration files? That would be enforcing /etc/security/limits.conf for all users which are usually unlimited for root. Root’s almost always allowed to do stuff bad enough to crash the machine or run it out of resources. If the /etc/pam.d/sshd file has pam_limits.so in it, that’s probably where the unlimited setting for root is coming from.
>
> Best,
> Bill.


End of slurm-users Digest, Vol 6, Issue 21
******************************************

Michael Di Domenico

unread,
Apr 16, 2018, 12:51:48 PM4/16/18
to Slurm User Community List
On Mon, Apr 16, 2018 at 6:35 AM, <Marius.C...@sony.com> wrote:
>
> According to the above I have the backfill scheduler enabled with CPUs and Memory configured as
> resources. I have 56 CPUs and 256GB of RAM in my resource pool. I would expect that he backfill
>scheduler attempts to allocate the resources in order to fill as much of the cores as possible if there
> are multiple processes asking for more resources than available. In my case I have the following queue:
>
perhaps i missed something in the email, but it sounds like you have
56 cores, you have two running jobs that consume 52 cores, leaving you
four free. then a third job came along and requested 20 cores (based
on the the show job output). slurm doesn't overcommit resources, so a
20 cpu job will not fit if there are only four cpus free

Benjamin Redling

unread,
Apr 17, 2018, 6:51:38 AM4/17/18
to slurm...@lists.schedmd.com
Hello,
No. From the original mail:
<--- %< --->
"scontrol show nodes cn_burebista" gives me the following:

NodeName=cn_burebista Arch=x86_64 CoresPerSocket=14
CPUAlloc=32
<--- %< --->

Jobs 2356 and 2357 use 32 CPUs as long as the original poster gave the
right numbers.

Regards,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323

Marius.C...@sony.com

unread,
Apr 19, 2018, 4:34:41 AM4/19/18
to slurm...@lists.schedmd.com
> Date: Mon, 16 Apr 2018 12:50:57 -0400
> From: Michael Di Domenico <mdidom...@gmail.com>

> To: Slurm User Community List <slurm...@lists.schedmd.com>
> Subject: Re: [slurm-users] slurm jobs are pending but resources are
> available
> Message-ID:
> <CABOsP2OzEZBKHuMtRv8QTQ6Qq4DMedXMGCPFhU=QqGyG...@mail.gmail.com>
> Content-Type: text/plain; charset="UTF-8"

> perhaps i missed something in the email, but it sounds like you have
> 56 cores, you have two running jobs that consume 52 cores, leaving you

> four free. then a third job came along and requested 20 cores (based
> on the the show job output). slurm doesn't overcommit resources, so a
> 20 cpu job will not fit if there are only four cpus free

I think you might have missed something in the email as in fact I have 56 cores
and I "request" 52 in total: 2 jobs requesting 16 cores and one requesting 20 - that can
be clearly seen from the logs I have "attached" in my message.

The jobs requesting 16 cores come from the same user and are properly allocated but
the one requesting 20 cores comes from another user and it's put in pending state although
there are still 24 cores available.

Reply all
Reply to author
Forward
0 new messages