[slurm-users] Resource LImits

39 views
Skip to first unread message

Hoot Thompson

unread,
Apr 19, 2023, 6:16:34 PM4/19/23
to slurm...@lists.schedmd.com
Is there a ‘how to’ or recipe document for setting up and enforcing resource limits? I can establish accounts, users, and set limits but 'current value' is not incrementing after running jobs.

Thanks in advance

Ole Holm Nielsen

unread,
Apr 20, 2023, 2:10:46 AM4/20/23
to slurm...@lists.schedmd.com
Hi Hoot,

On 4/20/23 00:15, Hoot Thompson wrote:
> Is there a ‘how to’ or recipe document for setting up and enforcing resource limits? I can establish accounts, users, and set limits but 'current value' is not incrementing after running jobs.

I have written about resource limits in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits

IHTH,
Ole

Hoot Thompson

unread,
Apr 20, 2023, 11:01:16 AM4/20/23
to Slurm User Community List
Thank you for this. I’ll give it a read but no promises that I won’t be back with more questions!

Hoot

Hoot Thompson

unread,
Apr 20, 2023, 12:33:46 PM4/20/23
to Slurm User Community List
Ole,

Earlier I found your Slurm_tools posting and found it very useful. This remains my problem, ‘current value’ not incrementing even after making needed changes to slurm.conf.


./showuserlimits -u ubuntu

scontrol -o show assoc_mgr users=ubuntu account=testing flags=Assoc

Association (Parent account):

      ClusterName =   dev-uid-testing

          Account =   testing

          UserName = 

        Partition = 

          Priority = 0

                ID = 6

    SharesRaw/Norm/Level/Factor =     1/18446744073709551616.00/1/0.00

    UsageRaw/Norm/Efctv = 0.00/1.00/0.00

    ParentAccount =   root, current value = 1

              Lft =   2

          DefAssoc = No

          GrpJobs = 

    GrpJobsAccrue = 

    GrpSubmitJobs = 

          GrpWall = 

          GrpTRES = 

             cpu: Limit = 1500, current value = 0

 

      GrpTRESMins = 

 

    GrpTRESRunMins = 

 

          MaxJobs =  

    MaxJobsAccrue =  

    MaxSubmitJobs =  

        MaxWallPJ =  

        MaxTRESPJ =  

        MaxTRESPN =  

    MaxTRESMinsPJ =  

    MinPrioThresh =  

Association (User):

      ClusterName =   dev-uid-testing

          Account =   testing

          UserName = ubuntu, UID=1000

        Partition = 

          Priority = 0

                ID = 9

    SharesRaw/Norm/Level/Factor =     1/18446744073709551616.00/1/0.00

    UsageRaw/Norm/Efctv = 0.00/1.00/0.00

    ParentAccount =  

              Lft =   3

          DefAssoc = Yes

          GrpJobs = 

    GrpJobsAccrue = 

    GrpSubmitJobs = 

          GrpWall = 

          GrpTRES = 

             cpu: Limit = 1500, current value = 0

 

      GrpTRESMins = 

             cpu: Limit = 1000, current value = 0

 

    GrpTRESRunMins = 

 

          MaxJobs =  

    MaxJobsAccrue =  

    MaxSubmitJobs =  

        MaxWallPJ =  

        MaxTRESPJ =  

        MaxTRESPN =  

    MaxTRESMinsPJ =  

    MinPrioThresh =  

 

Slurm share information:

Account                    User  RawShares  NormShares    RawUsage  EffectvUsage  FairShare 

-------------------- ---------- ---------- ----------- ----------- ------------- ---------- 

testing                  ubuntu                                0      0.000000   0.000000



Clearly I’m still missing something or I don’t understand how it’s supposed to work.

Hoot



On Apr 20, 2023, at 2:10 AM, Ole Holm Nielsen <Ole.H....@fysik.dtu.dk> wrote:

Ole Holm Nielsen

unread,
Apr 20, 2023, 1:01:49 PM4/20/23
to slurm...@lists.schedmd.com
On 20-04-2023 18:23, Hoot Thompson wrote:
> Ole,
>
> Earlier I found your Slurm_tools posting and found it very useful. This
> remains my problem, ‘current value’ not incrementing even after making
> needed changes to slurm.conf.

The ‘current value’ refers to those jobs that are currently running.
Does that answer your question?

/Ole
> testingubuntu1 00.000000 0.000000

Hoot Thompson

unread,
Apr 20, 2023, 1:05:26 PM4/20/23
to Ole.H....@fysik.dtu.dk, Slurm User Community List
Ahhhhh, I thought that was the aggregate of past and current jobs.

Hoot Thompson

unread,
Apr 20, 2023, 1:08:09 PM4/20/23
to Ole.H....@fysik.dtu.dk, Slurm User Community List
And it indeed does show current value for a running job!! Do I feel stupid :-)

Hoot Thompson

unread,
Apr 20, 2023, 1:27:57 PM4/20/23
to Ole.H....@fysik.dtu.dk, Slurm User Community List
So an update, GrpTRES registers a value while a job is running but GRpTRESMins does not. So I still have something wrong. GrpTRESMins reads in the docs like it is in fact an aggregate number.

Jason Simms

unread,
Apr 20, 2023, 2:11:55 PM4/20/23
to Slurm User Community List
Hello Ole and Hoot,

First, Hoot, thank you for your question. I've managed Slurm for a few years now and still feel like I don't have a great understanding about managing or limiting resources.

Ole, thanks for your continued support of the user community with your documentation. I do wish not only that more of your information were contained within the official docs, but also that there were even clearer discussions around certain topics.

As an example, you write that "It is important to configure slurm.conf so that the locked memory limit isn’t propagated to the batch jobs" by setting PropagateResourceLimitsExcept=MEMLOCK. It's unclear to me whether you are suggesting that literally everyone should have that set, or whether it only applies to certain configurations. We don't have it set, for instance, but we've not run into trouble with jobs failing due to locked memory errors.

Then, in the official docs, to which you link, it says that "it may also be desirable to lock the slurmd daemon's memory to help ensure that it keeps responding if memory swapping begins" by creating /etc/sysconfig/slurm containing the line SLURMD_OPTIONS="-M". Would there ever be a reason *not* to include that? That is, I can't think it would ever be desirable for slurmd to stop responding. So is that another "universal" recommendation, I wonder?

It may be me talking as a new-ish user, but I would find a concise document laying out common or useful configuration options to be presented when setting up or reconfiguring Slurm. I'm certain I have inefficient or missing options that I should have.

Warmest regards,
Jason
--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research Computing
Swarthmore College
Information Technology Services
Schedule a meeting: https://calendly.com/jlsimms

Ole Holm Nielsen

unread,
Apr 21, 2023, 4:44:21 AM4/21/23
to slurm...@lists.schedmd.com
Hi Jason,

On 4/20/23 20:11, Jason Simms wrote:
> Hello Ole and Hoot,
>
> First, Hoot, thank you for your question. I've managed Slurm for a few
> years now and still feel like I don't have a great understanding about
> managing or limiting resources.
>
> Ole, thanks for your continued support of the user community with your
> documentation. I do wish not only that more of your information were
> contained within the official docs, but also that there were even clearer
> discussions around certain topics.
>
> As an example, you write that "It is important to configure slurm.conf so
> that the locked memory limit isn’t propagated to the batch jobs" by
> setting PropagateResourceLimitsExcept=MEMLOCK. It's unclear to me whether
> you are suggesting that literally everyone should have that set, or
> whether it only applies to certain configurations. We don't have it set,
> for instance, but we've not run into trouble with jobs failing due to
> locked memory errors.

The link mentioned in the page hopefully explains it:
https://slurm.schedmd.com/faq.html#memlock

> Then, in the official docs, to which you link, it says that "it may also
> be desirable to lock the slurmd daemon's memory to help ensure that it
> keeps responding if memory swapping begins" by creating
> /etc/sysconfig/slurm containing the line SLURMD_OPTIONS="-M". Would there
> ever be a reason *not* to include that? That is, I can't think it would
> ever be desirable for slurmd to stop responding. So is that another
> "universal" recommendation, I wonder?

I'm not an expert on locking slurmd pages! The -M option is documented in
the slurmd manual page, and I probably read a thread long ago abut this on
the slurm-users mailing list discussing this. You could try it out in
your environment and see if all is well.

> It may be me talking as a new-ish user, but I would find a concise
> document laying out common or useful configuration options to be presented
> when setting up or reconfiguring Slurm. I'm certain I have inefficient or
> missing options that I should have.

IMHO, most sites have their own requirements and preferences, so I don't
think there is a one-size-fits-all Slurm installation solution.

Since requirements can be so different, and because Slurm is a fantastic
software that can be configured for many different scenarios, IMHO a
support contract with SchedMD is the best way to get consulting services,
get general help, and report bugs. We have excellent experiences with
SchedMD support (https://www.schedmd.com/support.php).

Best regards,
Ole

> On Thu, Apr 20, 2023 at 2:11 AM Ole Holm Nielsen
> <Ole.H....@fysik.dtu.dk <mailto:Ole.H....@fysik.dtu.dk>> wrote:
>
> Hi Hoot,
>
> On 4/20/23 00:15, Hoot Thompson wrote:
> > Is there a ‘how to’ or recipe document for setting up and enforcing
> resource limits? I can establish accounts, users, and set limits but
> 'current value' is not incrementing after running jobs.
>
> I have written about resource limits in this Wiki page:
> https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits <https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#partition-limits>

Hoot Thompson

unread,
Apr 21, 2023, 10:42:12 AM4/21/23
to Slurm User Community List
After assistance from an AWS colleague, GrpTRESMins seems to be working.

Hoot
Reply all
Reply to author
Forward
0 new messages