I suspect you have too low of a setting for "MaxJobCount"
MaxJobCount The maximum number of jobs SLURM can have in its active database at one time. Set the values of MaxJobCount and MinJobAge to insure the slurmctld daemon does not exhaust its memory or other resources. Once this limit is reached, requests to submit additional jobs will fail. The default value is 5000 jobs. This value may not be reset via "scontrol reconfig". It only takes effect upon restart of the slurmctld daemon. May not exceed 65533.
so if you already have (by default) 5000 jobs being considered, the remaining aren't even looked at.
Brian Andrus
Have you looked at the High Throughput Computing Administration Guide: https://slurm.schedmd.com/high_throughput.html
In particular, for this problem may be to look at the SchedulerParameters. I believe that the scheduler defaults to be very conservative and will stop looking for jobs to run pretty quickly.
Mike Robbert
Cyberinfrastructure Specialist, Cyberinfrastructure and Advanced Research Computing
Information and Technology Solutions (ITS)
303-273-3786 | mrob...@mines.edu
Our values: Trust | Integrity | Respect | Responsibility
From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of David Henkemeyer <david.he...@gmail.com>
Date: Thursday, May 12, 2022 at 12:34
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: [External] Re: [slurm-users] Question about having 2 partitions that are mutually exclusive, but have unexpected interactions
CAUTION: This email originated from outside of the Colorado School of Mines organization. Do not click on links or open attachments unless you recognize the sender and know the content is safe.