[slurm-users] Queue size, slow/unresponsive head node

300 views
Skip to first unread message

Colas Rivière

unread,
Jan 11, 2018, 4:40:47 PM1/11/18
to slurm...@schedmd.com
Hello,

I'm managing a small cluster (one head node, 24 workers, 1160 total
worker threads). The head node has two E5-2680 v3 CPUs (hyper-threaded),
~100 GB of memory and spinning disks.
The head node becomes occasionally less responsive when there are more
than 10k jobs in queue, and becomes really unmanageable when reaching
100k jobs in queue, with error messages such as:
> sbatch: error: Slurm temporarily unable to accept job, sleeping and
> retrying.
or
> Running: slurm_load_jobs error: Socket timed out on send/recv operation
Is that normal to experience slowdowns when the queue reaches this few
10k jobs? What limit should I expect? Would adding a SSD drive for
SlurmdSpoolDir help? What can be done to push this limit?

The cluster runs Slurm 17.02.4 on CentOS 6 and the config is attached
(from `scontrol show config`).

Thanks,
Colas
slurm.cfg

Nicholas C Santucci

unread,
Jan 12, 2018, 12:26:59 AM1/12/18
to Slurm User Community List, slurm...@schedmd.com
Why do you have?
SchedulerParameters     = (null)
--
Nick Santucci

John DeSantis

unread,
Jan 12, 2018, 9:09:22 AM1/12/18
to slurm...@schedmd.com
Colas,

We had a similar experience a long time ago, and we solved it by adding
the following SchedulerParameters:

max_rpc_cnt=150,defer

HTH,
John DeSantis

Colas Rivière

unread,
Jan 12, 2018, 2:01:19 PM1/12/18
to slurm...@lists.schedmd.com
Nicholas,


Why do you have?
SchedulerParameters     = (null)
I did not set these parameters, so I assume "(null)" means all the default values are used.

John,

thanks, I'll try that, and look into these SchedulerParameter more.

Cheers,
Colas
Reply all
Reply to author
Forward
0 new messages