[slurm-users] Rate-limiting sbatch and srun

154 views
Skip to first unread message

gphipps

unread,
Jul 18, 2022, 6:46:50 PM7/18/22
to slurm...@lists.schedmd.com

Hi

Everyone so often one of our users accidentally writes a “fork-bomb” that submits thousands of sbatch and srun requests per second. It is a giant DDOS attack on our scheduler. Is there a way of rate limiting these requests before they reach the daemon? I could imagine writing a shim in front of sbatch/srun, but I was hoping there was an official way to do this

 

Cheers

Geoff

Ole Holm Nielsen

unread,
Jul 19, 2022, 2:16:29 AM7/19/22
to slurm...@lists.schedmd.com
Perhaps setting MaxSubmitJobs and MaxJobs on associations and QOSes would
do the trick?

You may also want to increase the default MaxJobCount in slurm.conf.

See my Wiki page for the details:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#maxjobcount-limit

/Ole

Ole Holm Nielsen

unread,
Jul 19, 2022, 4:49:57 AM7/19/22
to slurm...@lists.schedmd.com
Another possibility would be to write a Job submit Lua plugin to reject
jobs before they get submitted. Of course, you would have to be able to
define some logic which somehow detects the "fork-bomb" situation, which
may not be so easy to do? See
https://slurm.schedmd.com/job_submit_plugins.html

I have some additional pointers to job submit plugins at
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#job-submit-plugins

/Ole

Christopher Samuel

unread,
Jul 20, 2022, 12:39:53 AM7/20/22
to slurm...@lists.schedmd.com
On 7/18/22 3:45 pm, gphipps wrote:

> Everyone so often one of our users accidentally writes a “fork-bomb”
> that submits thousands of sbatch and srun requests per second. It is a
> giant DDOS attack on our scheduler. Is there a way of rate limiting
> these requests before they reach the daemon?

Yes there is, you can use the Slurm cli_filter to do this.

https://slurm.schedmd.com/cli_filter_plugins.html

If you use the lua plugin you can write what you need in that; though of
course it would need careful thought as you would need somewhere to
store state on the node (writeable by users), a way of counting the
frequency of the RPCs and introducing increasing delays (up to a point)
if it's out of control and then decaying that delay time down when the
RPCs from that user cease/decrease.

All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA


Reply all
Reply to author
Forward
0 new messages