[slurm-users] Using oversubscribe to hammer a node

1,071 views
Skip to first unread message

Groner, Rob

unread,
Jan 19, 2023, 11:24:24 AM1/19/23
to slurm...@lists.schedmd.com
I'm trying to setup a specific partition where users can fight with the OS for dominance,  The oversubscribe property sounds like what I want, as it says "More than one job can execute simultaneously on the same compute resource."  That's exactly what I want.  I've setup a node with 48 CPU and oversubscribe set to force:4.  I then execute a job that requests 48 cpus, and that starts running.  I execute another job asking for 48 cores, and it gets assigned to the node...but it is not running, it's suspended.  I can execute 2 more jobs, and they'll all go on the node (so, 4x) but 3 will be suspended at any time.  I see the time slicing going on, but that isn't what I though it would be...I thought all 4 tasks per cpu would be running at the same time.  Basically, I want the CPU/OS to work out the sharing of resources.  Otherwise, if one of the tasks that is running is just sitting there doing nothing, it's going to do that for its 30 seconds while other tasks are suspended, right?  

What I want to see is 4x the nodes CPUs in tasks all running at the same time, not time slicing, just for jobs using this partition.  Is that a thing?

Thanks.

Loris Bennett

unread,
Jan 20, 2023, 1:48:36 AM1/20/23
to Slurm User Community List
Hi Rob,

"Groner, Rob" <rug...@psu.edu> writes:

> I'm trying to setup a specific partition where users can fight with the OS for dominance, The oversubscribe property sounds like what I want, as it says
> "More than one job can execute simultaneously on the same compute resource." That's exactly what I want. I've setup a node with 48 CPU and
> oversubscribe set to force:4. I then execute a job that requests 48 cpus, and that starts running. I execute another job asking for 48 cores, and it gets
> assigned to the node...but it is not running, it's suspended. I can execute 2 more jobs, and they'll all go on the node (so, 4x) but 3 will be suspended at
> any time. I see the time slicing going on, but that isn't what I though it would be...I thought all 4 tasks per cpu would be running at the same time.
> Basically, I want the CPU/OS to work out the sharing of resources. Otherwise, if one of the tasks that is running is just sitting there doing nothing, it's
> going to do that for its 30 seconds while other tasks are suspended, right?

Is --oversubscribe set for the jobs?

> What I want to see is 4x the nodes CPUs in tasks all running at the same time, not time slicing, just for jobs using this partition. Is that a thing?

It might be thing. I'm not sure it is a very sensible thing. Time
slicing and context switching is still going to take place, with each
process getting a quarter of a core on average. It is not clear that
you will actually increase throughput this way. I would probably first
turn on hyperthreading to deal with jobs which have intermittent
CPU-usage.

Still, since Slurm offers the possibility of oversubscription, I assume
there must be a use-case.

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin

Groner, Rob

unread,
Jan 20, 2023, 9:37:13 AM1/20/23
to Slurm User Community List
Don't worry, I'm well past the "is this a sensible thing".  Let's just call it an experiment.

I have oversubscribe=FORCE:4 set on the partition, and nothing set on the sbatch command itself.  And with that setting, I can execute a job that requires all of the node's cores 4x and it will put all of those jobs on that node.  When I execute a 5th job, it goes pending for resources.  But in the meantime, only one of the jobs is running at any given time, the rest are suspended.  That's just not what I would have thought it would be for "more than one job can execute simultaneously on the same compute resources."  I don't consider them to be executing simultaneously if they're suspended.

Rob


From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Loris Bennett <loris....@fu-berlin.de>
Sent: Friday, January 20, 2023 1:48 AM
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Using oversubscribe to hammer a node
 
Reply all
Reply to author
Forward
0 new messages