[slurm-users] Custom Plugin Integration

Bhaskar Chakraborty via slurm-users

unread,

Jul 9, 2024, 4:17:56 AM (7 days ago) Jul 9

to slurm...@schedmd.com

Hello,

We wish to have a scheduling integration with Slurm. Our own application has a backend system which will decide the placement of jobs across hosts & CPU cores.

The backend takes its own time to come back with a placement (which may take a few seconds) & we expect slurm to update it regularly about any change in the current state of available resources.

For this we believe we have 3 options broadly:

We use the const_tres Select plugin & modify it to let it query our backend system for job placements.
We write our own Select plugin avoiding any other Select plugin.
We use existing select plugin & also register our own plugin. Idea is that our plugin will cater to 'our' jobs (specific partition, say) while all other jobs would be taken up by the default plugin.

Problem with a> is that this leads to modification of existing plugin code & calling (our) library code from inside Select plugin lib.

With b> the issue is unless we have the full Slurm cluster to ourselves this isn't viable. Any insight how to proceed with this? Where should our select plugin, assuming we need to make one, fits in the slurm integration.

We are not sure whether c> is allowed in Slurm.

We went through existing Select plugins Linear & cons_tres. However, not able to figure out how to use them or write something on similar lines to suit our purpose.

Any help in this regard is appreciated.

Apologies if this question (or any other very similar) is already answered, please point to the relevant thread then.

Thanks in advance for any pointers.

Regards,

Bhaskar.

Daniel Letai via slurm-users

unread,

Jul 12, 2024, 1:50:47 PM (3 days ago) Jul 12

to slurm...@lists.schedmd.com

I'm not sure I understand why your app must decide the placement, rather
then tell Slurm about the requirements (This sounds suspiciously like
Not Invented Here syndrome), but Slurm does have the '-w' flag to
salloc,sbatch and srun.

I just don't understand if you don't have an entire cluster to
yourselves, how can you do a, not to mention b or c. Any change to Slurm
select mechanism is always site-wide.

I might go on a limb here, but I think Slurm would probably make better
placement choices than your self developed app, if you can communicate
the requirements well enough.

How does you app choose placement and cores? Why can't it communicate
those requirements to Slurm instead of making the decision itself?

I can guess at some reasons and there can be many, including but not
limited to: topology, heterogeneous HW and different parts of the app
having different HW requirements, some results placed on some nodes
requiring followup jobs to run on same nodes, NUMA considerations for
acceleration cards (including custom, mostly FPGA cards) etc.

If you provide the placement algorithm (in broad strokes), perhaps we
can find a Slurm solution that doesn't require breaking existing sites.
If that is the case, how much will it cost to 'degrade' your app to
communicate those requirements to Slurm instead of making the placement
decisions itself?

It's possible that you would be better off investing in developing a
monitoring solution that would cover the ' update it regularly about any
change in the current state of available resources '.

Again, that is also ruled out if you use a site without total ownership
- no site will allow you to place jobs without first allocating you the
resources, no matter the scheduling solution, which brings us back to
using `salloc -w`.

That said, --nodelist has the downside of requesting nodes that might
not be available, causing your jobs to starve while resources are available.

Imagine the following scenario:

1. Your app gets resource availability from Slurm.

2. Your app starts calculating the placement.

3. Meanwhile Slurm allocates those resources.

4. The plugin communicates the need to recalculate placement.

5. Your app restarts it's calculation

6. Meanwhile Slurm allocates the resources your app was going to use
now, since it was never told to reserve anything for you.

...

On highly active clusters, with pending queues in the millions, such a
starvation scenario is not that far fetched.

Best,

--Dani_L.

On 09/07/2024 11:15:51, Bhaskar Chakraborty via slurm-users wrote:
>
> Hello,
>
> We wish to have a scheduling integration with Slurm. Our own
> application has a backend system which will decide the placement of
> jobs across hosts & CPU cores.
> The backend takes its own time to come back with a placement (which
> may take a few seconds) & we expect slurm to update it regularly about
> any change in the current state of available resources.
>
> For this we believe we have 3 options broadly:
>

> 1. We use the const_tres Select plugin & modify it to let it query

> our backend system for job placements.

> 2. We write our own Select plugin avoiding any other Select plugin.
> 3. We use existing select plugin & also register our own plugin. Idea

> is that our plugin will cater to 'our' jobs (specific partition,
> say) while all other jobs would be taken up by the default plugin.
>
> Problem with a> is that this leads to modification of existing plugin
> code & calling (our) library code from inside Select plugin lib.
>
> With b> the issue is unless we have the full Slurm cluster to
> ourselves this isn't viable. Any insight how to proceed with this?
> Where should our select plugin, assuming we need to make one, fits in
> the slurm integration.
>
> We are not sure whether c> is allowed in Slurm.
>
> We went through existing Select plugins Linear & cons_tres. However,
> not able to figure out how to use them or write something on similar
> lines to suit our purpose.
> Any help in this regard is appreciated.
>
> Apologies if this question (or any other very similar) is already
> answered, please point to the relevant thread then.
>
> Thanks in advance for any pointers.
>
> Regards,
>
> Bhaskar.
>
>

--
Regards,

Daniel Letai
+972 (0)505 870 456

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

jubhaskar--- via slurm-users

unread,

Jul 15, 2024, 12:28:18 PM (13 hours ago) Jul 15

to slurm...@lists.schedmd.com

Hi Daniel,
Thanks for picking up this query. Let me try to briefly describe my problem.

As you rightly guessed, we have some hardware on the backend which would be used for our
jobs to run. The app which manages the h/w has its own set of resource placement/remapping
rules to place a job.
So, for eg., if only 3 hosts h1, h2, h3 (2 cores available each) are available at some point for a
4 core job then it's only a few combination of cores from these hosts can be allowed for
the job. Also there is a preference order of the placements decided by our app.

It's in this respect we want our backend app to bring the placement for the job.
Slurm would then dispatch the job accordingly while honoring the exact resource distribution
as asked for. In case for the need of preemption as well our backend would decide the placement
which would decide which preemptable job candidates to preempt.

So, how should we proceed then?
We mayn't have the whole site/cluster to ourselves. There me be other jobs which we don't
care about & hence they should go in the usual route from the select plugin which is there (linear, cons_tres etc).

Is there a scope for a separate partition which will encompass our resources only & trigger our
plugin only for our jobs?
How do the options a>, b> , c> stand (as described in my 1st message) now that I mention our requirement?

A 4th option which comes to my mind is that if there's a possibility through some API interface from Slurm
which will inform a separate process P (say) about resource availability on a real time basis.
P will talk to our backend app, bring a placement & then ask lSurm to place our job.

Your concern about everchanging resources (being allocated before our backend comes up) is uncalled for
as the hosts are segregated as far as our system is concerned. Our hosts will run only our jobs & other Slurm
jobs would run in different hosts.

Hope I make myself little more clearer ! Any help would be appreciated.

(Note: We already have a working solution with LSF! LSF does provide option for custom scheduler plugins
to let one connect in the decision making loop during scheduling. This only led us to believe Slurm would also
have some possibilities.)

Regards,
Bhaskar.

Reply all

Reply to author

Forward