[slurm-users] Convergence of Kube and Slurm?

321 views
Skip to first unread message

Dan Healy via slurm-users

unread,
May 4, 2024, 5:07:32 PM5/4/24
to slurm...@lists.schedmd.com
Bright Cluster Manager has some verbiage on their marketing site that they can manage a cluster running both Kubernetes and Slurm. Maybe I misunderstood it. But nevertheless, I am encountering groups more frequently that want to run a stack of containers that need private container networking. 

What’s the current state of using the same HPC cluster for both Slurm and Kube? 

Note: I’m aware that I can run Kube on a single node, but we need more resources. So ultimately we need a way to have Slurm and Kube exist in the same cluster, both sharing the full amount of resources and both being fully aware of resource usage.

Thanks,

Daniel Healy

Daniel Letai via slurm-users

unread,
May 6, 2024, 6:56:00 AM5/6/24
to slurm...@lists.schedmd.com

There is a kubeflow offering that might be of interest:

https://www.dkube.io/post/mlops-on-hpc-slurm-with-kubeflow


I have not tried it myself, no idea how well it works.


Regards,

--Dani_L.

Tim Wickberg via slurm-users

unread,
May 6, 2024, 8:50:37 PM5/6/24
to slurm...@lists.schedmd.com
> Note: I’m aware that I can run Kube on a single node, but we need more
> resources. So ultimately we need a way to have Slurm and Kube exist in
> the same cluster, both sharing the full amount of resources and both
> being fully aware of resource usage.

This is something that we (SchedMD) are working on, although it's a bit
earlier than I was planning to publicly announce anything...

This is a very high-level view, and I have to apologize for stalling a
bit, but: we've hired a team to build out a collection of tools that
we're calling "Slinky" [1]. These provide for canonical ways of running
Slurm within Kubernetes, ways of maintaining and managing the cluster
state, and scheduling integration to allow for compute nodes to be
available to both Kubernetes and Slurm environments while coordinating
their status.

We'll be talking about it in more details at the Slurm User Group
Meeting in Oslo [3], then KubeCon North America in Salt Lake, and SC'24
in Atlanta. We'll have the (open-source, Apache 2.0 licensed) code for
our first development phase available by SC'24 if not sooner.

There's a placeholder documentation page [4] that points to some of the
presentations I've given before talking about approaches to tackling
this converged-computing model, but I'll caution they're a bit dated and
the Slinky-specific presentation we've been working on internally aren't
publicly available yet.

If there are SchedMD support customers that have specific use cases,
please feel free to ping your account managers if you'd like to chat at
some point in the next few months.

- Tim

[1] Slinky is not an acronym (neither is Slurm [2]), but loosely stands
for "Slurm in Kubernetes".

[2] https://slurm.schedmd.com/faq.html#acronym

[3] https://www.schedmd.com/about-schedmd/events/

[4] https://slurm.schedmd.com/slinky.html

--
Tim Wickberg
Chief Technology Officer, SchedMD LLC
Commercial Slurm Development and Support

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Bjørn-Helge Mevik via slurm-users

unread,
May 7, 2024, 2:28:15 AM5/7/24
to slurm...@schedmd.com
Tim Wickberg via slurm-users <slurm...@lists.schedmd.com> writes:

> [1] Slinky is not an acronym (neither is Slurm [2]), but loosely
> stands for "Slurm in Kubernetes".

And not at all inspired by Slinky Dog in Toy Story, I guess. :D

--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

signature.asc

wdennis--- via slurm-users

unread,
Jul 29, 2024, 2:44:08 PM7/29/24
to slurm...@lists.schedmd.com
Can I ask if this replaces the work on "SUNK" that was previously announced? (but never released as open-source on GitHub as was planned; looks like it is only available on CoreWeave Cloud?)
Reply all
Reply to author
Forward
0 new messages