[slurm-users] Running Containerized Slurmctld and Slurmdb in Production?

188 views
Skip to first unread message

Hanby, Mike

unread,
Feb 15, 2023, 1:51:12 PM2/15/23
to slurm...@lists.schedmd.com

Howdy,

 

Just wondering if any sites are running containerized Slurmctld and Slurmdbd in production?

 

We are in the process of planning migrating from a single host running slurmctld, slurmdbd, and MySQL (and other HPC services) to separate OpenStack VMs. Our site averages less than 1000’s running / pending jobs at any given time. Like many HPC sites, our jobs are a mix of long running, large arrays, very short…

 

I ran across this Github project “Slurm Docker Cluster” https://github.com/giovtorres/slurm-docker-cluster and got me thinking that this method might be great for simpler upgrades, ease of reproducing the cluster in development, etc…

 

How about it, anyone running containerized Slurm server processes in production?

 

Thanks, Mike

Hanby, Mike

unread,
Mar 15, 2023, 12:49:04 PM3/15/23
to Slurm User Community List

FYI, after more internet sleuthing (searching for “juju slurm”) I came across this outstanding looking project: Omnivector Slurm Distribution (OSD): https://omnivector-solutions.github.io/osd-documentation/master/index.html

 

This project uses Juju (Canonical project) to deploy, configure and manage a Slurm cluster along with a variety of other components, like SlurmREST API, Prometheus integration , log forwarding via Fluentbit to Graylog and others

 

Deployment targets include cloud AWS/Openstack, local LXD, MAAS for baremetal…

 

I’ve only started to play with OSD, but it looks like a great framework for deploying Slurm clusters.

 

Quick install on an Ubuntu 22.04LTS host:

 

sudo snap install juju --classic

sudo snap install lxd

lxd init --auto

lxc network set lxdbr0 ipv6.address none

sudo ufw allow 8443/tcp

juju bootstrap --show-log localhost

 

Followed by a quick test of sinfo:

 

juju run --unit slurmctld/0 "sinfo"

 

PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST

osd-slurmd    up   infinite      1  down* juju-65df3d-2

 

juju run --unit slurmctld/0 "sinfo -R"

 

REASON               USER      TIMESTAMP           NODELIST

New node             slurm     2023-03-15T01:21:21 juju-65df3d-2

 

Mike

Rigoberto Corujo

unread,
Mar 15, 2023, 1:58:32 PM3/15/23
to slurm...@lists.schedmd.com
Hi Mike,

If you run the Slurm daemons in a container, but the Slurm commands are run from the host, you need to make sure that the Slurm commands on the host and the Slurm daemons in the container are running similar versions of Slurm. Otherwise, the commands may not be able to communicate with the daemons if there's a protocol change.  We were running the Slurm daemons in a container and ran into a problem where the commands could no longer communicate with the daemons when the version of Slurm on the host was updated, but the container was still running an older version of the daemons.

Rigoberto

Reply all
Reply to author
Forward
0 new messages