regards
Steven
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com
regards
Steven
|
You don't often get email from hea...@gmail.com.
Learn why this is important
|
regards
Steven
|
You don't often get email from jenny_w...@unc.edu.
Learn why this is important
|
On both a compute node and the controller
rpm -qi slurm-slurmctld
rpm -qi slurm-slurmd
check what the auth type is – for example, we still use munge, which in my compile is also the default auth type. :
# strings `which slurmd` |egrep -i munge
DEFAULT_AUTH_TYPE "auth/munge"
DEFAULT_CRED_TYPE "cred/munge"
#scontrol show config |egrep -i auth
AuthAltTypes = (null)
AuthAltParameters = (null)
AuthInfo = (null)
AuthType = auth/munge
Steven
|
You don't often get email from jenny_w...@unc.edu.
Learn why this is important
|
First I’d verify munge functionality in the updated environment –
https://github.com/dun/munge/wiki/Installation-Guide#troubleshooting
regards
Steven
regards
Steven
Steven
regards
Steven
regards
Steven
Late to the party here, but depending on how much time you have invested, how much you can tolerate reformats or other more destructive work, etc., you might consider OpenHPC and its install guide ([1] for RHEL 8 variants, [2] or [3] for RHEL 9 variants, depending on which version of Warewulf you prefer). I’ve also got some workshop materials on building login nodes, GPU drivers, stateful provisioning, etc. for OpenHPC 3 and Warewulf 3 at [4].
At least in an isolated VirtualBox environment with no outside IdP or other dependencies, my student workers have usually been able to get their first batch job running within a day.
From:
Steven Jones via slurm-users <slurm...@lists.schedmd.com>
Date: Sunday, February 2, 2025 at 5:48 PM
To: slurm...@lists.schedmd.com <slurm...@lists.schedmd.com>, Chris Samuel <ch...@csamuel.org>
Subject: [slurm-users] Re: Fw: Re: RHEL8.10 V slurmctld
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
regards
Steven
|
You don't often get email from ren...@tntech.edu.
Learn why this is important
|
We only do isolated on the students’ VirtualBox setups because it’s simpler for them to get started with. Our production HPC with OpenHPC is definitely integrated with our Active Directory (directly via sssd, not with an intermediate product), etc. Not everyone does it that way, but our scale is small enough to where we’ve never had a load or other performance issue with our AD.
regards
Steven
| External email: Please exercise caution |
regards
Steven
regards
Steven Jones
B.Eng (Hons)
Technical Specialist - Linux RHCE
Victoria University, Digital Solutions,
Level 8 Rankin Brown Building,
Wellington, NZ
6012
0064 4 463 6272
regards
Steven
Steven,
Looks like you may have had a secondary controller that took over and changed your StateSave files.
IF you don't need the job info AND no jobs are running, you can just rename/delete your StateSaveLocation directory and things will be recreated. Job numbers will start over (unless you set FirstJobId, which you should if you want to keep your sacct data).
It also looks like your logging does not have permissions. Change SlurmctldLogFile to be something like /var/log/slurm/slurmctld.log and set the owner of /var/log/slurm to the slurm user.
Ensure all slurmctld daemons are down, then start the first. Once it is up (you can run scontrol show config) start the second. Run 'scontrol show config' again and you should see both daemons listed as 'up at the end of the output.
-Brian Andrus