[slurm-users] sinfo not listing any partitions

311 views
Skip to first unread message

Kent L. Hanson via slurm-users

unread,
Nov 27, 2024, 9:58:40 AM11/27/24
to slurm...@lists.schedmd.com

I am doing a new install of slurm 24.05.3 I have all the packages built and installed on headnode and compute node with the same munge.key, slurm.conf, and gres.conf file. I was able to run munge and unmunge commands to test munge successfully. Time is synced with chronyd. I can’t seem to find any useful errors in the logs. For some reason when I run sinfo no nodes are listed. I just see the headers for each column. Has anyone seen this or know what a next step of troubleshooting would be? I’m new to this and not sure where to go from here. Thanks for any and all help!

 

The odd output I am seeing

[username@headnode ~] sinfo

PARTITION AVAIL    TIMELIMIT NODES   STATE   NODELIST

 

(Nothing is output showing status of partition or nodes)

 

 

Slurm.conf

 

ClusterName=slurmkvasir

SlurmctldHost=kadmin2

MpiDefault=none

ProctrackType=proctrack/cgroup

PrologFlags=contain

ReturnToService=2

SlurmctldPidFile=/var/run/slurm/slurmctld.pid

SlurmctldPort=6817

SlurmPidFile=/var/run/slurm/slurmd.pid

SlurmdPort=6818

SlurmdSpoolDir=/var/spool/slurmd

SlurmUser=slurm

StateSaveLocation=/var/spool/slurmctld

TaskPlugin=task/cgroup

MinJobAge=600

SchedulerType=sched/backfill

SelectType=select/cons_tres

PriorityType=priority/multifactor

AccountingStorageHost=localhost

AccountingStoragePass=/var/run/munge/munge.socket.2

AccountingStorageType=accounting_storage/slurmdbd

AccountingStorageTRES=gres/gpu,cpu,node

JobCompType=jobcomp/none

JobAcctGatherFrequency=30

JobAcctGatherType=jobacct_gather/cgroup

SlurmctldDebug=info

SlurmctldLogFile=/var/log/slurm/slurmctld.log

SlurmdDebug=info

SlurmLogFile=/var/log/slurm/slurmd.log

nodeName=k[001-448]

PartitionName=default Nodes=k[001-448] Default=YES MaxTime=INFINITE State=up

 

Slurmctld.log

 

Error: Configured MailProg is invalid

Slurmctld version 24.05.3 started on cluster slurmkvasir

Accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Regisetering slurmctld at port 8617

Error: read_slurm_conf: default partition not set.

Revovered state of 448 nodes

Down nodes: k[002-448]

Recovered information about 0 jobs

Revovered state of 0 reservations

Read_slurm_conf: backup_controller not specified

Select/cons_tres; select_p_reconfigure: select/cons_tres: reconfigure

Running as primary controller

 

Slurmd.log

 

Error: Node configuration differs from hardware: CPUS=1:40(hw) Boards=1:1(hw) SocketsPerBoard=1:2(hw) CoresPerSocket=1:20(hw) ThreadsPerCore:1:1(hw)

CPU frequency setting not configured for this node

Slurmd version 24.05.3started

Slurmd started on Wed, 27 Nov 2024 06:51:03 -0700

CPUS=1 Boards=1 Cores=1 Threads=1 Memory=192030 TmpDisk=95201 uptime 166740 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

Error: _forward_thread: failed to k019 (10.142.0.119:6818): Connection timed out

(Above line repeated 20 or so times for different nodes.)

 

Thanks,

Kent Hanson

Ole Holm Nielsen via slurm-users

unread,
Nov 27, 2024, 10:48:51 AM11/27/24
to slurm...@lists.schedmd.com
Hi Kent,

This problem could perhaps be due to your firewall setup. What is your
OS, and did you install Slurm by RPM packages or what?

Does sinfo work on your SlurmctldHost=kadmin2? Is the "headnode" a
different host? Try stopping the firewalld service.

You can see some advice on firewalls in the Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#configure-firewall-for-slurm-daemons
There is information about Slurm installation and configuration in the
Wiki pages in https://wiki.fysik.dtu.dk/Niflheim_system/

IHTH,
Ole

On 11/27/24 15:56, Kent L. Hanson via slurm-users wrote:
> I am doing a new install of slurm 24.05.3 I have all the packages built
> and installed on headnode and compute node with the same munge.key,
> slurm.conf, and gres.conf file. I was able to run munge and unmunge
> commands to test munge successfully. Time is synced with chronyd. I can’t
> seem to find any useful errors in the logs. For some reason when I run
> sinfo no nodes are listed. I just see the headers for each column. Has
> anyone seen this or know what a next step of troubleshooting would be? I’m
> new to this and not sure where to go from here. Thanks for any and all help!
>
> The odd output I am seeing
>
> [username@headnode ~] sinfo
>
> PARTITION AVAIL    TIMELIMIT NODES   STATE   NODELIST
>
> */(Nothing is output showing status of partition or nodes) /*
> Error: _/forward/_thread: failed to k019 (10.142.0.119:6818): Connection
> timed out
>
> */(Above line repeated 20 or so times for different nodes.)/*

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Kent L. Hanson via slurm-users

unread,
Nov 27, 2024, 11:18:05 AM11/27/24
to Ole.H....@fysik.dtu.dk, slurm...@lists.schedmd.com
Hello Ole,

I have no firewall on the computenodes and I have the internal interfaces on kadmin2, opa and eth, in the trusted zone of the firewall. It should allow everything through. I'm using RHEL 9.4. I built the rpm packages from source using the admin guide https://slurm.schedmd.com/quickstart_admin.html : https://slurm.schedmd.com/quickstart_admin.html

"Kadmin2" and "headnode" are the one and same. This system is on an air gapped network and I had to hand jam everything. Sorry for the confusion.

No luck stopping the firewall service. Still the same issue.

I'll continue to read the documentation that you have sent me and see if I missed anything.

Thanks,

Kent

Ryan Novosielski via slurm-users

unread,
Nov 27, 2024, 11:34:41 AM11/27/24
to Kent L. Hanson, slurm...@lists.schedmd.com
If you’re sure you’ve restarted everything after the config change, are you also sure that you don’t have that stuff hidden from your current user? You can try -a to rule that out. Or run as root.

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

Kent L. Hanson via slurm-users

unread,
Nov 27, 2024, 11:42:12 AM11/27/24
to novo...@rutgers.edu, slurm...@lists.schedmd.com

Hey Ryan,

 

I have restarted the slurmctld and slurmd services several times. I hashed the slurm.conf files. They are the same. I ran “sinfo -a” as root with the same result.

 

Thanks,

Kent

Christopher Samuel via slurm-users

unread,
Nov 27, 2024, 11:43:59 AM11/27/24
to slurm...@lists.schedmd.com
On 11/27/24 11:38 am, Kent L. Hanson via slurm-users wrote:

> I have restarted the slurmctld and slurmd services several times. I
> hashed the slurm.conf files. They are the same. I ran “sinfo -a” as root
> with the same result.

Are your nodes in the `FUTURE` state perhaps? What does this show?

sinfo -aFho "%N %T"

--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Ryan Novosielski via slurm-users

unread,
Nov 27, 2024, 11:48:43 AM11/27/24
to Kent L. Hanson, slurm...@lists.schedmd.com
At this point, I’d probably crank up the logging some and see what it’s saying in slurmctld.log.

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

Patrick Begou via slurm-users

unread,
Nov 28, 2024, 9:55:05 AM11/28/24
to slurm...@lists.schedmd.com
Hi Kent,

on your management node could you run:
systemctl status slurmctld

and check your 'Nodename=....' and 'PartitionName=...' in /etc/slurm.conf ? In my slurm.conf I have a more detailed description and the Nodename Keyword start with an upper case (do'nt know if slurm.conf is case sensitive) :

NodeName=kareline-0-[0-3]  Sockets=2 CoresPerSocket=6 ThreadsPerCore=1 RealMemory=47900

and it looks like your nodes description is not understood by slurm.

Patrick

Brian Andrus via slurm-users

unread,
Dec 2, 2024, 1:17:39 PM12/2/24
to slurm...@lists.schedmd.com
You only have one partition named 'default'
You are not allowed to name it that. Name it something else and you should be good.

Brian Andrus

Kent L. Hanson via slurm-users

unread,
Dec 2, 2024, 1:50:14 PM12/2/24
to toom...@gmail.com, slurm...@lists.schedmd.com

Thank you Brian! That was it. I named it compute and is started working.

 

Thanks for everyone’s help!

 

Kent

 

From: Brian Andrus via slurm-users <slurm...@lists.schedmd.com>

Sent: Monday, December 2, 2024 11:15 AM
To: slurm...@lists.schedmd.com

Reply all
Reply to author
Forward
0 new messages