[slurm-users] slurmctld error

3,603 views
Skip to first unread message

Ioannis Botsis

unread,
Apr 5, 2021, 4:04:09 AM4/5/21
to slurm...@lists.schedmd.com

Hello everyone,

 

I installed the slurm 19.05.5 from Ubuntu repo,  for the first time in a cluster with 44  identical nodes but I have problem with slurmctld.service

 

When I try to activate slurmctd I get the following message…

 

fatal: You are running with a database but for some reason we have no TRES from it.  This should only happen if the database is down and you don't have any state files

 

  • Ubuntu 20.04.2 runs on the server and nodes in the exact same version.
  • munge 0.5.13 installed from Ubuntu repo running on server and nodes.
  • mysql  Ver 8.0.23-0ubuntu0.20.04.1 for Linux on x86_64 ((Ubuntu))  installed from ubuntu repo running on server.

 

slurm.conf is the same on all nodes and on server.

 

slurmd.service is active and running on all nodes without problem.

 

mysql.service is active and running on server.

slurmdbd.service is active and running on server (slurm_acct_db created).

 

Find attached slurm.conf slurmdbd.com  and detailed output of slurmctld -Dvvvv  command.

 

Any hint?

 

Thanks in advance

 

jb

 

 

 

slurmdctl -Dvvvv.txt
slurm.conf
slurmdbd.conf

Sean Crosby

unread,
Apr 5, 2021, 4:46:58 AM4/5/21
to Slurm User Community List
Hi Jb,

You have set AccountingStoragePort to 3306 in slurm.conf, which is the MySQL port running on the DBD host.

AccountingStoragePort is the port for the Slurmdbd service, and not for MySQL.

Change AccountingStoragePort to 6819 and it should fix your issues.

I also think you should comment out the lines

AccountingStorageUser=slurm
AccountingStoragePass=/run/munge/munge.socket.2

You shouldn't need those lines

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



On Mon, 5 Apr 2021 at 18:03, Ioannis Botsis <ibo...@isc.tuc.gr> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts


Ioannis Botsis

unread,
Apr 5, 2021, 7:01:13 AM4/5/21
to Slurm User Community List

Hi Sean,

 

Thank you for your prompt response,  I made the changes you suggested, slurmctld refuse running……. find attached new slurmctld -Dvvvv

 

jb

slurmdctl -Dvvvv new.txt

Sean Crosby

unread,
Apr 5, 2021, 7:53:01 AM4/5/21
to Slurm User Community List
The error shows

slurmctld: debug2: Error connecting slurm stream socket at 10.0.0.100:6819: Connection refused
slurmctld: error: slurm_persist_conn_open_without_init: failed to open persistent connection to se01:6819: Connection refused

Is 10.0.0.100 the IP address of the host running slurmdbd?

If so, check the iptables firewall running on that host, and make sure the ctld server can access port 6819 on the dbd host.

You can check this by running the following from the ctld host (requires the package nmap-ncat installed)

nc -nz 10.0.0.100 6819 || echo Connection not working

This will try connecting to port 6819 on the host 10.0.0.100, and output nothing if the connection works, and would output Connection not working otherwise

I would also test this on the DBD server itself
 
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


ibo...@isc.tuc.gr

unread,
Apr 5, 2021, 3:00:47 PM4/5/21
to Slurm User Community List

Hi Sean,

 

10.0.0.100 is the dbd and ctld host with name se01. Firewall is inactive……

 

nc -nz 10.0.0.100 6819 || echo Connection not working

 

give me back …..  Connection not working

Prentice Bisbal

unread,
Apr 5, 2021, 4:02:50 PM4/5/21
to slurm...@lists.schedmd.com

It looks like slurm can't connect to the DB. Try connecting to the MySQL/MariaDB database the same way the slurm user would. You might not have your DB configured correctly to give Slurm access.

Prentice

Sean Crosby

unread,
Apr 5, 2021, 5:50:22 PM4/5/21
to Slurm User Community List
What's the output of

ss -lntp | grep $(pidof slurmdbd)

on your dbd host?

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


ibo...@isc.tuc.gr

unread,
Apr 6, 2021, 12:03:38 AM4/6/21
to Slurm User Community List

Hi Sean

 

ss -lntp | grep $(pidof slurmdbd)     return nothing……

 

systemctl status slurmdbd.service

 

● slurmdbd.service - Slurm DBD accounting daemon

     Loaded: loaded (/lib/systemd/system/slurmdbd.service; enabled; vendor preset: enabled)

     Active: active (running) since Mon 2021-04-05 13:52:35 EEST; 16h ago

       Docs: man:slurmdbd(8)

    Process: 1453365 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS)

   Main PID: 1453375 (slurmdbd)

      Tasks: 1

     Memory: 5.0M

     CGroup: /system.slice/slurmdbd.service

             └─1453375 /usr/sbin/slurmdbd

 

Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Starting Slurm DBD accounting daemon...

Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: slurmdbd.service: Can't open PID file /run/slurmdbd.pid (yet?) after start: Operation not permitted

Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Started Slurm DBD accounting daemon.

 

File /run/slurmdbd.pid exist and has  pidof slurmdbd   value….

Sean Crosby

unread,
Apr 6, 2021, 12:32:14 AM4/6/21
to Slurm User Community List
Interesting. It looks like slurmdbd is not opening the 6819 port

What does

ss -lntp | grep 6819

show? Is something else using that port?

You can also stop the slurmdbd service and run it in debug mode using

slurmdbd -D -vvv

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


Sean Crosby

unread,
Apr 6, 2021, 12:55:00 AM4/6/21
to Slurm User Community List
The other thing I notice for my slurmdbd.conf is that I have

DbdAddr=localhost
DbdHost=localhost

You can try changing your slurmdbd.conf to set those 2 values as well to see if that gets slurmdbd to listen on port 6819

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


Ioannis Botsis

unread,
Apr 6, 2021, 1:27:59 AM4/6/21
to Slurm User Community List

I turned DbdAddr  and DbdHost  to localhost and now slurmctld is active and running…..

 

Thanks

Ioannis Botsis

unread,
Apr 6, 2021, 1:57:25 AM4/6/21
to Slurm User Community List

Hi Sean,

 

slurmctld is active and running but on system reboot doesn’t start automatically…..I have to start it manually

Ole Holm Nielsen

unread,
Apr 6, 2021, 2:23:02 AM4/6/21
to slurm...@lists.schedmd.com
Hi Ioannis,

On 06-04-2021 07:56, Ioannis Botsis wrote:
> slurmctld is active and running but on system reboot doesn’t start
> automatically…..I have to start it manually

Maybe you will find my Slurm Wiki pages of use for setting up your Slurm
system: https://wiki.fysik.dtu.dk/niflheim/SLURM

For example, enabling the Slurm system services is described in the
section
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#installing-rpms

Best regards,
Ole

Ioannis Botsis

unread,
Apr 6, 2021, 5:19:55 AM4/6/21
to Slurm User Community List

Hi Sean,

 

I am trying to submit a simple job but freeze

 

srun -n44 -l /bin/hostname

srun: Required node not available (down, drained or reserved)

srun: job 15 queued and waiting for resources

^Csrun: Job allocation 15 has been revoked

srun: Force Terminated job 15

 

 

daemons are active and running on server and all nodes

 

nodes definition in slurm.conf is …

 

DefMemPerNode=3934

NodeName=wn0[01-44] CPUs=2 RealMemory=3934 Sockets=2 CoresPerSocket=2 State=UNKNOWN

PartitionName=TUC Nodes=ALL Default=YES MaxTime=INFINITE State=UP

 

tail -10 /var/log/slurmdbd.log

 

[2021-04-06T12:09:16.481] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist

[2021-04-06T12:09:16.481] error: _add_registered_cluster: trying to register a cluster (tuc) with no remote port

[2021-04-06T12:09:16.482] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist

[2021-04-06T12:09:16.482] error: It looks like the storage has gone away trying to reconnect

[2021-04-06T12:09:16.483] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist

[2021-04-06T12:09:16.483] error: _add_registered_cluster: trying to register a cluster (tuc) with no remote port

[2021-04-06T12:09:16.484] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist

[2021-04-06T12:09:16.484] error: It looks like the storage has gone away trying to reconnect

[2021-04-06T12:09:16.484] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist

[2021-04-06T12:09:16.485] error: _add_registered_cluster: trying to register a cluster (tuc) with no remote port

 

tail -10 /var/log/slurmctld.log

 

[2021-04-06T12:09:35.701] debug:  backfill: no jobs to backfill

[2021-04-06T12:09:42.001] debug:  slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)

[2021-04-06T12:10:00.042] debug:  slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)

[2021-04-06T12:10:05.701] debug:  backfill: beginning

[2021-04-06T12:10:05.701] debug:  backfill: no jobs to backfill

[2021-04-06T12:10:05.989] debug:  sched: Running job scheduler

[2021-04-06T12:10:19.001] debug:  slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)

[2021-04-06T12:10:35.702] debug:  backfill: beginning

[2021-04-06T12:10:35.702] debug:  backfill: no jobs to backfill

[2021-04-06T12:10:37.001] debug:  slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)

 

Attached sinfo -R 

 

Any hint?

Sean Crosby

unread,
Apr 6, 2021, 5:47:42 AM4/6/21
to Slurm User Community List
It looks like your attachment of sinfo -R didn't come through

It also looks like your dbd isn't set up correctly

Can you also show the output of

sacctmgr list cluster

and

scontrol show config | grep ClusterName

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


ibo...@isc.tuc.gr

unread,
Apr 6, 2021, 6:18:07 AM4/6/21
to Slurm User Community List

sacctmgr list cluster

   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS

---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------

       tuc                            0     0         1                                                                                           normal

 

 

scontrol show config | grep ClusterName

ClusterName             = tuc

ibo...@isc.tuc.gr

unread,
Apr 6, 2021, 6:20:21 AM4/6/21
to Slurm User Community List

sinfo -N -o "%N %T %C %m %P %a"

NODELIST STATE CPUS(A/I/O/T) MEMORY PARTITION AVAIL

wn001 drained 0/0/2/2 3934 TUC* up

wn002 drained 0/0/2/2 3934 TUC* up

wn003 drained 0/0/2/2 3934 TUC* up

wn004 drained 0/0/2/2 3934 TUC* up

wn005 drained 0/0/2/2 3934 TUC* up

wn006 drained 0/0/2/2 3934 TUC* up

wn007 drained 0/0/2/2 3934 TUC* up

wn008 drained 0/0/2/2 3934 TUC* up

wn009 drained 0/0/2/2 3934 TUC* up

wn010 drained 0/0/2/2 3934 TUC* up

wn011 drained 0/0/2/2 3934 TUC* up

wn012 drained 0/0/2/2 3934 TUC* up

wn013 drained 0/0/2/2 3934 TUC* up

wn014 drained 0/0/2/2 3934 TUC* up

wn015 drained 0/0/2/2 3934 TUC* up

wn016 drained 0/0/2/2 3934 TUC* up

wn017 drained 0/0/2/2 3934 TUC* up

wn018 drained 0/0/2/2 3934 TUC* up

wn019 drained 0/0/2/2 3934 TUC* up

wn020 drained 0/0/2/2 3934 TUC* up

wn021 drained 0/0/2/2 3934 TUC* up

wn022 drained 0/0/2/2 3934 TUC* up

wn023 drained 0/0/2/2 3934 TUC* up

wn024 drained 0/0/2/2 3934 TUC* up

wn025 drained 0/0/2/2 3934 TUC* up

wn026 drained 0/0/2/2 3934 TUC* up

wn027 drained 0/0/2/2 3934 TUC* up

wn028 drained 0/0/2/2 3934 TUC* up

wn029 drained 0/0/2/2 3934 TUC* up

wn030 drained 0/0/2/2 3934 TUC* up

wn031 drained 0/0/2/2 3934 TUC* up

wn032 drained 0/0/2/2 3934 TUC* up

wn033 drained 0/0/2/2 3934 TUC* up

wn034 drained 0/0/2/2 3934 TUC* up

wn035 drained 0/0/2/2 3934 TUC* up

wn036 drained 0/0/2/2 3934 TUC* up

wn037 drained 0/0/2/2 3934 TUC* up

wn038 drained 0/0/2/2 3934 TUC* up

wn039 drained 0/0/2/2 3934 TUC* up

wn040 drained 0/0/2/2 3934 TUC* up

wn041 drained 0/0/2/2 3934 TUC* up

wn042 drained 0/0/2/2 3934 TUC* up

wn043 drained 0/0/2/2 3934 TUC* up

wn044 drained 0/0/2/2 3934 TUC* up

Sean Crosby

unread,
Apr 6, 2021, 6:38:36 AM4/6/21
to Slurm User Community List
It looks like your ctl isn't contacting the slurmdbd properly. The control host, control port etc are all blank.

The first thing I would do is change the ClusterName in your slurm.conf from upper case TUC to lower case tuc. You'll then need to restart your ctld. Then recheck sacctmgr show cluster

If that doesn't work, try changing AccountingStorageHost in slurm.conf to localhost as well

For your worker nodes, your nodes are all in drain state.

Show the output of

scontrol show node wn001

It will give you the reason for why the node is drained.

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


Sean Crosby

unread,
Apr 6, 2021, 6:53:39 AM4/6/21
to Slurm User Community List
I think I've worked out a problem

I see in your slurm.conf you have this

SlurmdSpoolDir=/var/spool/slurm/d

It should be

SlurmdSpoolDir=/var/spool/slurmd

You'll need to restart slurmd on all the nodes after you make that change

I would also double check the permissions on that directory on all your nodes. It needs to be owned by user slurm

ls -lad /var/spool/slurmd

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


Sean Crosby

unread,
Apr 6, 2021, 7:12:01 AM4/6/21
to Slurm User Community List
I just checked my cluster and my spool dir is

SlurmdSpoolDir=/var/spool/slurm

(i.e. without the d at the end)

It doesn't really matter, as long as the directory exists and has the correct permissions on all nodes
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


Ioannis Botsis

unread,
Apr 8, 2021, 2:38:35 AM4/8/21
to Slurm User Community List

Hi Sean

 

I made all the changes you recommended but the problem remains.

 

Attached you will find dbd & ctld log files an slurmd log file from one node wn001. Also slum configuration.

 

scontrol show node wn001

 

NodeName=wn001 Arch=x86_64 CoresPerSocket=2

   CPUAlloc=0 CPUTot=2 CPULoad=0.01

   AvailableFeatures=(null)

   ActiveFeatures=(null)

   Gres=(null)

   NodeAddr=wn001 NodeHostName=wn001 Version=19.05.5

   OS=Linux 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021

   RealMemory=3934 AllocMem=0 FreeMem=3101 Sockets=2 Boards=1

   State=DOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A

   Partitions=aTUC

   BootTime=2021-04-01T13:26:24 SlurmdStartTime=2021-04-07T10:53:20

   CfgTRES=cpu=2,mem=3934M,billing=2

   AllocTRES=

   CapWatts=n/a

   CurrentWatts=0 AveWatts=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

   Reason=Low RealMemory [root@2021-04-

slurmdbd.log
slurmctld.log

Ioannis Botsis

unread,
Apr 8, 2021, 2:39:04 AM4/8/21
to Slurm User Community List

sacctmgr list cluster

   Cluster     ControlHost  ControlPort   RPC     Share GrpJobs       GrpTRES GrpSubmit MaxJobs       MaxTRES MaxSubmit     MaxWall                  QOS   Def QOS

---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------

       tuc       127.0.0.1         6817  8704         1

slurmd.log
show_config

Sean Crosby

unread,
Apr 8, 2021, 3:19:34 AM4/8/21
to Slurm User Community List
The reason why your nodes are drained is "Low RealMemory"

This reason is because you have told Slurm about the RAM on the node, but it is less than the RAM on the node.

You have told Slurm that the amount of RAM on wn001 is 3934MB

What does

free -m

show on wn001?

The DBD looks good now!

Can you also double check that you can resolve the worker node names from the Slurm controller and between each node

e.g.

on ctl - ping wn001
on wn001:
  ping wn002
  ping se01
on wn002:
  ping wn001
  ping se01

Also double check that each node can contact the slurmd port on every other node

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


Ioannis Botsis

unread,
Apr 8, 2021, 3:53:51 AM4/8/21
to Slurm User Community List

Total memory in each node is 3940 and free from 3353 to 3378, which value should I give to RealMemory

 

For each node  I have to create a different entry in slurm.conf ?

 

How can I check that each node can contact the slurmd port on every other node?

Sean Crosby

unread,
Apr 8, 2021, 4:14:34 AM4/8/21
to Slurm User Community List
The memory you tell Slurm (using the RealMemory value) is the memory the jobs can use. So if your nodes have a minimum free RAM of 3353, I think set RealMemory for your nodes to 3300MB. You always have to leave some RAM for OS/caching etc. The way we calculate Slurm RAM is (physical RAM - GPFS cache - 5GB). But our nodes have 768GB RAM...

Once you change the RealMemory, you'll have to restart slurmd and slurmctld on the nodes.

The way you can test if they can connect is

on wn001:

  nc -z wn002 6818 || echo Cannot connect
  nc -z wn003 6818 || echo Cannot connect

on wn002:

  nc -z wn001 6818 || echo Cannot connect
  nc -z wn003 6818 || echo Cannot connect

on wn003:

  nc -z wn001 6818 || echo Cannot connect
  nc -z wn002 6818 || echo Cannot connect

Make sure you test all of the nodes (or ensure they have a consistent firewall configuration).

You also have to make sure name resolution works. You have set the names in Slurm to be wn001-wn044, so every node has to be able to resolve those names. Hence the check using ping

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia


Reply all
Reply to author
Forward
0 new messages