Hello everyone,
I installed the slurm 19.05.5 from Ubuntu repo, for the first time in a cluster with 44 identical nodes but I have problem with slurmctld.service
When I try to activate slurmctd I get the following message…
fatal: You are running with a database but for some reason we have no TRES from it. This should only happen if the database is down and you don't have any state files
slurm.conf is the same on all nodes and on server.
slurmd.service is active and running on all nodes without problem.
mysql.service is active and running on server.
slurmdbd.service is active and running on server (slurm_acct_db created).
Find attached slurm.conf slurmdbd.com and detailed output of slurmctld -Dvvvv command.
Any hint?
Thanks in advance
jb
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts
Hi Sean,
10.0.0.100 is the dbd and ctld host with name se01. Firewall is inactive……
nc -nz 10.0.0.100 6819 || echo Connection not working
give me back ….. Connection not working
It looks like slurm can't connect to the DB. Try connecting to
the MySQL/MariaDB database the same way the slurm user would. You
might not have your DB configured correctly to give Slurm access.
Prentice
Hi Sean
ss -lntp | grep $(pidof slurmdbd) return nothing……
systemctl status slurmdbd.service
● slurmdbd.service - Slurm DBD accounting daemon
Loaded: loaded (/lib/systemd/system/slurmdbd.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2021-04-05 13:52:35 EEST; 16h ago
Docs: man:slurmdbd(8)
Process: 1453365 ExecStart=/usr/sbin/slurmdbd $SLURMDBD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 1453375 (slurmdbd)
Tasks: 1
Memory: 5.0M
CGroup: /system.slice/slurmdbd.service
└─1453375 /usr/sbin/slurmdbd
Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Starting Slurm DBD accounting daemon...
Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: slurmdbd.service: Can't open PID file /run/slurmdbd.pid (yet?) after start: Operation not permitted
Apr 05 13:52:35 se01.grid.tuc.gr systemd[1]: Started Slurm DBD accounting daemon.
File /run/slurmdbd.pid exist and has pidof slurmdbd value….
I turned DbdAddr and DbdHost to localhost and now slurmctld is active and running…..
Thanks
Hi Sean,
slurmctld is active and running but on system reboot doesn’t start automatically…..I have to start it manually
Hi Sean,
I am trying to submit a simple job but freeze
srun -n44 -l /bin/hostname
srun: Required node not available (down, drained or reserved)
srun: job 15 queued and waiting for resources
^Csrun: Job allocation 15 has been revoked
srun: Force Terminated job 15
daemons are active and running on server and all nodes
nodes definition in slurm.conf is …
DefMemPerNode=3934
NodeName=wn0[01-44] CPUs=2 RealMemory=3934 Sockets=2 CoresPerSocket=2 State=UNKNOWN
PartitionName=TUC Nodes=ALL Default=YES MaxTime=INFINITE State=UP
tail -10 /var/log/slurmdbd.log
[2021-04-06T12:09:16.481] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist
[2021-04-06T12:09:16.481] error: _add_registered_cluster: trying to register a cluster (tuc) with no remote port
[2021-04-06T12:09:16.482] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist
[2021-04-06T12:09:16.482] error: It looks like the storage has gone away trying to reconnect
[2021-04-06T12:09:16.483] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist
[2021-04-06T12:09:16.483] error: _add_registered_cluster: trying to register a cluster (tuc) with no remote port
[2021-04-06T12:09:16.484] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist
[2021-04-06T12:09:16.484] error: It looks like the storage has gone away trying to reconnect
[2021-04-06T12:09:16.484] error: We should have gotten a new id: Table 'slurm_acct_db.tuc_job_table' doesn't exist
[2021-04-06T12:09:16.485] error: _add_registered_cluster: trying to register a cluster (tuc) with no remote port
tail -10 /var/log/slurmctld.log
[2021-04-06T12:09:35.701] debug: backfill: no jobs to backfill
[2021-04-06T12:09:42.001] debug: slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)
[2021-04-06T12:10:00.042] debug: slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)
[2021-04-06T12:10:05.701] debug: backfill: beginning
[2021-04-06T12:10:05.701] debug: backfill: no jobs to backfill
[2021-04-06T12:10:05.989] debug: sched: Running job scheduler
[2021-04-06T12:10:19.001] debug: slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)
[2021-04-06T12:10:35.702] debug: backfill: beginning
[2021-04-06T12:10:35.702] debug: backfill: no jobs to backfill
[2021-04-06T12:10:37.001] debug: slurmdbd: PERSIST_RC is -1 from DBD_FLUSH_JOBS(1408): (null)
Attached sinfo -R
Any hint?
sacctmgr list cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
tuc 0 0 1 normal
scontrol show config | grep ClusterName
ClusterName = tuc
sinfo -N -o "%N %T %C %m %P %a"
NODELIST STATE CPUS(A/I/O/T) MEMORY PARTITION AVAIL
wn001 drained 0/0/2/2 3934 TUC* up
wn002 drained 0/0/2/2 3934 TUC* up
wn003 drained 0/0/2/2 3934 TUC* up
wn004 drained 0/0/2/2 3934 TUC* up
wn005 drained 0/0/2/2 3934 TUC* up
wn006 drained 0/0/2/2 3934 TUC* up
wn007 drained 0/0/2/2 3934 TUC* up
wn008 drained 0/0/2/2 3934 TUC* up
wn009 drained 0/0/2/2 3934 TUC* up
wn010 drained 0/0/2/2 3934 TUC* up
wn011 drained 0/0/2/2 3934 TUC* up
wn012 drained 0/0/2/2 3934 TUC* up
wn013 drained 0/0/2/2 3934 TUC* up
wn014 drained 0/0/2/2 3934 TUC* up
wn015 drained 0/0/2/2 3934 TUC* up
wn016 drained 0/0/2/2 3934 TUC* up
wn017 drained 0/0/2/2 3934 TUC* up
wn018 drained 0/0/2/2 3934 TUC* up
wn019 drained 0/0/2/2 3934 TUC* up
wn020 drained 0/0/2/2 3934 TUC* up
wn021 drained 0/0/2/2 3934 TUC* up
wn022 drained 0/0/2/2 3934 TUC* up
wn023 drained 0/0/2/2 3934 TUC* up
wn024 drained 0/0/2/2 3934 TUC* up
wn025 drained 0/0/2/2 3934 TUC* up
wn026 drained 0/0/2/2 3934 TUC* up
wn027 drained 0/0/2/2 3934 TUC* up
wn028 drained 0/0/2/2 3934 TUC* up
wn029 drained 0/0/2/2 3934 TUC* up
wn030 drained 0/0/2/2 3934 TUC* up
wn031 drained 0/0/2/2 3934 TUC* up
wn032 drained 0/0/2/2 3934 TUC* up
wn033 drained 0/0/2/2 3934 TUC* up
wn034 drained 0/0/2/2 3934 TUC* up
wn035 drained 0/0/2/2 3934 TUC* up
wn036 drained 0/0/2/2 3934 TUC* up
wn037 drained 0/0/2/2 3934 TUC* up
wn038 drained 0/0/2/2 3934 TUC* up
wn039 drained 0/0/2/2 3934 TUC* up
wn040 drained 0/0/2/2 3934 TUC* up
wn041 drained 0/0/2/2 3934 TUC* up
wn042 drained 0/0/2/2 3934 TUC* up
wn043 drained 0/0/2/2 3934 TUC* up
wn044 drained 0/0/2/2 3934 TUC* up
Hi Sean
I made all the changes you recommended but the problem remains.
Attached you will find dbd & ctld log files an slurmd log file from one node wn001. Also slum configuration.
scontrol show node wn001
NodeName=wn001 Arch=x86_64 CoresPerSocket=2
CPUAlloc=0 CPUTot=2 CPULoad=0.01
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=(null)
NodeAddr=wn001 NodeHostName=wn001 Version=19.05.5
OS=Linux 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021
RealMemory=3934 AllocMem=0 FreeMem=3101 Sockets=2 Boards=1
State=DOWN+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=aTUC
BootTime=2021-04-01T13:26:24 SlurmdStartTime=2021-04-07T10:53:20
CfgTRES=cpu=2,mem=3934M,billing=2
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Reason=Low RealMemory [root@2021-04-
sacctmgr list cluster
Cluster ControlHost ControlPort RPC Share GrpJobs GrpTRES GrpSubmit MaxJobs MaxTRES MaxSubmit MaxWall QOS Def QOS
---------- --------------- ------------ ----- --------- ------- ------------- --------- ------- ------------- --------- ----------- -------------------- ---------
tuc 127.0.0.1 6817 8704 1
Total memory in each node is 3940 and free from 3353 to 3378, which value should I give to RealMemory
For each node I have to create a different entry in slurm.conf ?
How can I check that each node can contact the slurmd port on every other node?