[slurm-users] Slurmctld Problems

11 views
Skip to first unread message

stth via slurm-users

unread,
Jun 25, 2024, 6:22:40 AM (10 days ago) Jun 25
to slurm...@lists.schedmd.com
Dear slurm users,

It is my first time setting slurm up and I am looking for a solution to this errors. Has anyone here already ecountered this problem. I would really appreciate the help.  mariadb, slurmdbd and slurmd are active.

× slurmctld.service - Slurm controller daemon

     Loaded: loaded (/usr/lib/systemd/system/slurmctld.service; enabled; preset: enabled)

     Active: failed (Result: exit-code) since Tue 2024-06-25 10:06:39 UTC; 2min 42s ago

   Duration: 584ms

       Docs: man:slurmctld(8)

    Process: 63738 ExecStart=/usr/sbin/slurmctld --systemd $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)

   Main PID: 63738 (code=exited, status=1/FAILURE)

        CPU: 25ms


Jun 25 10:06:39 server systemd[1]: Starting slurmctld.service - Slurm controller daemon...

Jun 25 10:06:39 server (lurmctld)[63738]: slurmctld.service: Referenced but unset environment variable evaluates to an empty string: SLURMCTLD_OPTIONS

Jun 25 10:06:39 server slurmctld[63738]: slurmctld: slurmctld version 23.11.4 started on servercluster

Jun 25 10:06:39 server systemd[1]: Started slurmctld.service - Slurm controller daemon.

Jun 25 10:06:39 server slurmctld[63738]: slurmctld: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd

Jun 25 10:06:39 server slurmctld[63738]: slurmctld: priority/multifactor: _read_last_decay_ran: No last decay (/var/spool/slurm/state/priority_last_decay_ran) to recover

Jun 25 10:06:39 server slurmctld[63738]: slurmctld: No memory enforcing mechanism configured.

Jun 25 10:06:39 server slurmctld[63738]: slurmctld: fatal: Can not recover last_conf_lite, incompatible version, (9472 not between 9728 and 10240), start with '-i' to ignore this. Warning: using -i will lose the data that can't be recovered.

Jun 25 10:06:39 server systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE

Jun 25 10:06:39 server systemd[1]: slurmctld.service: Failed with result 'exit-code'.

daijiangkuicgo--- via slurm-users

unread,
Jun 25, 2024, 9:49:17 AM (9 days ago) Jun 25
to slurm...@lists.schedmd.com
What's your “ Referenced but unset environment variable evaluates to an empty string:
SLURMCTLD_OPTIONS* ”? Meanwhile, you can check slurmctld.log and journalctl -u slurmctld --no-pager.

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

stth via slurm-users

unread,
Jun 25, 2024, 10:33:27 AM (9 days ago) Jun 25
to daijian...@gmail.com, slurm...@lists.schedmd.com
Hello,
slurmctld.log and journalctl -u slurmctld --no-pager give the same info as I have already provided. 
“ Referenced but unset environment variable evaluates to an empty string:
SLURMCTLD_OPTIONS* " has to do with the files on /etc/default (slurmdbd/slurmctld/slurmd), where there is a line: SLURMDBD_OPTIONS="".

But it does not have anything to do with the fact that the deamon is not active

Lorenzo Bosio via slurm-users

unread,
Jun 25, 2024, 10:53:06 AM (9 days ago) Jun 25
to stth, daijian...@gmail.com, slurm...@lists.schedmd.com

Hello,

I suppose the actual error is:

slurmctld: fatal: Can not recover last_conf_lite, incompatible version, (9472 not between 9728 and 10240), start with '-i' to ignore this. Warning: using -i will lose the data that can't be recovered.

did you upgrade from Slurm 21.08 (9472) to your actual version 23.11 (10240) ? See here for numbers reference: https://github.com/SchedMD/slurm/blob/40058e4df5fa243f4c340db9622ed559ce771778/src/common/slurm_protocol_common.h#L63

You have to stay in a 2 releases window for the upgrades to work.

Best regards,
Lorenzo

stth via slurm-users

unread,
Jun 25, 2024, 10:54:12 AM (9 days ago) Jun 25
to Lorenzo Bosio, daijian...@gmail.com, slurm...@lists.schedmd.com
Hello Lorenzo,

Thank you for your reply. Yes I got the 
23.11.4 version.

Timo Rothenpieler via slurm-users

unread,
Jun 25, 2024, 11:25:48 AM (9 days ago) Jun 25
to slurm...@lists.schedmd.com
On 25/06/2024 12:20, stth via slurm-users wrote:
> Jun 25 10:06:39 server slurmctld[63738]: slurmctld: fatal: Can not
> recover last_conf_lite, incompatible version, (9472 not between 9728 and
> 10240), start with '-i' to ignore this. Warning: using -i will lose the
> data that can't be recovered.

Seems like it's not the first time, but the first time in a long while.
If there is no important data in that old db, just do what the error
says as a one-off.

stth via slurm-users

unread,
Jun 25, 2024, 11:56:20 AM (9 days ago) Jun 25
to Timo Rothenpieler, slurm...@lists.schedmd.com
Hi Timo,

Thanks, The old data wasn’t important so I did that. I changed the line as follows in the 
/usr/lib/systemd/system/slurmctld.service : 
  
ExecStart=/usr/sbin/slurmctld --systemd -i $SLURMCTLD_OPTIONS

Slurmctld is now active

Timo Rothenpieler via slurm-users

unread,
Jun 25, 2024, 1:13:33 PM (9 days ago) Jun 25
to slurm...@lists.schedmd.com
On 25.06.2024 17:54, stth via slurm-users wrote:
> Hi Timo,
>
> Thanks, The old data wasn’t important so I did that. I changed the line
> as follows in the
> /usr/lib/systemd/system/slurmctld.service :
> ExecStart=/usr/sbin/slurmctld --systemd -i $SLURMCTLD_OPTIONS

You should be able to immediately remove it again.
I'd have probably just launched slurmctld maually via cli with -i once.
Reply all
Reply to author
Forward
0 new messages