[slurm-dev] An issue about slurm on CentOS 7.3

4 views
Skip to first unread message

Huijun HJ1 Ni

unread,
Aug 25, 2017, 7:38:01 AM8/25/17
to slurm-dev

Hi,

         I installed slurm on my cluster whose OS are CentOS7.3.

         After I completed the configuration, I found that it would be hung while executing ‘systemctl start slurm’ on compute nodes(but is ok on control node where slurmctld runs).

         But if I used the command ‘systemctl start slurmd’ on compute nodes, that were ok.

         So is that a defeat for slurm or any problems in my configurations? Can you help me?

         Attachment is my configurations.

         Thanks.

 

Best regards,

 

HuiJun Ni

Solution Python Developer

DC PG System Tools Dev

+8618116117580

ni...@lenovo.com

 

slurm.conf

Hadrian Djohari

unread,
Aug 25, 2017, 8:02:12 AM8/25/17
to slurm-dev
Slurm 17.x on CentOS 7 actually runs slurmd on compute nodes and slurmctld/slurmdbd on the head nodes.
slurm  was the name for the "service" on compute nodes for RHEL6.

Hadrian
--
Hadrian Djohari
HPCC Manager, [U]Tech
Case Western Reserve University
(W): 216-368-0395
(M): 216-798-7490

Ole Holm Nielsen

unread,
Aug 25, 2017, 8:08:32 AM8/25/17
to slurm-dev

On 08/25/2017 01:37 PM, Huijun HJ1 Ni wrote:> I installed
slurm on my cluster whose OS are CentOS7.3.
>
> After I completed the configuration, I found that it would be
> hung while executing ‘systemctl start slurm’ on compute nodes(but is ok
> on control node where slurmctld runs).
>
> But if I used the command ‘systemctl start slurmd’ on compute
> nodes, that were ok.
>
> So is that a defeat for slurm or any problems in my
> configurations? Can you help me?
>
> Attachment is my configurations.

Please see my HowTo Wiki about Slurm on CentOS/RHEL 7:
https://wiki.fysik.dtu.dk/niflheim/SLURM

Documentation about starting services:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration

/Ole

Nicholas McCollum

unread,
Aug 25, 2017, 12:19:32 PM8/25/17
to slurm-dev
I like your documentation but I would add a few things:

I highly recommend not having the slurmctld start automatically upon
reboot. If for some reason the slurm spool directory isn't available
(on a shared folder) it will cause all the jobs to die across the
cluster. I always like to triple check to make sure that the directory
is available before starting the slurmctld.

I also find it helpful, especially in instances like this, to run the
daemon in foreground mode.

# slurmctld -Dvvvv
# slurmd -Dvvvv

This will print out any errors directly on the terminal and you can see
right away while the daemon has crashed or failed to start.


--
Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

Ole Holm Nielsen

unread,
Aug 28, 2017, 3:37:41 AM8/28/17
to slurm-dev

On 08/25/2017 06:19 PM, Nicholas McCollum wrote:
> I like your documentation but I would add a few things:
>
> I highly recommend not having the slurmctld start automatically upon
> reboot. If for some reason the slurm spool directory isn't available
> (on a shared folder) it will cause all the jobs to die across the
> cluster. I always like to triple check to make sure that the directory
> is available before starting the slurmctld.
>
> I also find it helpful, especially in instances like this, to run the
> daemon in foreground mode.
>
> # slurmctld -Dvvvv
> # slurmd -Dvvvv
>
> This will print out any errors directly on the terminal and you can see
> right away while the daemon has crashed or failed to start.

Thanks for your nice comments. I added a section about manual daemon
startup to cover the scenario you describe:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#manual-startup-of-services

It's difficult to foresee every kind of problem which may occur, but
it's good to have common scenarios in the documentation.

Our Slurm master server only has local storage, but I suppose that you
need shared remote storage for Slurm HA controllers?

/Ole
Reply all
Reply to author
Forward
0 new messages