[slurm-users] Integrating Slurm with WekaIO

145 views
Skip to first unread message

Jeffrey Layton via slurm-users

unread,
Apr 19, 2024, 12:59:27 PM4/19/24
to Slurm User Community List
Good afternoon,

I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base Command Manager which is based on Bright Cluster Manager). I ran into an error and only just learned that Slurm and Weka don't get along (presumably because Weka pins their client threads to cores). I read through their documentation here:  https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8

I through I set everything correctly but when I try to restart the slurm server I get the following:

Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: _establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd initialization failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: _establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd initialization failed
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with result 'exit-code'.

Has anyone encountered this?

I read this is usually associated with configless Slurm, but I don't know how Slurm is built in BCM. slurm.conf is located in /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The environment variables for Slurm are set correctly so it points to this slurm.conf file.

One thing that I did not do was tell Slurm which cores Weka was using. I can seem to figure out the syntax for this. Can someone share the changes they made to slurm.conf?

Thanks!

Jeff

Brian Andrus via slurm-users

unread,
Apr 19, 2024, 1:10:54 PM4/19/24
to slurm...@lists.schedmd.com

This is because you have no slurm.conf in /etc/slurm, so it it is trying 'configless' which queries DNS to find out where to get the config. It is failing because you do not have DNS configured to tell nodes where to ask about the config.

Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).

Brian Andrus

Robert Kudyba via slurm-users

unread,
Apr 19, 2024, 1:16:35 PM4/19/24
to Brian Andrus, slurm...@lists.schedmd.com

Simple solution: put a copy of slurm.conf in /etc/slurm/ on the node(s).

For Bright slurm.conf is in /cm/shared/apps/slurm/var/etc/slurm including on all nodes. Make sure on the compute nodes $SLURM_CONF resolves to the correct path.

 
On 4/19/2024 9:56 AM, Jeffrey Layton via slurm-users wrote:
Good afternoon,

I'm working on a cluster of NVIDIA DGX A100's that is using BCM 10 (Base Command Manager which is based on Bright Cluster Manager). I ran into an error and only just learned that Slurm and Weka don't get along (presumably because Weka pins their client threads to cores). I read through their documentation here:  https://docs.weka.io/best-practice-guides/weka-and-slurm-integration#heading-h.4d34og8

I through I set everything correctly but when I try to restart the slurm server I get the following:

Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: _establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: slurmd: error: slurmd initialization failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: fetch_config: DNS SRV lookup failed
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: _establish_configuration: failed to load configs
Apr 19 05:29:39 bcm10-headnode slurmd[3992058]: error: slurmd initialization failed
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Main process exited, code=exited, status=1/FAILURE
Apr 19 05:29:39 bcm10-headnode systemd[1]: slurmd.service: Failed with result 'exit-code'.

Has anyone encountered this?

I read this is usually associated with configless Slurm, but I don't know how Slurm is built in BCM. slurm.conf is located in /cm/shared/apps/slurm/var/etc/slurm and this is what I edited. The environment variables for Slurm are set correctly so it points to this slurm.conf file.

One thing that I did not do was tell Slurm which cores Weka was using. I can seem to figure out the syntax for this. Can someone share the changes they made to slurm.conf?

Thanks!

Jeff



    

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Jeffrey Layton via slurm-users

unread,
Apr 19, 2024, 1:17:55 PM4/19/24
to Brian Andrus, slurm...@lists.schedmd.com
I like it, however, it was working before without a slurm.conf in /etc/slurm.

Plus the environment variable SLURM_CONF is pointing to the correct slurm.conf file (the one in /cm/...). Wouldn't Slurm pick up that one?

Thanks!

Jeff


Brian Andrus via slurm-users

unread,
Apr 19, 2024, 1:50:28 PM4/19/24
to Jeffrey Layton, slurm...@lists.schedmd.com

I would double-check where you are setting SLURM_CONF then. It is acting as if it is not set (typo maybe?)

It should be in /etc/defaults/slurmd (but could be /etc/sysconfig/slurmd).

Also check what the final, actual command being run to start it is. If anyone has changed the .service file or added an override file, that will affect things.

Brian Andrus

Robert Kudyba via slurm-users

unread,
Apr 19, 2024, 2:15:42 PM4/19/24
to Brian Andrus, Jeffrey Layton, slurm...@lists.schedmd.com
On Bright it's set in a few places:
grep -r -i SLURM_CONF /etc
/etc/systemd/system/slurmctld.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/systemd/system/slurmdbd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/systemd/system/slurmd.service.d/99-cmd.conf:Environment=SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf
/etc/logrotate.d/slurmdbd.rpmsave:  SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
/etc/logrotate.d/slurm.rpmsave:  SLURM_CONF=/cm/shared/apps/slurm/var/etc/slurm/slurm.conf /cm/shared/apps/slurm/current/bin/scontrol reconfig > /dev/null
/etc/pull.pl:$ENV{'SLURM_CONF'} = '/cm/shared/apps/slurm/var/etc/slurm/slurm.conf';

It'd still be good to check on a compute node what echo $SLURM_CONF returns for you.
Reply all
Reply to author
Forward
0 new messages