[slurm-users] bug when using SlurmctldParameters=cloud_reg_addrs ? error: get_name_info: getnameinfo() failed: Name or service not known

347 views
Skip to first unread message

Pablo Escobar Lopez

unread,
Oct 25, 2021, 12:57:05 PM10/25/21
to slurm...@schedmd.com
Hi,

I have configured slurm cloud scheduling for OpenStack. I am using CentOS7 with slurm version 20.11.8 installed using EPEL RPMs and it's working fine but I am getting some strange errors in the slurm master logs which I think are a bug.

I am using these options in slurm.conf:
SlurmctldParameters=enable_configless,cloud_reg_addrs,idle_on_node_suspend

I am using these options in my slurm.conf so the cloud nodes work in "configless"mode and the ip for the cloud nodes is automatically updated on the slurm master when the cloud node contacts the slurm master, as described in the docs:

When the cloud nodes are shutdown I get this info using scontrol:

$>scontrol show node demo-slurm-compute-05 |grep -i NodeAddr
NodeAddr=demo-slurm-compute-05 NodeHostName=demo-slurm-compute-05 Version=20.11.8

And when the cloud node boots and contacts the master the ip is properly updated so the option "cloud_reg_addrs" seems to work fine. This is the output of scontrol when a cloud node boots:

$> scontrol show node demo-slurm-compute-dynamic-05 |grep NodeAddr
NodeAddr=192.168.105.128 NodeHostName=192.168.105.128 Version=20.11.8

But still every time a new cloud node boots and contacts the slurm master I get these errors in the slurm master log "slurmctld.log"

error: get_name_info: getnameinfo() failed: Name or service not known
error: slurm_auth_get_host: Lookup failed for 192.168.105.128

It seems that even if the node ip is updated on the master slurmctld still tries to resolve the hostname and it's triggering this error. Despite the error the node joins the cluster and can execute jobs. 

Has anyone experienced this problem? Is this a bug or am I doing something wrong with my config?

Best regards,
Pablo.


Reply all
Reply to author
Forward
0 new messages