Error after deployment from MarketPlace

696 views
Skip to first unread message

tech-msp

unread,
May 30, 2022, 1:20:43 AM5/30/22
to google-cloud-slurm-discuss
To whom it may concern,

I received the following error directly after deploying Schedmd-Slurm-GCP from MarketPlace and running the "sinfo" command.

"sinfo: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
sinfo: error: fetch_config: DNS SRV lookup failed
sinfo: error: _establish_config_source: failed to fetch config
sinfo: fatal: Could not establish a configuration source"

Do advise how to resolve this. Thanks.

Regards, Wayne

Olivier Martin

unread,
Jun 2, 2022, 10:58:57 PM6/2/22
to google-cloud-slurm-discuss
Hi Wayne,

I've deployed and was able to reproduce your issue on an environment which has some tight organization policies around it and was able to make it work properly using the following things : 

  1. Have a network in place which has permissions to allow inter-node communications on the same subnet
  2. When enabling the Compute API, after this you need to (probably best to do it before deploying the marketplace image) make sure the default compute account (something like <projectid>-com...@developer.gserviceaccount.com) is granted permissions to the following roles : roles/logging.logWriterroles/monitoring.metricWriterwrite and roles/compute.admin. The compute.admin perhaps could be restricted more however it has to be able to deploy computer nodes and list compute types and perhaps a few more, I don't know exactly which calls are being made by the various scripts).
Let us know if this is helping your scenario.

Cheers,
Olivier

tech-msp

unread,
Jun 2, 2022, 11:59:52 PM6/2/22
to google-cloud-slurm-discuss
Hi Olivier,

I gathered that it was either the permissions or the network issue. I tried a few things and made some progress. I added the firewall rules for icmp, port 22, port 3389 and internal addresses. That helped the Slurm configuration setup. 

However after that, I encountered the problem when I ran a "sbatch" command and got compute nodes that were "DOWN".

I use "scontrol show node <node_id>" and got the following:
Details: "[{message: "The resource projects/<project_id>/regions/asia-southeast1/subnetworks/default was not found", domain: global, reason: notFound}]"> [slurm@2022-06-03T03:47:54]

Using Marketplace to deploy, my subnetwork was not named "default" and I chose the subnetwork that I had. However, it seems like "default" subnetwork was somehow "hardcoded" into the configuration despite selecting another subnetwork name.

Please advise. Thanks.

Regards, Wayne

Olivier Martin

unread,
Jun 3, 2022, 6:54:06 AM6/3/22
to tech-msp, google-cloud-slurm-discuss
Hi Wayne,

Did you enable full internal communications on the network? This will be required for mounting NFS and Slurm communications between the nodes (22 and 3389 won’t help there).

For the network/subnet, during the deployment, it is asking for which network you are deploying to, I didn’t try to deploy on a specific network, and used a network called ‘default’ with automatic subnets to test it out and it works well (with the network fully opened between hosts on the network)

Hope this helps,
Olivier

--
You received this message because you are subscribed to a topic in the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/google-cloud-slurm-discuss/gK3Qp6wVOdM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/9d5a28be-87f5-4f33-b2bd-10f51603a03bn%40googlegroups.com.
--

Olivier Martin

martin...@google.com

HPC Customer Engineer

(514) 670-8562

Reply all
Reply to author
Forward
0 new messages