Slurm on VM instance doesn't work with different MPI implementations

21 views
Skip to first unread message

Michael Martin

unread,
Oct 24, 2022, 8:51:31 AM10/24/22
to google-cloud-slurm-discuss

Hello,

I have created a Compute Engine VM instance built with Slurm (using one of the standard blueprints "hpc-cluster-small.yaml"). On the VM I have a code base that uses a variety of packages that have been installed using the Spack package manager, including MPICH. The issue I run into occurs when running the code using srun. 

============================================================
"The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM support. This usually happens
when OMPI was not configured --with-slurm and we weren't able
to discover a SLURM installation in the usual places."
============================================================

I've tried a number of things including building the MPICH library with Spack to include the existing installation of Slurm that is built with this instance. I've also tried installing MPICH with a Spack-installed Slurm, but whenever I load this module, it seems to break Slurm altogether and I get errors like:

============================================================
sinfo: error: resolve_ctls_from_dns_srv: res_nsearch error: Unknown host
sinfo: error: fetch_config: DNS SRV lookup failed
sinfo: error: _establish_config_source: failed to fetch config
sinfo: fatal: Could not establish a configuration source
============================================================

Is there a way to easily reconfigure the VM instance so that it recognizes different MPI implementations or different Slurm installations?

Thank you

Tom Downes

unread,
Oct 24, 2022, 10:27:31 AM10/24/22
to Michael Martin, google-cloud-slurm-discuss
Hi Michael-

I believe you're looking at a blueprint in the HPC Toolkit which uses the Slurm solution but is also a bit of a different beast. Please file a GitHub issue there:


Part of the answer is going to be how to help Spack recognize the copy of Slurm installed by SchedMD. This is one way:


It needs to version match the copy of Slurm from the image.


In the blueprint you're using, the hidden default is 21-08-8.

I mention these to give you a bit of help right away, but do please file the issue.


Tom Downes

Software Engineer, High Performance Computing +1-331-625-1145

210 N Carpenter St, Chicago, IL 60607




--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/8b492c21-1a3e-40e1-95e0-9595e0602666n%40googlegroups.com.

Tom Downes

unread,
Oct 24, 2022, 10:29:13 AM10/24/22
to Michael Martin, google-cloud-slurm-discuss
Sorry, the other important line is this:

https://github.com/GoogleCloudPlatform/hpc-toolkit/blob/main/community/examples/AMD/hpc-cluster-amd-slurmv5.yaml#L95


This tells openmpi to build in the support for Slurm specifically. But you'll likely want the other lines too to ensure it builds against the correct Spack library files.


Tom Downes

Software Engineer, High Performance Computing +1-331-625-1145

210 N Carpenter St, Chicago, IL 60607


Reply all
Reply to author
Forward
0 new messages