Starting lustre VMs before computeVMs

67 views
Skip to first unread message

Anush Elangovan

unread,
Jun 18, 2020, 12:13:40 AM6/18/20
to google-cloud-slurm-discuss
Hi,
   What is the best way to start a few VMs when spinning up the first compute node? I would like to spin up my lustre VMs on demand before the first compute node. 

Also could we mount the Lustre FS once the VMs are up ? 

Thanks
Anush

Joseph Schoonover

unread,
Jun 18, 2020, 10:00:05 AM6/18/20
to google-cloud-slurm-discuss
Anush,
I think what you're asking is still an open ended design and engineering question. If you want Lustre to automatically scale with the compute nodes of your cluster, I would monitor the Slurm database and keep track of the number of nodes or independent tasks and scale the number of object storage servers based on some expected amount of IO bandwidth requirement per task. From this, you could write a service that would create/delete OSS on-the-fly to meet compute resource demands. The next question is, where is the most appropriate place to run such a service ? You could run it on the controller or set up triggers when jobs are submitted to hit Pub/Sub and then execute Cloud Functions. This is really an open-ended question right now.

To mount Lustre to compute nodes, you'll likely want to place the appropriate mount calls in the compute startup-script ( under /apps/slurm/scripts/ ).

If you use the marketplace solution, you would add the following to /apps/cls/etc/cluster-config.yaml
mounts
 - group: gid
   mount_directory: /mnt/lustre
   mount_options : defaults,_netdev
   owner: username
   permission: 755
   protocol: lustre
   server_directory: LUSTRE_MDS_IP:EXPORT

where you would fill in gid, username, and the IP address for the Lustre MDS and the local path to export. When compute nodes come online after this modification, they will mount your lustre filesystem.

Anush Elangovan

unread,
Jun 18, 2020, 10:20:11 AM6/18/20
to Joseph Schoonover, google-cloud-slurm-discuss
Thanks Joe. 

I wasn't thinking of auto-scaling lustre but that would be really cool. I just wanted to startup the statically pre-allocated number of lustre nodes  before the first compute node comes online via Slurm. 

Best,
Anush

--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/12049c34-5d18-40b2-835b-7a1c7afe3c5ao%40googlegroups.com.

Joseph Schoonover

unread,
Jun 18, 2020, 10:32:01 AM6/18/20
to google-cloud-slurm-discuss
The only Lustre solutions I'm aware of that are available publicly are



The latter, you may be able to run from the controller instance, provided the appropriate auth-scopes are enabled on the controller and the default service account has appropriate permissions.
If you want it to come online when the first few compute nodes are created, and you want to happen when you submit the job, I'd have to think about how to identify the event associated with the first compute node comes online.

Alternatively, you can manually launch your Lustre cluster outside of your slurm-gcp instances. 
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-discuss+unsub...@googlegroups.com.

Anush Elangovan

unread,
Jun 18, 2020, 10:38:31 AM6/18/20
to Joseph Schoonover, google-cloud-slurm-discuss
On Thu, Jun 18, 2020 at 7:32 AM Joseph Schoonover <j...@fluidnumerics.com> wrote:
The only Lustre solutions I'm aware of that are available publicly are

Yup. I am using this. 
 

The latter, you may be able to run from the controller instance, provided the appropriate auth-scopes are enabled on the controller and the default service account has appropriate permissions.
If you want it to come online when the first few compute nodes are created, and you want to happen when you submit the job, I'd have to think about how to identify the event associated with the first compute node comes online.

Yeah this is what I was trying to get at - can we bring up on demand any compute node dependencies (like lustre or something else)? 
 

Alternatively, you can manually launch your Lustre cluster outside of your slurm-gcp instances. 

Yup. I have this working now. 

Thanks for your help
 
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/bd74d2c7-3d25-4a10-8550-2f0dbc1c8583o%40googlegroups.com.

Wyatt Gorman

unread,
Jun 18, 2020, 10:39:42 AM6/18/20
to Joseph Schoonover, google-cloud-slurm-discuss
Hi Anush,

We recommend the DDN EXAScaler Marketplace offering as the premier, supported Lustre offering. Google is the only cloud to offer DDN's Enterprise EXAScaler software. You can find it here: https://pantheon.corp.google.com/marketplace/details/ddnstorage/exascaler-cloud

You can set up the DDN cluster on a network like your default network or a new network, and then create your slurm cluster in that network using the Slurm YAML's VPC fields. Alternatively, you can create the networks separately and ensure the IP ranges aren't conflicting, and then later use VPC Peering to connect the two networks.

If you're looking for a programmatic way to launch the DDN EXAScaler software, we'll need to work together with DDN to enable that. Otherwise the Open Source lustre deployment is an option, and includes the possibility of transferring data to/from Google Cloud Storage.


Wyatt Gorman

HPC Solutions Manager

https://cloud.google.com/hpc




To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/bd74d2c7-3d25-4a10-8550-2f0dbc1c8583o%40googlegroups.com.

Anush Elangovan

unread,
Jun 18, 2020, 12:30:41 PM6/18/20
to Wyatt Gorman, Joseph Schoonover, google-cloud-slurm-discuss
HI Wyatt,

On Thu, Jun 18, 2020 at 7:39 AM 'Wyatt Gorman' via google-cloud-slurm-discuss <google-cloud-...@googlegroups.com> wrote:
Hi Anush,

We recommend the DDN EXAScaler Marketplace offering as the premier, supported Lustre offering. Google is the only cloud to offer DDN's Enterprise EXAScaler software. You can find it here: https://pantheon.corp.google.com/marketplace/details/ddnstorage/exascaler-cloud

I think that link is Google internal. In case anyone wants it I think this is the external one https://console.cloud.google.com/marketplace/details/ddnstorage/exascaler-cloud 
 

You can set up the DDN cluster on a network like your default network or a new network, and then create your slurm cluster in that network using the Slurm YAML's VPC fields. Alternatively, you can create the networks separately and ensure the IP ranges aren't conflicting, and then later use VPC Peering to connect the two networks.

Yeah deploying the opensource lustre + slurm-gcp leads to IP address conflicts. I managed to set it up as two different subnets and now they work fine.  I have an outstanding PR for the lustre version in slurm-gcp here: https://github.com/SchedMD/slurm-gcp/pull/8 since the version it refers to has changed. 

As a side note: I tried to deploy slurm-gcp with centos8 images for our workload but unfortunately RHEL8/CentOS 8 doesn't have pdsh and a bunch of its dependencies. :(  

 

If you're looking for a programmatic way to launch the DDN EXAScaler software, we'll need to work together with DDN to enable that. Otherwise the Open Source lustre deployment is an option, and includes the possibility of transferring data to/from Google Cloud Storage.


Yeah have to stick with the open source version for the initial exploration and will reach out when / if it goes production with our customers. 

Thanks

 

Joseph Schoonover

unread,
Jul 1, 2020, 11:58:34 PM7/1/20
to google-cloud-slurm-discuss
Hey Anush,
Have you thought about installing pdsh from source on CentOS 8 ?

https://github.com/chaos/pdsh

Joseph Schoonover

unread,
Jul 2, 2020, 12:07:07 AM7/2/20
to google-cloud-slurm-discuss
On your original question, you might consider modifying /apps/slurm/scripts/resume.py or create a suitable prolog script that executes creation of Lustre MDS and OSS nodes.

PM me if you need Lustre terraform scripts. We're getting ready to release a Lustre module July 6.

Anush Elangovan

unread,
Jul 2, 2020, 12:15:40 AM7/2/20
to Joseph Schoonover, google-cloud-slurm-discuss
Hi Joe, 
   Yes tried to do that but there a bunch of deps like "whatsup" that also missing. I filled an issue with fedora epel upstream but then need more info on all deps and I haven't gotten to it and context switched out.

Thanks

--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.

Anush Elangovan

unread,
Jul 2, 2020, 12:16:32 AM7/2/20
to Joseph Schoonover, google-cloud-slurm-discuss
Ok. Will give it a look. For now Im just turning them on manually.

Thanks

On Wed, Jul 1, 2020 at 9:07 PM Joseph Schoonover <j...@fluidnumerics.com> wrote:
On your original question, you might consider modifying /apps/slurm/scripts/resume.py or create a suitable prolog script that executes creation of Lustre MDS and OSS nodes.

PM me if you need Lustre terraform scripts. We're getting ready to release a Lustre module July 6.

--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.

Joseph Schoonover

unread,
Jul 2, 2020, 12:17:28 AM7/2/20
to google-cloud-slurm-discuss
Here's some documentation from SchedMD on prolog and epilog scripts

https://slurm.schedmd.com/prolog_epilog.html

Joseph Schoonover

unread,
Jul 2, 2020, 12:22:38 AM7/2/20
to google-cloud-slurm-discuss
Another route to install pdsh and all of its dependencies from source is to install spack

https://spack.io

And run

spack install pdsh

To use pdsh, you would run

spack load pdsh

To bring it to your path.
I know it's not an ideal solution but could get you up and running with your apps on CentOS 8

Reply all
Reply to author
Forward
0 new messages