lustre with slurm-gcp

185 views
Skip to first unread message

RiverShah

unread,
Nov 17, 2020, 10:25:12 AM11/17/20
to google-cloud-slurm-discuss
I have brought up a slurm cluster using terraform and the examples here:

Now I need to scale my storage and would like to use lustre. I brought up the lustre fs using the example here:

I am having trouble with vpn peering and having the lustre fs be mounted on the slurm cluster using strictly internal ip address and networks. 

May I please get detailed instructions on how to do this? I see the nfs mount sections in slurm terraform, however I assume putting a public ip address there from the lustre fs is a bad idea. Could I please get instructions on best practices here. 
Lastly, is there a terraform module for lustre-gcp?

Thanks!

Joseph Schoonover

unread,
Nov 17, 2020, 11:18:20 AM11/17/20
to RiverShah, google-cloud-slurm-discuss
Check out the fluid-hpc_terraform repository example for launching a slurm cluster with Lustre - https://github.com/FluidNumerics/fluid-hpc_terraform/tree/master/examples/complete-with-lustre

If you deploy straight from that example, it will use the fluid-slurm-gcp  solution for the cluster and will create a new VPC network that both the cluster and a Lustre filesystem will be deployed in. Be sure to read over the pricing section on the marketplace page and the EULA.

The content of this email is confidential and intended for the recipient specified in message only. It is strictly forbidden to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.



Dr. Joseph Schoonover

Chief Executive Officer

Senior Research Software Engineer

j...@fluidnumerics.com








--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/863b13d2-6f4d-46ac-b55b-d018444eaff4n%40googlegroups.com.

Alex Chekholko

unread,
Nov 17, 2020, 12:11:08 PM11/17/20
to RiverShah, google-cloud-slurm-discuss
Hi,

Are you sure you need Lustre and a separate storage cluster?  You can have up to 64TB disk on the head node with no additional setup.  

If that's not enough, it actually looks like they raised the max limit per instance from 64TB to 257TB:

If you need more capacity than that, it may be simpler to spin up multiple SLURM clusters rather than a separate Lustre cluster.

Regards,
Alex

RiverShah

unread,
Nov 17, 2020, 12:44:14 PM11/17/20
to google-cloud-slurm-discuss
@Dr. Schoonover: I will take a look at the example you posted. Thanks!
@Alex Good point regarding just using persistent disks with the controller node. Couple of questions: How do I ensure that I don't lose any data I have transferred to the persistent disks as I add or remove partitions or make other changes to the cluster and have to potentially tear down the cluster? How will I add provisioned capacity as my needs grow. What is the size of the controller instance I need to pick to maximize disk performance? Thanks for hinting towards a simpler deployment (that may be scalable enough)

Joseph Schoonover

unread,
Nov 17, 2020, 12:46:35 PM11/17/20
to RiverShah, google-cloud-slurm-discuss
Filestore is another good option, if all you need is storage capacity that can change over time. You can NFS mount filestore to your cluster.
Lustre is a great solution if you have performance critical parallel IO and a need for IO bandwidth beyond what a single point NFS connection can provide.

The content of this email is confidential and intended for the recipient specified in message only. It is strictly forbidden to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.



Dr. Joseph Schoonover

Chief Executive Officer

Senior Research Software Engineer

j...@fluidnumerics.com







Alex Chekholko

unread,
Nov 17, 2020, 12:59:36 PM11/17/20
to RiverShah, google-cloud-slurm-discuss
RE: " I add or remove partitions or make other changes to the cluster " I think these scripts are designed to make it really easy to set up and tear down the cluster but it's not so easy to modify an existing cluster.  So the operational tasks are left as an exercise to the reader. Check out the archives of this mailing list for tips for how to do things like change the number of compute nodes or change the specs of the compute nodes or change the shared filesystem mounts on the compute nodes or add a piece of software to the node image.

It's definitely designed to just pre-define everything and deploy and use and tear down.   It may be quicker and easier to tear down the existing cluster and make a new one with the change you want.  You'll have to play with all that yourself and try to fit it for your use case.

For example, in my case I found it easy to launch a cluster per-user-project with the instance types based on the job script specs the user comes to me with.  Completely separate cluster for each user and dataset.

Regards,
Alex

Joseph Schoonover

unread,
Nov 17, 2020, 1:09:22 PM11/17/20
to Alex Chekholko, RiverShah, google-cloud-slurm-discuss
If you want to easily modify compute partitions after deployment, check out the fluid-slurm-gcp solution. There is a CLI installed (cluster-services) that can be used to easily configure compute partitions at any point during your cluster's lifespan.

Here's a few resources that can help give you an idea of how it works
The content of this email is confidential and intended for the recipient specified in message only. It is strictly forbidden to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.



Dr. Joseph Schoonover

Chief Executive Officer

Senior Research Software Engineer

j...@fluidnumerics.com







Wyatt Gorman

unread,
Nov 17, 2020, 2:08:38 PM11/17/20
to Joseph Schoonover, Alex Chekholko, RiverShah, google-cloud-slurm-discuss
Hi RiverShah,

Your original problem likely lies in the configuration of your Slurm cluster and Lustre cluster to use the same VPC Network. First be sure to create the Lustre cluster, either on a new network (by commenting out the VPC fields in the YAML), or by specifying an existing network in the VPC fields (like "default"). Then in the Slurm YAML, specify the VPC Network and Subnet names at the top level (around line 30), and then also in the partition configuration(s). Once both clusters are on the same network, they will be able to ping each other, and you will be able to mount Lustre.

The FluidNumerics Marketplace Slurm solutions are great options that have more easy-to-use management and integration for situations like this.


Wyatt Gorman

HPC Solutions Manager

https://cloud.google.com/hpc




S. Ansar

unread,
Nov 18, 2020, 3:56:24 AM11/18/20
to google-cloud-slurm-discuss
Thanks for the input. I have lustre and the compute cluster now in the same vpn, fixed IP conflicts and can mount lfs on compute nodes. 
Reply all
Reply to author
Forward
0 new messages