How to update Slurm after applying new Terraform partition update

28 views
Skip to first unread message

tech-msp

unread,
Oct 14, 2022, 3:39:15 AM10/14/22
to google-cloud-slurm-discuss
Hi,

I have previously deployed Slurm using Terraform on Google Cloud Platform successfully.

Now I wish to add/update the partitions in my cluster. After updating the Terraform script and applying them successfully, I logged in to my controller/login node and the Slurm configuration (using command 'sinfo') was not updated.

May I ask how to update the slurm configuration to "detect" the changes and apply it such that 'sinfo' will show the new partitions? Thanks in advance.

Regards, Wayne

Olivier Martin

unread,
Oct 14, 2022, 8:23:44 PM10/14/22
to tech-msp, google-cloud-slurm-discuss
Hi,

Typically if you use the latest version of Slurm GCP (
https://github.com/SchedMD/slurm-gcp), there is a flags in the tfvars file (for example, here : 
https://github.com/SchedMD/slurm-gcp/blob/master/terraform/slurm_cluster/examples/slurm_cluster/cloud/full/example.tfvars) which is enable_reconfigure. If you set this to True, then there’s a pub/sub topic to which the controller VM subscribes, which will be able to learn when you push a change using terraform, and the changes will be applied to your cluster.

Hope this helps!


--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/72e273ca-9405-47b2-accc-0cccc25eb05dn%40googlegroups.com.
--

Olivier Martin

martin...@google.com

HPC Customer Engineer

(514) 670-8562

tech-msp

unread,
Oct 16, 2022, 9:48:03 PM10/16/22
to google-cloud-slurm-discuss
Hi Martin,

Thanks for the response. I updated the enable_reconfigure to true and after running terraform apply, an error appeared. Error as follows:

│ Error: Error creating Schema: googleapi: Error 403: User not authorized to perform this action.

│   with module.slurm_cluster.module.slurm_controller_instance[0].google_pubsub_schema.this[0]
│   on ../../../../../slurm_cluster/modules/slurm_controller_instance/main.tf line 277, in resource "google_pubsub_schema" "this":
│  277: resource "google_pubsub_schema" "this" {

Seems to be a permission/role issue. Do you happen to know what permissions are required for this to succeed? Thanks.

Regards, Wayne

Olivier Martin

unread,
Oct 17, 2022, 9:53:24 AM10/17/22
to tech-msp, google-cloud-slurm-discuss
Not sure exactly which permissions is missing however but you need whichever user terraform is using to have some permissions on pub/sub (pub/sub admin) but perhaps you can get away with less permissions. This guide gives you a bit more info.

You also need the service account (SA) of the controller to be able to read from these topics. Again, Pub/Sub Admin will do but it's more than you need and you probably should try to give it less permissions than more. The permissions I see being actually used by the SA for the controller are:
image.png

The other permissions of the Pub/Sub Admin roles (and there are 33 others) aren't used.

Hope this helps,
Olivier



Reply all
Reply to author
Forward
0 new messages