SchedMD Slurm - Controller Slurm install fails

110 views
Skip to first unread message

Gerhard Uwe Bartsch

unread,
May 4, 2022, 12:14:04 PM5/4/22
to google-cloud-slurm-discuss
I'm not sure if anyone knows the issue with this, but I used the basic template from SchedMD to start prototyping a Terraform based Slurm install.

I've run into two issues:

1) fludentd install errors on all three slurm instance installs in GCP (compute,controller,login) while running the  startup-script -> setup.py SchedMd script:

startup-script: Loaded plugins: fastestmirror
startup-script: Loading mirror speeds from cached hostfile
startup-script:  * base: ftpmirror.your.org
startup-script:  * epel: ord.mirror.rackspace.com
startup-script:  * extras: mirror.team-cymru.com
startup-script:  * updates: mirror.team-cymru.com
startup-script: Retrieving key from https://packages.cloud.google.com/yum/doc/yum-key.gpg
startup-script: Importing GPG key 0x0B5FC9E2:
startup-script:  Userid     : "Rapture Automatic Signing Key (cloud-rapture-signing-key-2022-03
startup-script:  Fingerprint: 9c5a 47a0 dedd 6927 121f 9095 daff b062 0b5f c9e2
startup-script:  From       : https://packages.cloud.google.com/yum/doc/yum-key.gpg
startup-script: Retrieving key from https://packages.cloud.google.com/yum/doc/rpm-package-key.g
startup-script: https://packages.cloud.google.com/yum/repos/google-cloud-logging-el7-x86_64/rep
startup-script: Trying other mirror.
startup-script:
startup-script:
startup-script:  One of the configured repositories failed (Google Cloud Logging Agent Reposito
startup-script:  and yum doesn't have enough cached data to continue. At this point the only
startup-script:  safe thing yum can do is fail. There are a few ways to work "fix" this:
startup-script:
startup-script:      1. Contact the upstream for the repository and get them to fix the problem
startup-script:
startup-script:      2. Reconfigure the baseurl/etc. for the repository, to point to a working
startup-script:         upstream. This is most often useful if you are using a newer
startup-script:         distribution release than is supported by the repository (and the
startup-script:         packages for the previous distribution release still work).
startup-script:
startup-script:      3. Run the command with the repository temporarily disabled
startup-script:             yum --disablerepo=google-cloud-logging ...
startup-script:
startup-script:      4. Disable the repository permanently, so yum won't use it by default. Yum
startup-script:         will then just ignore the repository until you permanently enable it
startup-script:         again or use --enablerepo for temporary usage:
startup-script:
startup-script:             yum-config-manager --disable google-cloud-logging
startup-script:         or
startup-script:             subscription-manager repos --disable=google-cloud-logging
startup-script:
startup-script:      5. Configure the failing repository to be skipped, if it is unavailable.
startup-script:         Note that yum will try to contact the repo. when it runs most commands,
startup-script:         so will have to try and fail each time (and thus. yum will be be much
startup-script:         slower). If it is a very temporary problem though, this is often a nice
startup-script:         compromise:
startup-script:
startup-script:             yum-config-manager --save --setopt=google-cloud-logging.skip_if_una
startup-script:
startup-script: failure: repodata/repomd.xml from google-cloud-logging: [Errno 256] No more mir
startup-script: https://packages.cloud.google.com/yum/repos/google-cloud-logging-el7-x86_64/rep
startup-script: https://packages.cloud.google.com/yum/repos/google-cloud-logging-el7-x86_64/rep
startup-script: Trying other mirror.


2) The controller's setup.py script appears to be failing to install Slurm, but I'm not sure what this error set actually is... but it appears that if Slurm doesn't install, then the other systems can't connect to /apps, and they error out:

startup-script: Waiting for /home to be mounted
startup-script: run: mount /home
startup-script: run: mount -a
startup-script: run: systemctl start munge
startup-script: Traceback (most recent call last):
startup-script:   File "/tmp/setup.py", line 1126, in <module>
startup-script:     main()
startup-script:   File "/tmp/setup.py", line 1037, in main
startup-script:     install_slurm()
startup-script:   File "/tmp/setup.py", line 509, in install_slurm
startup-script:     urllib.request.urlretrieve(slurm_url, src_path/file)
startup-script:   File "/usr/lib64/python3.6/urllib/request.py", line 248, in urlretrieve
startup-script:     with contextlib.closing(urlopen(url, data)) as fp:
startup-script:   File "/usr/lib64/python3.6/urllib/request.py", line 223, in urlopen
startup-script:     return opener.open(url, data, timeout)
startup-script:   File "/usr/lib64/python3.6/urllib/request.py", line 532, in open
startup-script:     response = meth(req, response)
startup-script:   File "/usr/lib64/python3.6/urllib/request.py", line 642, in http_response
startup-script:     'http', request, response, code, msg, hdrs)
startup-script:   File "/usr/lib64/python3.6/urllib/request.py", line 570, in error
startup-script:     return self._call_chain(*args)
startup-script:   File "/usr/lib64/python3.6/urllib/request.py", line 504, in _call_chain
startup-script:     result = func(*args)
startup-script:   File "/usr/lib64/python3.6/urllib/request.py", line 650, in http_error_default
startup-script:     raise HTTPError(req.full_url, code, msg, hdrs, fp)
startup-script: urllib.error.HTTPError: HTTP Error 404: Not Found
startup-script exit status 1
Finished running startup scripts.
Started Google Compute Engine Startup Scripts.

Our basic.tfvars file for Terraform looks like this....

cluster_name = "test-slurm-1"
project      = "our-test-project-name"
zone         = "us-central1-a"

network_name            = "our-vpc"
subnetwork_name         = "test-subnet"
shared_vpc_host_project = "our-infrastructure-vpc"

disable_controller_public_ips = true
disable_login_public_ips      = true
disable_compute_public_ips    = true

# ompi_version  = null # e.g. v3.1.x
# slurm_version = "19.05-latest"
suspend_time  = 60

# controller_machine_type = "n1-standard-2"
# controller_disk_type    = "pd-standard"
# controller_disk_size_gb = 50
# controller_labels = {
#   key1 = "val1"
#   key2 = "val2"
# }
controller_service_account = "our-compute instances service account"
controller_scopes          = ["https://www.googleapis.com/auth/cloud-platform"]
# cloudsql = {
#   server_ip = "<cloudsql ip>"
#   user      = "slurm"
#   password  = "verysecure"
#   db_name   = "slurm_accounting"
# }
# controller_secondary_disk      = false
# controller_secondary_disk_size = 100
# controller_secondary_disk_type = "pd-ssd"

# login_machine_type = "n1-standard-2"
# login_disk_type    = "pd-standard"
# login_disk_size_gb = 20
# login_labels = {
#   key1 = "val1"
#   key2 = "val2"
# }
login_node_count = 1
login_node_service_account = "our-compute instances service account"
login_node_scopes          = [
]

# Optional network storage fields
# network_storage is mounted on all instances
# login_network_storage is mounted on controller and login instances
network_storage = [{
  server_ip     = "10.0.0.10"
  remote_mount  = "/test_test"
  local_mount   = "/home"
  fs_type       = "nfs"
  mount_options = "nfsvers=3"
}]
#
login_network_storage = [{
  server_ip     = "10.0.0.11"
  remote_mount  = "/shared"
  local_mount   = "/mnt/shared"
  fs_type       = "nfs"
  mount_options = "nfsvers=3"
}]

# compute_image_machine_type = "n1-standard-2"
# compute_image_disk_type    = "pd-standard"
# compute_image_disk_size_gb = 20
# compute_image_labels = {
#   key1 = "val1"
#   key2 = "val2"
# }

# compute_node_service_account = "default"
# compute_node_scopes          = [
# ]

partitions = [
  { name                 = "debug"
    machine_type         = "n1-standard-2"
    static_node_count    = 0
    max_node_count       = 10
    zone                 = "us-central1-a"
    compute_disk_type    = "pd-standard"
    compute_disk_size_gb = 20
    compute_labels       = {}
    cpu_platform         = null
    gpu_count            = 0
    gpu_type             = null
    network_storage      = []
    preemptible_bursting = true
    vpc_subnet           = null
  },
#  { name                 = "partition2"
#    machine_type         = "n1-standard-16"
#    static_node_count    = 0
#    max_node_count       = 20
#    zone                 = "us-central1-a"
#    compute_disk_type    = "pd-ssd"
#    compute_disk_size_gb = 20
#    compute_labels       = {
#      key1 = "val1"
#      key2 = "val2"
#    }
#    cpu_platform         = "Intel Skylake"
#    gpu_count            = 8
#    gpu_type             = "nvidia-tesla-v100"
#    network_storage      = [{
#      server_ip     = "none"
#      remote_mount  = "<gcs bucket name>"
#      local_mount   = "/data"
#      fs_type       = "gcsfuse"
#      mount_options = "file_mode=664,dir_mode=775,allow_other"
#    }]
#    preemptible_bursting = true
#    vpc_subnet           = null
]

We've managed to find a manual solution to the fluentd install issue ( https://cloud.google.com/logging/docs/agent/logging/installation ), but the failure for the controller setup.py script is perplexing.  We had it all working just running the basic.tfvars as provided originally, but it needed network configuration... and now it's borked. (The systems can get out to the internet using CloudNAT, but do not have public IPs.)

Help!

Gerhard Uwe Bartsch

unread,
May 4, 2022, 9:56:47 PM5/4/22
to google-cloud-slurm-discuss
So it appears that part of the problem is that SchedMD removed all of their older versions (e.g. 19.x) from the download site, and probably pulled them from use in GCP.

" Due to a security vulnerability (CVE-2022-29500), all versions of Slurm prior to 21.08.8 or 20.11.9 are no longer available for download."

I suspect that the TF was trying to pull down a v19.x version as the default.

Wyatt Gorman

unread,
May 5, 2022, 7:39:04 PM5/5/22
to Gerhard Uwe Bartsch, google-cloud-slurm-discuss
Hi Gerhard,

Thanks for reaching out. Can you confirm that you're using the latest terraform code from the SchedMD repository? The attempt to download a version of Slurm from 2019 makes me suspect that this is an old copy of the Terraform code that may be out of date?

Thanks,


Wyatt Gorman

HPC Solutions Manager

https://cloud.google.com/hpc




--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/87a61826-ee12-4bee-995e-23e63cfc94c5n%40googlegroups.com.

Gerhard Uwe Bartsch

unread,
May 9, 2022, 7:41:01 AM5/9/22
to google-cloud-slurm-discuss
Thanks for getting back to me!!

I used this code to download the SchedMD git repo (that includes the sample basic.tfvars, and the rest of the setup scripts):


I found that the setup scripts appear to be trying to download an older version of the SLURM setup.

When I inserted a specific version from the SchedMD site ( https://www.schedmd.com/downloads.php ), it still has issues that prevent it from installing.

Specifically, the attempt to have the CentOS 7 core image download fluentd from Google fails 100% of the time, and clobbers the installation scripts for the login, controller, and compute node builds.

(A hard failure at that point.)

If I then logon to each and issue the following commands, I can get the login and controller to complete, but the compute node never finishes... (Compute seems to be waiting to connect to /apps, which, so some unknown reason, it's working... even though they are on the same subnet.)

sudo bash add-logging-agent-repo.sh --also-install

FWIW: I've been checking the startup scripts and their progress using this command:

sudo journalctl -o cat -u google-startup-scripts

If anyone has any thoughts as to why the basic test to install SLURM using their sample code is failing (with the shared VPC filled in, and the correct network configured), I'd be really greatful.

Gerhard

Gerhard Uwe Bartsch

unread,
May 9, 2022, 7:43:17 AM5/9/22
to google-cloud-slurm-discuss
BTW, the one spin from my late reply should read:

"If I then logon to each and issue the following commands, I can get the login and controller to complete, but the compute node never finishes... (Compute seems to be waiting to connect to /apps, which, so some unknown reason, isn't working... even though they are on the same subnet.)"

Reply all
Reply to author
Forward
0 new messages