slurm-gcp with Debian Deep Learning image on all compute nodes

147 views
Skip to first unread message

tk...@ucdavis.edu

unread,
Feb 8, 2021, 6:14:36 PM2/8/21
to google-cloud-slurm-discuss
Hi,
is it possible to mount a Debian Deep Learning image to all compute-nodes?
There are a number of packages that work quite well with DL (TF2,xgboost, pytorch). 

Lets say If I want to install: Deep Learning Image: Base m63 CUDA11.0
A debian-10 Linux based image with CUDA 11.0 preinstalled. 

I can use glcoud to find the specific image name:
>gcloud compute images list         --project deeplearning-platform-release         | grep common
common-cu110-v20201231-debian-9     deeplearning-platform-release   common-cu110-debian-9     
common-cu110-v20210203              deeplearning-platform-release   common-cu110              
common-cu110-v20210203-debian-10    deeplearning-platform-release   common-cu110-debian-10    
common-cu110-v20210203-ubuntu-1804  bdeeplearning-platform-release  common-cu110-ubuntu-1804

What is the next step? Is it simply adding the image name to the cluster_yaml file?
compute_image_family   : common-cu110-v20210203-debian-10

Thank you!
Tobias

tk...@ucdavis.edu

unread,
Feb 8, 2021, 8:15:15 PM2/8/21
to google-cloud-slurm-discuss
Ok,
I think that does not work . The gcp-slurm cluster deploys fine with head and login 
node CentOS, but the Debian compute node images do not start. 

>cluster_yaml
partitions :
      - name              : work
        machine_type      : n1-highcpu-8
        max_node_count    : 4
        zone              : us-central1-c
        compute_image_family : common-cu110-v20210203-debian-10
        # compute_image_family_project : custom-image's project

>sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
work*        up   infinite      1   mix# cluster-compute-0-1
work*        up   infinite      1  down~ cluster-compute-0-0
work*        up   infinite      2  idle~ cluster-compute-0-[2-3]

>[cluster-login0 ~]$ srun --pty $SHELL 
srun: error: Node failure on cluster-compute-0-1
srun: Force Terminated job 4
srun: error: Job allocation 4 has been revoked

Cheers
Tobias

Alex Chekholko

unread,
Feb 8, 2021, 8:38:38 PM2/8/21
to tk...@ucdavis.edu, google-cloud-slurm-discuss
Hi Tobias, 

Right, all the provisioning scripts are specific to that distro and version.  If your software does not easily run on CentOS 7, it's going to be tough.

There are several similar cluster provisioning scripts out there that may get you closer to your desired cluster state.

Regards,
Alex

--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/e296e958-8477-4f31-b883-ccf00a35c85fn%40googlegroups.com.

Wyatt Gorman

unread,
Feb 9, 2021, 8:35:47 AM2/9/21
to Alex Chekholko, tk...@ucdavis.edu, google-cloud-slurm-discuss
Hi Tobias, support for debian and images like the dlvm image are coming in the next major release of the slurm scripts, which will be released in the next few months. I will reach about directly about possibly getting you an early development branch that supports the dlvm image.

Wyatt Gorman
HPC Solutions Manager
Google Cloud

tk...@ucdavis.edu

unread,
Feb 9, 2021, 7:27:42 PM2/9/21
to google-cloud-slurm-discuss
Hi Wyatt,
that's great, Alex pointed me to Google Cloud Marketplace solutions, 
so I am currently running the Ubuntu based fluid-slurm-gcp engine, which works pretty well. 
I think Cent OS in general is fine, but especially when porting solutions from a local cluster,
its just easier to stay with the same operating system. 

Plus the available Google Deep Learning Images that have a wider number of deep learning 
tools (Pytorch, xgboost, TF and others) already installed. That makes life much easier for people 
that want to jump directly into work instead of installing endless packages and dependencies. 
Same goes for different CUDA versions, sometime software can not be upgraded easily so 
having different version already installed really helps. 

Best
Tobias

Joseph Schoonover

unread,
Feb 9, 2021, 7:44:51 PM2/9/21
to tk...@ucdavis.edu, google-cloud-slurm-discuss
Hey Tobias,
Glad to hear the ubuntu marketplace solution is working out for you. Feel free to reach out if you need any assistance using that solution. In general, we do our best to maintain documentation at https://help.fluidnumerics.com/slurm-gcp ; support can be reached at fluid-s...@fluidnumerics.com

The content of this email is confidential and intended for the recipient specified in message only. It is strictly forbidden to share any part of this message with any third party, without a written consent of the sender. If you received this message by mistake, please reply to this message and follow with its deletion, so that we can ensure such a mistake does not occur in the future.



Dr. Joseph Schoonover

Chief Executive Officer

Senior Research Software Engineer

j...@fluidnumerics.com








Reply all
Reply to author
Forward
0 new messages