Cloud Architecture Center tutorial doesn't seem to work (jobs stuck in BeginTime)

19 views
Skip to first unread message

David Huggins-Daines

unread,
Jul 28, 2022, 3:13:31 PM7/28/22
to google-cloud-slurm-discuss
Hi,

I'm just starting out with Cloud Engine and was very interested to find this tutorial about setting up a Slurm cluster:

https://cloud.google.com/architecture/deploying-slurm-cluster-compute-engine

But it doesn't seem to work.  Once I deploy the cluster and try to launch a test job, nothing happens (i.e. there is no output), and the jobs seem to get stuck with reason "BeginTime":

[dhdaines_gmail_com@full-login-8kjqahky-001 ~]$ sbatch -N2 --wrap="srun hostname"
Submitted batch job 1
[dhdaines_gmail_com@full-login-8kjqahky-001 ~]$ ls
[dhdaines_gmail_com@full-login-8kjqahky-001 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      2   mix# full-debug-test-[0-1]
debug*       up   infinite     18  idle~ full-debug-test-[2-19]
debug2       up   infinite     10  idle~ full-debug2-test-[0-9]
[dhdaines_gmail_com@full-login-8kjqahky-001 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1     debug     wrap dhdaines CF       0:06      2 full-debug-test-[0-1]
[dhdaines_gmail_com@full-login-8kjqahky-001 ~]$ ls
[dhdaines_gmail_com@full-login-8kjqahky-001 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1     debug     wrap dhdaines CF       0:14      2 full-debug-test-[0-1]
[dhdaines_gmail_com@full-login-8kjqahky-001 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1     debug     wrap dhdaines PD       0:00      2 (None)
[dhdaines_gmail_com@full-login-8kjqahky-001 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1     debug     wrap dhdaines PD       0:00      2 (BeginTime)


I'm not familiar with Slurm, having only used SGE and UGE in the past, so I'm sorry if I'm missing something obvious here.  I tried to hold and release the job, submitting another job, same problem.

Is the tutorial out of date?  Is my account perhaps not capable of deploying a cluster?  (I'm still in the free trial)

Any hints would be greatly appreciated.

David Huggins-Daines

unread,
Jul 28, 2022, 3:17:50 PM7/28/22
to google-cloud-slurm-discuss
Oh, never mind.  It just takes a long time for the cluster to "spin up", even after the point where the tutorial claims the cluster is "ready", and there is no indication of what is happening, and the tutorial doesn't really make this clear.

You (whoever writes this documentation, if you read this forum) should probably make this clear in the documentation, unless it is somehow considered to be obvious to everyone except me.

David Huggins-Daines

unread,
Jul 28, 2022, 5:15:57 PM7/28/22
to google-cloud-slurm-discuss
It seems that this codelab is vastly more informative, but is a bit out of date with respect to the current slurm-gcp configuration:


Had I read this instead, I would have had a better idea what was going on the first time I submitted a job to the cluster...

You should also *really* make clear that /home and /apps are NFS mounts from the controller in the Cloud Architecture Center tutorial, I was deeply confused about this (there's that "NFS Storage" box in the architecture diagram that isn't connected to anything, and the text just says "The NFS server provides a common shared space for files" without telling you, like, *where this space is*) until I did a bunch of searching in this group.

In short, this tutorial should be Considered Harmful (https://cloud.google.com/architecture/deploying-slurm-cluster-compute-engine) and if anyone is scratching their head after reading and trying it out they should probably look at the codelab above.

Thanks.
Reply all
Reply to author
Forward
0 new messages