2 types of nodes

294 views
Skip to first unread message

JA

unread,
Jan 2, 2019, 10:09:09 AM1/2/19
to google-cloud-slurm-discuss
Hi, Happy New Year everyone!
Is it possible to setup a google cloud slurm cluster with 2 types of nodes in 2 partitions (namely a partition with CPU only, and one with GPU nodes), with 2 different custom VM images, and that scale up and down independently? Use case is that in my pipeline I have a short task that runs on GPU, but then quite some time to spend on CPUs.

Thank you in advance



Keith Binder

unread,
Jan 4, 2019, 11:40:08 AM1/4/19
to JA, google-cloud-slurm-discuss

In the current implementation, you can only specify compute instance types per cluster.  If you prefer multiple instance types you can always create multiple clusters within your project. 





--
You received this message because you are subscribed to the Google Groups "google-cloud-slurm-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-cloud-slurm-...@googlegroups.com.
To post to this group, send email to google-cloud-...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-cloud-slurm-discuss/a08c5a65-8ce1-41cf-8366-df40cfc59e4a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Joseph Schoonover

unread,
Apr 3, 2019, 3:09:20 PM4/3/19
to google-cloud-slurm-discuss
Hey JA, I'm currently working with SchedMD on a patch for multiple partitions over the next couple of weeks. 
Stay tuned for a feature branch to monitor...

Joseph Schoonover

unread,
Apr 25, 2019, 12:09:59 AM4/25/19
to google-cloud-slurm-discuss
Hey JA and Keith, I have a fork of SchedMD's repo at our github that has a multi-partition feature; this feature allows you to have multiple instance types in one deployment : https://github.com/FluidNumerics/slurm-gcp/tree/multi-partition

Let me know if you need some help getting started

Robert Moulton

unread,
Jun 21, 2019, 1:36:47 PM6/21/19
to google-cloud-slurm-discuss
hi Joseph - we're hoping to use your multi-partition version and could use some guidance. our test deployment seems to be working in general, but at least one thing has us puzzled: Jobs often fail, reporting the following error:

slurmstepd: error: *** JOB 104 ON g2-compute03003 CANCELLED AT 2019-06-21T17:23:01 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS ***

Any idea what might be wrong? It seems to happen invariably if a job includes a 'sleep' element, for example.

-r

Joseph Schoonover

unread,
Jun 22, 2019, 11:07:02 AM6/22/19
to google-cloud-slurm-discuss
Robert, what does

cat /apps/slurm/log/slurmctld.log

show?

Robert Moulton

unread,
Jun 24, 2019, 12:56:03 PM6/24/19
to Joseph Schoonover, google-cloud-slurm-discuss
Joseph - Here's an example, a simple job submission and its
corresponding log output on a newly-deployed cluster. Note that with
'sleep 10' commented out, the job completes successfully.

-r

$ cat sleep10.sh
#!/bin/bash

date
sleep 10
date

$ sbatch sleep10.sh

$ sudo cat /apps/slurm/log/slurmctld.log
[...]
[2019-06-24T16:21:45.292] _slurm_rpc_submit_batch_job: JobId=4
InitPrio=4294901757 usec=1828
[2019-06-24T16:21:45.376] sched: Allocate JobId=4
NodeList=tm1-compute00000 #CPUs=2 Partition=tm-nc4-mem16
[2019-06-24T16:22:49.123] Node tm1-compute00000 now responding
[2019-06-24T16:23:01.407] update_node: node tm1-compute00000 reason set
to: Instance stopped/deleted
[2019-06-24T16:23:01.407] requeue job JobId=4 due to failure of node
tm1-compute00000
[2019-06-24T16:23:01.407] Requeuing JobId=4
[2019-06-24T16:23:01.407] update_node: node tm1-compute00000 state set
to DOWN
[2019-06-24T16:23:01.423] node_did_resp: node tm1-compute00000 returned
to service
[2019-06-24T16:24:01.814] update_node: node tm1-compute00000 reason set
to: Instance stopped/deleted
[2019-06-24T16:24:01.814] update_node: node tm1-compute00000 state set
to DOWN
[2019-06-24T16:25:02.726] sched: Allocate JobId=4
NodeList=tm1-compute00001 #CPUs=2 Partition=tm-nc4-mem16
[2019-06-24T16:25:38.951] Node tm1-compute00001 now responding
[2019-06-24T16:26:01.608] update_node: node tm1-compute00001 reason set
to: Instance stopped/deleted
[2019-06-24T16:26:01.608] requeue job JobId=4 due to failure of node
tm1-compute00001
[2019-06-24T16:26:01.608] Requeuing JobId=4
[2019-06-24T16:26:01.608] update_node: node tm1-compute00001 state set
to DOWN
[2019-06-24T16:26:01.624] node_did_resp: node tm1-compute00001 returned
to service
[2019-06-24T16:26:46.922] node_did_resp: node tm1-compute00000 returned
to service
[2019-06-24T16:27:02.010] update_node: node tm1-compute00000 reason set
to: Instance stopped/deleted
[2019-06-24T16:27:02.010] update_node: node tm1-compute00000 state set
to DOWN
[2019-06-24T16:27:02.010] update_node: node tm1-compute00001 reason set
to: Instance stopped/deleted
[2019-06-24T16:27:02.010] update_node: node tm1-compute00001 state set
to DOWN
[2019-06-24T16:28:02.205] sched: Allocate JobId=4
NodeList=tm1-compute00002 #CPUs=2 Partition=tm-nc4-mem16
[2019-06-24T16:28:41.013] Node tm1-compute00002 now responding
[2019-06-24T16:29:01.832] update_node: node tm1-compute00002 reason set
to: Instance stopped/deleted
[2019-06-24T16:29:01.832] requeue job JobId=4 due to failure of node
tm1-compute00002
[2019-06-24T16:29:01.832] Requeuing JobId=4
[2019-06-24T16:29:01.832] update_node: node tm1-compute00002 state set
to DOWN
[2019-06-24T16:29:01.847] node_did_resp: node tm1-compute00002 returned
to service
[2019-06-24T16:30:02.231] update_node: node tm1-compute00002 reason set
to: Instance stopped/deleted
[2019-06-24T16:30:02.231] update_node: node tm1-compute00002 state set
to DOWN
[2019-06-24T16:31:02.499] sched: Allocate JobId=4
NodeList=tm1-compute00003 #CPUs=2 Partition=tm-nc4-mem16
[2019-06-24T16:31:38.717] Node tm1-compute00003 now responding
[2019-06-24T16:32:01.995] update_node: node tm1-compute00003 reason set
to: Instance stopped/deleted
[2019-06-24T16:32:01.996] requeue job JobId=4 due to failure of node
tm1-compute00003
[2019-06-24T16:32:01.996] Requeuing JobId=4
[2019-06-24T16:32:01.996] update_node: node tm1-compute00003 state set
to DOWN
[2019-06-24T16:32:02.012] node_did_resp: node tm1-compute00003 returned
to service
[2019-06-24T16:33:01.378] update_node: node tm1-compute00003 reason set
to: Instance stopped/deleted
[2019-06-24T16:33:01.379] update_node: node tm1-compute00003 state set
to DOWN
[2019-06-24T16:34:19.572] backfill: Started JobId=4 in tm-nc4-mem16 on
tm1-compute00004
[2019-06-24T16:34:58.373] Node tm1-compute00004 now responding
[2019-06-24T16:35:02.171] update_node: node tm1-compute00004 reason set
to: Instance stopped/deleted
[2019-06-24T16:35:02.171] requeue job JobId=4 due to failure of node
tm1-compute00004
[2019-06-24T16:35:02.171] Requeuing JobId=4
[2019-06-24T16:35:02.171] update_node: node tm1-compute00004 state set
to DOWN
[2019-06-24T16:35:02.187] node_did_resp: node tm1-compute00004 returned
to service
[2019-06-24T16:36:01.559] update_node: node tm1-compute00004 reason set
to: Instance stopped/deleted
[2019-06-24T16:36:01.559] update_node: node tm1-compute00004 state set
to DOWN
[2019-06-24T16:37:19.574] backfill: Started JobId=4 in tm-nc4-mem16 on
tm1-compute00005
[2019-06-24T16:37:58.411] Node tm1-compute00005 now responding
[2019-06-24T16:38:01.354] update_node: node tm1-compute00005 reason set
to: Instance stopped/deleted
[2019-06-24T16:38:01.354] requeue job JobId=4 due to failure of node
tm1-compute00005
[2019-06-24T16:38:01.354] Requeuing JobId=4
[2019-06-24T16:38:01.354] update_node: node tm1-compute00005 state set
to DOWN
[2019-06-24T16:38:01.370] node_did_resp: node tm1-compute00005 returned
to service
[2019-06-24T16:39:01.742] update_node: node tm1-compute00005 reason set
to: Instance stopped/deleted
[2019-06-24T16:39:01.742] update_node: node tm1-compute00005 state set
to DOWN
[2019-06-24T16:39:08.205] _slurm_rpc_kill_job: REQUEST_KILL_JOB JobId=4
uid 326316723

$ sinfo -l
Mon Jun 24 16:43:47 2019
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS
NODES STATE NODELIST
tm-nc4-mem16* up infinite 1-infinite no NO all
4 down% tm1-compute[00002-00005]
tm-nc4-mem16* up infinite 1-infinite no NO all
16 idle~ tm1-compute[00000-00001,00006-00019]
tm-nc16-mem32 up infinite 1-infinite no NO all
20 idle~ tm1-compute[01000-01019]

Joseph Schoonover wrote on 6/22/19 8:07 AM:

Joseph Schoonover

unread,
Jun 24, 2019, 12:59:29 PM6/24/19
to google-cloud-slurm-discuss
It looks like the nodes are being deleted. It could be the suspend.py script is not working properly.
Try logging onto the controller and removing the suspend.py execution from the crontab. I'll spin something up to see if I can reproduce what you're seeing too. 

Joseph Schoonover

unread,
Jun 24, 2019, 1:03:13 PM6/24/19
to google-cloud-slurm-discuss
Correction -- slurm_gcp_sync.py, not suspend.py

Robert Moulton

unread,
Jun 24, 2019, 1:24:03 PM6/24/19
to Joseph Schoonover, google-cloud-slurm-discuss
Indeed, with the slurm_gcp_sync.py cron job disabled, such test jobs are
completing successfully. For example, output from a 'sleep 60' job:

$ cat slurm-8.out
Mon Jun 24 17:13:35 UTC 2019
Mon Jun 24 17:14:35 UTC 2019

Joseph Schoonover wrote on 6/24/19 10:03 AM:
> --
> You received this message because you are subscribed to the Google
> Groups "google-cloud-slurm-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to google-cloud-slurm-...@googlegroups.com
> <mailto:google-cloud-slurm-...@googlegroups.com>.
> To post to this group, send email to
> google-cloud-...@googlegroups.com
> <mailto:google-cloud-...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/google-cloud-slurm-discuss/b7ad5247-54b1-469b-ba4a-4146dd7202b5%40googlegroups.com
> <https://groups.google.com/d/msgid/google-cloud-slurm-discuss/b7ad5247-54b1-469b-ba4a-4146dd7202b5%40googlegroups.com?utm_medium=email&utm_source=footer>.

Joseph Schoonover

unread,
Jun 24, 2019, 1:27:26 PM6/24/19
to Robert Moulton, google-cloud-slurm-discuss
Got it. Looks like there are a couple things to sort out for the slurm-gcp-sync.py script. Thanks for testing things out. Definitely let me know if there are other issues ( https://github.com/FluidNumerics/slurm-gcp/issues )



Dr. Joseph Schoonover

Chief Executive Officer

HPC Specialist

j...@fluidnumerics.com






Robert Moulton

unread,
Jun 24, 2019, 1:48:27 PM6/24/19
to Joseph Schoonover, google-cloud-slurm-discuss
Will do, thanks.

Unrelated question: What is the proper way to add new partitions to an
existing cluster deployment? (I managed to do it but I'm not confident
that i did it in an appropriate way.)

Joseph Schoonover wrote on 6/24/19 10:27 AM:
> Got it. Looks like there are a couple things to sort out for the
> slurm-gcp-sync.py script. Thanks for testing things out. Definitely let
> me know if there are other issues (
> https://github.com/FluidNumerics/slurm-gcp/issues )
>
> <https://sites.google.com/a/fluidnumerics.com/fluidnumerics/>
>
>
>
>
>
> Dr. Joseph Schoonover
>
> Chief Executive Officer
>
> HPC Specialist
>
> j...@fluidnumerics.com <mailto:j...@fluidnumerics.com>
> <mailto:google-cloud-slurm-discuss%2Bunsu...@googlegroups.com>
> > <mailto:google-cloud-slurm-...@googlegroups.com
> <mailto:google-cloud-slurm-discuss%2Bunsu...@googlegroups.com>>.
> > To post to this group, send email to
> > google-cloud-...@googlegroups.com
> <mailto:google-cloud-...@googlegroups.com>
> > <mailto:google-cloud-...@googlegroups.com
> <mailto:google-cloud-...@googlegroups.com>>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/google-cloud-slurm-discuss/b7ad5247-54b1-469b-ba4a-4146dd7202b5%40googlegroups.com
>
> >
> <https://groups.google.com/d/msgid/google-cloud-slurm-discuss/b7ad5247-54b1-469b-ba4a-4146dd7202b5%40googlegroups.com?utm_medium=email&utm_source=footer>.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "google-cloud-slurm-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to google-cloud-slurm-...@googlegroups.com
> <mailto:google-cloud-slurm-...@googlegroups.com>.
> To post to this group, send email to
> google-cloud-...@googlegroups.com
> <mailto:google-cloud-...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/google-cloud-slurm-discuss/CA%2B1e_JU%2Bev6U8B2dvZSkMq-8MRq4R_K3mrQvMTzFoUcVFOEp5A%40mail.gmail.com
> <https://groups.google.com/d/msgid/google-cloud-slurm-discuss/CA%2B1e_JU%2Bev6U8B2dvZSkMq-8MRq4R_K3mrQvMTzFoUcVFOEp5A%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Joseph Schoonover

unread,
Jun 24, 2019, 2:07:52 PM6/24/19
to google-cloud-slurm-discuss
In general, what you will want to do is : 
* Cancel all jobs
* Update /apps/slurm/current/etc/slurm.conf to add the additional nodes
* Update the partitions json in all of the /apps/slurm/scripts/*.py
* Restart slurmctld
* Recreate image nodes

We're currently working on a service to handle this for a marketplace deployment on GCP
>      > To post to this group, send email to
>      > google-cloud-...@googlegroups.com
>      > To view this discussion on the web visit
>      >
>     https://groups.google.com/d/msgid/google-cloud-slurm-discuss/b7ad5247-54b1-469b-ba4a-4146dd7202b5%40googlegroups.com
>
>      >
>     <https://groups.google.com/d/msgid/google-cloud-slurm-discuss/b7ad5247-54b1-469b-ba4a-4146dd7202b5%40googlegroups.com?utm_medium=email&utm_source=footer>.
>      > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "google-cloud-slurm-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send
> To post to this group, send email to
> google-cloud-...@googlegroups.com

Christoph

unread,
Jul 3, 2019, 8:00:09 PM7/3/19
to google-cloud-slurm-discuss
I have just tried to use the preemptible partitions, to see if I can reproduce some of the problems.

Before starting jobs, there are two queues each with 1000 nodes, as it should be:

$ sinfo
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
us-west1-b-16*    up   infinite   1000  idle~ g1-compute[00000-00999]
us-west1-c-16     up   infinite   1000  idle~ g1-compute[10000-10999]


After starting two 5-node jobs, one each partition, at first all looks fine:

$ sinfo
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
us-west1-b-16*    up   infinite      5 alloc# g1-compute[00000-00004]
us-west1-b-16*    up   infinite    995  idle~ g1-compute[00005-00999]
us-west1-c-16     up   infinite      5 alloc# g1-compute[10000-10004]
us-west1-c-16     up   infinite    995  idle~ g1-compute[10005-10999]

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 2 us-west1-    t-1.1 cgorgull CF       2:15      5 g1-compute[00000-00004]
                 3 us-west1- t-1000.1 cgorgull CF       1:25      5 g1-compute[10000-10004]


But then, things go wrong soon:

$ sinfo
PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
us-west1-b-16*    up   infinite      2  idle# g1-compute[00002-00003]
us-west1-b-16*    up   infinite      3  down# g1-compute[00000-00001,00004]
us-west1-b-16*    up   infinite    995  idle~ g1-compute[00005-00999]
us-west1-c-16     up   infinite      5 alloc# g1-compute[10000-10004]
us-west1-c-16     up   infinite    995  idle~ g1-compute[10005-10999]+

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3 us-west1- t-1000.1 cgorgull CF       5:03      5 g1-compute[10000-10004]
                 2 us-west1-    t-1.1 cgorgull PD       0:00      5 (BeginTime)


Best,
Christoph

Joseph Schoonover

unread,
Jul 4, 2019, 12:12:29 PM7/4/19
to google-cloud-slurm-discuss
Christoph,

I have pulled in updates from SchedMD's repo, updated to 19.05-latest for slurm as the default slurm option, and provided a patch for slurm-gcp-sync.py that I think is getting us in the right direction.

Can you test out the latest version at https://github.com/FluidNumerics/slurm-gcp ?

I've tested with preemptibles and non-preemptibles on my end and nodes are coming online and being spun down as needed.

Christoph

unread,
Jul 4, 2019, 12:49:44 PM7/4/19
to google-cloud-slurm-discuss
Hi Joesph, 

That sounds like great progress. 

I have copied over the old config I used from the previous version of FluidNumerics/Slurm-GCP

Currently, the first problem is with the NFS apps server mounting. Here is the end of /var/log/messages:

Jul  4 16:23:10 g1-login1 rpc.statd[5275]: Initializing NSM state
Jul  4 16:23:10 g1-login1 systemd: Started NFS status monitor for NFSv2/3 locking..
Jul  4 16:25:16 g1-login1 startup-script: INFO startup-script: mount.nfs: Connection refused
Jul  4 16:25:19 g1-login1 startup-script: INFO startup-script: mount.nfs: access denied by server while mounting 10.182.38.170:@NFS_APPS_DIR@


The problem is the same for 
 
  nfs_apps_server            : 10.182.38.170
   
#nfs_apps_dir               : /apps

and for 
 
   nfs_apps_server            : 10.182.38.170
    nfs_apps_dir              
: /apps

(i.e. commented out).

I can manually mount the Filestore share on the login node, and it is in the same network, so it should be working.

Many thanks,
Christoph

Joseph Schoonover

unread,
Jul 4, 2019, 1:42:51 PM7/4/19
to google-cloud-slurm-discuss
Another patch has been pushed up that passes the nfs_apps_dir and nfs_home_dir to the startup script on deployment. You should be able to mount your filestore instance via the deployment scripts.
Reply all
Reply to author
Forward
0 new messages