Hi all,
We have various NFS servers that contain the data that our researchers want to process. These are mounted on our Slurm clusters on well-known paths. Also, the nodes have local fast scratch disk on another well-known path. We do not have any distributed file systems in use (Our Slurm clusters are basically just collections of hetero nodes of differing types, not a traditional HPC setup by any means.)
In most cases, the researchers can process the data directly off the NFS mounts without it causing any issues, but in some cases, this slows down the computation unacceptably. They could manually copy the data to the local drive using an allocation & srun commands, but I am wondering if there is a way to do this in sbatch?
I tried this method:
wdennis@submit01 ~> sbatch transfer.sbatch
Submitted batch job 329572
wdennis@submit01 ~> sbatch --dependency=afterok:329572 test_job.sbatch
Submitted batch job 329573
wdennis@submit01 ~> sbatch --dependency=afterok:329573 rm_data.sbatch
Submitted batch job 329574
wdennis@submit01 ~>
wdennis@submit01 ~> squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
329573 gpu wdennis_ wdennis PD 0:00 1 (Dependency)
329574 gpu wdennis_ wdennis PD 0:00 1 (Dependency)
329572 gpu wdennis_ wdennis R 0:23 1 compute-gpu02
But it seems to not preserve the node allocated with the --dependency jobs:
JobID|JobName|User|Partition|NodeList|AllocCPUS|ReqMem|CPUTime|QOS|State|ExitCode|AllocTRES|
329572|wdennis_data_transfer|wdennis|gpu|compute-gpu02|1|2Gc|00:02:01|normal|COMPLETED|0:0|cpu=1,mem=2G,node=1|
329573|wdennis_compute_job|wdennis|gpu|compute-gpu05|1|128Gn|00:03:00|normal|COMPLETED|0:0|cpu=1,mem=128G,node=1,gres/gpu=1|
329574|wdennis_data_removal|wdennis|gpu|compute-gpu02|1|2Gc|00:00:01|normal|COMPLETED|0:0|cpu=1,mem=2G,node=1|
What is the best way to do something like “stage the data on a local path / run computation using the local copy / remove the locally staged data when complete”?
Thanks!
Will
What I mean by “scratch” space is indeed local persistent storage in our case; sorry if my use of “scratch space” is already a generally-known Slurm concept I don’t understand, or something like /tmp… That’s why my desired workflow is to “copy data locally / use data from copy / remove local copy” in separate steps.
From:
slurm-users <slurm-use...@lists.schedmd.com> on behalf of Fulcomer, Samuel <samuel_...@brown.edu>
Date: Saturday, April 3, 2021 at 4:00 PM
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Staging data on the nodes one will be processing on via sbatch
[…]
[…]
Sorry, obvs wasn’t ready to send that last message yet…
Our issue is the shared storage is via NFS, and the “fast storage in limited supply” is only local on each node. Hence the need to copy it over from NFS (and then remove it when finished with it.)
I also wanted the copy & remove to be different jobs, because the main processing job usually requires GPU gres, which is a time-limited resource on the partition. I don’t want to tie up the allocation of GPUs while the data is staged (and removed), and if
the data copy fails, don’t want to even progress to the job where the compute happens (so like, copy_data_locally && process_data)
Sorry, obvs wasn’t ready to send that last message yet…
Our issue is the shared storage is via NFS, and the “fast storage in limited supply” is only local on each node. Hence the need to copy it over from NFS (and then remove it when finished with it.)
I also wanted the copy & remove to be different jobs, because the main processing job usually requires GPU gres, which is a time-limited resource on the partition. I don’t want to tie up the allocation of GPUs while the data is staged (and removed), and if the data copy fails, don’t want to even progress to the job where the compute happens (so like, copy_data_locally && process_data)
I think this is exactly the type of use case heterogeneous job support is for, which has been supported since Slurm 17.11
Slurm version 17.11 and later supports the ability to submit and manage heterogeneous jobs, in which each component has virtually all job options available including partition, account and QOS (Quality Of Service). For example, part of a job might require four cores and 4 GB for each of 128 tasks while another part of the job would require 16 GB of memory and one CPU.
Using this, you should be able to use a single core for the
transfer from NFS , use all the cores/GPUs you need for the
computation, and then use 1 single core to transfer back to NFS:
Disclaimer: I've never used this feature myself.
Prentice