[slurm-users] Staging data on the nodes one will be processing on via sbatch

317 views
Skip to first unread message

Will Dennis

unread,
Apr 3, 2021, 3:42:44 PM4/3/21
to slurm...@lists.schedmd.com

Hi all,

 

We have various NFS servers that contain the data that our researchers want to process. These are mounted on our Slurm clusters on well-known paths. Also, the nodes have local fast scratch disk on another well-known path. We do not have any distributed file systems in use (Our Slurm clusters are basically just collections of hetero nodes of differing types, not a traditional HPC setup by any means.)

 

In most cases, the researchers can process the data directly off the NFS mounts without it causing any issues, but in some cases, this slows down the computation unacceptably. They could manually copy the data to the local drive using an allocation & srun commands, but I am wondering if there is a way to do this in sbatch?

 

I tried this method:

 

wdennis@submit01 ~> sbatch transfer.sbatch

Submitted batch job 329572

wdennis@submit01 ~> sbatch --dependency=afterok:329572 test_job.sbatch

Submitted batch job 329573

wdennis@submit01 ~>  sbatch --dependency=afterok:329573 rm_data.sbatch

Submitted batch job 329574

wdennis@submit01 ~>

wdennis@submit01 ~> squeue

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

            329573       gpu wdennis_  wdennis PD       0:00      1 (Dependency)

            329574       gpu wdennis_  wdennis PD       0:00      1 (Dependency)

            329572       gpu wdennis_  wdennis  R       0:23      1 compute-gpu02

 

But it seems to not preserve the node allocated with the --dependency jobs:

 

JobID|JobName|User|Partition|NodeList|AllocCPUS|ReqMem|CPUTime|QOS|State|ExitCode|AllocTRES|

329572|wdennis_data_transfer|wdennis|gpu|compute-gpu02|1|2Gc|00:02:01|normal|COMPLETED|0:0|cpu=1,mem=2G,node=1|

329573|wdennis_compute_job|wdennis|gpu|compute-gpu05|1|128Gn|00:03:00|normal|COMPLETED|0:0|cpu=1,mem=128G,node=1,gres/gpu=1|

329574|wdennis_data_removal|wdennis|gpu|compute-gpu02|1|2Gc|00:00:01|normal|COMPLETED|0:0|cpu=1,mem=2G,node=1|

 

What is the best way to do something like “stage the data on a local path / run computation using the local copy / remove the locally staged data when complete”?

 

Thanks!

Will

Fulcomer, Samuel

unread,
Apr 3, 2021, 4:00:00 PM4/3/21
to Slurm User Community List

Unfortunately this is not a good workflow.

You would submit a staging job with a dependency for the compute job; however, in the meantime, the scheduler might launch higher-priority jobs that would want the scratch space, and cause it to be scrubbed.

In a rational process, the scratch space would be scrubbed for the higher-priority jobs. I'm now thinking of a way that the scheduler could consider data turds left by previous jobs, but that's not currently a scheduling feature in SLURM multi-factor or any other scheduler I know.

The best current workflow is to stage data into fast local persistent storage, and then to schedule jobs, or schedule a job that does it synchronously (TImeLimit=Stage+Compute). The latter is pretty unsocial and wastes cycles.

Will Dennis

unread,
Apr 3, 2021, 4:11:16 PM4/3/21
to Slurm User Community List

What I mean by “scratch” space is indeed local persistent storage in our case; sorry if my use of “scratch space” is already a generally-known Slurm concept I don’t understand, or something like /tmp… That’s why my desired workflow is to “copy data locally / use data from copy / remove local copy” in separate steps.

 

 

From: slurm-users <slurm-use...@lists.schedmd.com> on behalf of Fulcomer, Samuel <samuel_...@brown.edu>
Date: Saturday, April 3, 2021 at 4:00 PM
To: Slurm User Community List <slurm...@lists.schedmd.com>
Subject: Re: [slurm-users] Staging data on the nodes one will be processing on via sbatch

[…]

[…]

Fulcomer, Samuel

unread,
Apr 3, 2021, 4:26:48 PM4/3/21
to Slurm User Community List
Hi,

"scratch space" is generally considered ephemeral storage that only exists for the duration of the job (It's eligible for deletion in an epilog or next-job prolog) .

If you've got other fast storage in limited supply that can be used for data that can be staged, then by all means use it, but consider whether you want batch cpu cores tied up with the wall time of transferring the data. This could easily be done on a time-shared frontend login node from which the users could then submit (via script) jobs after the data was staged. Most of the transfer wallclock is in network wait, so don't waste dedicated cores for it.

Will Dennis

unread,
Apr 3, 2021, 4:33:33 PM4/3/21
to Slurm User Community List

Will Dennis

unread,
Apr 3, 2021, 4:49:08 PM4/3/21
to Slurm User Community List

Sorry, obvs wasn’t ready to send that last message yet…

 

Our issue is the shared storage is via NFS, and the “fast storage in limited supply” is only local on each node. Hence the need to copy it over from NFS (and then remove it when finished with it.)

I also wanted the copy & remove to be different jobs, because the main processing job usually requires GPU gres, which is a time-limited resource on the partition. I don’t want to tie up the allocation of GPUs while the data is staged (and removed), and if the data copy fails, don’t want to even progress to the job where the compute happens (so like, copy_data_locally && process_data)

Fulcomer, Samuel

unread,
Apr 3, 2021, 5:32:00 PM4/3/21
to Slurm User Community List
inline below...

On Sat, Apr 3, 2021 at 4:50 PM Will Dennis <wde...@nec-labs.com> wrote:

Sorry, obvs wasn’t ready to send that last message yet…

 

Our issue is the shared storage is via NFS, and the “fast storage in limited supply” is only local on each node. Hence the need to copy it over from NFS (and then remove it when finished with it.)

I also wanted the copy & remove to be different jobs, because the main processing job usually requires GPU gres, which is a time-limited resource on the partition. I don’t want to tie up the allocation of GPUs while the data is staged (and removed), and if the data copy fails, don’t want to even progress to the job where the compute happens (so like, copy_data_locally && process_data)


...yup... this is the problem. We've invested in GPFS and an NVMe Excelero pool (for initial placement); however, we still have the problem of having users pull down data from community repositories before running useful computation.

Your question has gotten me thinking about this more. In our case, all of our nodes are diskless, so this wouldn't really work for us (but we do have fast GPFS), but.... if your fast storage is only local to your nodes, the subsequent compute jobs will need to request those specific nodes, so you'll need to have a mechanism to increase the SLURM scheduling  "weight" of the nodes after staging, so the scheduler won't select them over nodes with a lower weight. That could be done in a job epilog.

Prentice Bisbal

unread,
Apr 5, 2021, 3:22:01 PM4/5/21
to slurm...@lists.schedmd.com

I think this is exactly the type of use case heterogeneous job support is for, which has been supported since Slurm 17.11

Slurm version 17.11 and later supports the ability to submit and manage heterogeneous jobs, in which each component has virtually all job options available including partition, account and QOS (Quality Of Service). For example, part of a job might require four cores and 4 GB for each of 128 tasks while another part of the job would require 16 GB of memory and one CPU.

https://slurm.schedmd.com/heterogeneous_jobs.html

Using this, you should be able to use a single core for the transfer from NFS , use all the cores/GPUs you need for the computation, and then use 1 single core to transfer back to NFS:

Disclaimer: I've never used this feature myself.

Prentice
Reply all
Reply to author
Forward
0 new messages