Hi all,
I would like to minimize the I/O overhead by running on the local scratch directory, and using multiple GPU nodes with the LSF scheduler.
I saw an example for the SLURM scheduler and I was wondering if anybody has something similar for the LSF scheduler.
Thanks a lot,
Best,
Paola
Hi Paola,
As far as I know, we only have example SLURM scripts for submitting WESTPA (in the westpa/user_submitted_scripts Github repository) and I have not personally used LSF before. I will try to locate something that will be of help to you.
Does anyone else in the WESTPA community have any experience using LSF scheduler? Or does anybody know of any example scripts to be used with WESTPA?
Anthony
--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/AB0EF0A4-4672-41B8-AA4F-4BFF7996850B%40contoso.com.
Hi Paola,
I don't have access to LSF but I can show you the logic on how to set this up and you should be able get it working. Also, this resource may help,
https://hpc.llnl.gov/banks-jobs/running-jobs/batch-system-commands
First, please see my comment on June 5,
https://groups.google.com/forum/#!msg/westpa-users/18mts9s_rxI/AIQpX9wEBAAJ
The SLURM script is runwe_bridges.sh. You will need to replace the SLURM directives (lines beginning with #SBATCH at the top) with the appropriate LSF directives. You will also need to replace the following with the appropriate LSF environment variables:
SLURM_SUBMIT_DIR -- This variable should point to the directory where the LSF job submission is located
SLURM_JOBID -- This should be the the job id assigned by LSF
The line (it's a SLURM command and is used in two places)
scontrol show hostname $SLURM_NODELIST
essentially generates a list of unique hostnames of the nodes
assigned by the job scheduler. You will need to find the
corresponding LSF variable or the command to generate such a
list. For example, the output of the above command on my cluster
is
gpu-stage05
gpu-stage06
The next set of code in the job submission script, ssh into each
assigned compute node and runs the node.sh script with the
required variables. You will need to make sure that LSF or your
cluster environment sets up CUDA_VISIBLE_DEVICES
appropriately. If your LSF assigns whole nodes (for example, no
two users can share a compute node) and each node has 7 GPUs, then
you can safely insert this line into the script
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6"
This means that you have access to all 7 devices on the compute
node. On that same line, you will need to update --n-workers=7.
You will need to update these values to reflect the number of GPUs
you have access to. Summary of what's being done: westpa ssh to
each compute node and launches --n-workers; each of
these workers perform the dynamics should run on a unique GPU
device ... that is why the value of this variable reflects the
number of GPUs.
I believe the runwe_bridges.sh is all that you need to modify. We would appreciate it that you contribute this multi-GPU LSF example back to the community once it's working.
Thanks. -Kim
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/0DACE2C1-9865-491E-9515-2E70220ADD1B%40hxcore.ol.
Thanks a lot Kim,
I was trying to adapt the script to LSF but I run into several errors. I apologize, I should have share that too when I posted my original message. I run into some error, because some of the variables were not declared properly and now I just realized that I forgot to change the ‘scontrol’ command according to LSF.
I will debug more carefully and keep you posted with more specific issues. Once I get the script to work, I’ll share it with the westpa community😉.
Best,
Paola
To view this discussion on the web visit
https://groups.google.com/d/msgid/westpa-users/77f71e8e-de34-9b9a-d9c9-869c4cb3aac6%40gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/1DC0F252-0AB8-46D5-8F1F-92287BF12610%40llnl.gov.