Running on multiple GPU nodes with LSF scheduler

304 views
Skip to first unread message

Bisignano, Paola

unread,
Jun 25, 2020, 4:39:08 PM6/25/20
to westpa...@googlegroups.com

Hi all,

 

 

I would like to minimize the I/O overhead by running on the local scratch directory, and using multiple GPU nodes with the LSF scheduler.

I saw an example for the SLURM scheduler and I was wondering if anybody has something similar for the LSF scheduler.

 

 

Thanks a lot,

 

 

Best,

 

Paola

Anthony Bogetti

unread,
Jun 29, 2020, 4:19:34 PM6/29/20
to westpa...@googlegroups.com

Hi Paola,

 

As far as I know, we only have example SLURM scripts for submitting WESTPA (in the westpa/user_submitted_scripts Github repository) and I have not personally used LSF before.  I will try to locate something that will be of help to you.

 

Does anyone else in the WESTPA community have any experience using LSF scheduler?  Or does anybody know of any example scripts to be used with WESTPA?

 

Anthony

--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/AB0EF0A4-4672-41B8-AA4F-4BFF7996850B%40contoso.com.

 

Kim F. Wong

unread,
Jun 30, 2020, 12:17:34 PM6/30/20
to westpa...@googlegroups.com

Hi Paola,

I don't have access to LSF but I can show you the logic on how to set this up and you should be able get it working.  Also, this resource may help,

    https://hpc.llnl.gov/banks-jobs/running-jobs/batch-system-commands

First, please see my comment on June 5,

    https://groups.google.com/forum/#!msg/westpa-users/18mts9s_rxI/AIQpX9wEBAAJ

The SLURM script is runwe_bridges.sh.  You will need to replace the SLURM directives (lines beginning with #SBATCH at the top) with the appropriate LSF directives.  You will also need to replace the following with the appropriate LSF environment variables:

    SLURM_SUBMIT_DIR        -- This variable should point to the directory where the LSF job submission is located

    SLURM_JOBID               -- This should be the the job id assigned by LSF

The line (it's a SLURM command and is used in two places)

    scontrol show hostname $SLURM_NODELIST

essentially generates a list of unique hostnames of the nodes assigned by the job scheduler.  You will need to find the corresponding LSF variable or the command to generate such a list.  For example, the output of the above command on my cluster is

    gpu-stage05
    gpu-stage06

The next set of code in the job submission script, ssh into each assigned compute node and runs the node.sh script with the required variables.  You will need to make sure that LSF or your cluster environment sets up CUDA_VISIBLE_DEVICES appropriately.  If your LSF assigns whole nodes (for example, no two users can share a compute node) and each node has 7 GPUs, then you can safely insert this line into the script

    export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6"

This means that you have access to all 7 devices on the compute node.  On that same line, you will need to update --n-workers=7.  You will need to update these values to reflect the number of GPUs you have access to.  Summary of what's being done:  westpa ssh to each compute node and launches --n-workers; each of these workers perform the dynamics should run on a unique GPU device ... that is why the value of this variable reflects the number of GPUs. 

I believe the runwe_bridges.sh is all that you need to modify.  We would appreciate it that you contribute this multi-GPU LSF example back to the community once it's working.

Thanks.  -Kim

Bisignano, Paola

unread,
Jun 30, 2020, 1:23:21 PM6/30/20
to westpa...@googlegroups.com

Thanks a lot Kim,

 

I was trying to adapt the script to LSF but I run into several errors. I apologize, I should have share that too when I posted my original message. I run into some error, because some of the variables were not declared properly and now I just realized that I forgot to change the ‘scontrol’ command according to LSF.

I will debug more carefully and keep you posted with more specific issues. Once I get the script to work, I’ll share it with the westpa community😉.

 

Best,

 

Paola

vipin sachdeva

unread,
May 20, 2021, 11:01:38 AM5/20/21
to westpa-users
Hi Paola/Kim,
                        Will you be willing to share your LSF scripts ? Thanks.

Lillian Chong

unread,
Jun 15, 2021, 10:17:59 AM6/15/21
to westpa...@googlegroups.com
Dear WESTPA users,
Thanks to Tanner and Vipin, here is an LSF script for running WESTPA:
All the best,
Lillian



--
Lillian T. Chong           
Associate Professor    
Department of Chemistry
University of Pittsburgh
219 Parkman Avenue
Pittsburgh, PA 15260
(412) 624-6026

Reply all
Reply to author
Forward
0 new messages