--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/cbb90616-d9f8-4f32-8fb1-23d96078be67%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/51668d89-2f8d-4484-970c-04c17fe6b26f%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/51668d89-2f8d-4484-970c-04c17fe6b26f%40googlegroups.com.
Hi Moutoshi,
We've been working on a multi-GPU example for the
westpa_tutorials. Can you test it? Here are the steps:
1. git clone
https://github.com/burntyellow/westpa_tutorials.git -b multi-GPU
2. The example is basic_nacl_amber_multi-GPU
3. The relevant files that you need to update/inspect are
(a) For env.sh, you need to expose the paths to Amber.
(b) You don't need to make any changes to node.sh but
focus on the CUDA_VISIBLE_DEVICES_ALLOCATED line.
(c) runwe_bridges.sh is the SLURM job submission
script, so depending on how the compute is setup at your
institution, you will need to update it accordingly. Focus on the
SLURM directives on the top and also on the line
ssh -o StrictHostKeyChecking=no $node $PWD/node.sh
$SLURM_SUBMIT_DIR $SLURM_JOBID $node $CUDA_VISIBLE_DEVICES
--work-manager=zmq --n-workers=4
--zmq-mode=client --zmq-read-host-info=$SERVER_INFO
--zmq-comm-mode=tcp &
towards the bottom of the file. The assumption here is that 1
GPU will run 1 instance of Amber. Since I requested 4 GPUs on
each node in the SLURM directive
#SBATCH --gres=gpu:4
I want to set the number of workers equal to the same number of
Amber instances on each node (which in this case also equals to
the number of GPUs).
(d) westpa_scripts/runseg.sh
The key parts here are in the lines
export CUDA_DEVICES=(`echo $CUDA_VISIBLE_DEVICES_ALLOCATED |
tr , ' '`)
export
CUDA_VISIBLE_DEVICES=${CUDA_DEVICES[$WM_PROCESS_INDEX]}
The first line takes all the SLURM allocated CUDA devices and puts them into an temporary array CUDA_DEVICES. The second line does what Josh suggested, which is to expose the GPU device for the Amber execution line the follows. The exposed GPU device is tagged to the $WM_PROCESS_INDEX. The $WM_PROCESS_INDEX is correlated to the --n-workers that was requested in runwe_bridges.sh.
This setup will work
on multi-node/multi-GPU by changing the #SBATCH
--nodes=<desired_number_nodes> in
runwe_bridges.sh.
-Kim
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/3c9b01eb-f73a-4a6f-adbc-7edf0a191cac%40googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/3c9b01eb-f73a-4a6f-adbc-7edf0a191cac%40googlegroups.com.