Parallelizing WESTPA + OpenMM on SLURM

7 views
Skip to first unread message

anando...@gmail.com

unread,
Mar 24, 2025, 2:31:19 PMMar 24
to westpa-users
Hello WESTPA community,

I am working on setting up WESTPA simulations with OpenMM on a SLURM-based cluster, and I am looking to parallelize segment propagation across 4 GPUs on a single node. I am using the processes work manager and running OpenMM on one GPU, but I would like to scale up to take advantage of all 4 GPUs.

1. Could anyone with experience in GPU-parallel WESTPA help answer the following? -- How can I best assign segment propagation to multiple GPUs (e.g., 4) using the processes or zmq work manager? or Do I need to modify runseg.sh to set CUDA_VISIBLE_DEVICES per segment?
Is there a way to ensure segments do not all end up on GPU 0?

2. Is w_run --work-manager zmq better suited for this than processes, even on a single node with multiple GPUs? -- I know the ZMQ server/client setup, but I am unclear on how to map each client (worker) to a specific GPU when all run on the same node.

3. Have others encountered significant overhead with OpenMM context creation?
If so, are there known optimizations (e.g., sharing a context between segments in runseg.sh) that reduce launch time? -- Example scripts would be greatly appreciated, especially SLURM job scripts (submit.sh) and WESTPA configs (run.sh, runseg.sh) that handle multi-GPU execution cleanly.

I am happy to share my scripts if that would help clarify. I really appreciate any help you can provide. 

Leung, Jeremy

unread,
Mar 24, 2025, 4:00:59 PMMar 24
to westpa...@googlegroups.com
Hi Anand,


2. ZMQ works on both single node and multi-node settings. I personally prefer using processes on a single node, even with multi GPUs (see example) because you no longer have to jump through tcp/ip connections (i.e. ssh), which are could be slow or lossy in HPC settings. One less hoop to jump through, one less factor for jobs to fail. See links in 1. for more explanation on ZMQ.

3. Startup overhead for OpenMM tends to be short compared to the propagation time, unless you have a really short tau. You could implement shared context (say using NVIDIA MPS) and that had demonstrated speedup, but not really needed unless you're stretching system limits.

Last thing I'd like to add is: in the WESTPA context, running one segment per GPU (and running more segments/bin) is probably more beneficial/faster than trying to get a single segment to run across multiple GPUs, especially with Nvidia SLI discontinued.

Best,

Jeremy L.

---
Jeremy M. G. Leung, PhD
Postdoctoral Associate, Chemistry (Chong Lab)
University of Pittsburgh | 219 Parkman Avenue, Pittsburgh, PA 15260
jml...@pitt.edu | [He, Him, His]

--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/westpa-users/ed29ac21-9d12-4a8c-bc31-4ea7f89b6cc1n%40googlegroups.com.

Anupam Anand Ojha

unread,
Mar 24, 2025, 7:32:23 PMMar 24
to westpa...@googlegroups.com
Thank you Jeremy for the reply. There is much more clarity now.



--
Best regards,

A. Anand Ojha
Flatiron Research Fellow
Center for Computational Biology & Center for Computational Mathematics
Flatiron Institute, New York
Reply all
Reply to author
Forward
0 new messages