Hi there, I hope you don't mind me adding to this thread to write about my experience with the usage of high-end GPUs running WESTPA simulations. This thread seemed like an appropriate place to post this.
I have access to NVIDIA Superpod nodes which have 8x A100 GPUs. I have been running WESTPA simulations of all-atom protein-protein interactions in explicit solvent (~300k atoms) using GROMACS 2023, which is highly optimized for GPUs. I've found that with too short of a τ value (20 ps in my case) while running 128 simulations in parallel per node (1 simulation per CPU and 16 per GPU with
MPS), the entire calculation will randomly crash after some time due to kernel panics. NVIDIA support narrowed down the cause of this kernel panic to some issue with the Lustre filesystem. It seems Lustre was overloaded by I/O operations. WESTPA was writing files too fast for the filesystem to keep up! To help alleviate this issue, I first tried to cut out all unnecessary file generation from my simulations. This helped some, but the crashed continued. I also combined my many auxiliary datasets into a single multi-dimensional dataset that was passed to WESTPA. This didn't seem to make much difference. In the end, I had to substantially increase the τ value such that most time was being spent on the simulations rather than on the calculation of pcoords/aux data (which in my case is CPU only). This means that I/O operations are performed more intermittently, and it has the effect of increasing throughput for the simulations overall, but reduces the enhanced sampling advantage as resampling now occurs less frequently.
If I could make a recommendation to WESTPA developers to help address this issue, it would be to create some API to pass data directly from calculation output into WESTPA-associated memory (and then to HDF5 files) rather than requiring the user to pass calculation outputs (such as pcoords and aux datasets) to temp files, which are then loaded by WESTPA into memory and passed to HDF5 files. I'm sure there are technical reasons that make this difficult to achieve, but I figured I would at least add my two cents!
Best,
Hayden