Question about the usage of GPUs

63 views
Skip to first unread message

Ibrahim Mohamed

unread,
Dec 21, 2023, 12:18:03 PM12/21/23
to westpa-users
in the WESTPA 2.0 paper there was a section about the usage of GPUs and in it there was this paragraph:

If your WE simulation has extremely frequent starting up of simulation segments, your simulation may overheat gaming GPUs and potentially damage the hardware. For example, folding simulations of the NTL9 protein in implicit solvent with a τ value of 15 ps resulted in such issues on gaming GPUs (i.e. NVIDIA GTX 1080Ti GPUs) while the same simulations have no such issues on professional graphics-programming GPUs.

i have a workstation with Nvidia quadro RTX 4000. If I want to run a simulation using this GPU, will it have this issue?

Thanks

Jeremy Leung

unread,
Dec 21, 2023, 12:37:45 PM12/21/23
to westpa-users
Hi Ibrahim,

This potential issue also depends on the size of your system, frequency of frames outputted, the τ value you choose and how many concurrent segments you run, all of which will dictate frequency of the read/writes. The NTL9 example was especially stressing because it was in implicit solvent (low number of atoms), a short τ value, and it was done on a cluster where many GPUs and segments were running at the same time.

Since RTX 4000 is a professional GPU, I'd assume it'll probably fare better than a GTX 1080 Ti. Also since this a workstation with just one (or a handful of) GPU(s), I'd also think it'd fare better than in a cluster environment, where there's potentially more collective heat generation that could overheat the GPUs. But if you're worried about potential damage, you can benchmark with `nvidia-smi` for system temperature starting with a larger τ (~100 ps for proteins) and slowly crank it down.

Disclaimer: This is just my personal opinion/experience and we're not responsible for any sort of damage as stated in the MIT License for WESTPA.

Best,

Jeremy L.

Ibrahim Mohamed

unread,
Dec 21, 2023, 1:09:25 PM12/21/23
to westpa-users
Hi Jeremy,

okay thank you for your suggestion. actually the system i am using has slightly more than 100,000 atoms and i was using  τ = 50 ps  (this was with CPU only). i will try to use nvidia-smi to check for system temperature.

Hayden Scheiber

unread,
Jan 12, 2024, 4:11:20 PM1/12/24
to westpa-users
Hi there, I hope you don't mind me adding to this thread to write about my experience with the usage of high-end GPUs running WESTPA simulations. This thread seemed like an appropriate place to post this.

I have access to NVIDIA Superpod nodes which have 8x A100 GPUs. I have been running WESTPA simulations of all-atom protein-protein interactions in explicit solvent (~300k atoms) using GROMACS 2023, which is highly optimized for GPUs. I've found that with too short of a τ value (20 ps in my case) while running 128 simulations in parallel per node (1 simulation per CPU and 16 per GPU with MPS), the entire calculation will randomly crash after some time due to kernel panics. NVIDIA support narrowed down the cause of this kernel panic to some issue with the Lustre filesystem. It seems Lustre was overloaded by I/O operations. WESTPA was writing files too fast for the filesystem to keep up! To help alleviate this issue, I first tried to cut out all unnecessary file generation from my simulations. This helped some, but the crashed continued. I also combined my many auxiliary datasets into a single multi-dimensional dataset that was passed to WESTPA. This didn't seem to make much difference. In the end, I had to substantially increase the τ value such that most time was being spent on the simulations rather than on the calculation of pcoords/aux data (which in my case is CPU only). This means that I/O operations are performed more intermittently, and it has the effect of increasing throughput for the simulations overall, but reduces the enhanced sampling advantage as resampling now occurs less frequently.

If I could make a recommendation to WESTPA developers to help address this issue, it would be to create some API to pass data directly from calculation output into WESTPA-associated memory (and then to HDF5 files) rather than requiring the user to pass calculation outputs (such as pcoords and aux datasets) to temp files, which are then loaded by WESTPA into memory and passed to HDF5 files. I'm sure there are technical reasons that make this difficult to achieve, but I figured I would at least add my two cents!

Best,
Hayden

Daniel Zuckerman

unread,
Jan 12, 2024, 8:22:54 PM1/12/24
to westpa-users
Hayden, thanks.  You make a good point.  --Dan


From: westpa...@googlegroups.com <westpa...@googlegroups.com> on behalf of Hayden Scheiber <hayden...@gmail.com>
Sent: Friday, January 12, 2024 1:11 PM
To: westpa-users <westpa...@googlegroups.com>
Subject: [EXTERNAL] [westpa-users] Re: Question about the usage of GPUs
 
--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/374b585d-a1a0-4185-96c8-e05a61aba69dn%40googlegroups.com.


From: westpa...@googlegroups.com <westpa...@googlegroups.com> on behalf of Hayden Scheiber <hayden...@gmail.com>
Sent: Friday, January 12, 2024 1:11 PM
To: westpa-users <westpa...@googlegroups.com>
Subject: [EXTERNAL] [westpa-users] Re: Question about the usage of GPUs
 
--
You received this message because you are subscribed to the Google Groups "westpa-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to westpa-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/westpa-users/374b585d-a1a0-4185-96c8-e05a61aba69dn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages