GPU memory bandwidth relationship to performance

28 views

Skip to first unread message

Jared Sabre

unread,

Feb 3, 2026, 4:48:19 PMFeb 3

to beast-users

Hi all. I work in IT at a university and I'm attempting to help one of our labs speed up their BEAST runs. I believe the post docs in the lab are following a sort of "tribal knowledge" script to run BEAST on our somewhat dated HPC cluster, and they don't seem to know what resources they are requesting. The two data samples they gave me were reported to have taken 7 and 20 days respectively to run.

My question around GPU VRAM bandwidth comes from the following observation.

Taking the "7 day" XML file, running on a desktop with an RTX 4090, I am initially seeing ~5.4min/million states while NVTOP has GPU usage locked in at 100%.

Running the same "7 day" XML file on a desktop with a new RTX Pro 6000 Max-Q, using all the same settings I am seeing ~2.6min/million states but NVTOP shows GPU usage ~43-48%.

In both cases I made sure double precision and CUDA (not OpenCL) was flagged to be used. I also read from a paper that the maximum likelihood kernels can be very memory bandwidth bound, so I'd like to know, what are the limits to performance gains as memory bandwith scales up? If kernels are waiting on memory operations, would it stand to reason that trying to track down an H100 NVL with ~4TB/s of memory bandwidth could see in improvement over my card's 1.8TB/s?

Thank you!

Jared Sabre

unread,

Feb 9, 2026, 1:21:46 PMFeb 9

to beast-users

Figured I would post this here incase the performance talk gets brought up again or someone else is interested. Using my "real" dataset XML I used a cloud service to test on both an H200 (SMX model), and a B200. Both of these have at minimum double the memory bandwidth of my RTX Pro 6000s, and well above an order of magnitude (sometimes 20x) times the stated double precision FP64 performance.

The H200 starts around 2.55min/million states rising up to 3ish after about 5 minutes and holds there. The B200 gets around 3.6min/million states and settles to again about 3ish after about 5 minutes.

I am unaware the extent of double precision use with my test data if different datasets vary in this way, or even how much it's used in various steps of the code, but I found it interesting the additional FP64 compute performance didn't do anything for me. I also assume from my results that the B200 being two chips appearing as one actually hampers performance.

The overall conclusion I'm drawing here is that a good "matching" combination of memory bandwidth and shaders is most important. I think this is illustrated my comparing my numbers between the H200 and RTX Pro.