I am currently working on a study region with a grid size of 538 × 904 × 10. I have compiled ParFlow 3.14.1 using CUDA 12.8, OpenMPI 4.0.3, UCX 1.17.0, Umpire, and Hypre (FoundHypre with CUDA backend). I am running simulations with the MGsemi solver and FullJacobian.
When using two NVIDIA H800 GPUs, I observe a speedup of only about 4x compared to a 144-core CPU run. Increasing to four H800 GPUs does not further improve the speedup—it still remains around 4x.
Could you please advise if this performance is expected for this configuration? Also, are there any recommended strategies or settings to further improve GPU acceleration in this scenario?
Thank you very much for your time and guidance.