MUMAX3 GPU Benchmarks: Scaling Comparison of Ada Lovelace (L40S/RTX 5080) vs. Ampere (A100)

8 views
Skip to first unread message

PeiYu Cai

unread,
Apr 23, 2026, 1:58:02 PM (6 days ago) Apr 23
to mumax2
Dear MUMAX3 developers and community members,

I am writing to share recent GPU benchmark data collected across several hardware architectures, which may serve as useful reference points for the community. The primary objective of this testing was to evaluate how different GPU architectures scale with grid size, specifically comparing high-clock, cache-heavy architectures like Ada Lovelace against high-bandwidth architectures like Ampere.

The benchmarking process consisted of two phases to capture both synthetic scaling and practical performance. For the standard synthetic scaling, I utilized the official benchmark script provided in the MUMAX3 repository to measure evaluations per second across grid sizes ranging from roughly 1000 to over 67 million cells. To provide a realistic counterpart to the synthetic tests, I also executed a long-duration micromagnetic simulation of a thin film system. This realistic simulation utilized a grid size of 256 x 256 x 15, modeling a dynamic pulse excitation with periodic boundary conditions, Dzyaloshinskii-Moriya Interaction, demagnetization, and uniaxial anisotropy.

The results demonstrate a distinct divergence in scaling behavior that strongly correlates with the underlying GPU architecture. The Ada Lovelace cards, specifically the L40S and the RTX 5080, exhibited exceptional performance peaks at approximately 1 million cells. This suggests that grids of this size fit efficiently within their massive L2 caches, allowing the high core clock speeds to dominate the computational throughput. Conversely, the Ampere architecture cards, including various A100 configurations, scaled more gradually but maintained high throughput at massive grid sizes of 4 million cells and beyond. This clearly highlights the advantage of High Bandwidth Memory when the simulation state expands beyond the cache capacity of the processor.

In the real-life simulation scenario, which operated at approximately 983 K cells, the high-clock architectures completed the task noticeably faster than the high-bandwidth architectures. The L40S and RTX 5080 demonstrated superior speed in this specific regime, validating the observation that cache-bound physics simulations benefit immensely from recent architectural advancements. I have attached the anonymized raw data logs below for integration into the community datasets, along with the comparative scaling plots in the attachments. I hope this information proves helpful for researchers optimizing their hardware allocations for thin-film simulations versus massive volumetric models.

Best Regards,

PC
gpu_combined_scaling_anonymized_v3.png
MUMAX3 GPU Benchmark Anonymized.txt
gpu_benchmarks_anonymized_v3.png
Reply all
Reply to author
Forward
0 new messages