Same Code, Different GPU, Different Results

Esther Chen

unread,

Apr 3, 2026, 2:27:26 PMApr 3

to mumax2

Hi Mumax community,

I have two exact pieces of code that simulate the switching of the MTJ free layer. In the code, we start with the same initialized states, applied the same current, then turn off the current to wait for the free layer to relax. Everything in the code is kept the same (material specs, geometry, runtime, etc, exactly the same code copy and paste). However, they produced different results on NVIDIA L40S versus NVIDIA H100.

Is this a pure hardware issue? Could it be the way the run() function is defined that uses some sorts of adaptive scheme which caused the difference under the hood? Is there a way to work around this issue so our simulation results are not hardware dependent?

I have attached the results for both jobs for review. Thank you very much!

h100.zip

l40s.zip

Josh Lauzier

unread,

Apr 7, 2026, 2:58:52 AMApr 7

to mumax2

Hi,

Running your script, I think there are some issues in the script. I didn't have time to fully debug it, but if you view it in the gui, you can see that the spins start oscillating like crazy. After awhile, the arrows don't align and if you view the x-component (or b_exch) you start to get a checkerboard pattern, which is usually a sign that the solver is struggling with something and exchange is going a bit crazy. This is with the current set to 0, so it's not being caused by STT.

It is not obvious if you just look at the total magnetization, but you can see the checkerboard pretty clearly if you look at the components, or B_exch directly. To get a clearer picture, you can also use Tableautosave() to save at periodic intervals, instead of just the 3 spots. You can also see it in b_exch in the table, where the value fluctuates wildly despite the magnetization not changing much visually. Your material parameters have a very large Msat and Ku. Without calculating, my guess is that your anisotropy is very close to the shape anisotropy, and that is confusing the solver? also there is no symmetry breaking in the x-y plane, which probably doesn't help. It is a bit strange, as the solver seems to be taking reasonable time steps of ~1e-14

The first thing I would start with is hitting it with a relax() and a minimize() before you start to run. That, along with reducing the MaxErr a bit to like 1e-9, and changing the gridsize to 64,64,2 (mostly for the z direction, having 2 spins gives it flexibility to adjust to a top/bottom) seemed to calm it down quite a bit. If it is acceptable, I would also recommend a small bias field like b_ext=vector(0.01,0,0) just to break up the symmetry. You might need to play around with it further. (Also, just as a very brief test, a Fixdt=1e-16 also seemed to work, but that is absurdly small and I wouldn't normally recommend it outside of testing/debugging, but it does help confirm there is an issue). Normally it is not recommended to change from the defaults., but I think you will need to play around with the solver settings a bit.

If that stuff doesn't work, there's some other things I can suggest, but I would start by tweaking the script. In general, Mumax3 should not have any dependence on hardware. When the solver goes crazy, it will make it very susceptible to things like numerical noise that isn't real.

One last thing, your script seems to be focused on STT. The STT torque goes like m x m_p x m, and m x m_p, where m_p for the fixed layer. For your fixed layer and magnetization, this is going to be very very small, ~0. I don't know if that's intentional.

Best,
Josh L.

Esther Chen

unread,

Apr 28, 2026, 1:42:54 PMApr 28

to mumax2

Hi Josh,

Thank you so much for your reply!

I wanted to share an update if it’s helpful. The discrepancy disappears when I increase the grid size to (64, 64, 2) or even (64, 64, 1), but remains for (32, 32, 2). This suggests that the dominant source is the discretization in the x–y plane, approximating the circular geometry with a 32×32 grid is likely too coarse, which makes the simulation very sensitive to small numerical differences.

I also observed a consistent grouping across GPUs:
NVIDIA H100 and NVIDIA H200 (Hopper) → same results
NVIDIA A100 and NVIDIA L40S(Non-Hoppers) → same results
But the two groups differ from each other

While tracing where the trajectories start to diverge and trying to understand why we see this difference across groups of GPUs, I found that even at initialization, before any time evolution, the Edens_demag already differs by ~0.1, which is far larger than the expected ~1e−7 level from single-precision roundoff. (Everything else at initialization is the same.) Immediately after we start the evolution, the first 10 steps already shows a 10^-5 difference in <mz>. Therefore I suspect the drastic difference we saw are not just accumulation of floating-point noise, might be related to how the demag field is realized on different GPUs at each step.

To narrow it down, I tested the demag kernel (computed on CPU) is identical between h100 and l40s. The discrepancy therefore might come from the GPU-side computation of B_demag. Since I'm not an expert in CUDA or Go, I asked ChatGPT to read through the source code as well as CUDA and pin-point where would this realization appear to be hardware dependent, and it suggest the difference might come from architecture-dependent behavior in the GPU FFT (cuFFT) used to compute B_demag. Since FFT involves many parallel reductions and floating-point operations (which are not associative), different kernel implementations or execution orders across architectures can lead to slightly different results.

In summary, increasing the grid size in x-y plane (from 32×32 to 64×64) eliminated the divergence across architectures for me. Hope this makes sense and might help others running into similar issues!