I wrote a small test program for solving a non-linear equation using the RK4-solver implemented in deal.II, and assembling the right hand side using the matrix-free framework (code is attached). Afterwards I wanted to check the scaling behavior, after it should serve as a base for a larger program. Therefore I run several tests both on the development machine (i7-6700) with 8 threads and the high-performance machine (
E5-2560 v4) with 24 threads. Both machines were configured for using AVX-extensions in deal.II, and the program itself was compiled in release mode.
When running the program in both configurations, I compared the time it took for taking the first step in time:
Local machine:
MPI-Threads TBB-Threads Time (s)
1 8 170
2 4 40
4 2 20
HPC:
MPI-Threads TBB-Threads Time (s)
1 24 840
2 12 887
4 6 424
8 3 41
12 2 28
24 1 14
I do not fully understand that behavior: Why is the code so much slower on the E5 compared to the i7, except for 24 threads? Due to a different clock frequency, or newer structure (Broadwell vs Skylake)? Why is the transition from 1 MPI thread to 2 MPI threads on the i7 four times faster, but going from 2 MPI threads to 4 MPI threads only twice (which is expected)?
Similarly for the E5: Going from 1 thread to 2 threads does not speed up the code at all. Going from two to four threads halves the execution time (as expected), but going from four to eight results in a factor of ten. The steps afterwards follow the expected pattern again.
Are there any explanations for the observed behavior? And if not, what can I do for a deeper investigation?
Thanks!