Donna Calhoun <
donna....@gmail.com> writes:
> Thanks for sharing this - very interesting.
>
> What exactly is meant by “memory bottlenecks”?
Presumably memory bandwidth (these CPUs can do about 80 vectorized flops
per double precision value loaded from DRAM) and limited cache reuse in
accessing the non-contiguous strips (cf. Fig 6.1) before calling rpn2.
In 2D patch-based AMR, the patches are often small enough that it's
merely non-contiguous access rather than cache misses, and I think
that's fine. In 3D, it's easy for patches to exceed cache (e.g., 3D
Euler will exceed L2 cache with a patch between 20^3 and 32^3 depending
on architecture; maybe sooner when including required scratch arrays).
FWIW, the omp simd vectorization strategy here is equivalent to what we
use for CPU in libCEED, pTatin, and other finite element solvers
(implemented in C), and (in libCEED) we reuse the same kernels with
just-in-time compilation to target GPUs.