Thesis that includes vectorization and performance optimization for Geoclaw

David Ketcheson

unread,

Jan 14, 2020, 10:07:17 PM1/14/20

to claw-dev

I'm guessing some people here already know about this:

https://pdfs.semanticscholar.org/c554/0c8820a5d863ebdd55c7b8d86a19d31fac6f.pdf

A student of Michael Bader and Olaf Schenk did some vectorization and performance optimization of Geoclaw as part of a thesis. One interesting conclusion is that (with the optimizations done, at least) the Riemann solver is no longer as computationally dominant as one would expect, and memory bottlenecks become quite significant.

Donna Calhoun

unread,

Jan 14, 2020, 10:18:35 PM1/14/20

to claw...@googlegroups.com

Thanks for sharing this - very interesting.

What exactly is meant by “memory bottlenecks”?

--
You received this message because you are subscribed to the Google Groups "claw-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to claw-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/claw-dev/48da9b5c-ae98-4e1e-8130-03ca0ff0234d%40googlegroups.com.

Jed Brown

unread,

Jan 15, 2020, 8:38:44 AM1/15/20

to Donna Calhoun, claw...@googlegroups.com

Donna Calhoun <donna....@gmail.com> writes:

> Thanks for sharing this - very interesting.
>
> What exactly is meant by “memory bottlenecks”?

Presumably memory bandwidth (these CPUs can do about 80 vectorized flops
per double precision value loaded from DRAM) and limited cache reuse in
accessing the non-contiguous strips (cf. Fig 6.1) before calling rpn2.
In 2D patch-based AMR, the patches are often small enough that it's
merely non-contiguous access rather than cache misses, and I think
that's fine. In 3D, it's easy for patches to exceed cache (e.g., 3D
Euler will exceed L2 cache with a patch between 20^3 and 32^3 depending
on architecture; maybe sooner when including required scratch arrays).

FWIW, the omp simd vectorization strategy here is equivalent to what we
use for CPU in libCEED, pTatin, and other finite element solvers
(implemented in C), and (in libCEED) we reuse the same kernels with
just-in-time compilation to target GPUs.

Kyle Mandli

unread,

Jan 15, 2020, 1:58:31 PM1/15/20

to Donna Calhoun, claw...@googlegroups.com

We have a PR out right now to merge Chaulio’s code into GeoClaw (although it is at a point where any vectorized Riemann solver could be used). While we were studying the vectorization we also found some other interesting impacts on memory layout that Chaulio put in his thesis if anyone is interested.

Kyle

--
You received this message because you are subscribed to the Google Groups "claw-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to claw-dev+u...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/claw-dev/87k15son4u.fsf%40jedbrown.org.

Jed Brown

unread,

Jan 15, 2020, 2:03:26 PM1/15/20

to Kyle Mandli, Donna Calhoun, claw...@googlegroups.com

If this is with respect to AoS vs SoA, it seems quite natural to me.
Also note that AoS is better for cache reuse, TLB, and packing for
communication (in parallel). Transposing to SoA for vectorization is a
common pattern, and this may well be faster end-to-end than SoA
globally.

> To view this discussion on the web visit https://groups.google.com/d/msgid/claw-dev/b9af967b-f54b-486d-9b10-7c5c2a4c1dc5%40Spark.

Kyle Mandli

unread,

Jan 15, 2020, 2:07:01 PM1/15/20

to Donna Calhoun, claw...@googlegroups.com, Jed Brown

Exactly what we found. It is interesting to think about providing a set of routines for switching between SoA and AoS when needed. Even with this switch back-and-forth we still saw an almost perfect speedup ~3.6-3.8 in the GeoClaw case.

Kyle

Reply all

Reply to author

Forward