I have been doing some MAGMA performance comparisons between AMD and NVIDIA and am seeing a large difference for a method that relies on dense diagonalization (e.g. much faster on summit than spock). I'd like to track down the cause and would appreciate any advice/help from MAGMA developers.
The codes I am working with are
I'm using nsight systems/compute for NVIDIA and rocprof for AMD. These are highlighting several kernels, so perhaps talking to someone about what these kernels are doing and how they are behaving might be a good place to start.