Hi Solal,
On 04/02/2020 08:39, Solal Amouyal wrote:
> From the information provided by microway:
>
> * 9x Intel 6540 = 11.25 TFlops (CPU taken at median flops)
> * 2x V100 = 14-16 TFlops.
>
> So theoretically, the 2 GPUs should offer better performance, but not as
> much as I've experienced. The issue lies somewhere else.
>
> I'll start profiling and see if the MPI isn't an issue (shouldn't be
> with only 18 ranks). I'll also benchmark my BLAS to see how it performs
> with respect to other measurements found online. From what I understand,
> as PyFR is written in Python, it heavily relies on BLAS for compute
> performance.
So a few things to check. First is the compiler. Sometimes I've got
better results with ICC than GCC (but always be sure to use the latest
version). Secondly, I think that this case (where anti-aliasing is
disabled) is limited not by FLOP/s but by memory bandwidth. Thus PyFR
will probably be using GiMMiK rather than vendor BLAS on both platforms.
On CPUs one thing you can do to improve performance is to make libxsmm
available on the shared library path. If available, PyFR will call into
this for sparse (and dense) BLAS and it tends to outperform everything else.
Another thing to check is that the OpenMP threads are not all getting
pinned to the same core. This can happen with some combinations of
OpenMP runtimes and MPI libraries. One thing you might want to try here
is running one MPI rank per core (with OMP_NUM_THREADS=1) and seeing if
this makes a difference.
Regards, Freddie.