Strong scaling issue on a deal.II based framework running on Skylake nodes.

David Montiel Taboada

unread,

Jan 21, 2021, 11:52:27 PM1/21/21

to deal.II User Group

Hello,

I am using the PRISMS-PF framework (which is based on deal.II) on the Skylake (skx) nodes (with 48 processors each) of the Stampede2 cluster.

I recently ran a series of strong scaling tests and noticed that the intra-node performance (i.e. 1 node, 1-48 processors) scales poorly, specifically the solver part. However, once I get past one node, the scaling is closer to ideal (taking 1 node as a reference).

Here is the behavior I got (solver part only; in every case I used as many MPI threads as processors):

Processors, Nodes, Solver time (s)

1, 1, 821

2 , 1, 608

4 , 1, 525

8, 1, 482

24, 1, 435

48, 1, 427

96, 2, 211

192, 4, 109

Does anyone know what may be the problem?

The code uses the matrix-free method and requires only the p4est and mpi libraries, which I included as dependencies when I did cmake to install deal.II.

Here is the line I used

cmake -DDEAL_II_WITH_MPI=ON -DDEAL_II_WITH_P4EST=ON

-DCMAKE_INSTALL_PREFIX=$WORK/dealii_install $WORK/dealii-9.2.0

Am I perhaps missing a flag?

By the way, the home nodes (which I used to install deal.II and compile my code) are also Skylake, so I would expect my code to have a good performance.

I do not observe the same issue elsewhere (e. g., on my local machine or on the KNL nodes on Cori).

Any help that might help me figure out this issue is appreciated.

Best,

David

Martin Kronbichler

unread,

Jan 22, 2021, 6:34:02 AM1/22/21

to dea...@googlegroups.com

Dear David,

Without knowing the exact components of deal.II you are using, the first places where I would start looking into is whether you use multi-threaded blas or multithreading within deal.II. So you could try to do

export DEAL_II_NUM_THREADS=1
export OMP_NUM_THREADS=1

or disable multithreading from the compilation of deal.II (and use serial BLAS/LAPACK libraries) and check again. The behavior you're describing looks to be a combination of something that sees a good speedup in some parts of the solver, but very little to none in other parts.

The second suspicion would be memory bandwidth limitations within the node, but even if you are fully memory bound you should see a factor of ~10-12 of speedup when going from 1 to 48 cores on a node (or a bit less if the processor has full turbo frequency turned on and thus clocks higher with 1 core loaded than with all 24 cores loaded per socket), while you observe much less than that.

Best,
Martin

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/8bd2837d-c284-4f1e-a194-ad4a56835cb6n%40googlegroups.com.

David Montiel Taboada

unread,

Jan 22, 2021, 12:30:49 PM1/22/21

to dea...@googlegroups.com

Thank you, Martin

I will try your suggestions!

Best,

David

To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/b22ad7fa-ccb8-17da-4c5c-daebfe30dd1d%40gmail.com.

Reply all

Reply to author

Forward