Stephen,
> For some background, I'm doing matrix-free calculations on distributed
> triangulations similar to step-48. Naively, when I started with deal.II I
> assumed that I would want to use MPI parallelism between nodes, but that
> threads would be better within a node. However, it's been clear that the
> fastest approach on a single node is to use as many MPI processes as there are
> physical cores and then leave the default threading behavior when
> calling mpi_initialization. It's a little slower (~10%) if I limit each
> process to a single thread. Conversely if I use a single MPI process and the
> default threading behavior, it's much slower (~50%) than the other two ways.
> My understanding is that this behavior can be explained by the contents of the
> shared-memory module entry
> <
https://www.dealii.org/9.0.0/doxygen/deal.II/group__threads.html>, that not
> everything in deal.II works well for task-based parallelization.
Yes, that is correct.
> With all of that said, am I right in thinking that the role of TBB in the
> context of distributed calculations to farm out some work to the virtual
> threads on each core? The virtual threads aren't useful for MPI, but there's
> some excess capacity (~10%) that TBB can take advantage of.
That's probably a good mental model. From a practical perspective, there is no
difference to a program between threads or cores: they all look the same. It's
not like there is a physical core/thread and a virtual core/thread. But in
practice I think that what you state is how one should see things.
> Thanks! (And thank you for taking care of the nitty-gritty parallelization
> details so that I don't have to!)
Thanks. Thread parallelization is surprisingly difficult in practice. I tend
to think of it as more difficult than MPI in many cases.
Best
W.
--
------------------------------------------------------------------------
Wolfgang Bangerth email:
bang...@colostate.edu
www:
http://www.math.colostate.edu/~bangerth/