The role of threads in distributed calculations

Stephen DeWitt

unread,

Jul 19, 2018, 4:31:52 PM7/19/18

to deal.II User Group

Hello all,

This is somewhat related to a previous question here. I'm trying to make sure that my mental model for using MPI and TBB is right.

For some background, I'm doing matrix-free calculations on distributed triangulations similar to step-48. Naively, when I started with deal.II I assumed that I would want to use MPI parallelism between nodes, but that threads would be better within a node. However, it's been clear that the fastest approach on a single node is to use as many MPI processes as there are physical cores and then leave the default threading behavior when calling mpi_initialization. It's a little slower (~10%) if I limit each process to a single thread. Conversely if I use a single MPI process and the default threading behavior, it's much slower (~50%) than the other two ways. My understanding is that this behavior can be explained by the contents of the shared-memory module entry, that not everything in deal.II works well for task-based parallelization.

With all of that said, am I right in thinking that the role of TBB in the context of distributed calculations to farm out some work to the virtual threads on each core? The virtual threads aren't useful for MPI, but there's some excess capacity (~10%) that TBB can take advantage of.

Thanks! (And thank you for taking care of the nitty-gritty parallelization details so that I don't have to!)

Steve

Wolfgang Bangerth

unread,

Jul 20, 2018, 4:14:59 AM7/20/18

to dea...@googlegroups.com, Stephen DeWitt

Stephen,

> For some background, I'm doing matrix-free calculations on distributed
> triangulations similar to step-48. Naively, when I started with deal.II I
> assumed that I would want to use MPI parallelism between nodes, but that
> threads would be better within a node. However, it's been clear that the
> fastest approach on a single node is to use as many MPI processes as there are
> physical cores and then leave the default threading behavior when
> calling mpi_initialization. It's a little slower (~10%) if I limit each
> process to a single thread. Conversely if I use a single MPI process and the
> default threading behavior, it's much slower (~50%) than the other two ways.
> My understanding is that this behavior can be explained by the contents of the
> shared-memory module entry

> <https://www.dealii.org/9.0.0/doxygen/deal.II/group__threads.html>, that not

> everything in deal.II works well for task-based parallelization.

Yes, that is correct.

> With all of that said, am I right in thinking that the role of TBB in the
> context of distributed calculations to farm out some work to the virtual
> threads on each core? The virtual threads aren't useful for MPI, but there's
> some excess capacity (~10%) that TBB can take advantage of.

That's probably a good mental model. From a practical perspective, there is no
difference to a program between threads or cores: they all look the same. It's
not like there is a physical core/thread and a virtual core/thread. But in
practice I think that what you state is how one should see things.

> Thanks! (And thank you for taking care of the nitty-gritty parallelization
> details so that I don't have to!)

Thanks. Thread parallelization is surprisingly difficult in practice. I tend
to think of it as more difficult than MPI in many cases.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Stephen DeWitt

unread,

Jul 20, 2018, 8:40:47 AM7/20/18

to deal.II User Group

Hi Wolfgang,

Thank you for your response and for your point at the bottom emphasizing that the program sees all threads as the same, even if there's a limit to how many can be utilized at full efficiency.