Advice on parallelism

50 views
Skip to first unread message

Victor Eijkhout

unread,
May 8, 2026, 1:15:48 PM (6 days ago) May 8
to deal.II User Group
I'm running with

OMP_PROC_BIND=true   DEAL_II_NUM_THREADS=100 OMP_NUM_THREADS=100 ./foo

and

MultithreadInfo::n_threads();

tells me that there are indeed 100 threads.

However! htop seems to show only 1 or 2 cores active. And of course speed of processing is abysmal.

What am I missing?

-- Victor.

Wolfgang Bangerth

unread,
May 9, 2026, 1:12:39 AM (6 days ago) May 9
to dea...@googlegroups.com
On 5/8/26 10:15, Victor Eijkhout wrote:
> **
Victor,
you're not saying what program/functionality you're running. In general,
deal.II doesn't use OpenMP. It does parallelize some functionality via
Taskflow, but not everything is parallelized, and without knowing what it is
that ./foo actually does, it's hard to tell whether you *should* be able to
expect that things run in parallel.

Best
W.

Victor Eijkhout

unread,
May 10, 2026, 11:22:12 AM (4 days ago) May 10
to deal.II User Group
Fair comment. I'm running your step35 tutorials with minimal modifications.

Is there documentation on how dealII does parallelism? I've come across mention of OMP_NUM_THREADS as a limit, and clearly it is using hwloc to discover hardware parallelism, and reading the source I see OMP_DEAL_II_THREADS, but I'm not sure how the whole caboodle fits together.

Victor.

Victor Eijkhout

unread,
May 10, 2026, 11:23:45 AM (4 days ago) May 10
to deal.II User Group
PS and in some attempt to run in parallel I got a warning message from kokkos that I was oversubscribing cores. 

Wolfgang Bangerth

unread,
May 13, 2026, 4:11:01 PM (23 hours ago) May 13
to dea...@googlegroups.com
Victor,
most of the parallelism would have to happen in the application program
(step-35 in your case). The key loop of that program looks like this:

loop:
interpolate_velocity();
diffusion_step(...);
projection_step(...);
update_pressure(...);

The second and third of these do expensive things like assembling linear
systems and solving them. For assembling the linear system, the program uses
WorkStream, which we know scales reasonably well to perhaps a dozen cores (or
maybe two dozen, depending on what the workload actually is). For solving
linear systems, there isn't much parallelism to be had: The matrices are very
small (31k and 4k rows) so matrix-vector products do not scale well to more
than at most a handful of cores, and the preconditioners used (ILU) has no
parallelism at all -- not because it isn't implemented, but because one can't
parallelize forward/backward substitution at all.

So I'm not surprised you don't get much speed-up. You'd need (i) a much larger
program to make those operations that are parallelized work well, and (ii) use
different algorithms than the ones used in this program to solve linear
problems. Both of these are of course possible, it's just not what this
program does. Nor what its intent is: the tutorials are meant to *teach* how
to write finite element codes; they're not *intended* to be HPC-ready
applications.

Best
W.
Reply all
Reply to author
Forward
0 new messages