Advice on parallelism

128 views
Skip to first unread message

Victor Eijkhout

unread,
May 8, 2026, 1:15:48 PMMay 8
to deal.II User Group
I'm running with

OMP_PROC_BIND=true   DEAL_II_NUM_THREADS=100 OMP_NUM_THREADS=100 ./foo

and

MultithreadInfo::n_threads();

tells me that there are indeed 100 threads.

However! htop seems to show only 1 or 2 cores active. And of course speed of processing is abysmal.

What am I missing?

-- Victor.

Wolfgang Bangerth

unread,
May 9, 2026, 1:12:39 AMMay 9
to dea...@googlegroups.com
On 5/8/26 10:15, Victor Eijkhout wrote:
> **
Victor,
you're not saying what program/functionality you're running. In general,
deal.II doesn't use OpenMP. It does parallelize some functionality via
Taskflow, but not everything is parallelized, and without knowing what it is
that ./foo actually does, it's hard to tell whether you *should* be able to
expect that things run in parallel.

Best
W.

Victor Eijkhout

unread,
May 10, 2026, 11:22:12 AMMay 10
to deal.II User Group
Fair comment. I'm running your step35 tutorials with minimal modifications.

Is there documentation on how dealII does parallelism? I've come across mention of OMP_NUM_THREADS as a limit, and clearly it is using hwloc to discover hardware parallelism, and reading the source I see OMP_DEAL_II_THREADS, but I'm not sure how the whole caboodle fits together.

Victor.

Victor Eijkhout

unread,
May 10, 2026, 11:23:45 AMMay 10
to deal.II User Group
PS and in some attempt to run in parallel I got a warning message from kokkos that I was oversubscribing cores. 

Wolfgang Bangerth

unread,
May 13, 2026, 4:11:01 PMMay 13
to dea...@googlegroups.com
Victor,
most of the parallelism would have to happen in the application program
(step-35 in your case). The key loop of that program looks like this:

loop:
interpolate_velocity();
diffusion_step(...);
projection_step(...);
update_pressure(...);

The second and third of these do expensive things like assembling linear
systems and solving them. For assembling the linear system, the program uses
WorkStream, which we know scales reasonably well to perhaps a dozen cores (or
maybe two dozen, depending on what the workload actually is). For solving
linear systems, there isn't much parallelism to be had: The matrices are very
small (31k and 4k rows) so matrix-vector products do not scale well to more
than at most a handful of cores, and the preconditioners used (ILU) has no
parallelism at all -- not because it isn't implemented, but because one can't
parallelize forward/backward substitution at all.

So I'm not surprised you don't get much speed-up. You'd need (i) a much larger
program to make those operations that are parallelized work well, and (ii) use
different algorithms than the ones used in this program to solve linear
problems. Both of these are of course possible, it's just not what this
program does. Nor what its intent is: the tutorials are meant to *teach* how
to write finite element codes; they're not *intended* to be HPC-ready
applications.

Best
W.

Victor Eijkhout

unread,
May 14, 2026, 3:31:27 PMMay 14
to deal.II User Group
I appreciate that they are tutorials, but given that I need a projection-step Navier-Stokes solver and that I lack the FEM background I thought I'd take that code.

Size of the problem: my main modification is reading in my own mesh. So I can have way more variables.

ILU: I'm ashamed to admit that that obvious fact slipped by me. Ok, I've switched to a Block Jacobi. That does increase the thread activity. Yay.
However, the pressure solve does not converge. Do I understand that BlockJacobi uses full inverses of the blocks? That ought to be pretty good.

"This costs some additional memory - for DG methods about 1/3 (for double inverses) or 1/6 (for float inverses) of that used for the matrix - but it makes the preconditioning much faster."

I don't get that. Is DG so different from regular FEM? The diagonal block of a FEM matrix is itself a FEM matrix on a subdomain, so the inverse (even if you mean that you "invert" by doing a full LU) takes a ton more storage.

My problem doesn't get awfully large, and I have lots of memory. Should I use a direct solver for both pressure & velocity solve? Can I assume that UMFpack is suitably multi-threaded?

Your thoughts as always appreciated.

V.

Wolfgang Bangerth

unread,
May 14, 2026, 11:28:21 PMMay 14
to dea...@googlegroups.com

Hi Victor:

> I appreciate that they are tutorials, but given that I need a projection-step
> Navier-Stokes solver and that I lack the FEM background I thought I'd take
> that code.

That's not unreasonable, but perhaps not what we intend these tutorials to be
for if speed is a major consideration.


> Size of the problem: my main modification is reading in my own mesh. So I can
> have way more variables.
>
> ILU: I'm ashamed to admit that that obvious fact slipped by me. Ok, I've
> switched to a Block Jacobi. That does increase the thread activity. Yay.
> However, the pressure solve does not converge. Do I understand that
> BlockJacobi uses full inverses of the blocks? That ought to be pretty good.

You mean deal.II's BlockJacobi class? I had to look it up (it's a rarely used
class that is >25 years old). My understanding is that the class takes the
diagonal blocks, inverts them, and multiples the vector to be preconditioned
by these inverse blocks. If you take block_size=1, then that's just a regular
Jacobi preconditioner. If you took block_size=n_dofs, you'd get an exact
inverse, but that's of course impractical. Whether it's a good preconditioner
then depends on how big the block size is that you choose.

I looked into the implementation as well. It doesn't use threads, though at
least there's no *fundamental* obstacle in the algorithm to it: Everything
(computing the inverses, applying the inverses) should parallelize trivially.
It's just that the class was written before we started to think about
threading. I don't think it would be terribly difficult to make the class use
threads -- it's just that someone needs to feel strongly enough about it to
offer the time.


> "This costs some additional memory - for DG methods about 1/3 (for double
> inverses) or 1/6 (for float inverses) of that used for the matrix - but it
> makes the preconditioning much faster."
>
> I don't get that. Is DG so different from regular FEM? The diagonal block of a
> FEM matrix is itself a FEM matrix on a subdomain, so the inverse (even if you
> mean that you "invert" by doing a full LU) takes a ton more storage.

I think what it is trying to say (without trying to do the math about it) is
that if, for example, you had a Laplace problem and you choose 4x4 diagonal
blocks, then what this class needs to store are the inverses of these 4x4
blocks. That costs as much memory as the original 4x4 diagonal blocks. But the
matrix itself of course also has many entries *outside* these 4x4 blocks, and
so storing the matrix is much more expensive than only storing the (inverses
of the) diagonal blocks. The person who wrote that comment apparently did the
math and figured out that the matrix in totally has 3 times as many entries as
just the diagonal blocks.

For DG methods, the matrix really does consists of 4x4 blocks (in 2d) because
that's the number of unknowns per cell, and these are all coupled to each
other. They couple to neighboring 4x4 blocks via the DG flux terms. For CG,
the matrix does not decompose into 4x4 blocks because degrees of freedom live
on multiple cells.

(I'm dubious about the claim with the 1/3 because I would have expected the
exact value that depends on the block size and the dimension we're in, but the
comment doesn't say any thing about that. We *ought* to fix this, but again
that requires someone taking the time to think this through.)


> My problem doesn't get awfully large, and I have lots of memory. Should I use
> a direct solver for both pressure & velocity solve? Can I assume that UMFpack
> is suitably multi-threaded?

For small enough problems, say <100,000 unknowns, direct solvers are often
faster than iterative ones. They're also robust -- no need to play with many
different options, they just work.

As for UMFPACK: No, it's not multithreaded. No widely used open source direct
solver is, to the best of my knowledge. PARDISO is, but it costs money and we
don't have interfaces to it. We also have these two issues that may be of
interest in this context:
https://github.com/dealii/dealii/issues/10414
https://github.com/dealii/dealii/pull/16602

Best
W.

Victor Eijkhout

unread,
May 15, 2026, 12:04:08 PMMay 15
to deal.II User Group
1. I thought multifrontal solvers are designed for parallelism. But anyway, since this is external to Deal, it wouldn't use your TaskFlow threads.
2. So BlockJacobi doesn't use parallel tasks either? I thought I was seeing thread activity in my run.

Ok, so how do I get a parallel preconditioner of higher quality than Jacobi?
Or since this is a time-stepping method, any parallel solver.

Should I use the Trilinos or PETSc wrappers and implicitly use MPI, even though I'm on a single node?

Nils Schween

unread,
May 18, 2026, 3:56:50 AMMay 18
to Victor Eijkhout, deal.II User Group
Hi Victor,

if you decide to use a direct solver and you go into the direction Prof.
Banghert hinted, i.e. you decide to use PARDISO: You can access it
indirectly to through the PETSC interface of deal.ii. You have to set
your -ksp_type to none and your -pc_type to lu and then you can choose
Pardiso with -pc_factor_mat_solver_type mkl_pardiso

See https://petsc.org/main/manualpages/Mat/MATSOLVERMKL_PARDISO/ for details.

Keep in mind that you need to instruct PARDISO how many cores to use. On
many clusters you can do this with exporting the variable OMP_NUM_THREADS

If you do not like to use a shared memory PARDISO, you can also use a
distributed memory variant called cluster PARDISO
https://petsc.org/release/manualpages/Mat/MATSOLVERMKL_CPARDISO/

A sidenote: If you like to use these you to need to compile PETSC such
that the solvers are known to it.

Best,
Nils
--
Nils Schween

Phone: +49 6221 516 557
Mail: nils.s...@mpi-hd.mpg.de
PGP-Key: 4DD3DCC0532EE96DB0C1F8B5368DBFA14CB81849

Max Planck Institute for Nuclear Physics
Astrophysical Plasma Theory (APT)
Saupfercheckweg 1, D-69117 Heidelberg
https://www.mpi-hd.mpg.de/mpi/en/research/scientific-divisions-and-groups/independent-research-groups/apt

Victor Eijkhout

unread,
May 18, 2026, 12:11:44 PMMay 18
to deal.II User Group
Nils,

I may try Pardiso.

However, can you advise me what iterative solvers D2 has natively that scale to large (or even modest) single-node core counts?

Also: I can not find information on how to measure performance. Do you have a petsc-style log view that tells me how much time was spent in each library routine & the corresrponding flop rate?

V.

Wolfgang Bangerth

unread,
May 18, 2026, 1:46:22 PMMay 18
to dea...@googlegroups.com
On 5/15/26 09:04, Victor Eijkhout wrote:
> **
>
> 1. I thought multifrontal solvers are designed for parallelism. But anyway,
> since this is external to Deal, it wouldn't use your TaskFlow threads.

*In principle*, multifrontal methods could use threads. In practice, it isn't
all that easy. In essence, each front eats its way through the domain until it
hits another front. At the end, the process stops when no front can move any
further into as-yet unprocessed parts of the domain, and then you have to
solve a dense linear system on those nodes that are parts of all of those
stalled fronts. The more fronts you have, the bigger the dense system will
typically be, and that limits the degree of parallelism that's efficient to
use. The dense linear system is also difficult to solve in parallel. With
large linear systems (say, 10^8 or more), you can probably scale this to 1000
parallel threads/processes as MUMPS does with MPI, for example. But for
smaller systems that you'd typically have on a single machine, say 10^6
unknowns, I would be surprised if you could scale this to more than a dozen or
two threads.

Either way, UMFPACK does not support parallelism. Since you're in Texas, I
imagine you know Tim Davis and could petition him to implement threading in
UMFPACK :-)


> 2. So BlockJacobi doesn't use parallel tasks either? I thought I was seeing
> thread activity in my run.

Not that I saw. I don't think it would actually be difficult to implement.


> Ok, so how do I get a parallel preconditioner of higher quality than Jacobi?
> Or since this is a time-stepping method, any parallel solver.
>
> Should I use the Trilinos or PETSc wrappers and implicitly use MPI, even
> though I'm on a single node?

If you really care about scaling in parallel, the only efficient solvers use
MPI. That's probably more due to how humans reason about writing their
software than for fundamental reasons, but all truly scalable PDE solvers use
MPI. On the other hand, that *really* gives you scalability: Programs such as
step-40 or step-32 have been run with very good efficiency to 10,000s of MPI
processes if you have sufficiently many unknowns (say, 10^5 times the number
of MPI ranks). The preconditioners used in such contexts are either algebraic
or geometric multigrid, both of which perform well to very large problem sizes
(>10^9).

Best
W>

Nils Schween

unread,
May 19, 2026, 3:33:58 AMMay 19
to Victor Eijkhout, deal.II User Group
Hi Victor,

> However, can you advise me what iterative solvers D2 has natively that
> scale to large (or even modest) single-node core counts?

I think Wolfgang's answer is very good. If you need a solver that
scales, MPI is usually used (see PETSc or Trilinos) and that case you do
not care much if use one node or more nodes.


> Also: I can not find information on how to measure performance. Do you
> have a petsc-style log view that tells me how much time was spent in
> each library routine & the corresrponding flop rate?

There is an additional petsc command line option that outputs the
internal PARDISO perfomance analysis. It is

-mat_mkl_pardiso_68 1

Here is an example output ( a million degrees of freedom on 20 Threads):

=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON


Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.543464 s
Time spent in reordering of the initial matrix (reorder) : 0.578133 s
Time spent in symbolic factorization (symbfct) : 1.705050 s
Time spent in data preparations for factorization (parlist) : 0.010526 s
Time spent in allocation of internal data structures (malloc) : 0.035932 s
Time spent in matching/scaling : 1.293615 s
Time spent in additional calculations : 1.807056 s
Total time spent : 5.973776 s

Statistics:
===========
Parallel Direct Factorization is running on 20 OpenMP

< Linear system Ax = b >
number of equations: 925696
number of non-zeros in A: 73808896
number of non-zeros in A (%): 0.008613

number of right-hand sides: 1

< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
< Preprocessing with state of the art partitioning metis>
number of supernodes: 46229
size of largest supernode: 6081
number of non-zeros in L: 402294842
number of non-zeros in U: 375995776
number of non-zeros in L+U: 778290618
=== PARDISO is running in In-Core mode, because iparam(60)=0 ===

Percentage of computed non-zeros for LL^T factorization
1 % 2 % 3 % 4 % 5 % 7 % 8 % 9 % 10 % 11 % 12 % 13 % 15 % 16 % 17 % 18 % 19 % 20 % 21 % 22 % 23 % 24 % 25 % 26 % 27 % 28 % 29 % 30 % 31 % 32 % 33 % 34 % 35 % 36 % 37 % 38 % 39 % 40 % 41 % 42 % 43 % 44 % 45 % 46 % 47 % 48 % 49 % 50 % 51 % 52 % 53 % 54 % 55 % 56 % 57 % 58 % 59 % 60 % 61 % 62 % 63 % 64 % 65 % 66 % 67 % 68 % 69 % 70 % 71 % 72 % 73 % 74 % 75 % 76 % 77 % 79 % 81 % 82 % 84 % 85 % 86 % 87 % 88 % 90 % 92 % 93 % 95 % 97 % 98 % 99 % 100 %


Best,
Nils

Victor Eijkhout <eijk...@tacc.utexas.edu> writes:

Victor Eijkhout

unread,
May 19, 2026, 8:25:40 AMMay 19
to deal.II User Group
But that's stats on Pardiso only. You don't do stats on native D2 operations?

Ok, moving on, I'm trying to replace SparseMatrix by Petsc::MPI::SparseMatrix but I'm getting lots of errors on undefined operations.
The two are not based on some virtual base class? Is there a conversion operation, and yes I realize that will not be efficient, just to get my code running in parallel?

V.

Wolfgang Bangerth

unread,
May 19, 2026, 7:10:42 PMMay 19
to dea...@googlegroups.com
On 5/19/26 05:25, Victor Eijkhout wrote:
>
> Ok, moving on, I'm trying to replace SparseMatrix by Petsc::MPI::SparseMatrix
> but I'm getting lots of errors on undefined operations.
> The two are not based on some virtual base class?

No, but in the spirit of generic programming they have compatible interfaces.
If you change the matrix type, you also have to change the vector type.

Best
W.

Reply all
Reply to author
Forward
0 new messages