question about hybrid MPI and TBB thread parallelism

157 views
Skip to first unread message

timo Hyvärinen

unread,
Sep 15, 2023, 5:17:39 AM9/15/23
to dea...@googlegroups.com
Dear dealii community and developers,

I have used dealii framework (9.3.x) a while on HPC machine. My project involves solving vector-valued nonlinear PDE with nine components.
Currently, I've implemented damping newton iteration with GMRES+AMG preconditioner with MPI on distributed memory architecture. 

A simple timing tells me the assembly process of system-matrix takes 99% of the whole running time in every newton iteration. I guess there are
a lot of idle cpu times during assembly because I don't take advantage of thread parallelism yet.

So here is my question, which tutorial steps demonstrate how to implement the mpi-thread hybrid parallelism. I've found step-48 is talking about this, but 
I wonder are there any other tutorial programs to look at? I also wonder if any of you guys have suggestions about mpi+thread parallelism under
dealii framework?

Sincerely,
Timo Hyvarinen 

Wolfgang Bangerth

unread,
Sep 15, 2023, 8:51:35 AM9/15/23
to dea...@googlegroups.com
On 9/15/23 03:17, timo Hyvärinen wrote:
>
> So here is my question, which tutorial steps demonstrate how to implement the
> mpi-thread hybrid parallelism. I've found step-48 is talking about this, but
> I wonder are there any other tutorial programs to look at? I also wonder if
> any of you guys have suggestions about mpi+thread parallelism under
> dealii framework?

Timo:
If your code already works, then the usual suggestion would be to use one MPI
process per core, and to not use thread-parallelism at all. In that case, you
should be using all cores equally.

There is an often-repeated observation that one can not optimize a code
without first profiling it. From your question, it sounds like you are not
entire sure why assembly is taking so long. I think it is likely that whatever
solution you try will not be successful unless you first find out why it
really is taking this long. I would run the code on one MPI process first and
see whether assembly is taking 99% there as well. If that's the case, use
valgrind's callgrind to figure out why. If it's not taking 99% on one MPI
process, then you've got another riddle to solve.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/


timo Hyvärinen

unread,
Sep 15, 2023, 9:56:19 AM9/15/23
to dea...@googlegroups.com, bang...@colostate.edu
Hi, Wolfgang,
Thank you for your reply and suggestion.

Your're right, I didn't conduct profiling yet. My commonly used test setup is 1 node (128 cores) with 128 tasks on it, 99% of assembly time came from this type of run.
It's a bit surprising for me to be suggested not to use thread, but I believe it should be a significant experience. I will do profiling first and then talk in this thread later.

Sincerely,
Timo 


--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/140169bb-1eb5-dcfd-44ef-14081aa713c5%40colostate.edu.
Message has been deleted

Marc Fehling

unread,
Sep 15, 2023, 1:22:22 PM9/15/23
to deal.II User Group
Hello Tim,

> A simple timing tells me the assembly process of system-matrix takes 99% of the whole running time in every newton iteration.

Just to make sure: did you compile the deal.II library and your code in Optimized mode/Release mode?

Best,
Marc

timo Hyvärinen

unread,
Sep 15, 2023, 1:53:59 PM9/15/23
to dea...@googlegroups.com, mafe...@gmail.com
hi, Marc,

Thank you for the reply.

I compiled the lib with debug mode, didn't try the optimized version. 
I didn't think this could be a significant issue, but I infer optimized lib could improve performance alot based on your question. 

Sincerely,
Timo

On Fri, Sep 15, 2023 at 8:21 PM Marc Fehling <mafe...@gmail.com> wrote:
Hello Tim,

> Yet, even though it is universally believed to be superior in terms of convergence properties, it is not widely used because it is often believed to be difficult to implement. One way to address this belief is to provide well-tested, easy to use software that provides this kind of functionality.

Just to make sure: did you compile the deal.II library and your code in Optimized mode/Release mode?

Best,
Marc

On Friday, September 15, 2023 at 3:17:39 AM UTC-6 Tim hyvärinen wrote:

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.

Bruno Turcksin

unread,
Sep 15, 2023, 4:04:16 PM9/15/23
to deal.II User Group
Timo,

You will get vastly different results in debug and release modes for two reasons. First, the compiler generates much faster code in release mode compared to debug. Second, there are a lot of checks inside deal.II that are only enabled in debug mode. This is great when you develop your code because it helps you catch bugs early but it makes your code much slower. In general, you want to develop your code in debug mode but your production run should be done in release.

Best,

Bruno

timo Hyvärinen

unread,
Sep 16, 2023, 3:47:22 AM9/16/23
to dea...@googlegroups.com, bruno.t...@gmail.com
Hi Bruno,

Thank you for your explanations.

Seemingly, I should compile an optimized lib then do profiling. 

Sincerely,
Timo

Bruno Turcksin

unread,
Sep 17, 2023, 8:47:12 PM9/17/23
to timo Hyvärinen, dea...@googlegroups.com
Timo,

Yes, you want to profile the optimized library but you also want the debug info. Without it, the information given by the profiler usually makes little sense. So you compile in release mode but you use the following option when configuring your deal.II "-DCMAKE_CXX_FLAGS=-g"

Best,

Bruno

timo Hyvärinen

unread,
Sep 30, 2023, 4:02:05 AM9/30/23
to Bruno Turcksin, mafe...@gmail.com, bang...@colostate.edu, dea...@googlegroups.com
Hi, dear all, I'm back to this thread and discussion.

I recompiled 9.3.3 as Release with debug flag "-g". For a 3D system with linear finite element (degree = 1), in which DoF is about 9.3*10^4, batch job with --ntasks-per-node=128 --cpus-per-task=1 is about 10+ times faster.  

When I use degree = 2 finite element (uniform grid), DoF increases to 6.5*10^5, batch run with same tasks-cpu setup gains about 5 times speed up (it is expected). However, the program crashes after two newton iterations with error message:
"
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=2795730.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: cXXXX: task 40: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=2795730.0
slurmstepd: error: *** STEP 2795730.0 ON cXXXX CANCELLED AT 2023-09-XXTXX:XX:XX ***
slurmstepd: error:  mpi/pmix_v3: _errhandler: cXXXX [0]: pmixp_client_v2.c:212: Error handler invoked: status = -25, source = [slurm.pmix.2795730.0:40]
"
,where cXXXX is node index.

My first intuition for this is memory leak, then I try to run Valgrind, and sadly noticed the Valgrind on the cluster was compiled with gcc 8.5, while dealII was built with gcc 11.2 (gcc 8.5 ).has been removed.

So my questions here are (i) Did this issue ever happened for other deal.II applications, how to solve it expect increase the number of nodes or memory requirements; (ii) What kind of profiling/debugger tools nowaday's deal.II experts are using to dress memory issue. Should I build Valgrind by myself? Does Valgrind only support MPI 2, my openMPI is v.3.

Tim,
Sincerely

Tim hyvärinen

unread,
Sep 30, 2023, 5:58:58 PM9/30/23
to deal.II User Group

Hi, dear community and all developers,

Here is update from my side about the questions and issue I dressed earlier:

About profiling/debugger tools, I found this thread in maillist: https://groups.google.com/g/dealii/c/7_JJvipz0wY/m/aFU4pTuvAQAJ?hl=en.

About out of memory error, my current solution is undersbscribing node by splitting one task into 2 cpus, which will be sharing 4GB memory. This certainly half cuts performance but saves program from crash.

Hope you guys can give any kind of suggestions or comments.

Tim,
Sincerely

Wolfgang Bangerth

unread,
Oct 1, 2023, 6:40:40 PM10/1/23
to timo Hyvärinen, Bruno Turcksin, mafe...@gmail.com, dea...@googlegroups.com
On 9/30/23 02:01, timo Hyvärinen wrote:
>
> So my questions here are (i) Did this issue ever happened for other deal.II
> applications, how to solve it expect increase the number of nodes or memory
> requirements; (ii) What kind of profiling/debugger tools nowaday's deal.II
> experts are using to dress memory issue. Should I build Valgrind by myself?
> Does Valgrind only support MPI 2, my openMPI is v.3.

Valgrind doesn't care.

6.5*10^5 unknowns with a quadratic element in 3d can probably be expected to
take in the range of 2-5 GB. That should fit into most machines. But at the
same time, this is a small enough problem that you can run it under valgrind's
memory profilers on any workstation or laptop you have access to. You could
also talk to the system administrators of the cluster you work on to see
whether they are willing to give you a more up to date version of valgrind.

kendrick zoom

unread,
Oct 1, 2023, 7:10:31 PM10/1/23
to dea...@googlegroups.com
Hello Wolfgang,
How are you doing today?

Well your question is not quite clear. 
So what exactly do you want to know?

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.

timo Hyvärinen

unread,
Oct 2, 2023, 2:19:55 AM10/2/23
to Wolfgang Bangerth, Bruno Turcksin, mafe...@gmail.com, dea...@googlegroups.com
hi, Wolfgang,

Thank you for your reply.

I did manage to run valgrind memcheck by launching job (--ntask-per-node=64, --cpu-per-task=2) as
"
srun valgrind --tool=memcheck --leak-check=full --track-origins=yes --suppressions=openmpi-valgrind.supp /my/project/path/binary > log.log
and this is typical output of memcheck for one task:
"
==1024886== HEAP SUMMARY:
==1024886==     in use at exit: 10,109,942 bytes in 20,666 blocks
==1024886==   total heap usage: 8,070,458 allocs, 8,049,792 frees, 10,241,574,631 bytes allocated
==1024886==
==1024855==
==1024906== HEAP SUMMARY:
==1024906==     in use at exit: 10,129,660 bytes in 20,721 blocks
==1024906==   total heap usage: 6,674,001 allocs, 6,653,280 frees, 50,454,219,932 bytes allocated
==1024906==
==1024910== HEAP SUMMARY:
==1024910==     in use at exit: 10,110,344 bytes in 20,671 blocks
==1024910==   total heap usage: 7,738,278 allocs, 7,717,607 frees, 9,563,207,155 bytes allocated
"
I'm not sure if the mpi-valgrind suppress  ever works, but there are about 10M leaks for one task.

Tim,
Sincerely

timo Hyvärinen

unread,
Oct 2, 2023, 2:47:07 AM10/2/23
to dea...@googlegroups.com, ken42...@gmail.com, Wolfgang Bangerth
Hi, Kendrick and Wolfgang,

Thank you for your reply.

I have two questions on hand:
(I) How to make the program run faster through Valgrind profiling? I know I should use cachegrind and callgrind, but I don't know is what things I should pay attention to from cachegrind/callgrind reports, what properties have significant impact on speed;
(2) GPU acceleration (this may be a bad question for this thread, but I really want to ask). I know dealii has CUDA wrapper and Kokkos leverage, but what I don't know is how I can use them to speed up my matrix-based newton iteration code. A straightforward idea in my mind is to use GPU for system-matrix assembly, but I didn't see this in the only cuda tutorial i.e., step-64. So I wonder what's the common way in deall for using GPU in matrix-based code.

Tim,
Sincerely

Bruno Turcksin

unread,
Oct 2, 2023, 9:20:25 AM10/2/23
to dea...@googlegroups.com, ken42...@gmail.com, Wolfgang Bangerth
Tim,

Valgrind is great but it is always slow. Instead you can use the AddressSanitizer from clang or gcc. It's much faster than Valgrind but I find the output is harder to read.

Best,

Bruno

You received this message because you are subscribed to a topic in the Google Groups "deal.II User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dealii/pkGlpp5uJUE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/CAArwj0FJ1EgswOFeSh2ECHH-GBnx54reEv8wXaT79%3DFL679%3D7w%40mail.gmail.com.

Wolfgang Bangerth

unread,
Oct 2, 2023, 12:01:24 PM10/2/23
to timo Hyvärinen, dea...@googlegroups.com, ken42...@gmail.com
On 10/2/23 00:46, timo Hyvärinen wrote:
> (I) How to make the program run faster through Valgrind profiling? I know I
> should use cachegrind and callgrind, but I don't know is what things I should
> pay attention to from cachegrind/callgrind reports, what properties
> have significant impact on speed;

Bruno already answered this, but I wanted to point you at the introduction of
step-22 for an example.

As for your other email: 10 MB of leaks is small potatoes compared to the 10
GB you allocate. You will have to figure out where all that memory is
allocated. My recommendation would be to debug these sorts of issues on a
local machine, rather than a cluster. You could set up a smaller test case
that runs faster.

timo Hyvärinen

unread,
Oct 2, 2023, 2:22:11 PM10/2/23
to dea...@googlegroups.com, Bruno Turcksin
hi, Bruno,

Thank you for your reply!

Valgrind is slow, indeed, so I reduced the grid size to 1/10 of full size. I found at least it's easy to use on cluster.

I don't know AddressSanitizer, certainly I need duckduckgo it first. 

Tim,
best

timo Hyvärinen

unread,
Oct 2, 2023, 2:28:14 PM10/2/23
to Wolfgang Bangerth, dea...@googlegroups.com
hi, Wolfgang,

thank you for your reply. 

>10 MB of leaks is small potatoes compared to the 10
>GB you allocate. You will have to figure out where all that memory is
>allocated. 

I must say this is certainly right thing to do.

Tim,
Sincerely
Reply all
Reply to author
Forward
0 new messages