profiling and parallel performance of deal.II user codes

richard....@gmail.com

unread,

Aug 3, 2021, 9:58:07 AM8/3/21

to deal.II User Group

Dear all,
I spent quite some time on our in-house CFD and FSI solvers, which are matrix-based and use deal.II, MPI and AMG packages of Trilinos and PETSc, all of which are so wonderfully accessible even for engineers like me. My computations now focused on problems with relatively small DoF count - say, max. 10 mio. - and the number of mpi ranks was eye-balled, staying below 20. At this stage, I would like to know

a) which (free) profiling tools can you recommend? I watched the video lecture of Wolfgang about that topic, but was looking for more opinions! I want to see which parts of the code take time apart from the (already detailed) TimerOutput.

b) If I use simply "mpirun -n 4 mycode" on a machine with 8 physical cores, why do both PETSc and Trilinos use 8 cores during the AMG setup and solve? I observed that using the htop command, even when using an off-the-shelf "step-40.release" as included in the library. Does anyone else see that? It looks something like this during the AMG setup and solve for "mpirun -n 8 step-40":

It might be linked to the installation on the server, where I used candi. On my local machine, however, this does not happen.

Any hints are very much welcome, thanks for reading and any tips!

Best regards & greetings from Graz
Richard

Wolfgang Bangerth

unread,

Aug 3, 2021, 11:56:47 AM8/3/21

to dea...@googlegroups.com

Richard,

> a) which (free) profiling tools can you recommend? I watched the video lecture
> of Wolfgang about that topic, but was looking for more opinions! I want to see
> which parts of the code take time apart from the (already detailed) TimerOutput.

Use valgrind's callgrind tool. The introduction to step-22 shows an example of
how this looks in practice.

valgrind is single-threaded, but you can call it for every process via
mpirun -n 8 valgrind --tool callgrind ./step-40
for example, and it will simply profile all 8 instances.

There is also Intel's VTune tool, which is useful to profile communication
issues, but I have not used it in many years and cannot say much about it.

> b) If I use simply "mpirun -n 4 mycode" on a machine with 8 physical cores,
> why do both PETSc and Trilinos use 8 cores during the AMG setup and solve? I
> observed that using the htop command, even when using an off-the-shelf
> "step-40.release" as included in the library. Does anyone else see that? It
> looks something like this during the AMG setup and solve for "mpirun -n 8
> step-40":

> screenshot_trilinos_step40_mpirun_n_8.png
> It might be linked to the installation on the server, where I used candi. On
> my local machine, however, this does not happen.

It may be that the AMG is using OpenMP under the hood. You will want to set
the number of threads available to OpenMP to one. There is an environment
variable for that that you need to set either in your .bashrc or, if you just
want to do it once, on the command line before running the program.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Bruno Turcksin

unread,

Aug 3, 2021, 6:07:41 PM8/3/21

to deal.II User Group

Richard,

On Tuesday, August 3, 2021 at 9:58:07 AM UTC-4 richard....@gmail.com wrote:

a) which (free) profiling tools can you recommend? I watched the video lecture of Wolfgang about that topic, but was looking for more opinions! I want to see which parts of the code take time apart from the (already detailed) TimerOutput.

I use two tools: Caliper (https://software.llnl.gov/Caliper/) and HPCToolkit (http://hpctoolkit.org/). These tools have different goals. Caliper is basically a much more powerful version of TimerOutput. You add Caliper to your code and you have access to different profiling measures and connectors to other profiling tools. The advantage of Caliper is that you can use it in your code and then monitor the performance over time. It is very easy to use and the Caliper annotation in your code can be used by VTune and NVidia NSight. HPCToolkit on the other hand is a more traditional profiling tool. You use it to find the bottlenecks in your code. I really like that it is non-invasive. Unlike Caliper, you don't need to change anything in your code. I've used it to profile code on small cluster but it also works on some of the largest supercomputers at the DOE.

Best,

Bruno

blais...@gmail.com

unread,

Aug 3, 2021, 10:44:33 PM8/3/21

to deal.II User Group

Dear Richard,

I have used valgrind's callgrind tool extensively in the past and it works quite well.

It seems Intel VTune is free now. I have used it a lot in the last two months to optimize our code and I have found it work very, very well. We managed to improve some functions tremendously because of it.

It is also significantly faster than callgrind, enabling you to profile on larger cases.

Best

Bruno

simon...@gmail.com

unread,

Aug 4, 2021, 1:19:47 AM8/4/21

to deal.II User Group

Hi,

I've used the scalasca tool suite for profiling:

https://www.scalasca.org/scalasca/about/about.html

which use score-p for measuring and the cube gui for visualizing the results:

https://www.vi-hps.org/projects/score-p
https://www.scalasca.org/scalasca/software/cube-4.x/download.html

I think it is really good. However, it is somewhat difficult to get started with. If you are interested, this user guide is a good place to start:

https://www.scalasca.org/scalasca/software/scalasca-2.x/documentation.html

Best,
Simon

Jean-Paul Pelteret

unread,

Aug 4, 2021, 2:13:07 AM8/4/21

to dea...@googlegroups.com

To add to this great list of tools (which we should document on our Wiki), there’s also the LIKWID performance monitoring and benchmarking suite which is able to provide some very low-level metrics and tools to help benchmarking.

https://github.com/RRZE-HPC/likwid

Best,

Jean-Paul

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/e8ebed6a-4bd9-418c-9fb9-751f9b2babc7n%40googlegroups.com.

heena patel

unread,

Aug 4, 2021, 3:33:40 AM8/4/21

to dea...@googlegroups.com

Dear Richard,

I recently attend summer school in which I learn about paraver but that is the visual tool for performance from Barcelona SuperComputing center. But it comes with one more pre- processor and a post-processor. Check the link below.

https://www.bsc.es/discover-bsc/organisation/scientific-structure/performance-tools

Regards,

Heena

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/01204e4d-1a90-4cfa-b5a2-a94b19d5cb6en%40googlegroups.com.

richard....@gmail.com

unread,

Aug 4, 2021, 4:17:30 AM8/4/21

to deal.II User Group

Hi all,

thanks for the many tips and suggestions, I really appreciate you spending your time and effort helping me out!
I set up valgrind and kcachegrind, which I found exceptionally easy and can get started now - perfect!
In case anyone reads this in the future, I had to use "mpirun -n 4 valgrind --tool=callgrind ./my_prog" (almost what Wolfgang suggested).

And regarding the other issue with more cores being used during AMG setup and solve with both Trilinos and PETSc, it seems to me that Wolfgang was right:
I set the environment variable according to the suggestion with "export OMP_NUM_THREADS=1" before program execution.
Then, I see the expected behavior and no additional cores are recruited - buonissimo!