measuring cpu and wall time for assembly routine

68 views
Skip to first unread message

Simon

unread,
Oct 19, 2022, 7:51:55 AM10/19/22
to deal.II User Group
Dear all,

I implemented two different versions to compute a stress for a given strain and want to compare the associated computation times in release mode.

version 1: stress = fun1(strain)      cpu time:  4.52  s      wall time:   4.53 s
version 2: stress = fun2(strain)     cpu time: 32.5    s      wall time: 167.5 s

fun1 and fun2, respectively, are invoked for all quadrature points (1,286,144 in the above example) defined on the triangulation. My program is not parallelized.
In fun2, I call  find_active_cell_around_point twice for two different points on two different (helper) triangulations and initialize two FEValues objects
with the points ' ref_point_vol' and 'ref_point_dev'
FEValues<1> fe_vol(dof_handler_vol.get_fe(),
                                        Quadrature<1>(ref_point_vol),
                                        update_gradients | update_values);                                                        
FEValues<1> fe_values_energy_dev(this->dof_handler_dev.get_fe(),
                                        Quadrature<1>(ref_point_dev),
                                        update_gradients | update_values);  

I figured out that the initialization of the two FEValues objects is the biggest portion of the above mentioned times.  In particular, if I comment the initialization out, I have
cpu time: 6.54 s     wall time: 6.55 s .

The triangulations associated with dof_handler_vol and dof_handler_dev are both 1d and store only 4 and 16 elements, respectively. That said, I am wondering why the initialization takes so long (roughly 100 seconds wall time in total) and why this causes a gap between the cpu and wall time.
Unfortunately, I have to reinitialize them anew whenever fun2 is called, because  the point 'ref_point_vol' (see Quadrature<1>(ref_point_vol)) is different in each call to fun2.

Best
Simon



Bruno Turcksin

unread,
Oct 19, 2022, 9:08:35 AM10/19/22
to deal.II User Group
Simon,

The best way to profile a code is to use a profiler. It can give a lot more information than what simple timers can do. You say that your code is not parallelized but by default deal.II is multithreaded . Did you set DEAL_II_NUM_THREADS=1? That could explain why CPU and Wall time are different. Finally, if I understand correctly, you are calling the constructor of FEValues about 2.5 million times. That means that the call to one FEValues constructor is 100/2.5e6 seconds about 40 microseconds. That doesn't seem too slow.

Best,

Bruno

Simon Wiesheier

unread,
Oct 19, 2022, 9:33:18 AM10/19/22
to dea...@googlegroups.com
Thank you for your answer!

" Did you set DEAL_II_NUM_THREADS=1?"

How can I double-check that?
ccmake .
only shows my the variables CMAKE_BUILD_TYPE and deal.II_DIR .
But I do  do knot if this is the right place to look for.

" That could explain why CPU and Wall time are different. Finally, if I understand correctly, you are calling the constructor of FEValues about 2.5 million times. That means that the call to one FEValues constructor is 100/2.5e6 seconds about 40 microseconds. That doesn't seem too slow. "

There was a typo in my post. It should be 160/2.5e6 seconds about 64 microsecends.

Best,
Simon

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "deal.II User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dealii/uAplhH99yg4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/d955e8e6-78c8-41f7-9f6c-f5339c22b319n%40googlegroups.com.

Bruno Turcksin

unread,
Oct 19, 2022, 10:17:51 AM10/19/22
to dea...@googlegroups.com
Simon,

Le mer. 19 oct. 2022 à 09:33, Simon Wiesheier <simon.w...@gmail.com> a écrit :
Thank you for your answer!

" Did you set DEAL_II_NUM_THREADS=1?"

How can I double-check that?
ccmake .
only shows my the variables CMAKE_BUILD_TYPE and deal.II_DIR .
But I do  do knot if this is the right place to look for.
It's an environment variable. If you are using bash, you can do

export DEAL_II_NUM_THREADS=1
 

" That could explain why CPU and Wall time are different. Finally, if I understand correctly, you are calling the constructor of FEValues about 2.5 million times. That means that the call to one FEValues constructor is 100/2.5e6 seconds about 40 microseconds. That doesn't seem too slow. "

There was a typo in my post. It should be 160/2.5e6 seconds about 64 microsecends.
My point is the constructor should not be called millions of times. You are not going to be able to get that function 100 times faster. It's best to find a way to call it less often.

Best,

Bruno

Simon Wiesheier

unread,
Oct 19, 2022, 10:45:15 AM10/19/22
to dea...@googlegroups.com
" It's an environment variable. "

I did
$DEAL_II_NUM_THREADS
and the variable is not set.
But if it were set to one, why would this explain the gap between cpu and wall time?

" My point is the constructor should not be called millions of times. You are not going to be able to get that function 100 times faster. It's best to find a way to call it less often. "

What I want to do boils down to the following:
Given the reference co-ordinates of a point 'p', along with the cell on which 'p' lives,
give me the value and gradient of a finite element function evaluated at 'p'.

My idea was to create a quadrature object with 'p' being the only quadrature point and pass this
quadrature object to the FEValues object and finally do the .reinit(cell) call (then, of course, get_function_values()...)
'p' is different for all (2.5 million) quadrature points, which is why I create the FEValues object so many times.

Do you a different suggestion to solve my problem, ie to evaluate the finite element field and its derivatives at 'p'?

Best,
Simon


--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/CAGVt9eMfVohOUToQOsBD_v%2BqU%3D0Em_XOMiwqFi2SM_0zLoy-sQ%40mail.gmail.com.

Martin Kronbichler

unread,
Oct 19, 2022, 12:21:24 PM10/19/22
to dea...@googlegroups.com
Dear Simon,

You seem to be looking for FEPointEvaluation. That class is shown in step-19 and provides, for simple FiniteElement types, a much faster way to evaluate solutions at arbitrary points within a cell. Do you want to give it a try? The issue you are facing is that FEValues that you are using is using a very abstract entry point that does precomputations that only pay off if using the unit points many times. And even in the case of the same unit points it is not really fast, it is a general-purpose baseline that I would not recommend for high-performance purposes.

As a final note, I would mention that FEPointEvaluation falls back to FEValues for complicated FiniteElement types, so it might be that you do not get speedups in those cases. But we could work on it if you need it, today we know much better what to do than a few years ago.

Best,
Martin

Wolfgang Bangerth

unread,
Oct 19, 2022, 4:34:44 PM10/19/22
to dea...@googlegroups.com
On 10/19/22 08:45, Simon Wiesheier wrote:
>
> What I want to do boils down to the following:
> Given the reference co-ordinates of a point 'p', along with the cell on
> which 'p' lives,
> give me the value and gradient of a finite element function evaluated at
> 'p'.
>
> My idea was to create a quadrature object with 'p' being the only
> quadrature point and pass this
> quadrature object to the FEValues object and finally do the
> .reinit(cell) call (then, of course, get_function_values()...)
> 'p' is different for all (2.5 million) quadrature points, which is why I
> create the FEValues object so many times.

It's worth pointing out that is exactly what VectorTools::point_values()
does.

(As others have already mentioned, if you want to do that many many
times over, this is too expensive and you should be using
FEPointEvaluation instead.)

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Simon Wiesheier

unread,
Oct 20, 2022, 5:55:56 AM10/20/22
to dea...@googlegroups.com
Dear Martin and Wolfgang,

" You seem to be looking for FEPointEvaluation. That class is shown in step-19 and provides, for simple FiniteElement types, a much faster way to evaluate solutions at arbitrary points within a cell. Do you want to give it a try? "

I implemented the FEPointEvaluation approach like this:

FEPointEvaluation<1,1> fe_eval(mapping,
                                        FE_Q<1>(1),
                                        update_gradients | update_values);
fe_eval.reinit(cell, make_array_view(std::vector<Point<1>>{ref_point_energy_vol}));
Vector<double> p_dofs(2);
cell->get_dof_values(solution_global, p_dofs);
fe_eval.evaluate(make_array_view(p_dofs),
                                    EvaluationFlags::values | EvaluationFlags::gradients);
double val = fe_eval.get_value(0);
Tensor<1,1> grad = fe_eval.get_gradient(0);

I am using FE_Q elements of degree one and a MappingQ object also of degree one.

Frankly, I do not really understand the measured computation times.
My program has several loadsteps with nested Newton iterations:
Loadstep 1:
Assembly 1: cpu time 12.8 sec  wall time 268.7 sec
Assembly 2: cpu time 17.7 sec  wall time 275.2 sec
Assembly 3: cpu time 22.3 sec  wall time 272.6 sec
Assembly 4: cpu time 23.8 sec  wall time 271.3sec
Loadstep 2:
Assembly 1: cpu time 14.3 sec  wall time 260.0 sec
Assembly 2: cpu time 16.9 sec  wall time 262.1 sec
Assembly 3: cpu time 18.5 sec  wall time 270.6 sec
Assembly 4: cpu time 17.1 sec  wall time 262.2 sec
...

Using FEValues instead of FEPointEvaluation, the results are:
Loadstep 1:
Assembly 1: cpu time 23.9 sec  wall time 171.0 sec
Assembly 2: cpu time 32.5 sec  wall time 168.9 sec
Assembly 3: cpu time 33.2 sec  wall time 168.0 sec
Assembly 4: cpu time 32.7 sec  wall time 166.9 sec
Loadstep 2:
Assembly 1: cpu time 24.9 sec  wall time 168.0 sec
Assembly 2: cpu time 34.7 sec  wall time 167.3 sec
Assembly 3: cpu time 33.9 sec  wall time 167.8 sec
Assembly 4: cpu time 34.3 sec  wall time 167.7 sec
...

Clearly, the fluctuations using FEValues are smaller than in case of FEPointEvaluation.
Anyway, using FEPointEvaluation the cpu time is smaller but the wall time substantially bigger.
If I am not mistaken, the values cpu time 34.3 sec and wall time 167.7 sec mean that
the cpu needs 34.3 sec to execute my assembly routine and has to wait in the
remaining 167.7-34.3 seconds.
This huge gap between cpu and wall time has to be related to what I do with FEValues or FEPointEvaluation
as cpu and wall time are nearly balanced if I use either neither of them.
What might be the problem?

Best
Simon





--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.

Simon Wiesheier

unread,
Oct 20, 2022, 10:47:17 AM10/20/22
to dea...@googlegroups.com
Update:

I profiled my program with valgrind --tool=callgrind and could figure out that
FEPointEvaluation creates an FEValues object along with a quadrature object under the hood.
Closer inspection revealed that all constructors, destructors,... associated with FEPointEvaluation
need roughly 5000 instructions more (per call!).
That said, FEValues is indeed the faster approach, at least for FE_Q elements.

export DEAL_II_NUM_THREADS=1
eliminated the gap between cpu and wall time.
Using FEValues directly, I get cpu time 19.8 seconds
and in the case of FEPointEvaluation cpu time = 21.9 seconds;
Wall times are in the same ballpark.
Out of curiosity, why produces multi-threading such high wall times (200 seconds) in my case?.

These times are far too big given that the solution of the linear system takes only about 13 seconds.
But based on what all of you have said, there is probably no other to way to implement my problem.

Best
Simon

Peter Munch

unread,
Oct 20, 2022, 10:53:47 AM10/20/22
to deal.II User Group
> FEPointEvaluation creates an FEValues object along with a quadrature object under the hood.
Closer inspection revealed that all constructors, destructors,... associated with FEPointEvaluation
need roughly 5000 instructions more (per call!).
That said, FEValues is indeed the faster approach, at least for FE_Q elements.

What type of Mapping are you using? If you take a look at https://github.com/dealii/dealii/blob/ad13824e599601ee170cb2fd1c7c3099d3d5b0f7/source/matrix_free/fe_point_evaluation.cc#L40-L95 you can see when the fast path of FEPointEvaluation is taken. Indeed, the slow path is (FEValues). One question: are you running in release or debug mode?

Hope this brings us closer to the issue,
Peter

Simon Wiesheier

unread,
Oct 20, 2022, 11:00:53 AM10/20/22
to dea...@googlegroups.com
" What type of Mapping are you using? If you take a look at https://github.com/dealii/dealii/blob/ad13824e599601ee170cb2fd1c7c3099d3d5b0f7/source/matrix_free/fe_point_evaluation.cc#L40-L95 you can see when the fast path of FEPointEvaluation is taken. Indeed, the slow path is (FEValues). One question: are you running in release or debug mode? "

I use FE_Q<1>(1) with a MappingQ<1>(1) and
FE_Q<2>(1) with a MappingQ<2>(1).

I am running in release mode.

Best,
Simon

Martin Kronbichler

unread,
Oct 20, 2022, 11:02:29 AM10/20/22
to dea...@googlegroups.com

Dear Simon,

When you use FEPointEvaluation, you should construct it only once and re-use the same object for different points. Furthermore, you should also avoid to create "p_dofs" and the "std::vector" near the  I was not clear with my original message. Anyway, the problem is the FEValues object that gets used. I am confused by your other message that you use FE_Q together with MappingQ - that combination should be supported and if it is not, we should take a look at a (reduced) code from you.

Regarding the high timings: There is some parallelization by tasks that gets done inside the constructor of FEValues. This has good intents for the case that we are in 3D and have a reasonable amount of work to do. However, you are in 1D (if I read your code correctly), and then it is having adverse effects. The reason is that the constructor of FEValues is very likely completely dominated by memory allocation. When we have 1 thread, everything is fine, but when we have multiple threads working they will start to interfere with each other when the request memory through malloc(), which has to be coordinated by the operating system (and thus gets slower). In fact, the big gap between compute time and wall time shows that there is a lot of time wasted by "system time" that does not do actual work on the cores.

I guess the library could have a better measure of when to spawn tasks in FEValues in similar context, but it is a lot of work to get this right. (This is why I keep avoiding it in critical functions.)

Best,
Martin

Simon Wiesheier

unread,
Oct 20, 2022, 12:11:09 PM10/20/22
to dea...@googlegroups.com
" When you use FEPointEvaluation, you should construct it only once and re-use the same object for different points. Furthermore, you should also avoid to create "p_dofs" and the "std::vector" near the  I was not clear with my original message. Anyway, the problem is the FEValues object that gets used. I am confused by your other message that you use FE_Q together with MappingQ - that combination should be supported and if it is not, we should take a look at a (reduced) code from you. "

I added a snippet of my code (see appendix) in which I describe the logic as to what I am doing with FEPointEvaluation.
In fact, constructing FEPointEvaluation (and the vector p_dofs) once and re-using them brings only minor changes as the overall costs are dominated by the call to reinit().
But, of course, it helps at least.

I am surprised too that the fast path is not used. Maybe you can identify a problem in my code.
Thank you!

Best,
Simon

You received this message because you are subscribed to a topic in the Google Groups "deal.II User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dealii/uAplhH99yg4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/24ba8ef3-39f4-af6c-7296-02677827d6d3%40gmail.com.
mwe.cpp

Simon Wiesheier

unread,
Oct 21, 2022, 4:59:35 AM10/21/22
to dea...@googlegroups.com
I revised the appendix from my last message a little bit and attache now a minimal working example (just 140 lines) along with a CMakeLists.txt.
After checking the profiling results from valgrind, the combination of MappingQ with FE_Q takes *not* the fast path.

For info: I use dealii version 9.3.2

Best,
Simon
CMakeLists.txt
minimal_working_example.cc

Peter Munch

unread,
Oct 22, 2022, 6:57:21 AM10/22/22
to deal.II User Group
You are right. Release 9.3 uses the slow path for MappingQ. The reason is that here https://github.com/dealii/dealii/blob/ccfaddc2bab172d9d139dabc044d028f65bb480a/include/deal.II/matrix_free/fe_point_evaluation.h#L708-L711 we check for MappingQGeneric. At that time MappingQ and MappingQGeneric were different classes. In the meantime, we have merged the classes so that in release 9.4 and on master this is not an issue anymore. Is there a chance that you update deal.II. Alternatively, you could use MappingQGeneric instead of MappingQ.

Hope this resolves this issue!

Peter

Simon Wiesheier

unread,
Oct 22, 2022, 10:46:16 AM10/22/22
to dea...@googlegroups.com
Yes, the issue is resolved and the computation time decreased significantly.

Thank you all!

-Simon

Peter Munch

unread,
Oct 22, 2022, 11:08:57 AM10/22/22
to deal.II User Group
Happy about that! May I ask you to post the results here. I am curious since I never actually compared timings (and only blindly trusted Martin).

Thanks,
Peter

Simon Wiesheier

unread,
Oct 23, 2022, 4:33:38 AM10/23/22
to dea...@googlegroups.com
Certainly.
When using the slow path, i.e. MappingQ in version 9.3.2, the cpu time is about 6.3 seconds.
In case of the fast path, i.e. MappingQGeneric in version 9.3.2, the cpu time is about 18.7 seconds.
Crudely,  the .reinit function associated with the FEPointEvaluation objects is called about 1.2 million times.

Best,
Simon

Peter Munch

unread,
Oct 23, 2022, 4:37:18 AM10/23/22
to deal.II User Group
Now, I am lost. The fast one is 3 times slower!?

Peter

Simon Wiesheier

unread,
Oct 23, 2022, 4:52:10 AM10/23/22
to dea...@googlegroups.com
Sorry, I was wrong. Of course, it is the other way round.
The fast one is 3 times faster.

-Simon

Reply all
Reply to author
Forward
0 new messages