measuring cpu and wall time for assembly routine

Simon

unread,

Oct 19, 2022, 7:51:55 AM10/19/22

to deal.II User Group

Dear all,

I implemented two different versions to compute a stress for a given strain and want to compare the associated computation times in release mode.

version 1: stress = fun1(strain) cpu time: 4.52 s wall time: 4.53 s

version 2: stress = fun2(strain) cpu time: 32.5 s wall time: 167.5 s

fun1 and fun2, respectively, are invoked for all quadrature points (1,286,144 in the above example) defined on the triangulation. My program is not parallelized.

In fun2, I call find_active_cell_around_point twice for two different points on two different (helper) triangulations and initialize two FEValues objects

with the points ' ref_point_vol' and 'ref_point_dev'

as returned by find_active_cell_around_point .

FEValues<1> fe_vol(dof_handler_vol.get_fe(),
Quadrature<1>(ref_point_vol),
update_gradients | update_values);
FEValues<1> fe_values_energy_dev(this->dof_handler_dev.get_fe(),
Quadrature<1>(ref_point_dev),
update_gradients | update_values);

I figured out that the initialization of the two FEValues objects is the biggest portion of the above mentioned times. In particular, if I comment the initialization out, I have

cpu time: 6.54 s wall time: 6.55 s .

The triangulations associated with dof_handler_vol and dof_handler_dev are both 1d and store only 4 and 16 elements, respectively. That said, I am wondering why the initialization takes so long (roughly 100 seconds wall time in total) and why this causes a gap between the cpu and wall time.

Unfortunately, I have to reinitialize them anew whenever fun2 is called, because the point 'ref_point_vol' (see Quadrature<1>(ref_point_vol)) is different in each call to fun2.

Best

Simon

Bruno Turcksin

unread,

Oct 19, 2022, 9:08:35 AM10/19/22

to deal.II User Group

Simon,

The best way to profile a code is to use a profiler. It can give a lot more information than what simple timers can do. You say that your code is not parallelized but by default deal.II is multithreaded . Did you set DEAL_II_NUM_THREADS=1? That could explain why CPU and Wall time are different. Finally, if I understand correctly, you are calling the constructor of FEValues about 2.5 million times. That means that the call to one FEValues constructor is 100/2.5e6 seconds about 40 microseconds. That doesn't seem too slow.

Best,

Bruno

Simon Wiesheier

unread,

Oct 19, 2022, 9:33:18 AM10/19/22

to dea...@googlegroups.com

Thank you for your answer!

" Did you set DEAL_II_NUM_THREADS=1?"

How can I double-check that?

ccmake .

only shows my the variables CMAKE_BUILD_TYPE and deal.II_DIR .

But I do do knot if this is the right place to look for.

" That could explain why CPU and Wall time are different. Finally, if I understand correctly, you are calling the constructor of FEValues about 2.5 million times. That means that the call to one FEValues constructor is 100/2.5e6 seconds about 40 microseconds. That doesn't seem too slow. "

There was a typo in my post. It should be 160/2.5e6 seconds about 64 microsecends.

Best,

Simon

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "deal.II User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dealii/uAplhH99yg4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/d955e8e6-78c8-41f7-9f6c-f5339c22b319n%40googlegroups.com.

Bruno Turcksin

unread,

Oct 19, 2022, 10:17:51 AM10/19/22

to dea...@googlegroups.com

Simon,

Le mer. 19 oct. 2022 à 09:33, Simon Wiesheier <simon.w...@gmail.com> a écrit :

Thank you for your answer!

" Did you set DEAL_II_NUM_THREADS=1?"

How can I double-check that?
ccmake .
only shows my the variables CMAKE_BUILD_TYPE and deal.II_DIR .
But I do do knot if this is the right place to look for.

It's an environment variable. If you are using bash, you can do

export DEAL_II_NUM_THREADS=1

" That could explain why CPU and Wall time are different. Finally, if I understand correctly, you are calling the constructor of FEValues about 2.5 million times. That means that the call to one FEValues constructor is 100/2.5e6 seconds about 40 microseconds. That doesn't seem too slow. "

There was a typo in my post. It should be 160/2.5e6 seconds about 64 microsecends.

My point is the constructor should not be called millions of times. You are not going to be able to get that function 100 times faster. It's best to find a way to call it less often.

Best,

Bruno

Simon Wiesheier

unread,

Oct 19, 2022, 10:45:15 AM10/19/22

to dea...@googlegroups.com

" It's an environment variable. "

I did

$DEAL_II_NUM_THREADS

and the variable is not set.

But if it were set to one, why would this explain the gap between cpu and wall time?

" My point is the constructor should not be called millions of times. You are not going to be able to get that function 100 times faster. It's best to find a way to call it less often. "

What I want to do boils down to the following:

Given the reference co-ordinates of a point 'p', along with the cell on which 'p' lives,

give me the value and gradient of a finite element function evaluated at 'p'.

My idea was to create a quadrature object with 'p' being the only quadrature point and pass this

quadrature object to the FEValues object and finally do the .reinit(cell) call (then, of course, get_function_values()...)

'p' is different for all (2.5 million) quadrature points, which is why I create the FEValues object so many times.

Do you a different suggestion to solve my problem, ie to evaluate the finite element field and its derivatives at 'p'?

Best,

Simon

--

The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---

You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/CAGVt9eMfVohOUToQOsBD_v%2BqU%3D0Em_XOMiwqFi2SM_0zLoy-sQ%40mail.gmail.com.

Martin Kronbichler

unread,

Oct 19, 2022, 12:21:24 PM10/19/22

to dea...@googlegroups.com

Dear Simon,

You seem to be looking for FEPointEvaluation. That class is shown in step-19 and provides, for simple FiniteElement types, a much faster way to evaluate solutions at arbitrary points within a cell. Do you want to give it a try? The issue you are facing is that FEValues that you are using is using a very abstract entry point that does precomputations that only pay off if using the unit points many times. And even in the case of the same unit points it is not really fast, it is a general-purpose baseline that I would not recommend for high-performance purposes.

As a final note, I would mention that FEPointEvaluation falls back to FEValues for complicated FiniteElement types, so it might be that you do not get speedups in those cases. But we could work on it if you need it, today we know much better what to do than a few years ago.

Best,

Martin

To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/CAM50jEtyY576riC6yNqqMafXfGGvTXY8mhm%3Di7HMzr-U_LAxbQ%40mail.gmail.com.

Wolfgang Bangerth

unread,

Oct 19, 2022, 4:34:44 PM10/19/22

to dea...@googlegroups.com

On 10/19/22 08:45, Simon Wiesheier wrote:
>
> What I want to do boils down to the following:
> Given the reference co-ordinates of a point 'p', along with the cell on
> which 'p' lives,
> give me the value and gradient of a finite element function evaluated at
> 'p'.
>
> My idea was to create a quadrature object with 'p' being the only
> quadrature point and pass this
> quadrature object to the FEValues object and finally do the
> .reinit(cell) call (then, of course, get_function_values()...)
> 'p' is different for all (2.5 million) quadrature points, which is why I
> create the FEValues object so many times.

It's worth pointing out that is exactly what VectorTools::point_values()
does.

(As others have already mentioned, if you want to do that many many
times over, this is too expensive and you should be using
FEPointEvaluation instead.)

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Simon Wiesheier

unread,

Oct 20, 2022, 5:55:56 AM10/20/22

to dea...@googlegroups.com

Dear Martin and Wolfgang,

" You seem to be looking for FEPointEvaluation. That class is shown in step-19 and provides, for simple FiniteElement types, a much faster way to evaluate solutions at arbitrary points within a cell. Do you want to give it a try? "

I implemented the FEPointEvaluation approach like this:

FEPointEvaluation<1,1> fe_eval(mapping,

FE_Q<1>(1),

update_gradients | update_values);
fe_eval.reinit(cell, make_array_view(std::vector<Point<1>>{ref_point_energy_vol}));
Vector<double> p_dofs(2);
cell->get_dof_values(solution_global, p_dofs);

fe_eval.evaluate(make_array_view(p_dofs),
EvaluationFlags::values | EvaluationFlags::gradients);

double val = fe_eval.get_value(0);

Tensor<1,1> grad = fe_eval.get_gradient(0);

I am using FE_Q elements of degree one and a MappingQ object also of degree one.

Frankly, I do not really understand the measured computation times.

My program has several loadsteps with nested Newton iterations:

Loadstep 1:

Assembly 1: cpu time 12.8 sec wall time 268.7 sec

Assembly 2: cpu time 17.7 sec wall time 275.2 sec

Assembly 3: cpu time 22.3 sec wall time 272.6 sec

Assembly 4: cpu time 23.8 sec wall time 271.3sec

Loadstep 2:

Assembly 1: cpu time 14.3 sec wall time 260.0 sec

Assembly 2: cpu time 16.9 sec wall time 262.1 sec

Assembly 3: cpu time 18.5 sec wall time 270.6 sec

Assembly 4: cpu time 17.1 sec wall time 262.2 sec

...

Using FEValues instead of FEPointEvaluation, the results are:

Loadstep 1:

Assembly 1: cpu time 23.9 sec wall time 171.0 sec

Assembly 2: cpu time 32.5 sec wall time 168.9 sec

Assembly 3: cpu time 33.2 sec wall time 168.0 sec

Assembly 4: cpu time 32.7 sec wall time 166.9 sec

Loadstep 2:

Assembly 1: cpu time 24.9 sec wall time 168.0 sec

Assembly 2: cpu time 34.7 sec wall time 167.3 sec

Assembly 3: cpu time 33.9 sec wall time 167.8 sec

Assembly 4: cpu time 34.3 sec wall time 167.7 sec

...

Clearly, the fluctuations using FEValues are smaller than in case of FEPointEvaluation.

Anyway, using FEPointEvaluation the cpu time is smaller but the wall time substantially bigger.

If I am not mistaken, the values cpu time 34.3 sec and wall time 167.7 sec mean that

the cpu needs 34.3 sec to execute my assembly routine and has to wait in the

remaining 167.7-34.3 seconds.

This huge gap between cpu and wall time has to be related to what I do with FEValues or FEPointEvaluation

as cpu and wall time are nearly balanced if I use either neither of them.

What might be the problem?

Best

Simon

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/cd1c8fa0-443d-b7bf-b433-f5ab033a247c%40colostate.edu.

Simon Wiesheier

unread,

Oct 20, 2022, 10:47:17 AM10/20/22

to dea...@googlegroups.com

Update:

I profiled my program with valgrind --tool=callgrind and could figure out that

FEPointEvaluation creates an FEValues object along with a quadrature object under the hood.

Closer inspection revealed that all constructors, destructors,... associated with FEPointEvaluation

need roughly 5000 instructions more (per call!).

That said, FEValues is indeed the faster approach, at least for FE_Q elements.

export DEAL_II_NUM_THREADS=1

eliminated the gap between cpu and wall time.

Using FEValues directly, I get cpu time 19.8 seconds

and in the case of FEPointEvaluation cpu time = 21.9 seconds;

Wall times are in the same ballpark.

Out of curiosity, why produces multi-threading such high wall times (200 seconds) in my case?.

These times are far too big given that the solution of the linear system takes only about 13 seconds.

But based on what all of you have said, there is probably no other to way to implement my problem.

Best

Simon

Peter Munch

unread,

Oct 20, 2022, 10:53:47 AM10/20/22

to deal.II User Group

> FEPointEvaluation creates an FEValues object along with a quadrature object under the hood.

Closer inspection revealed that all constructors, destructors,... associated with FEPointEvaluation

need roughly 5000 instructions more (per call!).

That said, FEValues is indeed the faster approach, at least for FE_Q elements.

What type of Mapping are you using? If you take a look at https://github.com/dealii/dealii/blob/ad13824e599601ee170cb2fd1c7c3099d3d5b0f7/source/matrix_free/fe_point_evaluation.cc#L40-L95 you can see when the fast path of FEPointEvaluation is taken. Indeed, the slow path is (FEValues). One question: are you running in release or debug mode?

Hope this brings us closer to the issue,
Peter

Simon Wiesheier

unread,

Oct 20, 2022, 11:00:53 AM10/20/22

to dea...@googlegroups.com

" What type of Mapping are you using? If you take a look at https://github.com/dealii/dealii/blob/ad13824e599601ee170cb2fd1c7c3099d3d5b0f7/source/matrix_free/fe_point_evaluation.cc#L40-L95 you can see when the fast path of FEPointEvaluation is taken. Indeed, the slow path is (FEValues). One question: are you running in release or debug mode? "

I use FE_Q<1>(1) with a MappingQ<1>(1) and

FE_Q<2>(1) with a MappingQ<2>(1).

I am running in release mode.

Best,

Simon

To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/cb9c0d1a-77f5-4920-86dc-bc38ee6ab1dfn%40googlegroups.com.

Martin Kronbichler

unread,

Oct 20, 2022, 11:02:29 AM10/20/22

to dea...@googlegroups.com

Dear Simon,

When you use FEPointEvaluation, you should construct it only once and re-use the same object for different points. Furthermore, you should also avoid to create "p_dofs" and the "std::vector" near the I was not clear with my original message. Anyway, the problem is the FEValues object that gets used. I am confused by your other message that you use FE_Q together with MappingQ - that combination should be supported and if it is not, we should take a look at a (reduced) code from you.

Regarding the high timings: There is some parallelization by tasks that gets done inside the constructor of FEValues. This has good intents for the case that we are in 3D and have a reasonable amount of work to do. However, you are in 1D (if I read your code correctly), and then it is having adverse effects. The reason is that the constructor of FEValues is very likely completely dominated by memory allocation. When we have 1 thread, everything is fine, but when we have multiple threads working they will start to interfere with each other when the request memory through malloc(), which has to be coordinated by the operating system (and thus gets slower). In fact, the big gap between compute time and wall time shows that there is a lot of time wasted by "system time" that does not do actual work on the cores.

I guess the library could have a better measure of when to spawn tasks in FEValues in similar context, but it is a lot of work to get this right. (This is why I keep avoiding it in critical functions.)

Best,
Martin

To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/CAM50jEt3LjNS%2BF%3D2pGUSHRNbaHy9EzRfogpyBxb51-7M%3DALxrA%40mail.gmail.com.

Simon Wiesheier

unread,

Oct 20, 2022, 12:11:09 PM10/20/22

to dea...@googlegroups.com

" When you use FEPointEvaluation, you should construct it only once and re-use the same object for different points. Furthermore, you should also avoid to create "p_dofs" and the "std::vector" near the I was not clear with my original message. Anyway, the problem is the FEValues object that gets used. I am confused by your other message that you use FE_Q together with MappingQ - that combination should be supported and if it is not, we should take a look at a (reduced) code from you. "

I added a snippet of my code (see appendix) in which I describe the logic as to what I am doing with FEPointEvaluation.

In fact, constructing FEPointEvaluation (and the vector p_dofs) once and re-using them brings only minor changes as the overall costs are dominated by the call to reinit().

But, of course, it helps at least.

I am surprised too that the fast path is not used. Maybe you can identify a problem in my code.

Thank you!

Best,

Simon

You received this message because you are subscribed to a topic in the Google Groups "deal.II User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dealii/uAplhH99yg4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/24ba8ef3-39f4-af6c-7296-02677827d6d3%40gmail.com.

mwe.cpp

Simon Wiesheier

unread,

Oct 21, 2022, 4:59:35 AM10/21/22

to dea...@googlegroups.com

I revised the appendix from my last message a little bit and attache now a minimal working example (just 140 lines) along with a CMakeLists.txt.

After checking the profiling results from valgrind, the combination of MappingQ with FE_Q takes *not* the fast path.

For info: I use dealii version 9.3.2

Best,

Simon

CMakeLists.txt

minimal_working_example.cc

Peter Munch

unread,

Oct 22, 2022, 6:57:21 AM10/22/22

to deal.II User Group

You are right. Release 9.3 uses the slow path for MappingQ. The reason is that here https://github.com/dealii/dealii/blob/ccfaddc2bab172d9d139dabc044d028f65bb480a/include/deal.II/matrix_free/fe_point_evaluation.h#L708-L711 we check for MappingQGeneric. At that time MappingQ and MappingQGeneric were different classes. In the meantime, we have merged the classes so that in release 9.4 and on master this is not an issue anymore. Is there a chance that you update deal.II. Alternatively, you could use MappingQGeneric instead of MappingQ.

Hope this resolves this issue!

Peter

Simon Wiesheier

unread,

Oct 22, 2022, 10:46:16 AM10/22/22

to dea...@googlegroups.com

Yes, the issue is resolved and the computation time decreased significantly.

Thank you all!

-Simon

To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/b9e2b386-f9f1-4046-9d35-cfbf045beb4bn%40googlegroups.com.

Peter Munch

unread,

Oct 22, 2022, 11:08:57 AM10/22/22

to deal.II User Group

Happy about that! May I ask you to post the results here. I am curious since I never actually compared timings (and only blindly trusted Martin).

Thanks,

Peter

Simon Wiesheier

unread,

Oct 23, 2022, 4:33:38 AM10/23/22

to dea...@googlegroups.com

Certainly.

When using the slow path, i.e. MappingQ in version 9.3.2, the cpu time is about 6.3 seconds.

In case of the fast path, i.e. MappingQGeneric in version 9.3.2, the cpu time is about 18.7 seconds.

Crudely, the .reinit function associated with the FEPointEvaluation objects is called about 1.2 million times.

Best,

Simon

To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/725f9507-0236-47f7-b0df-03c7d21fe503n%40googlegroups.com.

Peter Munch

unread,

Oct 23, 2022, 4:37:18 AM10/23/22

to deal.II User Group

Now, I am lost. The fast one is 3 times slower!?

Peter

Simon Wiesheier

unread,

Oct 23, 2022, 4:52:10 AM10/23/22

to dea...@googlegroups.com

Sorry, I was wrong. Of course, it is the other way round.

The fast one is 3 times faster.

-Simon

To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/f9433201-16e6-48e0-830d-4be535d80716n%40googlegroups.com.

Reply all

Reply to author

Forward