What's the best strategy to speed up assembly?

Kev Se

unread,

Dec 8, 2022, 11:10:51 AM12/8/22

to deal.II User Group

Hej,

I've written here before and hope I don't misuse the mailing list - however I've been looking a bit in the documentation and here and haven't really found a conclusive answer.

I am aiming to solve a nonlinear hyperbolic transport equation. For the sake of the argument, let's say it reads

mu \cdot \nabla f(x) = - f(x)^2 - 2*b(x)*f(x) - a(x)

this is, of course, a Riccati equation (up to signs, possibly). In my case, f is a complex function but this is of little relevance here. Since it's a nonlinear problem I need to construct both the Jacobian and the residual. For starters, I do that in each step.

I've managed to implement this and even get a PETSc-parallelised version to work, and am very happy. (I love deal.ii, by the way - very impressive). It scales not "optimally" on my small laptop but it's still a fine speedup when using MPI. So far so good.

However, I want to solve my problem for many different directions vF, and then extract all the solutions and do something with them. As such, my problem is less that I need very large number of DOFs / huge meshes - my typical mesh will be on the order of 10000 unknowns, maybe 100k but not millions. Rather, I want the individual solves to be as fast as possible since I need to do on the order of 100-10000 of them, depending on the problem at hand.

I've done some layman's benchmarking of the individual "steps" (setup, assembly, solve, ...) in my current version of the code. It looks as if the assembly takes several orders of magnitude (~100 at least) longer than the solving part.

My question is now: What is the best strategy to speed up assembly, is there any experience with this? I've read different approaches and am confused what's promising for small-scale problems. So far I'm considering:

1) Using a matrix-free approach rather than PETSc - this seems to be a win in most cases, would however consider rewriting large parts of the code and I am not sure if I will gain a lot given my small system size.

2) Only assemble the Jacobian every few steps, but the residual in every step. This is probably easier to implement. I know from experience with my problem that I pretty quickly land in a situation where I need only one or two Newton steps to find the solution to my nonlinear equation, so there saving will be small at best.

Is there anything else one can do?

So far I've been using MeshWorker, which is fine and understandable to be, but e.g. the boundary term as used in Example 12 queries the scalar product of \mu and the edge normal in each boundary element, which seems like a possible slowdown - in addition to generating jumps and averages on inner cell edges.

Any help is much appreciated. Sorry for the long text!

/Kev

blais...@gmail.com

unread,

Dec 8, 2022, 2:13:08 PM12/8/22

to deal.II User Group

I am a bit surprised that your assembly is really 100x more expensive than your linear solver.

Maybe your assembly code is not optimized?

For example, I try to avoid doing as much work as possible during the double loop over DOFs which is the inner most loop. Sometimes pre-calculating things in the outer loop really speeds-up the calculation. This also depends on the polynomials you are using for interpolation. If you are using high-order polynomials, I think this is where you will reap the benefits of Matrix free significantly.

Feel free to ask more questions :)!

Best

Bruno

Wolfgang Bangerth

unread,

Dec 8, 2022, 3:30:33 PM12/8/22

to dea...@googlegroups.com

On 12/8/22 12:13, blais...@gmail.com wrote:
> For example, I try to avoid doing as much work as possible during the
> double loop over DOFs which is the inner most loop. Sometimes
> pre-calculating things in the outer loop really speeds-up the
> calculation. This also depends on the polynomials you are using for
> interpolation. If you are using high-order polynomials, I think this is
> where you will reap the benefits of Matrix free significantly.

This is extensively discussed in step-22, as a point of reference.
Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Wolfgang Bangerth

unread,

Dec 8, 2022, 4:35:22 PM12/8/22

to dea...@googlegroups.com

Kev,

> I've done some layman's benchmarking of the individual "steps" (setup,
> assembly, solve, ...) in my current version of the code. It looks as if the
> assembly takes several orders of magnitude (~100 at least) longer than the
> solving part.
>
> My question is now: What is the best strategy to speed up assembly, is there
> any experience with this? I've read different approaches and am confused
> what's promising for small-scale problems. So far I'm considering:
>
> 1) Using a matrix-free approach rather than PETSc - this seems to be a win in
> most cases, would however consider rewriting large parts of the code and I am
> not sure if I will gain a lot given my small system size.
>
> 2) Only assemble the Jacobian every few steps, but the residual in every step.
> This is probably easier to implement. I know from experience with my problem
> that I pretty quickly land in a situation where I need only one or two Newton
> steps to find the solution to my nonlinear equation, so there saving will be
> small at best.

Like Bruno, it seems rather unlikely to me that assembly would take 100x times
longer than other things, and I would suggest you use something like
TimerOutput to time individual sections to narrow down where the issue lies.

Beyond this, only updating the Jacobian is indeed a fairly usual strategy. One
can of course implement this by hand, but it's quite cumbersome to implement
optimal algorithms to determine when updating is necessary, and I would
encourage you to take a look at step-77 to see how one can solve nonlinear
problems efficiently through interfaces to advanced libraries such as SUNDIALS:
https://dealii.org/developer/doxygen/deal.II/step_77.html

I don't remember if KINSOL has ways to solve a sequence of nonlinear system,
re-using the Jacobian between systems. But if it does not, you can always just
store a pointer to the last Jacobian matrix and whenever you start solving a
linear system for the first time, copy the stored matrix over.

In any case, I think the first step should be to look at (i) whether assembly
really takes that long for you, (ii) understand why it takes that long. If
step-77 is any indication, then assembling the Jacobian should really not take
that much longer than actually solving linear systems.

Marc Fehling

unread,

Dec 9, 2022, 4:07:31 PM12/9/22

to deal.II User Group

Hello Kev,

just to make sure: did you compile the deal.II library and your code in Optimized mode/Release mode?

Marc

Martin Kronbichler

unread,

Dec 11, 2022, 4:24:27 PM12/11/22

to dea...@googlegroups.com

Dear Kev,

To give an additional perspective on this topic: The assembly functions of deal.II (when using FEValues and the approach shown in step-1 to step-30 at least) are not optimized for performance also when compiled in release mode, but rather for readability. On an average case with linear or quadratic polynomials, I would expect a factor of 10 in performance gain by using one of the following approaches:

- You could express the three nested loop of a typical matrix assembly routine as a dense matrix-matrix multiplication. This requires you to rearrange some computations, especially when you work with systems of equations, but the gain comes from the fact that matrix-matrix multiplication is typically mapped to BLAS functions, which are one of the most optimized algorithms in the world (as LINPACK uses it as the dominating algorithm). The step-65 tutorial program explains this for a symmetric problem, but it is not hard to do for non-symmetric problems.

- You could express the code that computes the local matrix such that FEEvaluation (from the matrix-free framework) can compute the system matrix by applying the local operator to all unit vectors. You could also go completely matrix-free in the sense of not computing a matrix at all, but using more optimized strategies to compute the local matrices and residuals could be a middle ground when you don't want to fiddle with iterative solvers beyond PETSc's (great) default choices. The matrix-free framework uses computer hardware slightly less efficiently than BLAS-3, but it can leverage the structure in shape functions and hence lower complexity, often quite significantly.

Obviously, both the above options require some work, and lead to not-so-nice code. It depends on your use case if that is worth it. Of course, all the other suggestions, like not assembling the matrix in every step (possibly via SUNDIALS) or gathering precise timings to take the right action, should also be in your list of considerations.

Best,
Martin

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/5bb618cf-52ce-4871-944b-aeb56d1b4c50n%40googlegroups.com.

Kev Se

unread,

Dec 11, 2022, 6:08:21 PM12/11/22

to deal.II User Group

Hello everyone,

first of all, thank you for all the detailed and kind answers! I will try to answer and probably miss a few points.

I knew about the debug and release modes and thought I was still in release mode. I had apparently switched back to debug mode for benchmarking purposes without remembering doing that. That is of course silly and a large part of the discrepancy. Now the factor is down from 100-200 to around 4-5, which is of course a great improvement that I should have found myself. Oh well.

I'm still very appreciative for all the comments since I'm currently in some form of early alpha of my programme and trying to get the basics "right" so that the layer surrounding the FEM solution step can be build on top. [In case that is of interest: I'm working on simulating superconductors, in my case the FEM solution gets fed into some form of self-consistency loop for an energy landscape/ the coefficients of the equation that happens outside of a FEM method.]

As for tutorial 22, I hadn't read the part about assembly and gladly read that. Tutorial 77 I had indeed thought about when asking about not recalculating the Jacobian in each step, it's a very interesting thing to add but I might postpone SUNDIALS to a later state. I have worked with it a few years ago and remember it to be a beast to tame, so to speak, but probably worth the effort.

Martin, thanks for all the background and suggestions regarding some matrix-free / matrix-product options. I should definitely look into that once I understand it better :-)

-----------------------------

Here the general answer stops, and a particular question follows - feel feel to skip reading on.

-----------------------------

One "bottleneck" in my current matrix-based approaches is in the "interface worker" that calculates fluxes between neighbouring cells. I have tried to avoid "re-computations" and shift as much outside of the inner loops as possible, which probably helped by a factor of 2 or so from my earliest approaches. One thing particularly I am still confused about is the following:

I want to calculate an upwind value at interior interfaces. In Example/ Step 12 tutorial, this is done via something of the form:

fe_iv.shape_value((beta_dot_n > 0), j, qpoint) // phi_j^{upwind}

which is very clear. Now, for my nonlinear problem I need to do something like this with my "old" solution guess in the residual and Jacobian. Since I have a an FEsystem with at least two components, for the jacobian I end up using

fe_iv[component_1].value((beta_dot_n > 0), j, qpoint)

which is exactly the same thing, to my limited understanding. Now, for the residual I need this term for my "last solution" vector, but I couldn't find a corresponding function for that problem. What I did find was

fe_iv[component_1].get_function_values((vF_dot_n > 0), last_solution_ghost, tmp_array);

which, however, does not operate point-wise but rather determines the value vector for the entire cell(?), and then I have to query tmp_array at given point q_point. Since I need to recalculate beta_dot_n at every point, this means a lot of additional copy operations. I did not fully understand if get_function_values_from_local_dof_values() is the way to go? My current solution is to use a "workaround" of sorts, by pre-calculating the averages and jumps outside the "three" loop over qpoints and DOFs i and j, and inside the loop the use

last_solution_average*beta_dot_n + 0.5*abs(beta_dot_n)*last_solution_jump

which does the job but still requires the calculation of all these averages and jumps rather than picking a certain value based on a simple if-else query. Replacing this type of construct in the Jacobian (with the above-mentioned function) saved some 10% or so, although this is arguably more often looped over than for the residual.

-------------------- question ends------------

Thanks for any hints so far, and I'm appreciative for any further help - I sincerely appreciate the top-notch documentation and apologize for my lack of understanding all these fine details (and obvious mishaps such as the release flag...).

Best,

Kev

Wolfgang Bangerth

unread,

Dec 13, 2022, 12:20:49 AM12/13/22

to dea...@googlegroups.com

On 12/11/22 16:08, Kev Se wrote:
> Now, for the residual I need this term for my "last solution" vector, but I
> couldn't find a corresponding function for that problem. What I did find was
>

> /fe_iv[component_1].get_function_values((vF_dot_n > 0), last_solution_ghost,
> tmp_array);/

>
> which, however, does not operate point-wise but rather determines the value
> vector for the entire cell(?),

Well, for the entire *face*, but the issue is clear anyway. I think it would
probably be useful to introduce a function that, instead of a single
here_or_there argument, takes a vector of n_quadrature_points such bool values
-- in your case, the values of the expression vf[q]*n[q]>0. As is always the
case, we would of course welcome such a patch.

Looking at the implementation, this shouldn't actually be very difficult.
You'd create a copy of the existing
get_function_values_from_local_dof_values() that takes such a vector<bool> as
argument, and call that from the new get_function_values() function that takes
such a bool vector.

> and then I have to query /tmp_array/ at given
> point /q_point/. Since I need to recalculate /beta_dot_n/ at every point, this

> means a lot of additional copy operations. I did not fully understand

> if/get_function_values_from_local_dof_values/() is the way to go? My current

> solution is to use a "workaround" of sorts, by pre-calculating the averages
> and jumps outside the "three" loop over qpoints and DOFs i and j, and inside
> the loop the use
>

> /last_solution//_average*beta_dot_n + 0.5*abs(beta_dot_n)*//last_solution//_jump/

This avoids the copies of vectors, but of course is not much cheaper either.

Reply all

Reply to author

Forward