Suggestions for "cheap" distributed preconditioner for transient systems?

blais...@gmail.com

unread,

Dec 6, 2020, 2:27:08 PM12/6/20

to deal.II User Group

Hello all,

I hope you are all well. This is not strictly a deal.II question, but I thought I could benefit for the huge amount of expertise present on this board.

I am currently launching large (at least from my POV) tasks on our HPC cluster. The task generally contain on the order of 50M to 100M DOFs and are transient DNS problems.

Right now we are solving the Navier-Stokes equations in a GLS formulation. For steady-state problems, it is reasonably pertinent to use expensive preconditioners (like AMG with ILU-smoother/coarsener or straight ILU(1) or ILU(2)) because we do not do a large number of iterations. However, in the present case we wish to use a very small time-step (even though we are using a high-order L-Stable SDIRK33 time integration).

In this case, even using ILU(1) is extremely expensive (e.g. it takes 2X more time than to assemble the matrix and it really doubles up the iteration time). Consequently, we are using ILU(0), which in the present case is performing relatively well (say between 40-100 iterations per newton step). I was wondering if there were any other suggestions in terms of preconditioner that we could use?

Our system : GLS stabilized Navier-Stokes, monolithic matrix formulation (single matrix for u,v,w,p). The matrix is non-symmetric.

Our solver : Either GMRES or BiCGStab. In our case it seems GMRES is a bit faster (around 20%) because the iterations are cheaper. I have tried all of the solvers in the TrilinosWrappers except FGMRES actually.

Our current preconditioner : ILU(0)

Library used : Trilinos through the deal.II wrappers

Element order : Q2-Q2, Q3-Q3 or Q4-Q4.

Any suggestions would be appreciated. I am doing all I can right now to speed-up matrix assembly, but in general we only need to assemble the matrix every 3-5 time steps, so I am just trying to find a good compromise between a good preconditioner that is not crazy expensive. Would I be better to accept having a poorer jacobian matrix, use a more expensive preconditioner and keeping it for a longer time (say 10-20 time-steps)?

Wolfgang Bangerth

unread,

Dec 6, 2020, 8:00:09 PM12/6/20

to dea...@googlegroups.com

On 12/6/20 12:27 PM, blais...@gmail.com wrote:
>
> In this case, even using ILU(1) is extremely expensive (e.g. it takes 2X more
> time than to assemble the matrix and it really doubles up the iteration time).
> Consequently, we are using ILU(0), which in the present case is performing
> relatively well (say between 40-100 iterations per newton step). I was
> wondering if there were any other suggestions in terms of preconditioner that
> we could use?
>
> Our system : GLS stabilized Navier-Stokes, monolithic matrix formulation
> (single matrix for u,v,w,p). The matrix is non-symmetric.
> Our solver : Either GMRES or BiCGStab. In our case it seems GMRES is a bit
> faster (around 20%) because the iterations are cheaper. I have tried all of
> the solvers in the TrilinosWrappers except FGMRES actually.
> Our current preconditioner : ILU(0)
> Library used : Trilinos through the deal.II wrappers
> Element order : Q2-Q2, Q3-Q3 or Q4-Q4.
>
> Any suggestions would be appreciated. I am doing all I can right now to
> speed-up matrix assembly, but in general we only need to assemble the matrix
> every 3-5 time steps, so I am just trying to find a good compromise between a
> good preconditioner that is not crazy expensive. Would I be better to accept
> having a poorer jacobian matrix, use a more expensive preconditioner and
> keeping it for a longer time (say 10-20 time-steps)?

Have you played with the trade-off of having a more expensive preconditioner
but building it less often? For example, using ILU(1) instead of ILU(0), but
only build it every few time steps?

How come you get away with only assembling the matrix every 3-5 time steps?
Are you treating advection explicitly?

I'm also curious about what the matrix you use looks like. It must be
something like
A B
B^T C
What terms are in A,B,C?

Best
Wolfgang

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Timo Heister

unread,

Dec 6, 2020, 9:04:24 PM12/6/20

to dea...@googlegroups.com

> The task generally contain on the order of 50M to 100M DOFs and are transient DNS problems.

With this you are likely at a point where factorizations (even ILU)
not work very well. Even if you can afford a Block ILU, you likely run
on 100+ cores, which means the quality of the preconditioner will
deteriorate quite a lot just based on the number of processors. You
likely need something that puts more effort into the parallel aspect.

> Our solver : Either GMRES or BiCGStab. In our case it seems GMRES is a bit faster (around 20%) because the iterations are cheaper. I have tried all of the solvers in the TrilinosWrappers except FGMRES actually.

My general advice is to work on the preconditioner to have a method in
the 10-40 iterations range. Then, it doesn't really matter which
Krylov method you use. Just use FGMRES (if you need flexibility), CG
(for SPD), MINRES (for symmetric), or GMRES (for everything else).
Switching between different Krylov methods is unlikely to make a big
difference unless you are in the 100k+ core range or need too many
iterations.

> Our current preconditioner : ILU(0)

> Element order : Q2-Q2, Q3-Q3 or Q4-Q4.

Are you just applying ILU to the whole system? My general advice is to
exploit the block structure of your PDE first and then apply ILU or
AMG to individual blocks. Time dependent Navier-Stokes with small time
steps is a relatively easy case. Let me know if you want more details.

--
Timo Heister
http://www.math.clemson.edu/~heister/

blais...@gmail.com

unread,

Dec 7, 2020, 8:19:02 AM12/7/20

to deal.II User Group

Dear Wolgang, thank you for the answers

We get off by assembling the matrix every N time-steps for two main reasons:

- We use SDIRK which has a constant diagonal for all stages, hence this part of the matrix does not change significantly

- We are willing to sacrifice one non-linear iteration there and there, at the cost of less assembly. The idea is that we use an exact jacobian when we assemble the matrix, but then we let it become innacurate as we continue iterating, while just updating the rhs. This is not a good approach when the time-step is large, but when the time-step is small (say CFL<1), it really does not affect significantly the number of newton iterations. Consequently, we can update the jacobian matrix every 3 to 5 time-steps or so. This is just an indication, generally we update it as soon as a non-linear step has not lead to a decrease of more than 10x of the residual, meaning that our Newton method is becoming inefficient.

I'm also curious about what the matrix you use looks like. It must be
something like
A B
B^T C
What terms are in A,B,C?

It has exactly this shape. The full weak form and some results are detailed in : https://www.sciencedirect.com/science/article/pii/S2352711020302922

A is momentum + SUPG

C is the PSPG block. In essence it looks very much like a stiffness matrix weighted by the stabilization parameter. Hence, I really don't think it is "well conditioned".

Best
Wolfgang

Thank you for your time. As always, I really appreciate this community :)!

blais...@gmail.com

unread,

Dec 7, 2020, 8:26:30 AM12/7/20

to deal.II User Group

Dear Timo,

thank you for your answer and your time :)

On Sunday, December 6, 2020 at 9:04:24 p.m. UTC-5 Timo Heister wrote:

> The task generally contain on the order of 50M to 100M DOFs and are transient DNS problems.

With this you are likely at a point where factorizations (even ILU)
not work very well. Even if you can afford a Block ILU, you likely run
on 100+ cores, which means the quality of the preconditioner will
deteriorate quite a lot just based on the number of processors. You
likely need something that puts more effort into the parallel aspect.

I fully agree. This is the part where my knowledge is extremely limited. I can clearly see that my strategy is becoming poorer and poorer as my number of core and the size of my system increases. To be honest, this is a first venture for me in terms of looking at problems this size. In the past in my career I was always in the early 1-2M cells, whereas here we are talking about 50M cells and traditional approach quickly become inefficient.

> Our solver : Either GMRES or BiCGStab. In our case it seems GMRES is a bit faster (around 20%) because the iterations are cheaper. I have tried all of the solvers in the TrilinosWrappers except FGMRES actually.

My general advice is to work on the preconditioner to have a method in
the 10-40 iterations range. Then, it doesn't really matter which
Krylov method you use. Just use FGMRES (if you need flexibility), CG
(for SPD), MINRES (for symmetric), or GMRES (for everything else).
Switching between different Krylov methods is unlikely to make a big
difference unless you are in the 100k+ core range or need too many
iterations.

Understood. Let's say that right now we are more in the 40-80 iterations range, making me think that we could be more efficient.

> Our current preconditioner : ILU(0)
> Element order : Q2-Q2, Q3-Q3 or Q4-Q4.

Are you just applying ILU to the whole system? My general advice is to
exploit the block structure of your PDE first and then apply ILU or
AMG to individual blocks. Time dependent Navier-Stokes with small time
steps is a relatively easy case. Let me know if you want more details.

Yes, I am applying ILU to the whole system, without anything else really. I think this is relatively traditional in GLS and VMS approaches (although I would not be able to say if that is a good or a bad thing).

If I understand correctly, you mean formulating the problem using a BlockMatrix and a BlockVector, then preconditioning each matrix of the block independently?

I guess this would be best achieved using the linear operators within deal.II right?

What would be the added benefit of that? I would presume that since each block have more "coherent units", this would at least to a better conditioning of the final system, but is there anything else there?

Is there any literature or elements on this? I feel like linear solver aspect is such a critical element of a good Navier-Stokes solver, but I struggle to find easy/accessible literature on the topic. This is where my Chemical Engineering background makes me feel like I am biting more than I can chew :).

Thank you very much for your time. I really appreciate it.

Martin Kronbichler

unread,

Dec 7, 2020, 9:22:36 AM12/7/20

to dea...@googlegroups.com

Dear Bruno,

Yes, I am applying ILU to the whole system, without anything else really. I think this is relatively traditional in GLS and VMS approaches (although I would not be able to say if that is a good or a bad thing).

If I understand correctly, you mean formulating the problem using a BlockMatrix and a BlockVector, then preconditioning each matrix of the block independently?

I guess this would be best achieved using the linear operators within deal.II right?

Yes, you would apply block vectors and block matrices. Concerning the linear operators that would be one option, but it is not really a necessity. For example you can have a look at step-31 and the class BlockSchurPreconditioner in particular, see its implementation here: https://github.com/dealii/dealii/blob/2a6983945af78ee20b10757ab010040b9f2a2f89/examples/step-31/step-31.cc#L407-L417

What would be the added benefit of that? I would presume that since each block have more "coherent units", this would at least to a better conditioning of the final system, but is there anything else there?

Is there any literature or elements on this?

The benefit is that you can tailor the components to the problem at hand. If you have a Navier-Stokes problem with small time steps, at some point you will need a Poisson-type operator in the pressure variable to reflect the incompressibility constraint. To resolve that global coupling, you need some form of multigrid (or other strong preconditioners). However, the effect is really only needed for the pressure, so it might be inefficient to also apply multigrid on the velocity components, and instead get around much more cheaply by doing an ILU there or even just the diagonal. For small time steps, the diagonal should be a good approximation, even though it might not be the most efficient choice.

Regarding literature: I would recommend the book by Elman, Silvester and Wathen, https://doi.org/10.1093/acprof:oso/9780199678792.001.0001 , or the review paper by Benzi, Golub and Liesen, https://dx.doi.org/10.1017/S0962492904000212

In my own experience, and ILU on the velocity block and a Cahouet-Chabard approach for the pressure Schur complement works well for medium scale simulations, i.e., polynomial degrees 2 and 3 and up to a few billion of unknowns and maybe 10k processor cores, see also here: https://doi.org/10.1177/1094342016671790 . For very large sizes, ILU is not really that good any more as Timo said, so one can reduce cost by just using the mass matrix. And geometric multigrid works better on that scale than algebraic multigrid used in that paper. But if you want efficiency for large scales and with higher Reynolds numbers, it is hard to beat splitting-type solvers.

Best,
Martin

Timo Heister

unread,

Dec 7, 2020, 9:27:09 AM12/7/20

to dea...@googlegroups.com

> If I understand correctly, you mean formulating the problem using a BlockMatrix and a BlockVector, then preconditioning each matrix of the block independently?

Correct. You perform a block LU decomposition by hand and use that as
your preconditioner. Then approximate the blocks as best as you can.
The advantage is that you can exploit PDE specifics:
- lumping of mass matrices
- AMG/GMG for laplace-like operators
- ILU/Jacobi for mass matrix-like operators
- spectrally-equivalent matrices to approximate Schur complements
- and more

We have many tutorial steps that teach this (but not precisely for
your problem).

> I guess this would be best achieved using the linear operators within deal.II right?

Not necessarily. See step-55 and step-57 for example. The linear
operator framework helps, especially to define implicit operators
(step-20), but is not required.

> What would be the added benefit of that? I would presume that since each block have more "coherent units", this would at least to a better conditioning of the final system, but is there anything else there?

I doubt you can get optimal preconditioners (in the sense of cost
independent of h and other problem parameters) without exploiting
PDE-specifics.

blais...@gmail.com

unread,

Dec 7, 2020, 11:52:39 AM12/7/20

to deal.II User Group

Dear Timo and Martin,

Thank you very much for the help and advice.

I'll definitely take the time to read the references of Martin and look into what you are suggesting. Clearly, I need to exploit the structure of the matrix significantly more to be able to extract more performance out of the solver. Thank you very much your time:). Now it's back to reading! The complexity of these equations never ceases to amaze me.

Take care

Bruno

Wolfgang Bangerth

unread,

Dec 7, 2020, 11:52:51 AM12/7/20

to dea...@googlegroups.com

On 12/7/20 6:19 AM, blais...@gmail.com wrote:
>
> How come you get away with only assembling the matrix every 3-5 time steps?
> Are you treating advection explicitly?
>
> We get off by assembling the matrix every N time-steps for two main reasons:
> - We use SDIRK which has a constant diagonal for all stages, hence this part
> of the matrix does not change significantly
> - We are willing to sacrifice one non-linear iteration there and there, at the
> cost of less assembly. The idea is that we use an exact jacobian when we
> assemble the matrix, but then we let it become innacurate as we continue
> iterating, while just updating the rhs. This is not a good approach when the
> time-step is large, but when the time-step is small (say CFL<1), it really
> does not affect significantly the number of newton iterations. Consequently,
> we can update the jacobian matrix every 3 to 5 time-steps or so. This is just
> an indication, generally we update it as soon as a non-linear step has not
> lead to a decrease of more than 10x of the residual, meaning that our Newton
> method is becoming inefficient.

Nice approach!

> I'm also curious about what the matrix you use looks like. It must be
> something like
> A B
> B^T C
> What terms are in A,B,C?
>
> It has exactly this shape. The full weak form and some results are detailed in
> : https://www.sciencedirect.com/science/article/pii/S2352711020302922
> A is momentum + SUPG
> C is the PSPG block. In essence it looks very much like a stiffness matrix
> weighted by the stabilization parameter. Hence, I really don't think it is
> "well conditioned".

Timo and Martin have already answered the question that underlaid why I asked
about the block structure. In practice, you can't get good preconditioners if
you don't exploit what a matrix corresponds to. ILU doesn't work well for this
reason, but if you can decompose a matrix into its blocks, you can get optimal
preconditioners for each block (say, the top left A block, given what it
represents -- if you resolve the length scales of advection, i.e., if your
cell Peclet number is less than 1, then it's essentially an elliptic operator
for which you can apply multigrid methods). Then you build your overall
preconditioner from preconditioners for each block separately.

I second the suggestion of the book by Elman, Silvester, and Wathen. I would
also point you to step-22. The implementation in the actual program is more
for illustrative purposes of what one can do, but I would suggest you
specifically read the section in the "Possibilities for extensions" that
discusses the Silvester/Wathen preconditioner which is probably close to
optimal. (It uses a mass matrix for the Schur complement, and that can be done
better using the BTFB method, but I think it points you in the right direction).

Completely separately, one of the things Timo and I played around with in
ASPECT a long time ago is whether to make some terms in the equation explicit
or implicit. We had originally made advection explicit, because that then
leads to a symmetric system matrix and we thought that that would be a more
efficient way to solve the linear system. And that's true, symmetric systems
are faster to solve. But it required us to have a substantially smaller time
step, and in the end we realized that the overall time spent is (time per time
step) * (number of time steps) and that by doing everything implicit, we could
increase the time step size by 10x whereas the cost for solving the linear
systems only increased by 2x, so the made the whole simulation cheaper by 5x.

I think that from your description you're already doing everything implicitly,
but it's worth thinking about these kinds of trade-offs anyway if you
currently have anything that limits your time step size.

Best
W.

blais...@gmail.com

unread,

Dec 7, 2020, 12:18:16 PM12/7/20

to deal.II User Group

Thank you all for the help :).

This is always a bit of a lesson in humility where I realise there is just so much I don't know, and how much out there there is still to learn :)!.

Reply all

Reply to author

Forward