Large iteration counts switching from Q2-Q1 to Q1-P0 in step-22 and step-56 --- advice for nonlinear incompressible elasticity?

Simon

unread,

Sep 4, 2025, 12:20:51 PMSep 4

to deal.II User Group

Dear All,

I recently modified step-22 and step-56 (the Stokes problem) by switching the usual Q2-Q1 element to a cheaper Q1-P0 element, i.e., linear Lagrange polynomials for displacement and element-wise constant pressure.

What I noticed is that using Q1-P0 significantly increases the cost of the linear solve. Here are some example iteration counts from step-56:

Q2-Q1:

Cycle 2: 21 (Outer), 320 Inner (A), 22 Inner (S)
Cycle 3: 21 (Outer), 592 Inner (A), 22 Inner (S)

Q1-P0:

Cycle 2: 258 (Outer), 2313 Inner (A), 258 Inner (S)
Cycle 3: 811 (Outer), 12600 Inner (A), 811 Inner (S)

I looked at a paper co-authored by Wolfgang (link here), and Fig. 8 suggests the FGMRES iteration count shouldn’t blow up this much. But clearly, the true bottleneck is the inner iterations for A.

My understanding is that using the upper block-triangular preconditioner with ILU for both the mass matrix and the A-matrix should still be reasonable when lowering the polynomial degree.

Does anyone (maybe the authors of the tutorials?) have an idea why the iteration counts increase so dramatically with the Q1-P0 element?

My goal is actually not to solve Stokes, but to develop a preconditioner for incompressible nonlinear elasticity, solving:

with the stress and the deformation gradient,
.

The discretized system looks structurally similar to Stokes, especially since the A-matrix (stiffness matrix) is symmetric and the B-matrix depends similarly on . So the same upper block-triangular preconditioner should be applicable, maybe with tweaks to the inner preconditioners.

I tried switching inner preconditioners for the A-matrix (AMG with one cycle, CG with ILU like in step-56), but the huge number of inner iterations for A persists --- even for the more stable Q2-P1 element and at small loadings. AMG didn’t notably help reduce iterations either.

I can still solve moderate-sized problems with a direct solver, so invertibility is not the issue. I also double-checked that the matrices A and B are on similar scales by comparing their frobenius norm, so scaling should not be the main problem either.

Any recommendations or insights on what might need to be adapted in the preconditioner design when moving from Stokes to incompressible nonlinear elasticity?

Thanks in advance!

Best,

Simon

Wolfgang Bangerth

unread,

Sep 7, 2025, 6:14:57 PMSep 7

to dea...@googlegroups.com

Simon:

> I recently modified step-22 and step-56 (the Stokes problem) by switching the
> usual Q2-Q1 element to a cheaper Q1-P0 element, i.e., linear Lagrange
> polynomials for displacement and element-wise constant pressure.
>
> What I noticed is that using Q1-P0 significantly increases the cost of the
> linear solve. Here are some example iteration counts from step-56:
>

> *Q2-Q1:*
>
> * Cycle 2: 21 (Outer), 320 Inner (A), 22 Inner (S)
> * Cycle 3: 21 (Outer), 592 Inner (A), 22 Inner (S)
>
> *Q1-P0:*
>
> * Cycle 2: 258 (Outer), 2313 Inner (A), 258 Inner (S)
> * Cycle 3: 811 (Outer), 12600 Inner (A), 811 Inner (S)
>
> I looked at a paper co-authored by Wolfgang (link here<https://
> www.math.colostate.edu/~bangerth/publications/2021-q1q1.pdf?
> utm_source=chatgpt.com>), and Fig. 8 suggests the FGMRES iteration count

> shouldn’t blow up this much. But clearly, the true bottleneck is the inner
> iterations for A.
>
> My understanding is that using the upper block-triangular preconditioner with
> ILU for both the mass matrix and the A-matrix should still be reasonable when
> lowering the polynomial degree.
>
> Does anyone (maybe the authors of the tutorials?) have an idea why the
> iteration counts increase so dramatically with the Q1-P0 element?

I interpret the numbers you show slightly differently. For both elements, you
need about 10-30 inner A iterations for each outer iteration. That's actually
quite a good number. It means that the top left block can be inverted easily
in either case, and that's of course also what I would expect. In particular,
you need more inner A iterations when using the Q2 element than for the Q1
element -- which I'd expect to be the case.

The difference is in the outer iterations. That, too, is what I would expect:
The Q1-P0 element is unstable, so there is one (or a few) eigenvalues that go
to zero even if the rest stays the same under mesh refinement. In other words,
the condition number grows substantially more than what you'd see just from
mesh refinement, and that reflects in the number of outer iterations. I can't
say why that isn't reflected in the paper (different test case?) but what you
see does not surprise me.

In short, I think the message should be: Don't use unstable elements.

> My goal is actually not to solve Stokes, but to develop a *preconditioner *
> for *incompressible nonlinear elasticity*, solving:
>
> * div(P(F(u))) = 0 with P the stress and F the deformation gradient,
> * det(F) - 1 = 0.

>
> The discretized system looks structurally similar to Stokes, especially since
> the A-matrix (stiffness matrix) is symmetric and the B-matrix depends
> similarly on div(u). So the same upper block-triangular preconditioner should
> be applicable, maybe with tweaks to the inner preconditioners.

Is a Q1-P0 discretization of this system stable? I would expect the same issue
as with Stokes here.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Simon

unread,

Sep 9, 2025, 4:07:20 PMSep 9

to deal.II User Group

" I interpret the numbers you show slightly differently. For both elements, you
need about 10-30 inner A iterations for each outer iteration. That's actually
quite a good number. It means that the top left block can be inverted easily
in either case, and that's of course also what I would expect. In particular,
you need more inner A iterations when using the Q2 element than for the Q1
element -- which I'd expect to be the case.

The difference is in the outer iterations. That, too, is what I would expect:
The Q1-P0 element is unstable, so there is one (or a few) eigenvalues that go
to zero even if the rest stays the same under mesh refinement. In other words,
the condition number grows substantially more than what you'd see just from
mesh refinement, and that reflects in the number of outer iterations. I can't
say why that isn't reflected in the paper (different test case?) but what you
see does not surprise me.

In short, I think the message should be: Don't use unstable elements. "

I am aware that choosing an unstable element is generally not recommended.

However, despite not being LBB stable, the Q1-P0 element is still commonly used

in solid mechanics and included in commercial codes. In practice, it performs

quite well for my benchmark problems and the computational savings for not

choosing quadratic displacement elements are significant.

Besides the condition number, I am wondering whether the increased outer iterations I presented

could stem from a poor approximation of the Schur complement, S = B*A^-1*B^T?

Following one of the approachs PETSc uses for saddle point systems (see here),

I tried approximating the Schur complement as

S = B*diag(A)^-1*B^T

and obtained the following results:

Cycle 2: 41(Outer), 434 Inner (A), 3379 Inner (S)
Cycle 3: 82 (Outer), 1669 Inner (A), 13308 Inner (S)

The outer FMGRES iterations are quite reasonable, but the number of inner iterations for solving

with the Schur complement clearly increased significantly.

Sloppy speaking, did I simply shift the conditioning issue from the outer solve to the inner Schur solve?

Using AMG for the Schur solve helped to reduce the inner iteration counts for S to

530 (Cycle 2) and 2746 (Cycle 3). However, the total cpu time remained similar to that of using sparse ilu.

Given all this, do you have any recommendations for a more effective Schur complement approximation

and/or preconditioner? Ideally, this can be generalized to incompressible elasticity as well.

Thank you,

Simon

Wolfgang Bangerth

unread,

Sep 9, 2025, 8:05:55 PMSep 9

to dea...@googlegroups.com

On 9/9/25 14:07, Simon wrote:
>
> I am aware that choosing an unstable element is generally not recommended.
> However, despite not being LBB stable, the Q1-P0 element is still commonly used
> in solid mechanics and included in commercial codes. In practice, it performs
> quite well for my benchmark problems and the computational savings for not
> choosing quadratic displacement elements are significant.

Yes, that's true. As mentioned in the paper you referenced, the element is
also widely used in the geosciences. Of course, the issue with the small
eigenvalues is a problem for solvers everywhere else too.

> Besides the condition number, I am wondering whether the increased outer
> iterations I presented
> could stem from a poor approximation of the Schur complement, S = B*A^-1*B^T?
> Following one of the approachs PETSc uses for saddle point systems (see here

> <https://nam10.safelinks.protection.outlook.com/?
> url=https%3A%2F%2Fpetsc.org%2Fmain%2Fmanualpages%2FPC%2FPCFieldSplitSetSchurPre%2F&data=05%7C02%7CWolfgang.Bangerth%40colostate.edu%7Cb0a3dfe376814b2d690608ddefdc8187%7Cafb58802ff7a4bb1ab21367ff2ecfc8b%7C0%7C0%7C638930452796650916%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=4u4HhaAE03L2Xf95FFuICzDzVgqh2MndaRdZyugWM00%3D&reserved=0>),

> I tried approximating the Schur complement as
> S = B*diag(A)^-1*B^T
> and obtained the following results:
>

> * Cycle 2: 41(Outer), 434 Inner (A), 3379 Inner (S)
> * Cycle 3: 82 (Outer), 1669 Inner (A), 13308 Inner (S)

That's much better than what you had before. What option specifically did you try?

> The outer FMGRES iterations are quite reasonable, but the number of inner
> iterations for solving
> with the Schur complement clearly increased significantly.
> Sloppy speaking, did I simply shift the conditioning issue from the outer
> solve to the inner Schur solve?
> Using AMG for the Schur solve helped to reduce the inner iteration counts for
> S to
> 530 (Cycle 2) and 2746 (Cycle 3). However, the total cpu time remained similar
> to that of using sparse ilu.
>
> Given all this, do you have any recommendations for a more effective Schur
> complement approximation
> and/or preconditioner? Ideally, this can be generalized to incompressible
> elasticity as well.

I don't have any good suggestions. You probably already saw that, but it's
worth reading through the "Possibilities for extensions" section of step-22.

I had a graduate student who spent 3 years working on finding preconditioners
for a block system, with only moderate success. I'm currently working on a
problem where I'm also having great trouble finding decent preconditioners,
despite trying the best ideas I have on that topic. Preconditioning block
systems is hard :-( That's why people still write papers about it, and why the
issue is mentioned in many tutorials.

Simon

unread,

Sep 10, 2025, 5:41:31 AMSep 10

to deal.II User Group

" That's much better than what you had before. What option specifically did you try?"

In the raw tutorial code (step-56), the Schur matrix was approximated by the mass matrix on the pressure space.

However, this is likely a poor approximation if the pressure is element-wise constant.

PETSc offers several options for approximating the Schur matrix.

One such option is "self", where the approximation reads

S = B*diag(A)^-1*B^T.

This is clearly not a bad choice, considering the exact Schur complement is

S = B*A^-1*B^T.

The dealii::SparseMatrix provides a mmult function to implement this matrix multiplication.

To answer your question, using this "self" approximation, the number of outer iterations

was reduced to 41 (Cycle 2) and 82 (Cycle 3), but the inner iterations for solving with S

increased over-proportionally to 3379 (Cycle 2) and 13308 (Cycle 3).

I continued using SolverCG with SparseILU, since S = B*A^-1*B^T is symmetric and positive definite.

Reducing the inner iteration counts for S would make this approach competitive with

the performance observed for the stable Q2-Q1 element.

But SolverCG + SparseILU (or AMG) is likely a reasonable choice for solving S*y=z with

S = B*diag(A)^-1*B^T, right?

PETSc also offers the "selfp" option, which approximates S using

Least Squares Commutators (see here).

I am not yet familiar with the math behind it, so I can not assess whether

this approach would render further improvement.

Best,

Simon

Wolfgang Bangerth

unread,

Sep 10, 2025, 9:33:55 AMSep 10

to dea...@googlegroups.com

On 9/10/25 03:41, Simon wrote:
>
> To answer your question, using this "self" approximation, the number of outer
> iterations
> was reduced to 41 (Cycle 2) and 82 (Cycle 3), but the inner iterations for
> solving with S
> increased over-proportionally to 3379 (Cycle 2) and 13308 (Cycle 3).
> I continued using SolverCG with SparseILU, since S = B*A^-1*B^T is symmetric
> and positive definite.
> Reducing the inner iteration counts for S would make this approach competitive
> with
> the performance observed for the stable Q2-Q1 element.
> But SolverCG + SparseILU (or AMG) is likely a reasonable choice for solving
> S*y=z with
> S = B*diag(A)^-1*B^T, right?

Yes. I don't think you can do better than CG + AMG for this matrix. And for
Q1-P0, even multiplying out B*diag(A)^-1*B^T isn't even going to lead to too
dense a matrix.

> PETSc also offers the "selfp" option, which approximates S using

> Least Squares Commutators (see here <https://
> nam10.safelinks.protection.outlook.com/?
> url=https%3A%2F%2Fpetsc.org%2Frelease%2Fmanualpages%2FPC%2FPCLSC%2F%23id1650&data=05%7C02%7CWolfgang.Bangerth%40colostate.edu%7C1ac7efedc6d14063922108ddf04e3f00%7Cafb58802ff7a4bb1ab21367ff2ecfc8b%7C0%7C0%7C638930941065658755%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=fsxY5VeKeH77Z%2Fd2GyzBHfgOTcFnbZfNkHZnS2AEvWo%3D&reserved=0>).

> I am not yet familiar with the math behind it, so I can not assess whether
> this approach would render further improvement.

That's the state of the art for problems with constant viscosity. For problems
with variable coefficients, you may want to look at
https://epubs.siam.org/doi/10.1137/16M108450X, but I think that in your case
the difficulty isn't due to spatially variable coefficients, but due to the
one run-away small eigenvalue.

Best
W.

Reply all

Reply to author

Forward