Reason for SolverGMRES being slower in parallel?

98 views
Skip to first unread message

Feimi Yu

unread,
Mar 22, 2018, 1:31:44 PM3/22/18
to deal.II User Group
Hi,

I'm going to solve a Schur complement in my preconditioner for SUPG stabilized fluid solver and will be using ILU(0) decomposition.
Since the BlockJacobi does not turn out to work well, I wrote a wrapper for Pilut, which is a package for ILUT decomposition in Hypre
and wrapped by PETSc. I tested my Pilut preconditioner with tutorial step-17 (the MPI elastic problem) and everything looks ok.
However, when I apply it to my own code, running with 2 ranks takes more time than 1, specifically on solving the Schur complement
where my Pilut is applied.
I tried to output the total iteration number of the solver, but with more ranks, actually less iterations are used. I don't understand why
less iterations take more time. Is there any potential reason for that?
Something that came up to my mind, but I'm not sure: my Schur complement is actually not a matrix, but a derived class that only has
definition for vmult. For that reason, my solver is a dealii:solverGMRES instead of PETScWrappers::solverGMRES.
However, my results in testing with step-17 did not show significant differences between these two.

Thanks!
Feimi

Wolfgang Bangerth

unread,
Mar 27, 2018, 1:45:41 PM3/27/18
to dea...@googlegroups.com

> I'm going to solve a Schur complement in my preconditioner for SUPG
> stabilized fluid solver and will be using ILU(0) decomposition.

That's generally not a good preconditioner -- it will lead to large
numbers of iterations in your linear solver if the problem becomes
large. But that's tangential to your point...


> Since the BlockJacobi does not turn out to work well, I wrote a wrapper
> for Pilut, which is a package for ILUT decomposition in Hypre
> and wrapped by PETSc. I tested my Pilut preconditioner with tutorial
> step-17 (the MPI elastic problem) and everything looks ok.

Interesting. Would you be willing to contribute this wrapper?


> However, when I apply it to my own code, running with 2 ranks takes more
> time than 1, specifically on solving the Schur complement
> where my Pilut is applied.
> I tried to output the total iteration number of the solver, but with
> more ranks, actually less iterations are used. I don't understand why
> less iterations take more time. Is there any potential reason for that?

It often comes down to how much communication you do. If your problem is
small, then there is not very much work to do for each process, and much
time is spent on communication between processes. A typical rule of
thumb is that your problem needs to have at least 50,000 to 100,000
unknowns per process for computations to offset communication.


> Something that came up to my mind, but I'm not sure: my Schur complement
> is actually not a matrix, but a derived class that only has
> definition for vmult. For that reason, my solver is a dealii:solverGMRES
> instead of PETScWrappers::solverGMRES.
> However, my results in testing with step-17 did not show significant
> differences between these two.

I don't think that is the problem -- the two GMRES implementations are
comparable in performance.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Feimi Yu

unread,
Apr 3, 2018, 9:53:20 PM4/3/18
to deal.II User Group
Hi Wolfgang,

Thanks for your reply!

Of course, I will make a pull request after the preconditioner is fully commented & tested,
although it was a really small addition to the code...

I'm sorry I was not clear in explaining my problem. I had a block matrix to solve, and the ILU
is used for solving the inverse of the schur complement (by applying ILU to an approximate of it
as the preconditioner) in my DIY preconditioner.

I tried different # of dofs as you suggested and I found it very weird: for same case with different
number of global refinement, I found only when I have 26k dofs, the time with 2 cores is faster than
1 core, either coarser or finer mesh causes solver to be slower in parallel (also tried BlockJacobi
and the results are almost the same). I wonder if there could be any possibility that my communication
has problem or the matrix is already not distributed? Don't understand....

Thanks!
Feimi

Wolfgang Bangerth

unread,
Apr 3, 2018, 11:29:53 PM4/3/18
to dea...@googlegroups.com, Feimi Yu

Feimi,

> Of course, I will make a pull request after the preconditioner is fully
> commented & tested, although it was a really small addition to the code...

Excellent -- we are always glad for any kind of contribution, large or small.


> I'm sorry I was not clear in explaining my problem. I had a block matrix to
> solve, and the ILU is used for solving the inverse of the schur complement
> (by applying ILU to an approximate of it as the preconditioner) in my DIY
> preconditioner.
>
> I tried different # of dofs as you suggested and I found it very weird: for
> same case with different number of global refinement, I found only when I
> have 26k dofs, the time with 2 cores is faster than 1 core, either coarser
> or finer mesh causes solver to be slower in parallel (also tried
> BlockJacobi and the results are almost the same). I wonder if there could
> be any possibility that my communication has problem or the matrix is
> already not distributed? Don't understand....

That's such a small problem. The fastest way to solve problems that are so
small is to use a direct solver. I'll point to this again:

> It often comes down to how much communication you do. If your problem is
> small, then there is not very much work to do for each process, and much
> time is spent on communication between processes. A typical rule of thumb
> is that your problem needs to have at least 50,000 to 100,000 unknowns per
> process for computations to offset communication.


As to the concrete slowdown you see, I would suggest to also look at the
number of GMRES iterations you do. If the number of iterations you need
depends significantly on the number of cores, then you're in trouble. Good and
scalable methods must all result in a number of iterations that does not
depend on the number of cores in a significant way.

Feimi Yu

unread,
Apr 4, 2018, 10:38:00 AM4/4/18
to deal.II User Group
I wish I can only deal with small problem, but that is only a test problem and
I guess we need to use this code to compute much larger 3-D cases. Now it
is too slow so I cannot test any larger cases than 400k dofs. I tried 400k dofs,
still, 2 cores are slower than 1 core.

The weirdest thing is the iterations does not change much, when using block
jacobi this would change, but not significantly. Sometimes with Pilut the number
of iterations even decreased, but with more time.

Wolfgang Bangerth

unread,
Apr 5, 2018, 7:42:17 PM4/5/18
to dea...@googlegroups.com
On 04/04/2018 08:38 AM, Feimi Yu wrote:
> I wish I can only deal with small problem, but that is only a test
> problem and
> I guess we need to use this code to compute much larger 3-D cases. Now it
> is too slow so I cannot test any larger cases than 400k dofs. I tried
> 400k dofs,
> still, 2 cores are slower than 1 core.
>
> The weirdest thing is the iterations does not change much, when using block
> jacobi this would change, but not significantly. Sometimes with Pilut
> the number
> of iterations even decreased, but with more time.

It's hard to tell what exactly is the problem without seeing the code
and/or more details. How many outer iterations do you need? How many
iterations do you need to solve for the Schur complement?

I would say that a good preconditioner for a problem with constant
coefficient should generally not require more than, say, 50-100 outer
iterations, and at most a similar number of inner iterations.

Maybe explain how your matrix block structure looks like, and how your
block solver/block preconditioner looks like?

Weixiong Zheng

unread,
Apr 7, 2018, 1:15:58 AM4/7/18
to deal.II User Group
I would guess this would rather be the degeneration of ILUt in parallel than the
implementation.

If I am not wrong, ILU-T in parallel is not ILU-like preconditioner anymore, right? I remember
the inter-core part is implemented in a block-Jacobi fashion.

I've never used ILUT but used ILU a lot. For my past experience, it was super easy for certain problems that ILU serial outperforms
ILU in parallel. I experienced that in problems like when I have elliptic equations in subdomains
connected with hyperbolic-type (upwinding) interface conditions.

在 2018年4月5日星期四 UTC-7下午4:42:17,Wolfgang Bangerth写道:

Feimi Yu

unread,
Apr 8, 2018, 5:01:58 PM4/8/18
to deal.II User Group
Hi Weixiong,

I did consider this problem so I wanted to avoid using a fake ILU like BlockJacobi.
As I experimented, using BlockJacobi will cause more iterations, and more
time in computation. However, PILUT is different, it should produce same matrices in parallel.
The algorithm that is used for PILUT comes from this article:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.35.4684&rep=rep1&type=pdf

In their paper there was an attractive speedup in parallel. In my case, the number of iterations
did not change much either (sometimes it decreased).
One thing that I'm curious is that my parallel performance is strongly related to the time
step size that I choose. More specifically, the properties (maybe the condition number) of
the matrix. When small time step size is chosen, the speedup can be obviously improved.
So I would agree with your opinion, maybe it is something inside the preconditioner, not my implementation.
By the way, I use the PILUT to solve the schur complement, which is supposed to be very stiff.

Thanks!
Feimi
Reply all
Reply to author
Forward
0 new messages