Is parallel direct solver extremely slow?

Pai Liu

unread,

Aug 29, 2018, 11:55:19 PM8/29/18

to deal.II User Group

Hi all,

I learn form Prof. Wolfgang's lectures that parallel direct solver is really competitive, when compared with the parallel distributed solver.

So I try to modify the solver in step-17 into a parallel direct solver, and I found it extremely solw.

The following is the details how I modify step-17, and the timing I got:

Modifications (the cpp file is attached):

1. add timing for the solving process;

2. mesh in a 3D case, with 40*40*40 cells; and only consider cycle 0 (i.e. do not refine the mesh)

3. modify the default cg solve in step-17 into "PETScWrappers::SparseDirectMUMPS solver";

Timing (I ran all the following codes with: mpirun -np 6 ./step-17):

Case 1. Run with step-17's default cg solver, the solving process takes about "1.13s".

Case 2. Use PETScWrappers::SparseDirectMUMPS solver, and set "solver.set_symmetric_mode(true)", the solving process takes about "157s";

Case 3. Use PETScWrappers::SparseDirectMUMPS solver, and do NOT set "solver.set_symmetric_mode(true)", the solving process takes about "328s"!!!

I guess the parallel direct solver may be a little slower than the cg solver. But when it takes so much more time as I tested, I begin to doult something is not set correctly.

Can anyone please help me for this case? In fact I really need to use parallel direct solver, as I am studying on cases that have the same boundary conditions on the system_matrix, and several different system_rhs.

Best,

Pai

step-17.cc

Wolfgang Bangerth

unread,

Aug 30, 2018, 12:37:58 PM8/30/18

to dea...@googlegroups.com

On 08/29/2018 09:55 PM, Pai Liu wrote:
>
> I learn form Prof. Wolfgang's lectures that parallel direct solver is
> really competitive, when compared with the parallel distributed solver.
> So I try to modify the solver in step-17 into a parallel direct solver,

> and I found it *extremely solw*.

Direct solvers are only really competitive for relatively small problems
-- my rule of thumb is "less than 100,000 unknowns" in 2d. I don't know
how large your problem is, but your 40x40x40 mesh should have about
200,000 unknowns, so I'm not terribly surprised that the direct solver
is slower than the CG solver. Furthermore, 3d problems make life much
harder for direct solvers than 2d problems.

I am surprised that it is *that* much slower. My reference point is
UMFPACK, so it is possible that MUMPS is significantly slower than
UMFPACK. But, MUMPS can at least in principle run in parallel, so you
could use MPI to make things faster.

Interesting experiments would be:
* Try what happens if you run this in parallel
* Try how the run time of both the direct and iterative solvers
change as you increase the number of unknowns. (E.g., start with a
10x10x10 mesh, then try a 20x20x20, ... mesh.)

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Pai Liu

unread,

Aug 30, 2018, 9:41:50 PM8/30/18

to deal.II User Group

Hi Wolfgang,

Interesting experiments would be:
* Try what happens if you run this in parallel

The timing in my question was already obtained with the command "mpirun -np 6 ./step-17" (as mentioned in my question).

* Try how the run time of both the direct and iterative solvers
change as you increase the number of unknowns. (E.g., start with a
10x10x10 mesh, then try a 20x20x20, ... mesh.)

As suggested, I have added more tests, which are all executed with MPI in parallel:

(in the following, I only show the timing for solution)

Case 1. 10*10*10 cells:

Case 1.1 mpirun -np 2 ./xxxx

cg: 0.00621s MUMPS (with symmetric setting): 0.0932s MUMPS (without symmetric setting): 0.122s

Case 1.2 mpirun -np 6 ./xxxx

cg: 0.00479s MUMPS (with symmetric setting): 0.0774s MUMPS (without symmetric setting): 0.169s

Case 2. 20*20*20 cells:

Case 2.1 mpirun -np 2 ./xxxx

cg: 0.0884s MUMPS (with symmetric setting): 3.71s MUMPS (without symmetric setting): 6.44s

Case 2.2 mpirun -np 6 ./xxxx

cg: 0.087s MUMPS (with symmetric setting): 2.16s MUMPS (without symmetric setting): 4.29s

Case 3. 30*30*30 cells:

Case 3.1 mpirun -np 2 ./xxxx

cg: 0.39s MUMPS (with symmetric setting): 26.2s MUMPS (without symmetric setting): 50.8s

Case 3.2 mpirun -np 6 ./xxxx

cg: 0.372s MUMPS (with symmetric setting): 23.4s MUMPS (without symmetric setting): 43.8s

So, in all the tests, parallel direct solver (MUMPS here) is still much slower than parallel iterative solver (cg here).

I am not familiar with UMFPACK, maybe I will try to use that.

I have another question:

For problem with millions of unknowns, the same Dirichlet boundary condition and different right hand sides (e.g. rhs1, rhs2, ..., rhs8). How can I speed up the solution process with (maybe) iterative solver?

I think for a small number of unknowns, maybe I can use parallel direct solver, which can reuse the factorization of the system matrix for rhs2-rhs8 after I solve the solution with rhs1.

But for a problem with millions of unknowns, maybe I have to use iterative solver for efficiency. So what solver or what technique should I use to speed up the solution of such a multiple load case problem?

Best,

Pai

Bruno Turcksin

unread,

Aug 31, 2018, 8:56:04 AM8/31/18

to deal.II User Group

Pai,

On Thursday, August 30, 2018 at 9:41:50 PM UTC-4, Pai Liu wrote:

I have another question:
For problem with millions of unknowns, the same Dirichlet boundary condition and different right hand sides (e.g. rhs1, rhs2, ..., rhs8). How can I speed up the solution process with (maybe) iterative solver?
I think for a small number of unknowns, maybe I can use parallel direct solver, which can reuse the factorization of the system matrix for rhs2-rhs8 after I solve the solution with rhs1.
But for a problem with millions of unknowns, maybe I have to use iterative solver for efficiency. So what solver or what technique should I use to speed up the solution of such a multiple load case problem?

With iterative solvers, what matters the most is to find an efficient preconditioner. Algebraic multigrid might be good for your application. The setup phase only requires the matrix of the system and thus, you can reuse the same preconditioner with different right-hand sides.

Best,

Bruno

Wolfgang Bangerth

unread,

Aug 31, 2018, 10:18:14 AM8/31/18

to dea...@googlegroups.com

> * Try how the run time of both the direct and iterative solvers
> change as you increase the number of unknowns. (E.g., start with a
> 10x10x10 mesh, then try a 20x20x20, ... mesh.)
>
> As suggested, I have added more tests, which are all executed with MPI in
> parallel:

> (*in the following, I only show the timing for solution*)
>
> *Case 1*. 10*10*10 cells:

>
> Case 1.1 mpirun -np 2 ./xxxx
> cg: 0.00621s MUMPS (with symmetric setting): 0.0932s MUMPS (without
> symmetric setting): 0.122s
>
> Case 1.2 mpirun -np 6 ./xxxx
> cg: 0.00479s MUMPS (with symmetric setting): 0.0774s MUMPS (without
> symmetric setting): 0.169s
>

> *Case 2.* 20*20*20 cells:

>
> Case 2.1 mpirun -np 2 ./xxxx
> cg: 0.0884s MUMPS (with symmetric setting): 3.71s MUMPS
> (without symmetric setting): 6.44s
>
> Case 2.2 mpirun -np 6 ./xxxx
> cg: 0.087s MUMPS (with symmetric setting): 2.16s MUMPS
> (without symmetric setting): 4.29s
>

> *Case 3*. 30*30*30 cells:

>
> Case 3.1 mpirun -np 2 ./xxxx
> cg: 0.39s MUMPS (with symmetric setting): 26.2s MUMPS
> (without symmetric setting): 50.8s
>
> Case 3.2 mpirun -np 6 ./xxxx
> cg: 0.372s MUMPS (with symmetric setting): 23.4s MUMPS
> (without symmetric setting): 43.8s

I have to admit that I find the CG times too small to be credible. The last
case should have about 200,000 unknowns. It seems implausible to me that you
can solve that in 0.4 seconds on 2 processors. What preconditioner do you use,
and do you include the time to build the preconditioner in this time?

My rule of thumb has always been that to solve a problem with 100,000 unknowns
on one processor, it takes about a minute. If you have a fast processor, then
maybe you can get that done in 20 or 30 seconds, so the times you quote for
MUMPS seem not out of the ordinary to me.

> *I have another question:*
> *For problem with millions of unknowns, the same Dirichlet boundary condition

> and different right hand sides (e.g. rhs1, rhs2, ..., rhs8). How can I speed

> up the solution process with (maybe) iterative solver?*

> I think for a small number of unknowns, maybe I can use parallel direct
> solver, which can reuse the factorization of the system matrix for rhs2-rhs8
> after I solve the solution with rhs1.
> But for a problem with millions of unknowns, maybe I have to use iterative
> solver for efficiency. So what solver or what technique should I use to speed
> up the solution of such a multiple load case problem?

Bruno already gave the correct answer: Build an expensive preconditioner
because you only need to build it once. Of course, the best preconditioner is
an LU decomposition of the matrix, which is what a direct solver computes.

But you will need to expect that fundamentally, solving N problems with an
iterative solver requires N times as many operations as solving one (once you
have built the preconditioner). There are "block" variants of solvers such as
GMRES or CG that can be more efficient because they group these operations in
a more efficient way through vectorization or grouping communication, but they
fundamentally still have to do N times as many operations.

Pai Liu

unread,

Aug 31, 2018, 8:15:55 PM8/31/18

to deal.II User Group

Hi Wolfgang,

I have to admit that I find the CG times too small to be credible. The last
case should have about 200,000 unknowns. It seems implausible to me that you
can solve that in 0.4 seconds on 2 processors. What preconditioner do you use,
and do you include the time to build the preconditioner in this time?

The number of unknowns in the third case is 89373 (31*31*31*3). I ran it on my pc with a cpu of i7-7700K. For the cg case, it used the PETSc's cg and PETSc's blockJacobi preconditioner (which is the default setting in step-17). The timing includes the time to build the preconditioner and the time to solve. And I re-checked the timing, I it really just about 0.4s (to timing cg, I only added timing, and changed the meshing process, and did no other modifications to step-17).

Bruno already gave the correct answer: Build an expensive preconditioner
because you only need to build it once. Of course, the best preconditioner is
an LU decomposition of the matrix, which is what a direct solver computes.

But you will need to expect that fundamentally, solving N problems with an
iterative solver requires N times as many operations as solving one (once you
have built the preconditioner). There are "block" variants of solvers such as
GMRES or CG that can be more efficient because they group these operations in
a more efficient way through vectorization or grouping communication, but they
fundamentally still have to do N times as many operations.

I understand the situation now. I am interested in the "block" variants of CG you mentioned, does TrilinosWrappers or PETScWrappers has that kind of CG variant?

Best,

Pai

Pai Liu

unread,

Aug 31, 2018, 8:21:03 PM8/31/18

to deal.II User Group

Hi Bruno,

Your explanation makes me clear.Thanks.

In TrilinosWrappers:: I found two amg preconditioner: PreconditionerAMGMueLu and PreconditionerAMG. What is the general difference between these two preconditioners? For a linear elasticity problem, which one should I use?

Best,

Pai

Wolfgang Bangerth

unread,

Aug 31, 2018, 10:00:51 PM8/31/18

to dea...@googlegroups.com, Pai Liu

> The number of unknowns in the third case is 89373 (31*31*31*3). I ran it on
> my pc with a cpu of i7-7700K. For the cg case, it used the PETSc's cg and
> PETSc's blockJacobi preconditioner (which is the default setting in step-17).
> The timing includes the time to build the preconditioner and the time to
> solve. And I re-checked the timing, I it really just about 0.4s (to timing cg,
> I only added timing, and changed the meshing process, and did no other
> modifications to step-17).

BlockJacobi builds an LU decomposition of that part of the matrix that is
stored locally. So it's a really expensive preconditioner to build (which I
gather you don't include in the time?) but then solves the problem in only a
few iterations. If you want a fair comparison, you need to include the time to
build the preconditioner.

> I understand the situation now. I am interested in the "block" variants of
> CG you mentioned, does TrilinosWrappers or PETScWrappers has that kind of CG
> variant?

There are no wrappers. I believe that Trilinos has them, and PETSc does not.
So you would have to interface with Trilinos directly.

Wolfgang Bangerth

unread,

Aug 31, 2018, 10:01:59 PM8/31/18

to dea...@googlegroups.com, Pai Liu

On 08/31/2018 06:21 PM, Pai Liu wrote:
> In TrilinosWrappers:: I found two amg preconditioner: PreconditionerAMGMueLu
> and PreconditionerAMG. What is the general difference between these two
> preconditioners? For a linear elasticity problem, which one should I use?

Pai -- PreconditionerAMG uses the Trilinos package ML, whereas
PreconditionerAMGMueLu uses the Trilinos package MueLu. MueLu is a more modern
implementation of the same ideas. I don't think they are vastly different in
performance.

Pai Liu

unread,

Aug 31, 2018, 10:56:32 PM8/31/18

to deal.II User Group

Hi Wolfgang,

Thank you so much for all your detailed explanation. Now I have a general idea of what all these things are and what I should do for my problem (a multiple load case problem). I really appreciate your kind help.

BlockJacobi builds an LU decomposition of that part of the matrix that is
stored locally. So it's a really expensive preconditioner to build (which I
gather you don't include in the time?) but then solves the problem in only a
few iterations. If you want a fair comparison, you need to include the time to
build the preconditioner.

However, I really like to figure out the problem of timing you mentioned. Here I attach the step-17.cc I modified.

I just added timing and modify the meshing code (to generate 30*30*30 cells), and nothing else.

I added timing to the member function solve() like the following and change nothing else.

template <int dim>
unsigned int ElasticProblem<dim>::solve ()
{
TimerOutput::Scope t(computing_timer, "solve");

    SolverControl           solver_control (solution.size(),
                                            1e-8*system_rhs.l2_norm());
    PETScWrappers::SolverCG cg (solver_control,
                                mpi_communicator);

    PETScWrappers::PreconditionBlockJacobi preconditioner(system_matrix);

    cg.solve (system_matrix, solution, system_rhs,
              preconditioner);

    Vector<double> localized_solution (solution);

    hanging_node_constraints.distribute (localized_solution);

    solution = localized_solution;

    return solver_control.last_step();
}

Thus I think the timing includes both the time to build the preconditoner, as well as the time to solve Ax=b;

And when I run this file with mpirun -np 2 ./step-17, it really just takes about 0.4s to solve a 30*30*30 cells problem (with all the boundary conditions unchanged from the original step-17 example).

Best,

Pai

step-17.cc

Wolfgang Bangerth

unread,

Sep 3, 2018, 10:30:26 PM9/3/18

to dea...@googlegroups.com

> *I just added timing and modify the meshing code (to generate 30*30*30 cells),
> and nothing else.*

> I added timing to the member function solve() like the following and change
> nothing else.
>
> template <int dim>
> unsigned int ElasticProblem<dim>::solve ()
> {

> *TimerOutput::Scope t(computing_timer, "solve");*

>
>     SolverControl           solver_control (solution.size(),
>                                             1e-8*system_rhs.l2_norm());
>     PETScWrappers::SolverCG cg (solver_control,
>                                 mpi_communicator);
>
>     PETScWrappers::PreconditionBlockJacobi preconditioner(system_matrix);
>
>     cg.solve (system_matrix, solution, system_rhs,
>               preconditioner);
>
>     Vector<double> localized_solution (solution);
>
>     hanging_node_constraints.distribute (localized_solution);
>
>     solution = localized_solution;
>
>     return solver_control.last_step();
> }
>

> *Thus I think the timing includes both the time to build the preconditoner, as
> well as the time to solve Ax=b;*

Yes, that looks correct. I'm impressed it goes this fast.

In your TimerOutput object, do you output wall time or CPU time? Do you
initialize it with the MPI communicator object?

Pai Liu

unread,

Sep 4, 2018, 2:30:03 AM9/4/18

to deal.II User Group

Hi Wolfgang,

In your TimerOutput object, do you output wall time or CPU time? Do you
initialize it with the MPI communicator object?

I output wall time.

And I initialized it with "computing_timer(mpi_communicator, pcout, TimerOutput::summary, TimerOutput::wall_times)".

Best,

Pai

Wolfgang Bangerth

unread,

Sep 4, 2018, 5:27:43 PM9/4/18

to dea...@googlegroups.com

On 09/04/2018 12:30 AM, Pai Liu wrote:
>
> In your TimerOutput object, do you output wall time or CPU time? Do you
> initialize it with the MPI communicator object?
>
>

> I output *wall time*.
> And *I initialized it* with "computing_timer(mpi_communicator, pcout,

> TimerOutput::summary, TimerOutput::wall_times)".

This looks correct. I guess the PETSc BlockJacobi solver is miraculously fast
in this case!

Uwe Köcher

unread,

Sep 6, 2018, 3:42:53 AM9/6/18

to deal.II User Group

Dear Pai,

> In TrilinosWrappers:: I found two amg preconditioner: PreconditionerAMGMueLu
> and PreconditionerAMG. What is the general difference between these two
> preconditioners? For a linear elasticity problem, which one should I use?

Pai -- PreconditionerAMG uses the Trilinos package ML, whereas
PreconditionerAMGMueLu uses the Trilinos package MueLu. MueLu is a more modern
implementation of the same ideas. I don't think they are vastly different in
performance.

Best
W.

I recently switched over parts of my codes from using PreconditionerAMG to MueLu for a linear elasticity computation.

The performance difference is slight, but MueLu offers you more oppertunities to tune your preconditioner.

If you start implementing nowadays I would prefer to use MueLu.

Best

Uwe

Pai Liu

unread,

Sep 10, 2018, 2:46:06 AM9/10/18

to deal.II User Group

Hi Uwe,

Thank you very much for your information!

Best,

Pai

David F

unread,

Mar 26, 2019, 10:48:50 AM3/26/19

to deal.II User Group

Dear Pai,

I'm very interested in solving a problem with characteristics very similar to yours. Consequently, I run your modified code of step-17.cc for 30*30*30 cells and for me it takes 6.43s with cg with -np 2 instead of 0.39s. Do you have any idea where this huge speed up migth come from? Is it maybe due some optimezed libraries that you are using? Right now I'm using the system default's one that I can find in the OS repositories. I would really appreciated if you could give me some hint about this, or which strategy you found more effective for solving many times the same elastic system with different rhs.