SLEPc in Parallel Slower than Serial

Jonathan Russ

unread,

Jun 10, 2016, 11:30:27 AM6/10/16

to deal.II User Group

Hello -

Does anyone have any experience with getting SLEPc to work efficiently in parallel? For some reason, SLEPc in parallel takes roughly 140 times more time to find one eigenvalue and eigenvector than the same solution in serial (i.e. the same matrices). For example, I have a problem with 4,905 degrees of freedom. It takes the serial solve (assembly and solution of one eigenpair) 0.16667 seconds, while the parallel solution (2 processors) takes > 24 seconds. For larger problems I haven't even decided to wait on the solution of the parallel problem to see how long it would actually take (the longest I waited was 45 minutes for a problem that took 1.2 minutes in serial).

For the parallel solution I am using the AMG preconditioner but maybe this is a poor choice for this problem? I'm really not sure. The eigenvalue problem I am solving is:
K x = w M x , where K is the elastic stiffness matrix, M is the mass matrix, w is an eigen-frequency of the system, and x is the eigenvector. Typically, the shift - invert method works very well (as it certainly does in serial) but I can't quite figure out why SLEPc is so helplessly slow in parallel. If anyone has any experience with this and would be willing to give me some advice, I would really appreciate it!

Here is the parallel version of the "solve" function.

unsigned int EigenvalueProblem<dim>::solve ()
  {
    TimerOutput::Scope t(computing_timer, "solve");
    pcout << "   Number of eigenvectors requested: "
              << eigenvectors.size() << std::endl;

    PETScWrappers::PreconditionBoomerAMG::AdditionalData data;
    data.symmetric_operator = true;
    PETScWrappers::PreconditionBoomerAMG preconditioner(mpi_communicator, data);
    SolverControl linear_solver_control(dof_handler.n_dofs(),1.0e-12,false,false);
    PETScWrappers::SolverCG linear_solver(linear_solver_control,mpi_communicator);
    linear_solver.initialize(preconditioner);

    SolverControl solver_control (2000, 1e-9, false, false);
    SLEPcWrappers::SolverLanczos eigensolver(solver_control, mpi_communicator);

    double shift_freq = parameters.get_double ("Shift frequency");
    double eigen_shift = std::pow( 2.0 * PI * shift_freq, 2.0);
    SLEPcWrappers::TransformationShiftInvert::AdditionalData additional_data(eigen_shift);
    SLEPcWrappers::TransformationShiftInvert shift (mpi_communicator, additional_data);
    shift.set_solver(linear_solver);
    eigensolver.set_transformation (shift);
    eigensolver.set_which_eigenpairs (EPS_SMALLEST_REAL);
    eigensolver.set_problem_type (EPS_GHEP);

    eigensolver.solve (stiffness_matrix,mass_matrix,eigenvalues,eigenvectors,
                       eigenvectors.size());

    for (unsigned int i=0; i<eigenvectors.size(); ++i)
        eigenvectors[i] /= eigenvectors[i].linfty_norm ();

    // Finally return the number of iterations it took to converge:
    return solver_control.last_step ();
  }

And here is the serial version of the solve function:

unsigned int EigenvalueProblem<dim>::solve ()
  {
      
    std::cout << "   Number of eigenvectors requested: "
              << eigenvectors.size() << std::endl;

    SolverControl solver_control (2000, 1e-9);
    SLEPcWrappers::SolverKrylovSchur eigensolver(solver_control);
    eigensolver.set_which_eigenpairs (EPS_SMALLEST_REAL);
    eigensolver.set_problem_type (EPS_GHEP);
    
    double eigen_shift = std::pow( 2.0 * PI * parameters.get_double ("Shift frequency"), 2.0);
    SLEPcWrappers::TransformationShiftInvert::AdditionalData additional_data(eigen_shift);
    SLEPcWrappers::TransformationShiftInvert shift (MPI_COMM_WORLD, additional_data);
    eigensolver.set_transformation (shift);
      
    eigensolver.solve (stiffness_matrix,mass_matrix,eigenvalues,eigenvectors,
                       eigenvectors.size());

    for (unsigned int i=0; i<eigenvectors.size(); ++i)
      eigenvectors[i] /= eigenvectors[i].linfty_norm ();

    // Finally return the number of iterations it took to converge:
    return solver_control.last_step ();
  }

Thank you again for taking a look at my problem,
Jonathan

Tobi Young

unread,

Jun 10, 2016, 3:44:53 PM6/10/16

to dealii

Jonathan,

Interesting result. :-( It seems to me you are comparing apples to pears.

A few observations that make the situation very unclear for me. (i)
You are using two different solvers in two different ways. Your serial
code uses KrylovSchur and your parallel code uses Lanczos with some
kind of preconditioner. Why are you comparing these two solvers in
this way? (ii) Why do you have two different sets of code to solve the
same problem? The MPI code runs in serial as well.

Have you tried running your `parallel' code in serial (eg. mpirun -np
1 ./application)?
Why are you starting out with a preconditioner that you do not know is
useful to the problem? I do not know what your equation set is so it
is not possible for me comment on this.

I can tell you from experience, that in general SLEPc scales *very*
well with larger MPI processes - almost linearly if you do things
right - however, there is a caveat. SLEPc is designed for LARGE
problems and 4096 is a very small number of degrees of freedom. I do
not even bother to run such a problem in parallel except to check the
code returns the same result; such a small system is not useful for
benchmarking timings.

I suggest the following. Get rid of the serial code (parallel works in
serial too). Revert your `parallel' code to use the KrylovSchur solver
and don't use a preconditioner - that's the next step.
Refine your grid to, say one million degrees of freedom and compare
the timings for 1, 2, 4, etc. MPI processes. That is the first step to
bencharking, according to me. :-)

Then do the same with Lanczos and then try to preconditioner thing.

See how that works for you and then let me know what results you get.

One more thing to consider, is your K matrix sparse?

I hope that gives you a clue on finding out what the problem is.

Best,
Tobi

> --
> The deal.II project is located at http://www.dealii.org/
> For mailing list/forum options, see
> https://groups.google.com/d/forum/dealii?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "deal.II User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dealii+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Jonathan Russ

unread,

Jun 10, 2016, 4:53:28 PM6/10/16

to deal.II User Group

Hi Toby -

I have tried the KrylovSchur and Lanczos solvers in both the parallel and serial codes. The parallel code, when run with only one processor takes roughly twice as long as with two processors (over 200 times slower than the actual serial version). I have the serial version because I wanted to see if it would be faster for "smaller" problems... It is. Much faster. Over 100 times faster for the problems that I have "tested" it with (up to 180,000 degrees of freedom, I have a small computer).

I am using a preconditioner because without it the parallel version of my code does not converge (regardless of how many processors I run it on).

Both matrices are sparse. I ultimately would like to solve the weak form of the linear momentum equation: density * u'' + div ( stress tensor ) = f, but I would like to do this by first solving the generalize eigenvalue problem, K x = w M x ... The matrices are assembled with precisely the same stiffness matrix, K, as in linear elasticity and Mass matrix, M is very simple. Here is the relevant piece of my "assemble_system" function showing exactly how K and M are constructed. There are no hanging nodes in my models.

for (unsigned int q_point=0; q_point<n_q_points; ++q_point)
                  for (unsigned int i=0; i<dofs_per_cell; ++i)
                  {
                      const SymmetricTensor<2,dim> phi_i_symmgrad
                      = fe_values[displacements].symmetric_gradient (i,q_point);
                      const double phi_i_div
                      = fe_values[displacements].divergence (i,q_point);

                      for (unsigned int j=0; j<dofs_per_cell; ++j)
                      {
                          const SymmetricTensor<2,dim> phi_j_symmgrad
                          = fe_values[displacements].symmetric_gradient (j,q_point);
                          const double phi_j_div
                          = fe_values[displacements].divergence (j,q_point);

                          cell_stiffness_matrix(i,j)
                          +=  (phi_i_div * phi_j_div *
                               lambda_value
                               +
                               2 *
                               (phi_i_symmgrad * phi_j_symmgrad) *
                               mu_value) *
                          fe_values.JxW(q_point);

                          cell_mass_matrix(i,j)
                          += (fe_values[displacements].value (i,q_point) *
                              fe_values[displacements].value (j,q_point) *
                              density_value) *
                          fe_values.JxW(q_point);
                      }
                  }

I think I will try a few more preconditioners to see if I am simply selecting the wrong one (to be honest, I have no idea which preconditioner is appropriate for this problem). If none of them seem to help then I'll just use the serial version of my code for small problems and wait until there is a wrapper for a Trilinos eigensolver or an ARPACK solver in parallel to try the parallel version again. Thanks for trying to help me!

Jonathan

Jonathan Russ

unread,

Jun 10, 2016, 5:44:44 PM6/10/16

to deal.II User Group

Using the Jacobi preconditioner the solution of a single eigenpair required 17 minutes on 2 processors for a problem with 120,000 degrees of freedom. The same eigenpair was calculated using the serial version of my code in 1.2 minutes. There aren't any other preconditioners that are available for use with SLEPc that I am aware of so for now I will just stick to the serial version of my code.

Tobi Young

unread,

Jun 10, 2016, 5:52:20 PM6/10/16

to dealii

Hi Jonathan,

> I have tried the KrylovSchur and Lanczos solvers in both the parallel and
> serial codes. The parallel code, when run with only one processor takes
> roughly twice as long as with two processors (over 200 times slower than the
> actual serial version). I have the serial version because I wanted to see if
> it would be faster for "smaller" problems... It is. Much faster. Over 100
> times faster for the problems that I have "tested" it with (up to 180,000
> degrees of freedom, I have a small computer).

This looks simple. Your parallel code is not doing what you think it
is doing. I believe, that your application is faulty.
This should not happen under normal circumstances, because the
underlying objects of the parallel and serial PETSc objects are
basically the same thing.

> I am using a preconditioner because without it the parallel version of my
> code does not converge (regardless of how many processors I run it on).

This again tells me you have a bug in your parallel code. Why would
you expect a problem to magically solve by changing the number of MPI
processes? If your parallel version does not coverge with one MPI
process, but it does in the serial version - you have a bug in the
parallel version. You need to deal with that. :-)

Do some debugging. Calm down and grab yourself a cup of tea or a pizza
or some high sugar snacks, or whatever makes you happy and debug your
code. I eat icecream when I debug, normally vanilla flavour. Check
what happens with the memory on your machine when you run on multiple
MPI processes. Are you calling a procedure that is eating up your
memory when run in parallel? Is that why it is so slow? Do you have
valgrind installed on your machine? If yes use it if o install it and
then use it. Simplify your problem, simplify your solver, and check
the residual-norm of your solution - it should be converging to
something if all is ok.

Do you get the same result when you run in parallel with fancy
preconditioners and in serial with a simple KrylovSchur call?

Another thing you can do is experiment with step-40? Do you experience
the same behaviour, that parallel does not scale? SLEPc solvers are
very good in parallel for large sparse eigenspectrum systems - and
they use PETSc objects as their basis. Experiment with PETSc useage
first.

Sorry for my terseness, I am trying to help, but I am also rather busy
these days.

Best,
Toby

Tobi Young

unread,

Jun 10, 2016, 5:55:27 PM6/10/16

to dealii

> Using the Jacobi preconditioner the solution of a single eigenpair required
> 17 minutes on 2 processors for a problem with 120,000 degrees of freedom.
> The same eigenpair was calculated using the serial version of my code in 1.2
> minutes. There aren't any other preconditioners that are available for use
> with SLEPc that I am aware of so for now I will just stick to the serial
> version of my code.

Don't lose hope and don't give up. Explore! :-)

Best,
Toby

Jonathan Russ

unread,

Jun 10, 2016, 9:22:43 PM6/10/16

to deal.II User Group

Thanks for the suggestions! And thanks again for taking a look!

Jonathan

Reply all

Reply to author

Forward