Assemble function, long time

82 views
Skip to first unread message

Hermes Sampedro

unread,
Mar 3, 2022, 4:31:04 AM3/3/22
to deal.II User Group
Dear all, 

I am experiencing long times when computing the assembling and I would like to ask if this is common or there is something wrong with my implementation.

My model is built in a similar way as step-29 and step-40 (using complex values ad solving with a direct solver using distributed parallel implementation).
Now I am running larger systems with 3.5million dof and the assembling took 16h, while the solver function took much less.

 I can show the structure of my assembly_system() function to ask if there is something that can be done in order to speed up the process:

void Problem<dim>::assemble_system()
{
for (unsigned int i = 0; i < dofs_per_cell; ++i) {
for (unsigned int j = 0; j < dofs_per_cell; ++j)
{
for (unsigned int q_point = 0; q_point < n_q_points; ++q_point)
{
if (fe.system_to_component_index(i).first == fe.system_to_component_index(j).first)
{
cell_matrix(i, j) += ....
}
if (fe.system_to_component_index(i).first != fe.system_to_component_index(j).first)
{
cell_matrix(i, j) += ....
}
}

// Boundaries
if (fe.system_to_component_index(i).first == fe.system_to_component_index(j).first)
{
for (unsigned int face_no : GeometryInfo<dim>::face_indices())
if (cell->face(face_no)->at_boundary() && (cell->face(face_no)->boundary_id() == 0))
{
fe_face_values.reinit(cell, face_no);
for (unsigned int q_point = 0; q_point < n_face_q_points; ++q_point)
cell_matrix(i, j) += ....
}
}
if (fe.system_to_component_index(i).first != fe.system_to_component_index(j).first)
{
for (unsigned int face_no : GeometryInfo<dim>::face_indices())
{
if (cell->face(face_no)->at_boundary() && (cell->face(face_no)->boundary_id() == 0))
{
fe_face_values.reinit(cell, face_no);
for (unsigned int q_point = 0; q_point < n_face_q_points;++q_point)
cell_matrix(i, j) += ....
}
}
}
}
}


Thank you very much.
Regards,
Hermes

Bruno Turcksin

unread,
Mar 3, 2022, 9:13:30 AM3/3/22
to deal.II User Group
Hermes,

There is a couple of things that you could do but it probably won't give you a significant speed up. Are you sure that you are running in Release mode and not in Debug? Do you evaluate complicated functions in the assembly?
A couple changes that could help:
- don't use fe.system_to_component_index(i).first and fe.system_to_component_index(j).first  everywhere. Just define const k = ... and const m = ... and use k and m. That might help the compiler with some optimizations
- move the two if for the cell assembly outside the for loop on the quadrature point, similar to what you did for the boundaries. This could potentially help quite a bit if the cpu often gets the branch prediction wrong

Best,

Bruno

Hermes Sampedro

unread,
Mar 6, 2022, 7:19:29 AM3/6/22
to deal.II User Group
Dear Bruno, 

Thank you very much for the comments. The problem was that I was running in Debug mode without knowing. Now, after changing to Release the assembling time is considerably reduced.

Moreover, I am experiencing another issue that I would like to ask. My mesh is done with hyper_cube() in 3D and 5 refinements. The dof is around 3 million. When running, I always get a memory issue and the program stops. I realized that the problem is in the line that executes solver.solve(system_matrix, completely_distributed_solution, system_rhs);
I am using SparseMatrix and I do not fully understand where the problem could come from. The matrices are initialized beforehand, what reason do you think It could produce a memory issue in the solver?

Below is the full solver function:

template <int dim>
void LaplaceProblem<dim>::solve()
{
PETScWrappers::MPI::Vector completely_distributed_solution(locally_owned_dofs,mpi_communicator);
SolverControl cn;
PETScWrappers::SparseDirectMUMPS solver(cn, mpi_communicator);
solver.solve(system_matrix, completely_distributed_solution, system_rhs);
constraints.distribute(completely_distributed_solution);
locally_relevant_solution = completely_distributed_solution;
}


Thank you again for your help
Regards
H.

Bruno Turcksin

unread,
Mar 7, 2022, 9:04:19 AM3/7/22
to dea...@googlegroups.com
Hermes,

The problem is that you are using a direct solver. Direct solvers
require a lot of memory because the inverse of a sparse matrix is
generally not sparse. If you use a LU decomposition, which I think
MUMPS does, you need a dense matrix to store the LU decomposition.
That's a lot of memory! You will need to use a Krylov to solve a
problem of this size.

Best,

Bruno
> --
> The deal.II project is located at http://www.dealii.org/
> For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
> ---
> You received this message because you are subscribed to a topic in the Google Groups "deal.II User Group" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/dealii/KQGIxkJZL6w/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to dealii+un...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/d80cbd1c-7d1f-4042-a9e8-e8382a721790n%40googlegroups.com.

Hermes Sampedro

unread,
Mar 10, 2022, 2:51:01 AM3/10/22
to deal.II User Group

Dear Bruno,

Thank you again for your answer. 

I managed to solve now a system of 3.5 million DOF using the same solver as I posted above, SparseDirectMUMPS. Now, in release mode, the assembling takes a few minutes instead of hours, however,  the solver function takes approximately 1.5h (per frequency iteration) using 40 processes in parallel (similar to step-40). 

I was expecting to get faster performance when running in parallel with 40 processes, especially because I need to run for several frequencies.  I would like to ask if you also would expect faster performance. Would that be solved using the solver that you suggested (Krylov)?


Thank you

Regards, 

H

Bruno Turcksin

unread,
Mar 10, 2022, 8:54:13 AM3/10/22
to dea...@googlegroups.com
Hermes,

For large systems, Krylov solvers are faster and require less memory
than direct solvers. Direct solvers scale poorly, in terms of memory
and performance, with the number of unknowns. The only problem with
Krylov solvers is that you need to use a good preconditioner. The
choice of the preconditioner depends on the system that you want to
solve.

Best,

Bruno

Le jeu. 10 mars 2022 à 02:51, Hermes Sampedro
> To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/0438f349-0ecd-4503-944c-6a5068599ccfn%40googlegroups.com.

Hermes Sampedro

unread,
Mar 10, 2022, 9:00:39 AM3/10/22
to deal.II User Group
Hi Bruno, 

Yes, for now, I have to use a direct solver due to the preconditioner. 
I am experiencing long computational times with the solver function.  I am trying to use DoFRenumbering::Cuthill_McKee(dof_handler), DoFRenumbering::boost::Cuthill_McKee(dof_handler,false,false)
but I get even higher computational times. Am I doing something wrong?

In the setup_system() function I do:
dof_handler.distribute_dofs(fe);
DoFRenumbering::Cuthill_McKee(dof_handler);

Then thee solver is
void LaplaceProblem<dim>::solve()
{
PETScWrappers::MPI::Vector completely_distributed_solution(locally_owned_dofs,mpi_communicator);
SolverControl cn;
PETScWrappers::SparseDirectMUMPS solver(cn, mpi_communicator);
solver.solve(system_matrix, completely_distributed_solution, system_rhs);
constraints.distribute(completely_distributed_solution);
locally_relevant_solution = completely_distributed_solution;
}

Thank you
Regards,
H

Bruno Turcksin

unread,
Mar 10, 2022, 9:12:19 AM3/10/22
to dea...@googlegroups.com
Hermes,

I think Cuthill-McKee only works on symmetric matrices, is your matrix
symmetric? Also, the goal of Cuthill-McKee is to help with the fill in
of the matrix.There is no guarantee that it helps with the
performance. If you don't know which preconditioner to use, you can
use ILU (Incomplete LU decomposition). Basically, you use a direct
solver but you drop all the "small" entries in the matrix. It's not
the best preconditioner but you can control how much time you spend in
the "direct solver". The problem with direct solvers is that there is
not much you can do to speed them up. In practice, everybody uses
Krylov solvers because of the problems you are encountering now.

Best,

Bruno

Le jeu. 10 mars 2022 à 09:00, Hermes Sampedro
> To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/cb8066c4-d160-4ebb-977b-d96ca07a93e5n%40googlegroups.com.

Hermes Sampedro

unread,
Mar 10, 2022, 9:18:46 AM3/10/22
to deal.II User Group
Thank you for your suggestions. Could you please suggest me what function can work well for using a Krylov solver? I can no see examples.
My actual code is implemented using PETS (for sparsematrix, solver, etc). I can see that SLEPcWrappers::SolverKrylovSchur allows PETS matrices.


Thank you again

Bruno Turcksin

unread,
Mar 10, 2022, 9:29:59 AM3/10/22
to dea...@googlegroups.com
If your matrix is symmetric definite positive, you use CG. Otherwise, you use GMRES. Here is the page for ILU

Bruno

Hermes Sampedro

unread,
Mar 10, 2022, 10:19:40 AM3/10/22
to deal.II User Group
Dear Bruno, 

Thank you very much, I will try this.
The last question if it is not too much to ask is about In the PreconditionILU matrix:

PETScWrappers::PreconditionILU::PreconditionILU
(const MatrixBase &  matrix, const AdditionalData &  additional_data = AdditionalData() )

 Is it correct in this case to use my system_matrix? Is there any example in the tutorials about this? I can not find any.

Thank you very much
Regards, 
H.

Bruno Turcksin

unread,
Mar 10, 2022, 10:29:44 AM3/10/22
to dea...@googlegroups.com
Yes, you should use your system_matrix. AdditionalData can be used to modify the parameters used by ILU. The interface of PreconditionILU should work very similarly to BlockJacobi see https://dealii.org/current/doxygen/deal.II/step_17.html#ElasticProblemsolve There are several  tutorials that use petsc: steps 17, 18, 40, 50, and 55. We also have test that use PreconditionILU  https://github.com/dealii/dealii/blob/master/tests/petsc/solver_03_precondition_ilu.cc

Bruno

Wolfgang Bangerth

unread,
Mar 10, 2022, 11:17:47 AM3/10/22
to dea...@googlegroups.com
On 3/10/22 07:00, Hermes Sampedro wrote:
> I am experiencing long computational times with the solver function.  I am
> trying to use DoFRenumbering::Cuthill_McKee(dof_handler),
> DoFRenumbering::boost::Cuthill_McKee(dof_handler,false,false)
> but I get even higher computational times. Am I doing something wrong?

Renumbering makes an enormous difference for sparse direct solvers, and as a
consequence all such solvers I know of do it internally (though they use
variations of the "minimum degree" renumbering, rather than Cuthill-McKee). As
a consequence, renumbering yourself likely makes no difference.

But, as you discover and as Bruno already pointed out, even with optimal
ordering, direct solvers do not scale well both in terms of overall time and
in parallelism. You may want to take a look at the several video lectures on
solvers and preconditioners to see what you can do about your case.

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Hermes Sampedro

unread,
Mar 11, 2022, 4:47:37 AM3/11/22
to deal.II User Group
Dear Bruno and Wolfgang, 

thank you very much for your comments and help, it is very helpful. Actually, I think that is what I am experiencing. When running with my actual direct solver a system with 15 elements per direction (4th polynomial order with 0.5million dof), the solver takes 50 seconds. However, increasing to 30 elements per direction (3.5 million dof) the solver takes 1.5 hours. I think this shows how it does not scale well in terms of time as you mentioned. I will definitely try with the iterative solver.

Thannk you again
Regards, 
H.

Hermes Sampedro

unread,
Mar 14, 2022, 5:42:44 AM3/14/22
to deal.II User Group

Dear Bruno,

I have been reading the examples and documents you pointed out. I tried to use SolvereGREMS with PreconditionILU. However, I am getting a running error that I can not really understand when calling PETScWrappers::PreconditionILU preconditioner(system_matrix). However, it seems to work with  PETScWrappers::PreconditionBlockJacobi preconditioner(system_matrix). The solver function ad error that I get is the following:

void LaplaceProblem<dim>::solve()

   {

       PETScWrappers::MPI::Vector completely_distributed_solution(locally_owned_dofs,mpi_communicator);

         SolverControl cn(completely_distributed_solution.size(), 1e-8 * system_rhs.l2_norm());

          PETScWrappers::SolverGMRES solver(cn, mpi_communicator);

       PETScWrappers::PreconditionILU preconditioner(system_matrix);

         solver.solve(system_matrix, completely_distributed_solution, system_rhs, preconditioner); 

       constraints.distribute(completely_distributed_solution);

       locally_relevant_solution = completely_distributed_solution;

   }


[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------

[0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/linearsolvertable.html for possible LU and Cholesky solvers

[0]PETSC ERROR: Could not locate a solver package for factorization type ILU and matrix type mpiaij.

[0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.

[0]PETSC ERROR: Petsc Release Version 3.13.1, May 02, 2020 

[0]PETSC ERROR: ./waveLaplaceSolver on a  named gbarlogin1 by hsllo Fri Mar 11 11:05:23 2022

[0]PETSC ERROR: Configure options --prefix=/zhome/32/9/115503/dealii-candi/petsc-3.13.1 --with-debugging=0 --with-shared-libraries=1 --with-mpi=1 --with-x=0 --with-64-bit-indices=0 --download-hypre=1 CC=mpicc CXX=mpicxx FC=mpif90 --with-blaslapack-dir=/appl/OpenBLAS/0.3.17/XeonE5-2660v3/gcc-11.2.0/lib --with-parmetis-dir=/zhome/32/9/115503/dealii-candi/parmetis-4.0.3 --with-metis-dir=/zhome/32/9/115503/dealii-candi/parmetis-4.0.3 --download-scalapack=1 --download-mumps=1

[0]PETSC ERROR: #1 MatGetFactor() line 4492 in /zhome/32/9/115503/dealii-candi/tmp/build/petsc-lite-3.13.1/src/mat/interface/matrix.c

[0]PETSC ERROR: #2 PCSetUp_ILU() line 133 in /zhome/32/9/115503/dealii-candi/tmp/build/petsc-lite-3.13.1/src/ksp/pc/impls/factor/ilu/ilu.c

[0]PETSC ERROR: #3 PCSetUp() line 894 in /zhome/32/9/115503/dealii-candi/tmp/build/petsc-lite-3.13.1/src/ksp/pc/interface/precon.c

----------------------------------------------------

Exception on processing: 

--------------------------------------------------------

An error occurred in line <431> of file </zhome/32/9/115503/dealii-candi/tmp/unpack/deal.II-v9.3.1/source/lac/petsc_precondition.cc> in function

    void dealii::PETScWrappers::PreconditionILU::initialize(const dealii::PETScWrappers::MatrixBase&, const dealii::PETScWrappers::PreconditionILU::AdditionalData&)

The violated condition was: 

    ierr == 0

Additional information: 

    deal.II encountered an error while calling a PETSc function.

    The description of the error provided by PETSc is "See

    https://www.mcs.anl.gov/petsc/documentation/linearsolvertable.html for

    possible LU and Cholesky solvers".

    The numerical value of the original error code is 92.





Could you please help me to understand what is happening?

Thank you again for your help

Regards, 
H

Bruno Turcksin

unread,
Mar 14, 2022, 9:24:45 AM3/14/22
to dea...@googlegroups.com
Hermes,

Sorry, I don't use petsc. Maybe someone else can help you.

Best,

Bruno

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "deal.II User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dealii/KQGIxkJZL6w/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dealii+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages