Errors when using MUMPS/PETSc LU

96 views
Skip to first unread message

Lucas Campos

unread,
Nov 17, 2017, 8:12:18 AM11/17/17
to deal.II User Group
Dear all,

First of all, a bit of context:
I am trying to debug an error in my application where randomly I start seeing nan's. The probability of this increases with the number of MPI processors I use, so it looks like it is a data race of some sort. Any advice on the best way to find the error?

My current approach is to use project MUST[1] to help me find the issues. When I ran MUST with the debug version of my code on the local cluster, it returned a errors related to the MPI internalities of dealii/petsc(/MUMPS?). An exemplary output can be seen on errors.txt. The output stopping in "Solving... " suggested that the error was in between the following lines of my code:

PetscPrintf(mpi_communicator, "Solving... \n");
computing_timer.enter_section("solve");

SolverControl cn;
PETScWrappers::SparseDirectMUMPS solver(cn, mpi_communicator);
solver.set_symmetric_mode(false);
solver.solve(system_matrix, distributed_dU, system_rhs); 

computing_timer.exit_section("solve");
PetscPrintf(mpi_communicator, "Solved! \n");


 Indeed, when I comment out the "solver.solve(system_matrix, distributed_dU, system_rhs); " line, it runs with no errors at all.

Could this be the source of my issues? Also, how can I solve this specific issue?
errors.txt

Lucas Campos

unread,
Nov 17, 2017, 8:13:05 AM11/17/17
to deal.II User Group
Sorry, I forgot to include the link to MUST. Here it is: https://doc.itc.rwth-aachen.de/display/CCP/Project+MUST

Timo Heister

unread,
Nov 17, 2017, 9:15:07 AM11/17/17
to dea...@googlegroups.com
Lucas,

those kind of bugs are hard to find. Honestly, the bug could still be
in your code, inside deal.II, inside PETSc, inside MUMPS, or related
to the software/hardware you are running on.

I know this won't be of much help, but I would suggest you try a
different solver to see if MUMPS is the problematic part here. Maybe
they are doing some invalid operations (maybe one processor has no
DoFs?). Try to simplify your test problem as much as possible. If the
problem is small enough, test on a different machine
(workstation/laptop), run using valgrind, etc..

Best,
Timo


On Fri, Nov 17, 2017 at 8:13 AM, Lucas Campos <rmk...@gmail.com> wrote:
> Sorry, I forgot to include the link to MUST. Here it is:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__doc.itc.rwth-2Daachen.de_display_CCP_Project-2BMUST&d=DwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=31KxKzbiEhsoTggloCeUx5owuWvvNSO3OBJDaSNDJks&s=MoO0ofNJV7qBLDqBoKHK0fpxT7ResQnpHz4l1KXw3q0&e=
> --
> The deal.II project is located at https://urldefense.proofpoint.com/v2/url?u=http-3A__www.dealii.org_&d=DwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=31KxKzbiEhsoTggloCeUx5owuWvvNSO3OBJDaSNDJks&s=FlH-oR80VOUqL_lvynTH4ECrvEahwkkYt5AFNP2aunA&e=
> For mailing list/forum options, see
> https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_forum_dealii-3Fhl-3Den&d=DwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=31KxKzbiEhsoTggloCeUx5owuWvvNSO3OBJDaSNDJks&s=rbFKu8LqzeLfmSoFFB0CzZIBK77ZoK0iyZM9kWMCRAY&e=
> ---
> You received this message because you are subscribed to the Google Groups
> "deal.II User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dealii+un...@googlegroups.com.
> For more options, visit https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_optout&d=DwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=31KxKzbiEhsoTggloCeUx5owuWvvNSO3OBJDaSNDJks&s=sdbgJtXhDYlYqBoAKOgCjPO9NN_zS6ZZKgEb-orkspo&e= .



--
Timo Heister
http://www.math.clemson.edu/~heister/

Lucas Campos

unread,
Nov 17, 2017, 9:37:57 AM11/17/17
to dea...@googlegroups.com
Dear Timo,

Thanks for you advice. I am running the program in three different computers -- my notebook, my research group's server and the local cluster. In all of them I have this (small) change to suddenly find the nan.

According to MUST, there is clearly a problem inside deal.II, MUMPS or PETSc, as can be noticed on the file I sent on the previous messages. Namely,

Invalid MPI_Op, error stack:
MPI_Op_free(111): MPI_Op_free(op=0x7ffcf6298dac) failed
MPI_Op_free(75).: Null Op pointer

If this error would lead the the issues I am having, is up to discussion. 

I tried using PETSc's SolverPreOnly 

        SolverControl solver_control;
        PETScWrappers::SolverPreOnly solver(solver_control, mpi_communicator);
        PETScWrappers::PreconditionLU preconditioner(system_matrix);
        solver.solve(system_matrix, distributed_dU, system_rhs, preconditioner);

It shows the same issues, as expected. I will follow your advice and try to use a different solver. 

Still, would it be possible for you to comment a bit more on those MPI_Op_free errors?

Cheers,
Lucas


--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "deal.II User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dealii/nKdtA03jfB0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dealii+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lucas Campos

unread,
Nov 17, 2017, 9:39:32 AM11/17/17
to dea...@googlegroups.com
Dear Timo, 

Also, do you have recommendations for Valgrind flags other than --track-origins=yes --leak-check=full?

Lucas

Wolfgang Bangerth

unread,
Nov 17, 2017, 11:07:51 AM11/17/17
to dea...@googlegroups.com
On 11/17/2017 07:37 AM, Lucas Campos wrote:
>
> Invalid MPI_Op, error stack:
>
> MPI_Op_free(111): MPI_Op_free(op=0x7ffcf6298dac) failed
>
> MPI_Op_free(75).: Null Op pointer
>
>
> If this error would lead the the issues I am having, is up to discussion.
>
> I tried using PETSc's SolverPreOnly
>
>         SolverControl solver_control;
>
>         PETScWrappers::SolverPreOnly solver(solver_control,
> mpi_communicator);
>
>         PETScWrappers::PreconditionLU
> preconditioner(system_matrix);
>
>         solver.solve(system_matrix, distributed_dU, system_rhs,
> preconditioner);
>
>
> It shows the same issues, as expected. I will follow your advice and try
> to use a different solver.
>
> Still, would it be possible for you to comment a bit more on those
> MPI_Op_free errors?

It means that an MPI_Op object is freed (like calling 'free' for memory)
but that the object that's being freed doesn't actually exist (is a NULL
pointer).

There clearly is a bug here, but it's impossible to tell without a
backtrace where that might be.

Best
W.


--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Timo Heister

unread,
Nov 17, 2017, 1:50:50 PM11/17/17
to dea...@googlegroups.com
> It shows the same issues, as expected. I will follow your advice and try to use a different solver.

Then you have to simplify your problem as much as possible until we
can reproduce it.

> Still, would it be possible for you to comment a bit more on those MPI_Op_free errors?

This can happen for many different reasons. It might be a bug inside
one of the libraries or it might happen if you overwrite memory (the
reason I suggested valgrind) or you are trying to delete a PETSc
object more than once, etc..

Are you using the debug build of PETSc?

> Also, do you have recommendations for Valgrind flags other than --track-origins=yes --leak-check=full?

That should be enough to find out whether anybody is overwriting
memory. You can also use clang's address sanitizer.
> www: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.math.colostate.edu_-7Ebangerth_&d=DwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=IpF6CHQF4jujTvcadoPchJantOI8WT4okP-4h4ugYXU&s=oUUKpv7xLIryBSCggjeM-8LDP9MPpZTG7nlTwGXS7os&e=
>
>
> --
> The deal.II project is located at https://urldefense.proofpoint.com/v2/url?u=http-3A__www.dealii.org_&d=DwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=IpF6CHQF4jujTvcadoPchJantOI8WT4okP-4h4ugYXU&s=25xaQnWa3gLddZdQtfajI_ydxKLe6cHjjurmVW4hMn0&e=
> For mailing list/forum options, see
> https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_forum_dealii-3Fhl-3Den&d=DwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=IpF6CHQF4jujTvcadoPchJantOI8WT4okP-4h4ugYXU&s=cjIApHTOyuurgKurhKBam3fbKFKCLlDjcodrdwGi30w&e=
> --- You received this message because you are subscribed to the Google
> Groups "deal.II User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dealii+un...@googlegroups.com.
> For more options, visit https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_optout&d=DwIBaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=IpF6CHQF4jujTvcadoPchJantOI8WT4okP-4h4ugYXU&s=lmGCGqSem2q3jHD4B432n4vW9Dd6H619qSqMhAyrYPY&e= .

Lucas Campos

unread,
Nov 30, 2017, 8:27:13 AM11/30/17
to dea...@googlegroups.com
Deal all,

Sorry for the late reply. I had some urgent matters to solve, and had to put this investigation to a halt for a while. 

> Then you have to simplify your problem as much as possible until we
> can reproduce it.

I am not sure of the best way to do it, as a have a rather large program and I need it to build the system matrix. In your experience, saving the matrices
to files and then just load them would be a good path?


> Are you using the debug build of PETSc?

I think so. I built it from candi, and I am compiling my program in debug mode, as created by deal.ii's auto pilot.


> That should be enough to find out whether anybody is overwriting
> memory. 

Please, see the attachment.

> You can also use clang's address sanitizer.

I could not compile with clang, due to an unknown flag, -fopenmp-simd. Do you happen to know to to disable this one?

Bests,
Lucas

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---

You received this message because you are subscribed to a topic in the Google Groups "deal.II User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dealii/nKdtA03jfB0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dealii+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

log_valgrind_debug.txt

Timo Heister

unread,
Nov 30, 2017, 10:28:33 AM11/30/17
to dea...@googlegroups.com
>> Then you have to simplify your problem as much as possible until we
>> can reproduce it.
>
> I am not sure of the best way to do it, as a have a rather large program and
> I need it to build the system matrix. In your experience, saving the
> matrices
> to files and then just load them would be a good path?

While not ideal, this would be a start. Unfortunately, I am not sure
if we have a good way to serialize parallel vectors/matrices you can
use. First try to decrease the size of the linear system and the
number of processors required to see the problem.

>> That should be enough to find out whether anybody is overwriting
>> memory.
>
> Please, see the attachment.

I don't know what these close() warnings are about, but everything
else looks good.

>> You can also use clang's address sanitizer.
>
> I could not compile with clang, due to an unknown flag, -fopenmp-simd. Do
> you happen to know to to disable this one?

I think you would also need to compile deal.II using clang.
Reply all
Reply to author
Forward
0 new messages