Internal instability of the GMRES Solver / Trilinos

82 views
Skip to first unread message

Pascal Kraft

unread,
Mar 14, 2017, 2:29:09 PM3/14/17
to deal.II User Group
Dear list members,

I am facing a really weird problem, that I have been struggling with for a while now. I have written a problem class which, based on other objects, generates a system matrix, rhs and solution vector object. The datastructures are Trilinos Block distributed types. When I do this for the first time it all  works perfectly. However the class is part of an optimization scheme and usually at the second time the object is used (randomly also later, but this has only happened once or twice) the Solver does not start. I am checking with MPI-barriers to see if all processes arrive at the GMRES::solve and they do but somehow not even my own preconditioners vmult method gets called anymore. The objects (the two vectors and the system matrix are exactly the same that they have been at the previous step (only slightly different numbers, but same vectors of IndexSets for the partition among processors)

I have debugged this code-segment with Eclipse and the parallel debugger but don't know what to do with the call stack:
18 ompi_request_default_wait_all()  7fffddd54b15
17 ompi_coll_tuned_barrier_intra_recursivedoubling()  7fffcf9abb5d
16 PMPI_Barrier()  7fffddd68a9c
15 Epetra_MpiDistributor::DoPosts()  7fffe4088b4f
14 Epetra_MpiDistributor::Do()  7fffe4089773
13 Epetra_DistObject::DoTransfer()  7fffe400a96a
12 Epetra_DistObject::Export()  7fffe400b7b7
11 int Epetra_FEVector::GlobalAssemble<int>()  7fffe4023d7f
10 Epetra_FEVector::GlobalAssemble()  7fffe40228e3
9 dealii::TrilinosWrappers::MPI::Vector::reinit() trilinos_vector.cc:261 7ffff52c937e
8 dealii::TrilinosWrappers::MPI::BlockVector::reinit() trilinos_block_vector.cc:191 7ffff4e43bd9
7 dealii::internal::SolverGMRES::TmpVectors<dealii::TrilinosWrappers::MPI::BlockVector>::operator() solver_gmres.h:535 4a847d
6 dealii::SolverGMRES<dealii::TrilinosWrappers::MPI::BlockVector>::solve<dealii::TrilinosWrappers::BlockSparseMatrix, PreconditionerSweeping>() solver_gmres.h:813 4d654a
5 Waveguide::solve() Waveguide.cpp:1279 48f150

The last line (5) here is a function I wrote which calls SolverGMRES<dealii::TrilinosWrappers::MPI::BlockVector>::solve with my preconditioner (which works perfecly fine during the previous run. I found some information online about MPI_Barrier being instable sometimes but I don't know enough about the inner workings of Trilinos (Epetra) and Dealii to make a judgment call here. If none can help I will try to provide a code fragment but I doubt that will be possible (if it really is a racing condition and I strip away the rather large ammout of code surrounding this segment, it is unlikely to be reproducible.

Originally I had used two MPI communicatorsthat were only different in the numbering of the processes (one for the primal, one for the dual problem) and created two independend objects of my problem class hich only used their respective communicator. In that case, the solver had only worked whenever the numbering of processes was either equal to that of MPI_COMM_WORLD or exactly the opposite but not for say 1-2-3-4 -> 1-3-2-4 and gotten stuck in the exact same way. I had thought it might be some internal use of MPI_COMM_WORLD that was blocking somehow but it also happens now that I only use one communicator (MPI_COMM_WORLD).

Thank you in advance for your time,
Pascal Kraft

Pascal Kraft

unread,
Mar 14, 2017, 2:43:41 PM3/14/17
to deal.II User Group
By the way: After some time I see the additional function opal_progress() on top of the stack.
Also here is what I use:
gcc (GCC) 6.3.1 20170306
openmpi 1.10.6-1
trilinos-12.6.1
dealii-8.4.1
and my testcases consit of 4 MPI processes.

Timo Heister

unread,
Mar 14, 2017, 4:49:13 PM3/14/17
to dea...@googlegroups.com
Pascal,

I have no idea why this is happening. I think you have to try to make
a minimal example that hangs so we can find out what the problem is. I
assume we incorrectly allocate/deallocate temporary vectors somewhere.

Are all processors stuck inside
> 9 dealii::TrilinosWrappers::MPI::Vector::reinit() trilinos_vector.cc:261 7ffff52c937e
?
> --
> The deal.II project is located at https://urldefense.proofpoint.com/v2/url?u=http-3A__www.dealii.org_&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=VOkcHXQVqmsabYZ_e85-hkeUX-pMTpVdQ9jMXMiWqXI&s=IXiRAQqQMkK6-Y-7bT60xzRYzX2hbkSrNCYSHJIr9jU&e=
> For mailing list/forum options, see
> https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_forum_dealii-3Fhl-3Den&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=VOkcHXQVqmsabYZ_e85-hkeUX-pMTpVdQ9jMXMiWqXI&s=MymJilrrRnEk7vLTTovuxOlaZwuAwq2hW-cqk-X0tLU&e=
> ---
> You received this message because you are subscribed to the Google Groups
> "deal.II User Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to dealii+un...@googlegroups.com.
> For more options, visit https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_optout&d=DwIFaQ&c=Ngd-ta5yRYsqeUsEDgxhcqsYYY1Xs5ogLxWPA_2Wlc4&r=4k7iKXbjGC8LfYxVJJXiaYVu6FRWmEjX38S7JmlS9Vw&m=VOkcHXQVqmsabYZ_e85-hkeUX-pMTpVdQ9jMXMiWqXI&s=jY1jYrbAGVwPW8EqDUqwRYAVClEhs8utft_-GlLKN8E&e= .



--
Timo Heister
http://www.math.clemson.edu/~heister/

Pascal Kraft

unread,
Mar 15, 2017, 9:10:29 AM3/15/17
to deal.II User Group
Dear Timo,

I have done some more digging and found out the following. The problems seem to happen in trilinos_vector.cc between the lines 240 and 270.
What I see on the call stacks is, that one process reaches line 261 ( ierr = vector->GlobalAssemble (last_action); ) and then waits inside this call at an MPI_Barrier with the following stack:
20 <symbol is not available> 7fffd4d18f56
19 opal_progress()  7fffdc56dfca
18 ompi_request_default_wait_all()  7fffddd54b15
17 ompi_coll_tuned_barrier_intra_recursivedoubling()  7fffcf9abb5d
16 PMPI_Barrier()  7fffddd68a9c
15 Epetra_MpiDistributor::DoPosts()  7fffe4088b4f
14 Epetra_MpiDistributor::Do()  7fffe4089773
13 Epetra_DistObject::DoTransfer()  7fffe400a96a
12 Epetra_DistObject::Export()  7fffe400b7b7
11 int Epetra_FEVector::GlobalAssemble<int>()  7fffe4023d7f
10 Epetra_FEVector::GlobalAssemble()  7fffe40228e3

The other (in my case three) processes are stuck in the head of the if/else-f statement leading up to this point, namely in the line 
if (vector->Map().SameAs(v.vector->Map()) == false)
inside the call to SameAs(...) with stacks like

15 opal_progress() 7fffdc56dfbc 14 ompi_request_default_wait_all() 7fffddd54b15 13 ompi_coll_tuned_allreduce_intra_recursivedoubling() 7fffcf9a4913 12 PMPI_Allreduce() 7fffddd6587f 11 Epetra_MpiComm::MinAll() 7fffe408739e 10 Epetra_BlockMap::SameAs() 7fffe3fb9d74

Maybe this helps. Producing a smaller example will likely not be possible in the coming two weeks but if there are no solutions until then I can try.

Greetings,
Pascal

Martin Kronbichler

unread,
Mar 15, 2017, 12:26:23 PM3/15/17
to dea...@googlegroups.com

Dear Pascal,

This problem seems related to a problem we recently worked around in https://github.com/dealii/dealii/pull/4043

Can you check what happens if you call GrowingVectorMemory<TrilinosWrappers::MPI::Vector>::release_unused_memory()

between your optimization steps? If a communicator gets stack in those places it is likely a stale object somewhere that we fail to work around for some reason.

Best,
Martin

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en

---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Pascal Kraft

unread,
Mar 15, 2017, 8:28:14 PM3/15/17
to deal.II User Group
Hi Martin,

that didn't solve my problem. What I have done in the meantime is replace the check in line 247 of trilinos_vector.cc with true. I don't know if this causes memory leaks or anything but my code seems to be working fine with that change. 
To your suggestion: Would I have also had to call the templated version for BlockVectors or only for Vectors? I only tried the latter. Would I have had to also apply some patch to my dealii library for it to work or is the patch you talked about simply that you included the functionality of the call GrowingVectorMemory<TrilinosWrappers::MPI::Vector>::release_unused_memory() in some places?
I have also wanted to try MPICH instead of OpenMPI because of a post about an internal error in OpenMPI and one of the functions appearing in the call stacks sometimes not blocking properly.

Thank you for your time and your fast responses - the whole library and the people developing it and making it available are simply awesome ;)

Pascal

Martin Kronbichler

unread,
Mar 16, 2017, 3:58:53 AM3/16/17
to dea...@googlegroups.com

Dear Pascal,

You are right, in your case one needs to call
GrowingVectorMemory<TrilinosWrappers::MPI::BlockVector>::release_unused_memory()
rather than for the vector. Can you try that as well?

The problem appears to be that the call to SameAs returns different results for different processors, which it should not be, which is why I suspect that there might be some stale communicator object around. Another indication for that assumption is that you get stuck in the initialization of the temporary vectors of the GMRES solver, which is exactly this kind of situation.

As to the particular patch I referred to: It does release some memory that might have stale information but it also changes some of the call structures slightly. Could you try to change the following:

if (vector->Map().SameAs(v.vector->Map()) == false)

to

if (v.vector->Map().SameAs(vector->Map()) == false)

Best, Martin

Pascal Kraft

unread,
Mar 16, 2017, 7:35:10 AM3/16/17
to deal.II User Group
Dear Martin,

my local machine is dying to a Valgrind run at the moment, but as soon as that is done with one step I will put these changes in right away and post the results here (<6 hrs).
From what I make of the call stacks on process somehow gets out of the SameAs() call without being MPI-blocked, and the others are then forced to wait during the All_Reduce call. How or where that happens I will try to figure out later today. SDM is now working well in my eclipse setup and I hope to be able to track the problem.

Best,
Pascal

Pascal Kraft

unread,
Mar 16, 2017, 10:21:24 AM3/16/17
to deal.II User Group
Hi Martin,

 I have tried a version with GrowingVectorMemory<TrilinosWrappers::MPI::BlockVector>::release_unused_memory() at the end of each step and removed my change to trilinos_vector.cc l.247 (back to the version from dealii source) and it seems to work fine. I have not tried the other solution you proposed, should I? Would the result help you?

Thank you a lot for your support! This had been driving me crazy :)

Best,
Pascal

Martin Kronbichler

unread,
Mar 16, 2017, 10:22:30 AM3/16/17
to dea...@googlegroups.com

Dear Pascal,

No, you do not need to try the other solution. I'm glad I could help. (This asserts the approach that we need to be careful with the vector pool between different calls.)

Best,
Martin

Reply all
Reply to author
Forward
0 new messages