hp::DoFHandler with FESystem and FE_Nothing - crash in parallel when compressing Trilinos sparsity pattern

56 views
Skip to first unread message

Mathias Anselmann

unread,
May 22, 2019, 11:12:37 AM5/22/19
to deal.II User Group
Hello,
I want to use the deal.II hp::DoFHandler in a principally similar way as in step-46 - so using it with a coupled problem. I'm at the beginning of implementing this and want to use deal.II with Trilinos, so that the program can run on distributed machines.
Unfortunately I have a problem when it comes to creating the TrilinosWrappers::BlockSparsityPattern: the code runs serial (mpirun -np 1 ./problem_fe_nothing) but crashes with np >= 2.
I could isolate the error a little bit and have also created a "minimal working example" (which is more or less an adapted step-40 and attached at this post).

To explain my problem in a nutshell:
I need to create a FESystem which has one FE_Nothing block:
FESystem<dim> fe_problem(FE_Q<dim>(2)^dim, FE_Nothing<dim>()^1);
I add this FESystem to a hp::FECollection<dim> and distribute the dofs via the hp::DoFHandler<dim>.
Afterwards I reorder the DoFs component wise and extract the locally owned and locally relevant DoFs.
I then reinit my TrilinosWrappers::BlockSparsityPattern:
sp.reinit(locally_owned_dofs,
          locally_owned_dofs
,
          locally_relevant_dofs
,
          mpi_communicator
);
create the sparsity pattern afterwards with DoFTools::make_sparsity_pattern() and then do a:
sp.compress()
and this is where the program crashes (in parallel) with an uncaught exception in main() - so there is no exception thrown in a underlying deal.II or Trilinos routine (using debug mode).


I have found out, that the program does not crash, if I don't use FE_Nothing, so if I use:
FESystem<dim> fe_works(FE_Q<dim>(2)^dim, FE_W<dim>()^1);
instead of fe_problem the code runs fine - even in parallel.
Also if I use fe_problem in conjunction with:
sp.reinit(locally_owned_dofs,
          locally_owned_dofs
,
          mpi_communicator
);
the code runs fine (but will according to the documentation result in a not multi thread writable matrix).

I can debug the program in serial with gdb, but unfortunately this doesn't help me here since the problem does only occur in parallel.
As proposed in debug mpi I tried to debug it with:
mpirun -np 2 urxvt -e gdb ./problem_fe_nothing
After crashing here "backtrace" just leads to "no stack" and is of no help either (everything compiled in debug mode).

For the sake of completeness:
I use the brand new deal.II 9.1.0 with Trilinos from the current master branch, gcc 8.3 and mpich 3.3.

It would be great if anybody could help me finding the problem here or debugging the problem further it would be great.

Thanks and greetings,

Mathias
problem_fe_nothing.zip

Daniel Arndt

unread,
May 22, 2019, 3:52:01 PM5/22/19
to deal.II User Group
Mathias,

some observations:
- it seems that the call to nonlocal_graph->FillComplete (https://github.com/dealii/dealii/blob/master/source/lac/trilinos_sparsity_pattern.cc#L820) is failing for block (0,1).
- the code also fails for me when using a DoFHandler object instead of a hp::DoFHandler object.

Best,
Daniel

Mathias Anselmann

unread,
May 22, 2019, 10:06:33 PM5/22/19
to deal.II User Group
 Hello Daniel,

thanks for the observations. Interesting that the problem doesn't depend on the hp framework.
I myself managed to debug the problem a little bit further with gdb on the second running process (when starting with -np 2) and got the following backtrace directly before crashing:
>>> bt
#0  Epetra_CrsGraph::MakeIndicesLocal (this=0x559811539590, domainMap=..., rangeMap=...) at /apps/spack/var/spack/stage/trilinos-master/packages/epetra/src/Epetra_CrsGraph.cpp:1823
#1  0x00007f3bdc4a48a2 in Epetra_CrsGraph::FillComplete (this=0x559811539590, domainMap=..., rangeMap=...) at /apps/spack/var/spack/stage/trilinos-master/packages/epetra/src/Epetra_CrsGraph.cpp:979
#2  0x00007f3bf4bcf089 in dealii::TrilinosWrappers::SparsityPattern::compress (this=0x559811566ee0) at /apps/spack/var/spack/stage/dealii-9.1.0/source/lac/trilinos_sparsity_pattern.cc:820
#3  0x00007f3bf45a4983 in dealii::BlockSparsityPatternBase<dealii::TrilinosWrappers::SparsityPattern>::compress (this=0x7ffcb4c37918) at /apps/spack/var/spack/stage/dealii-9.1.0/source/lac/block_sparsity_pattern.cc:172
#4  0x000055980fe1b3f8 in Problem_FeNothing::Demonstration<2>::setup_system (this=0x7ffcb4c36690) at ~/Programming/problem_fe_nothing/problem_fe_nothing.cc:146
#5  0x000055980fe1953b in Problem_FeNothing::Demonstration<2>::run (this=0x7ffcb4c36690) at ~/Programming/problem_fe_nothing/problem_fe_nothing.cc:163
#6  0x000055980fe12539 in main (argc=1, argv=0x7ffcb4c37c48) at ~/Programming/problem_fe_nothing/problem_fe_nothing.cc:183

The relevant point in the source code is at crashing time in Epetra_CRSGraph.cpp around line 1823:
─── Source ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1818       int GID = ColIndices[j];
1819       int LID = colmap.LID(GID);
1820       if(LID != -1)
1821         ColIndices[j] = LID;
1822       else
>>1823         throw ReportError("Internal error in FillComplete ",-1);
1824       }
1825     }
1826   }
1827   else if(RowMap().GlobalIndicesLongLong())
1828   {

the value of LID is indeed "-1" and therefore we land in line 1823 and the error is thrown.
Since I'm really no expert in the underlying data structures in the deal.II TrilinosWrapper and Trilinos itself, is there any way to get further help in fixing this problem?

Thanks,

Mathias

Daniel Arndt

unread,
May 22, 2019, 11:13:21 PM5/22/19
to deal.II User Group
Mathias,

this patch seems to work:

diff --git a/source/lac/trilinos_sparsity_pattern.cc b/source/lac/trilinos_sparsity_pattern.cc
index 84403175c6..c448279505 100644
--- a/source/lac/trilinos_sparsity_pattern.cc
+++ b/source/lac/trilinos_sparsity_pattern.cc
@@ -791,7 +791,8 @@ namespace TrilinosWrappers
     if (nonlocal_graph.get() != nullptr)
       {
         if (nonlocal_graph->IndicesAreGlobal() == false &&
-            nonlocal_graph->RowMap().NumMyElements() > 0)
+            nonlocal_graph->RowMap().NumMyElements() > 0 &&
+            column_space_map->NumMyElements() > 0)
           {
             // Insert dummy element at (row, column) that corresponds to row 0
             // in local index counting.
@@ -813,6 +814,7 @@ namespace TrilinosWrappers
             AssertThrow(ierr == 0, ExcTrilinosError(ierr));
           }
         Assert(nonlocal_graph->RowMap().NumMyElements() == 0 ||
+                 column_space_map->NumMyElements() == 0 ||
                  nonlocal_graph->IndicesAreGlobal() == true,
                ExcInternalError());
 
Still needs some testing, though.
Can you give it a a shot?

Best,
Daniel

Wolfgang Bangerth

unread,
May 22, 2019, 11:23:23 PM5/22/19
to dea...@googlegroups.com
On 5/22/19 9:13 PM, Daniel Arndt wrote:
>
> this patch seems to work:

Is that because one processor has no elements in (column space of) the
sparsity pattern?

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Daniel Arndt

unread,
May 23, 2019, 12:09:47 AM5/23/19
to deal.II User Group

> this patch seems to work:

This is now a pull request: https://github.com/dealii/dealii/pull/8277/

Is that because one processor has no elements in (column space of) the
sparsity pattern?

In fact, the column space for block (0,1) is empty for all processors
since it corresponds to the degrees of freedoms for a FE_Nothing element.

Best,
Daniel

Mathias Anselmann

unread,
May 23, 2019, 3:26:13 AM5/23/19
to deal.II User Group
Daniel this patch works great for me!
Both the little example that I uploaded here and my problem that I'm working at the moment run flawlessly with your patch and multiple processes.

Thank you so much, this is actually the second time that you help me with a Trilinos sparsity pattern problem (and again incredibly fast). I really appreciate this very much, this helps me a lot.

Greetings

Mathias
Reply all
Reply to author
Forward
0 new messages