p::d::SolutionTransfer::interpolate fails with LA::distributed::Vector's in the specific case

45 views
Skip to first unread message

Denis Davydov

unread,
Jan 2, 2017, 11:50:32 AM1/2/17
to deal.II User Group
Hi all,

I came across a very weird problem today. When using p::d::SolutionTransfer during refinement I get the following error during interpolate step at the second refinement cycle:

61: An error occurred in line <686> of file </Users/davydden/spack/var/spack/stage/dealii-develop-7douosfpwd62254laf4et7n6x64b62fh/dealii/include/deal.II/lac/la_parallel_vector.templates.h> in function
61:     void dealii::LinearAlgebra::distributed::Vector<double>::compress_finish(::dealii::VectorOperation::values)
61: The violated condition was:
61:     *read_position == Number() || std::abs(local_element(j) - *read_position) <= std::abs(local_element(j)) * 1000. * std::numeric_limits<real_type>::epsilon()
61: Additional information:
61:     Called compress(VectorOperation::insert), but the element received from a remote processor, value 2.059635156599626e-06, does not match with the value 2.059635156600494e-06 on the owner processor 2

I am quite certain that SolutionTransfer gets parallel vectors with (i) constraints distributed (ii) ghosted, That is accomplished by

constraints.distribute (solution_vectors[i]);
solution_vectors
[i].update_ghost_values();


and a minor trick (prior to those lines) to change ghost index set when combining matrix-free and Kelly, as discussed here https://groups.google.com/d/msg/dealii/-FMHGdn18fE/w6YotFXRAAAJ

Between this point and the usage of solution transfer, there is only Kelly estimator / DataOutput and cell marking involved.


The issue appears for a very specific problem and does not happen always. 
So far it looks like a hidden Heisenbug, which gets triggered in certain cases. 
Not knowing the details of p::d::SolutionTransfer class, I am not quite sure where to dig next.  
Given the fact that ghost values should be consistent among MPI processes, I don't see a way how the interpolation during solution transfer 
could lead to such discrepancies.

Regards,
Denis

Denis Davydov

unread,
Jan 3, 2017, 3:19:55 AM1/3/17
to deal.II User Group
Relevant backtrace is:

    frame #4: 0x00000001085b8c86 libdeal_II.g.8.5.0-pre.dylib`void dealii::deal_II_exceptions::internals::issue_error<dealii::LinearAlgebra::distributed::Vector<double>::ExcNonMatchingElements>(handling=abort_on_exception, file="/Users/davydden/libs-sources/deal.ii/davydden/include/deal.II/lac/la_parallel_vector.templates.h", line=686, function="void dealii::LinearAlgebra::distributed::Vector<double>::compress_finish(::dealii::VectorOperation::values)", cond="*read_position == Number() || std::abs(local_element(j) - *read_position) <= std::abs(local_element(j)) * 1000. * std::numeric_limits<real_type>::epsilon()", exc_name="ExcNonMatchingElements(*read_position, local_element(j), part.this_mpi_process())", e=ExcNonMatchingElements @ 0x00007fff5fbfa5e0) + 134 at exceptions.h:285
    frame
#5: 0x00000001085b6d41 libdeal_II.g.8.5.0-pre.dylib`dealii::LinearAlgebra::distributed::Vector<double>::compress_finish(this=0x00000001261087c8, operation=insert) + 2401 at la_parallel_vector.templates.h:681
    frame
#6: 0x00000001085b516f libdeal_II.g.8.5.0-pre.dylib`dealii::LinearAlgebra::distributed::Vector<double>::compress(this=0x00000001261087c8, operation=insert) + 47 at la_parallel_vector.templates.h:494
    frame
#7: 0x000000010a150a73 libdeal_II.g.8.5.0-pre.dylib`dealii::parallel::distributed::SolutionTransfer<3, dealii::LinearAlgebra::distributed::Vector<double>, dealii::DoFHandler<3, 3> >::interpolate(this=0x00007fff5fbfb5b8, all_out=size=120) + 2451 at solution_transfer.cc:183

but that's not really helpful.

Daniel Arndt

unread,
Jan 3, 2017, 9:09:28 AM1/3/17
to deal.II User Group
Denis,

Do you have a minimal working example that shows this problem?

Some ideas:
Can you confirm that the output vector you are using is "clean" in the sense that there are no spurious ghost values before handing it to interpolate?
In other words: Does output.zero_out_ghosts() or output=0. help? Do you get the same error if you are using a Trilinos or PETSc vector as output?

Given that we test SolutionTransfer in the testsuite, I would suspect that this is related to your setup of the output vector.

Best,
Daniel

Denis Davydov

unread,
Jan 3, 2017, 10:03:02 AM1/3/17
to dea...@googlegroups.com
Hi Daniel,

> On 3 Jan 2017, at 15:09, Daniel Arndt <d.a...@math.uni-goettingen.de> wrote:
>
> Do you have a minimal working example that shows this problem?

I don’t have a MWE for this, not yet at least. The tricky part is that this happens for some usage cases,
but the same code runs fine in other, more-or-less as complicated, usage cases.

>
> Some ideas:
> Can you confirm that the output vector you are using is "clean" in the sense that there are no spurious ghost values before handing it to interpolate?

I call interpolate() right after calling setup_system() in my application, which in turn does NOT assign any values to those vectors.
Within the setup_system() the initialization of output vectors is done via

MatrixFree<dim,double>::initialize_dof_vector(solution_vectors[i]);

I added

solution_vectors[i] = 0.;

right after initialization but this does not change anything, neither does

solution_vectors[i].zero_out_ghosts()

> In other words: Does output.zero_out_ghosts() or output=0. help? Do you get the same error if you are using a Trilinos or PETSc vector as output?

Luckily I have a version of the main class which uses Trilinos (same p::d::Tria, 4 MPI cores and 2 threads each), when using it I
do not have this error. BUT it does not mean this is not happening, maybe Trilinos parallel vectors don’t have this assert at all?

>
> Given that we test SolutionTransfer in the testsuite, I would suspect that this is related to your setup of the output vector.

So far I don’t see any indication that this is the case.
I also tried running with a single thread, but this does not influence the results.

p.s. the nature of application is such that I can not simply switch from deal.ii to Trilinos vectors.
So the only way to have some MWE is to serialize p::d::Tria and write out vectors in ASCII or something.

Regards,
Denis.

Martin Kronbichler

unread,
Jan 3, 2017, 11:49:03 AM1/3/17
to dea...@googlegroups.com

Hi Denis,


Hi Daniel,

On 3 Jan 2017, at 15:09, Daniel Arndt <d.a...@math.uni-goettingen.de> wrote:

Do you have a minimal working example that shows this problem?
I don’t have a MWE for this, not yet at least. The tricky part is that this happens for some usage cases,
but the same code runs fine in other, more-or-less as complicated, usage cases.

Some ideas:
Can you confirm that the output vector you are using is "clean" in the sense that there are no spurious ghost values before handing it to interpolate?
I call interpolate() right after calling setup_system() in my application, which in turn does NOT assign any values to those vectors.
Within the setup_system() the initialization of output vectors is done via 

MatrixFree<dim,double>::initialize_dof_vector(solution_vectors[i]);

I added 

solution_vectors[i] = 0.;

right after initialization but this does not change anything, neither does 

solution_vectors[i].zero_out_ghosts()

In other words: Does output.zero_out_ghosts() or output=0. help? Do you get the same error if you are using a Trilinos or PETSc vector as output?
Luckily I have a version of the main class which uses Trilinos (same p::d::Tria, 4 MPI cores and 2 threads each), when using it I 
do not have this error. BUT it does not mean this is not happening, maybe Trilinos parallel vectors don’t have this assert at all?
Indeed, Trilinos vectors do not have this assertion at all and happily overwrite the values, and the contribution that happens to come last in whatever communication pattern is chosen by Trilinos  takes precedence.

The question here is whether the original check might be too rigid such that different roundoff behavior from different sides of an interpolation operation get different contributions. Looking at the entries:

> value 2.059635156599626e-06, does not match with the value 2.059635156600494e-06 on the owner processor 2

the difference appears in the 13th digit. I guess that's within the limits of the use of the vector and we might use 1e4 rather than 1e3 in the amount of admissible deviance. Feel free to propose a patch that changes the check.

Best,
Martin

Denis Davydov

unread,
Jan 3, 2017, 12:19:31 PM1/3/17
to dea...@googlegroups.com
Hi Martin,

On 3 Jan 2017, at 17:49, Martin Kronbichler <kronbichl...@gmail.com> wrote:
In other words: Does output.zero_out_ghosts() or output=0. help? Do you get the same error if you are using a Trilinos or PETSc vector as output?
Luckily I have a version of the main class which uses Trilinos (same p::d::Tria, 4 MPI cores and 2 threads each), when using it I 
do not have this error. BUT it does not mean this is not happening, maybe Trilinos parallel vectors don’t have this assert at all?
Indeed, Trilinos vectors do not have this assertion at all and happily overwrite the values, and the contribution that happens to come last in whatever communication pattern is chosen by Trilinos  takes precedence.

The question here is whether the original check might be too rigid such that different roundoff behavior from different sides of an interpolation operation get different contributions. Looking at the entries:
> value 2.059635156599626e-06, does not match with the value 2.059635156600494e-06 on the owner processor 2

Can you please elaborate?

My naïve understanding would be:
1. before doing interpolation, all ghost values are syncrhonized between the MPI processors
2. solution transfer would, roughly, take local values on each element, apply transformation matrices corresponding to each child and ship, if needed, interpolated values to new MPI owners of cells.
3. Each MPI would go through its cells and set values of vectors at locally active DoFs which, of course, means that at the interface between the two paritions different MPI processes may be trying to write to  the same global DoF entry in the vector.

In the above, i fail to see what could influence the order of operations and result in round-off errors between different MPI cores.
Maybe Timo, being an author of p::d::SolutionTransfer, could comment on this?


the difference appears in the 13th digit. I guess that's within the limits of the use of the vector and we might use 1e4 rather than 1e3 in the amount of admissible deviance. Feel free to propose a patch that changes the check.

Thanks, will do.

p.s. I would be happy to admit that this is not an issue at all, just wanted to have some peace of mind ;-)

Regards,
Denis.

Martin Kronbichler

unread,
Jan 3, 2017, 4:31:28 PM1/3/17
to dea...@googlegroups.com
Hi Denis,


> Can you please elaborate?
>
> My naïve understanding would be:
> 1. before doing interpolation, all ghost values are syncrhonized
> between the MPI processors
> 2. solution transfer would, roughly, take local values on each
> element, apply transformation matrices corresponding to each child and
> ship, if needed, interpolated values to new MPI owners of cells.
> 3. Each MPI would go through its cells and set values of vectors at
> locally active DoFs which, of course, means that at the interface
> between the two paritions different MPI processes may be trying to
> write to the same global DoF entry in the vector.
>
> In the above, i fail to see what could influence the order of
> operations and result in round-off errors between different MPI cores.

The description of the three steps above is correct. The problem appears
in step 2 where the solution gets interpolated separately on the two
cells sharing a face. For an interpolatory finite element, the values on
a shared face should coincide and we know mathematically that they
indeed interpolate to the same values. However, the additions in the
(different) transfer matrices from the two sides might perform the
additions in a different order. Thus, roundoff could accumulate
differently, in particular if you have some cancellation. Does that make
sense?

This problem would of course also appear in serial, but we do not check
that the value set into a vector is indeed the same as the value that
was there before.

Best,
Martin

Denis Davydov

unread,
Jan 3, 2017, 4:55:35 PM1/3/17
to dea...@googlegroups.com
Hi Martin,


> 3 янв. 2017 г., в 22:31, Martin Kronbichler <kronbichl...@gmail.com> написал(а):
>
> Hi Denis,
>
>
>> Can you please elaborate?
>>
>> My naïve understanding would be:
>> 1. before doing interpolation, all ghost values are syncrhonized between the MPI processors
>> 2. solution transfer would, roughly, take local values on each element, apply transformation matrices corresponding to each child and ship, if needed, interpolated values to new MPI owners of cells.
>> 3. Each MPI would go through its cells and set values of vectors at locally active DoFs which, of course, means that at the interface between the two paritions different MPI processes may be trying to write to the same global DoF entry in the vector.
>>
>> In the above, i fail to see what could influence the order of operations and result in round-off errors between different MPI cores.
>
> The description of the three steps above is correct. The problem appears in step 2 where the solution gets interpolated separately on the two cells sharing a face. For an interpolatory finite element, the values on a shared face should coincide and we know mathematically that they indeed interpolate to the same values. However, the additions in the (different) transfer matrices from the two sides might perform the additions in a different order. Thus, roundoff could accumulate differently, in particular if you have some cancellation. Does that make sense?

Thanks for clarification, that makes perfect sense indeed.

Regards,
Denis

Reply all
Reply to author
Forward
0 new messages