writing data to a file changes program output although this file is never used

101 views
Skip to first unread message

Simon

unread,
Mar 24, 2023, 1:04:49 PM3/24/23
to deal.II User Group
Dear all,

my deal.ii program is called roughly 300 times from a matlab script for the purpose of a parameter identification. In essence, given some parameters, the matlab script calls deal.ii to compute a displacement field. I exchange the data transfer between deal.ii and matlab by reading from and writing into files. In particular, I store the parameters in a file with 13 digits after the decimal point since there are finite-difference calculations involved.
I double-checked that my simulations are deterministic, i.e. the parameters at all 300 calls to deal.ii are identical if I re-run the matlab script anew. By identical, I mean that all 13 digits coincide.

My deal.ii program has, besides the standard data structures (Triangulation, DoFHandler,...), a pointer to a class as member variable.
This class has a member function, which is called during assembly of the linear system to calculate a stress tensor. Within this member function, I created a
std::ofstream out_data("/a/path/file.txt")
object and wrote the calculated stress into this file:
out_data<<stress<<std::endl
The point is that I really just write the stress to the file.
Neither my deal.ii program nor the matlab script read from this file.
I just created it to have a look at the stress values.

However, after adding
out_data<<stress<<std::endl
to the above member function, I observed that the parameters up to the 24th call to deal.ii are identical compared to the parameters without having this line of code,
but in the 25th call, some parameters are different at the last three digits, i.e.
the first 10 digits are still identical and the last three differ.
This error sums up in the following calls...
These results are also deterministic: Re-running the matlab script anew gives the same false parameters in the 25th call...
If I comment really only the line
//out_data<<stress<<std::endl 
I obtain again the correct parameters.

Having all that said, is there any reasonable argument as to what could cause the program output to be different by just adding ...<<... statement?
Admittedly, I do not know how to debug this further, because there is in theory
nothing to debug if I can trace the differences back to the ...<<... statement.

Thank you and best regards
Simon

Wolfgang Bangerth

unread,
Mar 24, 2023, 5:38:49 PM3/24/23
to dea...@googlegroups.com
On 3/24/23 11:04, Simon wrote:
>
> Having all that said, is there any reasonable argument as to what could
> cause the program output to be different by just adding ...<<... statement?

There is no reasonable argument why that should be the case. The
function you are calling looks like this:

template <int rank_, int dim, typename Number>
inline std::ostream &
operator<<(std::ostream &out, const Tensor<rank_, dim, Number> &p)
{
for (unsigned int i = 0; i < dim; ++i)
{
out << p[i];
if (i != dim - 1)
out << ' ';
}

return out;
}

It *should* have no side effects. If it does, there is something funny
going on in your program.

That leaves a philosophical question:

> Admittedly, I do not know how to debug this further, because there is in
> theory
> nothing to debug if I can trace the differences back to the ...<<...
> statement.

Approach A is to say "Well, let me just remove the output statement
then; I know that my program works in that case". Approach B is to say
"This situation illustrates that I don't understand something
fundamental about my program. Let me use the occasion to investigate".
Everyone who has been following this mailing list for a while will
suspect that I lean quite heavily towards Approach B. If the state of
the program changes in ways I do not understand, and that make no sense
to me, it is not unreasonable to suspect that there is a conceptual bug
somewhere, however unpleasant it may be to search for it, and that not
investigating the cause for the issue is really just closing one's eyes
towards a problem that is clearly there.

In your case, I would start by outputting vastly more intermediate
results (up to machine precision) to see where the first time is that
something is different between the program with and the one without the
<< statement. Maybe you can identify which variable it is that first
changes, and if you know that, it may be easier to also see *why* it
changes. You could wrap the operator<< statement in some kind of
if-condition that you can control from outside the program, and then run
your program twice every time, once with and without the output
statement. Pipe the output of each run into two files that you can then
run 'diff' on.

From experience, this is not likely going to be a bug that is fun to
chase down. But it might make for a good learning opportunity. It would
be quite interesting to hear back here if you find out what caused the
issue!

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

Simon Wiesheier

unread,
Mar 29, 2023, 9:02:52 AM3/29/23
to dea...@googlegroups.com
Of course, Approach B is the right way to go if I want to produce reliable results.

After plenty of hours of debugging, I found a possible source for the differences:
The member function called by my assembly routine has the following signature
void fun(const Tensor<2,3 & F,
               const double JxW,
               const unsigned int global_index,
               std::vector<Vector<double>> & sensitivity_matrix)
{
    //use F and JxW to write some values into sensitivity_matrix at global_index
}

To debug my problem, I wrapped the << statement into an if-condition that I controlled via my parameter file:
{
    // use F and JxW to write some values into sensitivity_matrix, which is passed by reference since it is a large matrix
    if(print_or_not==true)  std::cout<<"print anything here..."<<std::endl
}
Interestingly, setting print_or_not==false or just out-commenting the line changes some results of my program.
In other words, the output obtained by
{
    // use F and JxW to write some values into sensitivity_matrix, which is passed by reference since it is a large matrix
    if(false)  std::cout<<"print anything here..."<<std::endl
}
is different from the output obtained by
{
    // use F and JxW to write some values into sensitivity_matrix, which is passed by reference since it is a large matrix
    // if(false)  std::cout<<"print anything here..."<<std::endl
}

Aside, adding more << statements inside fun changed the output again. 
By output, I refer to sensitivity_matrix after assembly has finished. 
Differences can be seen only for some entries of sensitivity_matrix and, as for them, only the 13th or higher digits after the decimal point. 

Then, I changed the signature of fun to
void fun(const Tensor<2,3 & F,
               const double & JxW,
               const unsigned int & global_index
               std::vector<Vector<double>> & sensitivity_matrix)
Adding the reference symbol & to JxW and global_index, 
the program output is identical, regardless of the number of print statements -- as it should be the case.
Within the assembly function (which calls fun()), JxW and global_index are defined as always in a deal.ii program:
const double JxW=fe_values.JxW(k);
std::vector<types::global_dof_index> local_dof_indices(...); -> cell->get_dof_indices(local_dof_indices); 
gobal_index = local_dof_indices[i];

Could the missing reference symbol & be an explanation for why I observed the undefined behavior of my program?
It clearly changes something, but I do not really understand why that might be the case.

Best 
Simon







--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/dealii/da62c7c6-6328-e385-7029-afe5edf43ecd%40colostate.edu.


--
Simon Wiesheier, M.Sc., M.Sc.

Doctoral Researcher

Institute of Applied Mechanics
Friedrich-Alexander-Universität
Erlangen-Nürnberg
Paul-Gordan-Str. 3
D-91052 Erlangen

Room 00.043

Tel: +49 (0) 9131 85 - 64 410
Fax: +49 (0) 9131 85 - 64 413

Simon

unread,
Mar 29, 2023, 12:41:24 PM3/29/23
to deal.II User Group
Too early pleased -- the issue still persists.

As described above, adding the reference symbol & to JxW and global_index makes a difference in my program, but is not the solution.

I ran my program in debug mode using
valgrind --tool=memcheck --leak-check=full ./my_program

The valgrind output associated with
void fun(const Tensor<2,3 & F,
               const double JxW,
               const unsigned int global_index,
               std::vector<Vector<double>> & sensitivity_matrix)
{
    // use F and JxW to write some values into sensitivity_matrix, which is passed by reference since it is a large matrix
    if(print_or_not==true)  std::cout<<"print anything here..."<<std::endl
}
is
memcheck_NotOutC.png

The valgrind output associated with
void fun(const Tensor<2,3 & F,
               const double JxW,
               const unsigned int global_index,
               std::vector<Vector<double>> & sensitivity_matrix)
{
    // use F and JxW to write some values into sensitivity_matrix, which is passed by reference since it is a large matrix
    // if(print_or_not==true)  std::cout<<"print anything here..."<<std::endl
}
memcheck_outC.png

The differences are "only" possibly lost messages, but the backtraces are not really clear to me.
Do you see any problems with my code by inspecting the memory check?

Based on your experience, I would appreciate your feedback regarding how to debug my problem further.
Adding print statements clearly is not really helpful if my program shows undefined behavior and
if the print statement itself is what causes differences.

Thank you!

Bruno Turcksin

unread,
Mar 30, 2023, 2:18:31 PM3/30/23
to deal.II User Group
Hello,

Usually when I have this kind of bug, there are two possibilities:
 1. I am using an un-initialized value
 2. I am writing out of bound

What I do is using valgrind with my code in Debug mode and without TBB enabled otherwise you get difficult to understand backtrace like here. However, what I've also encountered is that in Debug mode, the compiler would initialize the variable but it wouldn't do it in Release... So make sure that all the variables are initialized and when using a std::vector, use at() instead of operator[]. Only use at() in Debug mode because it is much slower than operator[]. Other than that rerun valgrind without TBB enabled, you should get something more meaningful.

Best,

Bruno

Daniel Jodlbauer

unread,
Mar 30, 2023, 4:37:56 PM3/30/23
to deal.II User Group
I encountered similar funny bugs in the past. It was usually one or more of
- read/write out of bounds
- race condition or other multithread effects, e.g. interaction between TBB/OpenMP/Threads/Blas/...)
- dangling reference (mainly with clang)
- compiler bug
(somewhat in decreasing order of likelihood).

I would try to
- disable parallelization as much as possible
- run in release mode instead of debug: may give some hint if it is a race condition / timing issue
- use asan/tsan: I've found the sanitizers more reliable than valgrind, in particular for multithreaded code, although I haven't used those with deal.II (requires recompilation)
- debug both variants in parallel and check for differences

Some more desparate attempts:
- try different compiler
- set variables to some fixed values here and there
- output to stringstream instead of file
- insert sleep/wait commands
(although probably none of these really helps to identify the problem)

You can also define _GLIBCXX_ASSERTIONS to enable bound checks for operator[] in libstdc++.

Good luck!

Simon Wiesheier

unread,
Mar 31, 2023, 8:58:33 AM3/31/23
to dea...@googlegroups.com
Hi Bruno, 

I do not see my bug in debug mode. So there the print statement has no effect. 

Given that it works fine in debug mode, there are likely no out-of-bound violations. I just use std::vector and Vector<double> as data structures which would be detected in debug mode.  

How can I run my program without TBB enabled?

Best
Simon 

Daniel Arndt

unread,
Mar 31, 2023, 9:02:58 AM3/31/23
to dea...@googlegroups.com
Simon,

what happens if you build deal.II without support for TBB? Does valgrind still complain?

Best,
Daniel

Bruno Turcksin

unread,
Mar 31, 2023, 9:06:47 AM3/31/23
to dea...@googlegroups.com
Simon,

Le ven. 31 mars 2023 à 08:58, Simon Wiesheier <simon.w...@gmail.com> a écrit :


Given that it works fine in debug mode, there are likely no out-of-bound violations. I just use std::vector and Vector<double> as data structures which would be detected in debug mode.  

That's optimistic :) It's not because you are in debug mode that std::vector checks the bounds. The bounds are only checked when using at()
 
How can I run my program without TBB enabled?

Reconfigure and recompile deal.II with -D DEAL_II_WITH_TBB=OFF

Best,

Bruno
 

Simon Wiesheier

unread,
Mar 31, 2023, 10:45:26 AM3/31/23
to dea...@googlegroups.com
Hello to all,

unfortunately, I have no admin rights and can not re-compile the library with -DDEAL_II_WITH_TBB=OFF.

But I did
export DEAL_II_NUM_THREADS=1
to make sure that nothing like a race condition is involved.
However, my program is not parallelized explicitedly by myself.
If there is multi-threading, then just because I call
dealii functions which are multi-threaded per default.

Anyway, with DEAL_II_NUM_THREADS=1, valgrind does not produce any error messages anymore (see valgrind.txt), but
my issue is still there. To give you a better insight, I attached the output of my program once
with the print statement in my assembly routine (output_print_NOT_outcommented.txt) and
without the print statement ( output_print_outcommented.txt).
As you can see, the RHS of my linear system after convergence is different at some load steps.
And as I said, these differences sum up in repeating calls to my program and, clearly, should not be there -- regardless how small they are.

What I did is to run my program with the address and undefined behavior sanitizer of clang by passing some flags to cmake:
cmake -DCAMKE_CXX_FLAGS="-fsanitize=undefined" -DCMAKE_EXE_LINKER_FLAGS ="-fsanitize=undefined" -DCAMKE_CXX_FLAGS="-fsanitize=address"
-DCMAKE_EXE_LINKER_FLAGS ="-fsanitize=address" -DCAMKE_CXX_FLAGS="-fno-optimize-sibling-calls" .
At the very end of my program (probably when the destructor is called or even later), the sanitizers reported some errors (see address_sanitizer.txt),
but this may be some false positives. In particular, no error is reported during the relevant part of my program.
These errors occur with and without the print statement, and also in debug mode.

If there is a bug in my program, this bug seems to be a tough one (at least for me).

If someone of you (Wolfgang, Bruno, Daniel,...) were willed to have a look at my code,
I would reduce the issue to a minimal working example and upload all source files here or sent it do you directly.
I would highly appreciate that since I do not want to extend my code further with that bug in it.

Best
Simon


--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
valgrind.txt
output_print_NOT_outcommented.txt
output_print_outcommented.txt
address_sanitizer.txt
Reply all
Reply to author
Forward
0 new messages