MPI_InitFinalize causes runtime error when setting the build type to release

113 views
Skip to first unread message

JxW

unread,
Dec 2, 2016, 6:12:14 AM12/2/16
to deal.II User Group
I encountered this problem on my linux machine. Everything works fine in debug mode; however, as soon as I switch to release mode using -DCMAKE_BUILD_TYPE=Release, the code still compiles, but bumps into a segmentation violation when it runs.

The following code reproduces this problem for me
(The CMakeLists.txt simply calls the DEAL_II_INVOKE_AUTOPILOT() macro):

#include <iostream>
#include <deal.II/base/mpi.h>

int
main (int argc, char *argv[]) {
  dealii::Utilities::MPI::MPI_InitFinalize mpi_init (argc, argv);
  std::cout << "Program runs normally." << std::endl;
  return 0;
}

Initially I suspected that the mpi_init object might be optimized away by the compiler. But adding volatile to its
declaration does not fix this. Now I still do not know what is causing the problem. Any help or suggestion is appreciated.

P.S.: I also tried running the binary in valgrind, which reports an invalid read. The full report is as follows:

==14930== Memcheck, a memory error detector
==14930== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==14930== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==14930== Command: ./test.o
==14930== 
==14930== Invalid read of size 8
==14930==    at 0xADD6692: tbb::internal::generic_scheduler::allocate_task(unsigned long, tbb::task*, tbb::task_group_context*) (scheduler.cpp:315)
==14930==    by 0xADD675A: tbb::internal::generic_scheduler::generic_scheduler(tbb::internal::market&) (scheduler.cpp:100)
==14930==    by 0xADD8207: custom_scheduler (custom_scheduler.h:59)
==14930==    by 0xADD8207: tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::allocate_scheduler(tbb::internal::market&) (custom_scheduler.h:115)
==14930==    by 0xADD7051: allocate_scheduler (scheduler.cpp:42)
==14930==    by 0xADD7051: tbb::internal::generic_scheduler::create_master(tbb::internal::arena*) (scheduler.cpp:1032)
==14930==    by 0xADD1B53: tbb::internal::governor::init_scheduler(int, unsigned long, bool) (governor.cpp:206)
==14930==    by 0xADD1C0B: tbb::task_scheduler_init::initialize(int, unsigned long) (governor.cpp:349)
==14930==    by 0x97A96C2: dealii::MultithreadInfo::set_thread_limit(unsigned int) (in /home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/dealii-develop-e5t4qkpvvpzbbomgfymanf2z5ja2yxfr/lib/libdeal_II.so.8.5.0-pre)
==14930==    by 0x97A3D0B: dealii::Utilities::MPI::MPI_InitFinalize::MPI_InitFinalize(int&, char**&, unsigned int) (in /home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/dealii-develop-e5t4qkpvvpzbbomgfymanf2z5ja2yxfr/lib/libdeal_II.so.8.5.0-pre)
==14930==    by 0x409C77: main (in /home/xywei/release_debug/build/test.o)
==14930==  Address 0xfffffffffffffff7 is not stack'd, malloc'd or (recently) free'd
==14930== 
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run 
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See http://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.7.4, Oct, 02, 2016 
[0]PETSC ERROR: ./test.o on a arch-linux2-c-opt named WXYZG_under_Arch by xywei Fri Dec  2 19:01:12 2016
[0]PETSC ERROR: Configure options --prefix=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/petsc-3.7.4-gdocdh6szvdge7ybgtujtibhu3xdvhc6 --with-ssl=0 --with-mpi=1 --with-mpi-dir=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/openmpi-2.0.1-tqpsaspkym4fdbxzzr2g7qfbaf52qz2u --with-cpp=cpp --with-cxxcpp=cpp --with-precision=double --with-scalar-type=real --with-shared-libraries=1 --with-debugging=0 --with-blas-lapack-lib=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/openblas-0.2.19-443ize6rvixorrj5ftlz4brlmju2lrfq/lib/libopenblas.so --with-metis=1 --with-metis-dir=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/metis-5.1.0-p255v6f6j47usne5perkzwkrcqjzyllg --with-boost=1 --with-boost-dir=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/boost-1.62.0-fmjpoz5juo2quof5vmwkxq7l64jlvam5 --with-hdf5=1 --with-hdf5-dir=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/hdf5-1.10.0-patch1-viwtpeovrofsuhsll2wccgqulpnx4zhx --with-hypre=1 --with-hypre-dir=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/hypre-2.10.1-eyk2nctsm3lf5kzx67isilgboqtqkxgo --with-parmetis=1 --with-parmetis-dir=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/parmetis-4.0.3-dwetxnnqiilkpubh3rwzia3224ww4cqc --with-mumps=1 --with-mumps-dir=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/mumps-5.0.1-6mivsraqmu72gd6zadnx63526hqiaxqk --with-scalapack=1 --with-scalapack-dir=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/netlib-scalapack-2.0.2-tieocctr7jazlz47blq5ff4gkyluer73 --with-superlu_dist-include=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/superlu-dist-5.0.0-sxkek4cbgxo7ea6imi4bfbygqidsfhen/include --with-superlu_dist-lib=/home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/superlu-dist-5.0.0-sxkek4cbgxo7ea6imi4bfbygqidsfhen/lib/libsuperlu_dist.a --with-superlu_dist=1
[0]PETSC ERROR: #1 User provided function() line 0 in  unknown file
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
==14930== Warning: invalid file descriptor -1 in syscall close()
==14930== 
==14930== HEAP SUMMARY:
==14930==     in use at exit: 4,955,503 bytes in 18,505 blocks
==14930==   total heap usage: 28,634 allocs, 10,129 frees, 6,265,419 bytes allocated
==14930== 
==14930== LEAK SUMMARY:
==14930==    definitely lost: 844,584 bytes in 65 blocks
==14930==    indirectly lost: 0 bytes in 0 blocks
==14930==      possibly lost: 704 bytes in 2 blocks
==14930==    still reachable: 4,110,215 bytes in 18,438 blocks
==14930==                       of which reachable via heuristic:
==14930==                         newarray           : 1,560 bytes in 3 blocks
==14930==         suppressed: 0 bytes in 0 blocks
==14930== Rerun with --leak-check=full to see details of leaked memory
==14930== 
==14930== For counts of detected and suppressed errors, rerun with: -v
==14930== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)

Wolfgang Bangerth

unread,
Dec 2, 2016, 10:35:34 AM12/2/16
to dea...@googlegroups.com
On 12/02/2016 04:12 AM, JxW wrote:
>
> The following code reproduces this problem for me
> (The CMakeLists.txt simply calls the DEAL_II_INVOKE_AUTOPILOT() macro):
>
> #include <iostream>
> #include <deal.II/base/mpi.h>
>
> intmain (int argc, char *argv[]) {
> dealii::Utilities::MPI::MPI_InitFinalize mpi_init (argc, argv);
> std::cout << "Program runs normally." << std::endl;
> return 0;
> }
>
> Initially I suspected that the mpi_init object might be optimized away by the
> compiler. But adding volatile to its
> declaration does not fix this. Now I still do not know what is causing the
> problem. Any help or suggestion is appreciated.

Does this also happens with just a single processor? If you run this in a
debugger, can you get a backtrace that shows where the problem happens?

Best
W.

--
------------------------------------------------------------------------
Wolfgang Bangerth email: bang...@colostate.edu
www: http://www.math.colostate.edu/~bangerth/

JxW

unread,
Dec 2, 2016, 10:48:45 PM12/2/16
to deal.II User Group, bang...@colostate.edu
Thanks a lot for the helpful advice.

If I pass a custom max_num_threads to the constructor of mpi_init, then I get
the following behavior:

 - Segmentation violation, when max_num_threads = 3, 4, numbers::invalid_unsigned_int

 - Normal execution,         when max_num_threads = 1, 2, 5, 6, 7, 8, 9, 10


P.S.: My computer has a dual core CPU with hyper-threading, so in /proc/cpuinfo
it shows 4 processors.

When running in GDB, all segmentation violation cases above give the same message,
which points to the same position where the invalid read happens according to valgrind:

Thread 1 "test" received signal SIGSEGV, Segmentation fault.
tbb::internal::generic_scheduler::allocate_task (this=this@entry=0x7fffb8736600, 
    number_of_bytes=number_of_bytes@entry=8, parent=parent@entry=0x0, 
    context=context@entry=0x7ffff1e5c0a0 <tbb::internal::the_dummy_context>)
    at ../../src/tbb/scheduler.cpp:315
315	../../src/tbb/scheduler.cpp: No such file or directory.

Wolfgang Bangerth

unread,
Dec 3, 2016, 7:17:24 AM12/3/16
to dea...@googlegroups.com
On 12/02/2016 08:48 PM, JxW wrote:
> Thanks a lot for the helpful advice.
>
> If I pass a custom max_num_threads to the constructor of mpi_init, then I get
> the following behavior:
>
> - Segmentation violation, when max_num_threads = 3, 4,
> numbers::invalid_unsigned_int
>
> - Normal execution, when max_num_threads = 1, 2, 5, 6, 7, 8, 9, 10

That is bizarre. I have never seen anything like this.

Is this using the TBB that comes with your operating system, or the one that
is bundled with deal.II? You can see this in the summary statement that is
produced at the end of the cmake run.


> P.S.: My computer has a dual core CPU with hyper-threading, so in /proc/cpuinfo
> it shows 4 processors.
>
> When running in GDB, all segmentation violation cases above give the same message,
> which points to the same position where the invalid read happens according to
> valgrind:
>
> Thread 1 "test" received signal SIGSEGV, Segmentation fault.
> tbb::internal::generic_scheduler::allocate_task (this=this@entry=0x7fffb8736600,
> number_of_bytes=number_of_bytes@entry=8, parent=parent@entry=0x0,
> context=context@entry=0x7ffff1e5c0a0 <tbb::internal::the_dummy_context>)
> at ../../src/tbb/scheduler.cpp:315
> 315 ../../src/tbb/scheduler.cpp: No such file or directory.

But what is the backtrace? Where is this function called from? You should be
able to see this in a debugger.

JxW

unread,
Dec 4, 2016, 11:26:56 PM12/4/16
to deal.II User Group, bang...@colostate.edu


On Saturday, December 3, 2016 at 8:17:24 PM UTC+8, Wolfgang Bangerth wrote:
On 12/02/2016 08:48 PM, JxW wrote:
> Thanks a lot for the helpful advice.
>
> If I pass a custom max_num_threads to the constructor of mpi_init, then I get
> the following behavior:
>
>  - Segmentation violation, when max_num_threads = 3, 4,
> numbers::invalid_unsigned_int
>
>  - Normal execution,         when max_num_threads = 1, 2, 5, 6, 7, 8, 9, 10

That is bizarre. I have never seen anything like this.

Is this using the TBB that comes with your operating system, or the one that
is bundled with deal.II? You can see this in the summary statement that is
produced at the end of the cmake run.

 I believe it is using the TBB that come with the GCC 6.2.1. Although I didn't find
 it in cmake output, the executable links to libtbb.so.2 under gcc's directory.
$: ldd test | grep tbb
libtbb.so.2 => /home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/tbb-4.4.4-c3xgursmdjopb3jywpcowiv74hap3bpp/lib/libtbb.so.2 (0x00007f6a2de29000)
libtbbmalloc.so.2 => /home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/tbb-4.4.4-c3xgursmdjopb3jywpcowiv74hap3bpp/lib/libtbbmalloc.so.2 (0x00007f69fe376000)
 
 

> P.S.: My computer has a dual core CPU with hyper-threading, so in /proc/cpuinfo
> it shows 4 processors.
>
> When running in GDB, all segmentation violation cases above give the same message,
> which points to the same position where the invalid read happens according to
> valgrind:
>
> Thread 1 "test" received signal SIGSEGV, Segmentation fault.
> tbb::internal::generic_scheduler::allocate_task (this=this@entry=0x7fffb8736600,
>     number_of_bytes=number_of_bytes@entry=8, parent=parent@entry=0x0,
>     context=context@entry=0x7ffff1e5c0a0 <tbb::internal::the_dummy_context>)
>     at ../../src/tbb/scheduler.cpp:315
> 315        ../../src/tbb/scheduler.cpp: No such file or directory.

But what is the backtrace? Where is this function called from? You should be
able to see this in a debugger.

 The backtrace is nothing different from what valgrind produces:
 
#0  tbb::internal::generic_scheduler::allocate_task (this=this@entry=0x7fffb8736600, 
    number_of_bytes=number_of_bytes@entry=8, parent=parent@entry=0x0, 
    context=context@entry=0x7ffff1e5c0a0 <tbb::internal::the_dummy_context>)
    at ../../src/tbb/scheduler.cpp:315
#1  0x00007ffff1c4375b in tbb::internal::generic_scheduler::generic_scheduler (
    this=this@entry=0x7fffb8736600, m=...) at ../../src/tbb/scheduler.cpp:100
#2  0x00007ffff1c45208 in tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::custom_scheduler (m=..., this=0x7fffb8736600) at ../../src/tbb/custom_scheduler.h:59
#3  tbb::internal::custom_scheduler<tbb::internal::IntelSchedulerTraits>::allocate_scheduler (
    m=...) at ../../src/tbb/custom_scheduler.h:115
#4  0x00007ffff1c44052 in tbb::internal::allocate_scheduler (m=...)
    at ../../src/tbb/scheduler.cpp:42
#5  tbb::internal::generic_scheduler::create_master (a=0x7fffb873ed00)
    at ../../src/tbb/scheduler.cpp:1032
#6  0x00007ffff1c3eb54 in tbb::internal::governor::init_scheduler (
    num_threads=num_threads@entry=4, stack_size=stack_size@entry=0, 
    auto_init=auto_init@entry=false) at ../../src/tbb/governor.cpp:206
#7  0x00007ffff1c3ec0c in tbb::task_scheduler_init::initialize (
    this=0x7ffff7dccd88 <dealii::MultithreadInfo::set_thread_limit(unsigned int)::dummy>, 
    number_of_threads=4, thread_stack_size=0) at ../../src/tbb/governor.cpp:349
#8  0x00007ffff67cf6c3 in dealii::MultithreadInfo::set_thread_limit(unsigned int) ()
   from /home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/dealii-develop-e5t4qkpvvpzbbomgfymanf2z5ja2yxfr/lib/libdeal_II.so.8.5.0-pre
#9  0x00007ffff67c9d0c in dealii::Utilities::MPI::MPI_InitFinalize::MPI_InitFinalize(int&, char**&, unsigned int) ()
   from /home/xywei/spack/opt/spack/linux-arch-x86_64/gcc-6.2.1/dealii-develop-e5t4qkpvvpzbbomgfymanf2z5ja2yxfr/lib/libdeal_II.so.8.5.0-pre
#10 0x000000000040a164 in main ()

Timo Heister

unread,
Dec 5, 2016, 9:03:58 AM12/5/16
to dea...@googlegroups.com, Wolfgang Bangerth
> I believe it is using the TBB that come with the GCC 6.2.1. Although I didn't find
> it in cmake output,

Can you do
$ grep THREADS detailed.log
in your build directory? It should show something like
# DEAL_II_WITH_THREADS set up with bundled packages
# THREADS_CXX_FLAGS = -Wno-parentheses
# THREADS_LINKER_FLAGS = -pthread
# THREADS_DEFINITIONS_DEBUG = TBB_USE_DEBUG;TBB_DO_ASSERT=1
# THREADS_USER_DEFINITIONS_DEBUG = TBB_USE_DEBUG;TBB_DO_ASSERT=1
# THREADS_BUNDLED_INCLUDE_DIRS =
/ssd/deal-git/bundled/tbb41_20130401oss/include
# THREADS_LIBRARIES = dl

--
Timo Heister
http://www.math.clemson.edu/~heister/

Xiaoyu Wei

unread,
Dec 6, 2016, 12:37:25 AM12/6/16
to dea...@googlegroups.com
On Mon, Dec 5, 2016 at 10:03 PM, Timo Heister <hei...@clemson.edu> wrote:
> I believe it is using the TBB that come with the GCC 6.2.1. Although I didn't find
> it in cmake output,

Can you do
$ grep THREADS detailed.log

Weirdly I do not have detailed.log under the build directory.
 
in your build directory? It should show something like
#        DEAL_II_WITH_THREADS set up with bundled packages
#            THREADS_CXX_FLAGS = -Wno-parentheses
#            THREADS_LINKER_FLAGS = -pthread
#            THREADS_DEFINITIONS_DEBUG = TBB_USE_DEBUG;TBB_DO_ASSERT=1
#            THREADS_USER_DEFINITIONS_DEBUG = TBB_USE_DEBUG;TBB_DO_ASSERT=1
#            THREADS_BUNDLED_INCLUDE_DIRS =
/ssd/deal-git/bundled/tbb41_20130401oss/include
#            THREADS_LIBRARIES = dl

I printed out all the compiler and linker commands and confirmed that it is using
tbb4.4.4 that comes with gcc6.2.1.

 

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "deal.II User Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/dealii/oJxQubGPhwA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to dealii+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Timo Heister

unread,
Dec 6, 2016, 11:09:10 AM12/6/16
to dea...@googlegroups.com
> I printed out all the compiler and linker commands and confirmed that it is
> using
> tbb4.4.4 that comes with gcc6.2.1.

Or there is something messed up in the configuration and we mix two
different versions of the TBB. That can explain weird crashes. Please
redo your deal.II configuration and check your detailed.log.

You can also try forcing deal.II to use the bundled TBB.

Bruno Turcksin

unread,
Mar 6, 2017, 10:04:59 AM3/6/17
to deal.II User Group, bang...@colostate.edu
On Friday, December 2, 2016 at 10:48:45 PM UTC-5, JxW wrote:
Thanks a lot for the helpful advice.

If I pass a custom max_num_threads to the constructor of mpi_init, then I get
the following behavior:

 - Segmentation violation, when max_num_threads = 3, 4, numbers::invalid_unsigned_int

 - Normal execution,         when max_num_threads = 1, 2, 5, 6, 7, 8, 9, 10


P.S.: My computer has a dual core CPU with hyper-threading, so in /proc/cpuinfo
it shows 4 processors.

When running in GDB, all segmentation violation cases above give the same message,
which points to the same position where the invalid read happens according to valgrind:

Thread 1 "test" received signal SIGSEGV, Segmentation fault.
tbb::internal::generic_scheduler::allocate_task (this=this@entry=0x7fffb8736600, 
    number_of_bytes=number_of_bytes@entry=8, parent=parent@entry=0x0, 
    context=context@entry=0x7ffff1e5c0a0 <tbb::internal::the_dummy_context>)
    at ../../src/tbb/scheduler.cpp:
315
15	../../src/tbb/scheduler.cpp: No such file or directory.

In case someone else has the same problem, I just hit the same bug with TBB 4.4.4. Updating to TBB 2017 fixed it.

Bruno
Reply all
Reply to author
Forward
0 new messages