*** stack smashing detected *** when CuDevice::Instantiate().SelectGpuId("yes")

444 views
Skip to first unread message

Adam Zahran

unread,
Aug 2, 2017, 11:30:59 AM8/2/17
to kaldi-help
Hi,

I'm trying to build an application that uses the kaldi::CuDevice.
I build kaldi with MKL but I get a stack smashing detected when I start the application.

I configured my kaldi with the following line:
./configure --shared --mathlib=MKL --threaded-math=yes --mkl-root=/opt/intel/mkl --mkl-threading=gomp --mkl-libdir=/opt/intel/mkl/lib/intel64 --double-precision=yes

I made a very simple application to reproduce the error.
int main()
{
        CuDevice::Instantiate().SelectGpuId("yes");
        return 0;
}

I compiled it using the following command:
g++ -c -m64 -pipe -fPIC -g -std=c++0x -Wall -W -fPIC -DHAVE_OPENBLAS -DHAVE_CUDA -DKALDI_DOUBLEPRECISION=1 -DQT_QML_DEBUG -I../reproducingGPUbug -I. -I../kaldi/src -I../kaldi/tools/openfst/include -I/opt/intel/mkl/include -I/usr/lib/x86_64-linux-gnu/qt5/mkspecs/linux-g++-64 -o main.o ../reproducingGPUbug/main.cpp

and then:
g++ -m64 -o reproducingGPUbug main.o   -L../kaldi/tools/openfst/lib -L../kaldi/libs -L/usr/local/cuda/lib64 -L../ -Wl,--start-group -lkaldi-base -lkaldi-cudamatrix -lkaldi-matrix -lkaldi-util -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -lcublas -lcudart -lcufft -lcurand -Wl,--end-group -lfst -fopenmp -liomp5 -lpthread -lm -ldl

I used -L../ because I placed the MKL .a libs there.

By debugging I see that I get the SIGABRT at the very end of a function called CuDevice::IsComputeExclusive().

I guess this is reproducible anywhere but please let me know if you need more debugging information.

Daniel Povey

unread,
Aug 2, 2017, 4:47:04 PM8/2/17
to kaldi-help
I have never heard of this error. You need to give much more info.
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Daniel Galvez

unread,
Aug 2, 2017, 5:06:58 PM8/2/17
to kaldi-help
Provide the stack trace, at the very least.


> For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Daniel Galvez

Adam Zahran

unread,
Aug 2, 2017, 5:12:02 PM8/2/17
to kaldi...@googlegroups.com
Hi,

Thanks for the reply. I will definitely post more debugging info and the stack trace tomorrow morning once I get hold of my development workstation again.

I'm just replying now to point out that this error only occurs when I link to MKL and use the Wl,--start-group and -Wl,--end-group Option.

Thank you for your support.

You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/eifhWfGJjno/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+unsubscribe@googlegroups.com.

Adam Zahran

unread,
Aug 3, 2017, 4:34:51 AM8/3/17
to kaldi-help
Hi,

Here's the output of valgrind (including a stacktrace) hope it helps.
==10385== Memcheck, a memory error detector
==10385== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==10385== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==10385== Command: ./reproducingGPUbug
==10385== 
==10385== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints.
==10385==    This could cause spurious value errors to appear.
==10385==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==10385== Warning: noted but unhandled ioctl 0x27 with no size/direction hints.
==10385==    This could cause spurious value errors to appear.
==10385==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==10385== Warning: noted but unhandled ioctl 0x7ff with no size/direction hints.
==10385==    This could cause spurious value errors to appear.
==10385==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==10385== Conditional jump or move depends on uninitialised value(s)
==10385==    at 0xD4E81CC: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD6BBE1E: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD5B3295: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD6BE1CC: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD6BE5BB: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD5B4820: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD58711C: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD5D493C: cuInit (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0x7805DD4: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==10385==    by 0x7805E30: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==10385==    by 0xBD6BA98: __pthread_once_slow (pthread_once.c:116)
==10385==    by 0x7838918: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==10385== 
==10385== Conditional jump or move depends on uninitialised value(s)
==10385==    at 0xD4E823C: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD6BBE1E: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD5B3295: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD6BE1CC: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD6BE5BB: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD5B4820: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD58711C: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD5D493C: cuInit (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0x7805DD4: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==10385==    by 0x7805E30: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==10385==    by 0xBD6BA98: __pthread_once_slow (pthread_once.c:116)
==10385==    by 0x7838918: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==10385== 
==10385== Conditional jump or move depends on uninitialised value(s)
==10385==    at 0xD4E7EEA: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD6BBE6C: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD5B3295: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD6BE1CC: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD6BE5BB: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD5B4820: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD58711C: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0xD5D493C: cuInit (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==10385==    by 0x7805DD4: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==10385==    by 0x7805E30: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==10385==    by 0xBD6BA98: __pthread_once_slow (pthread_once.c:116)
==10385==    by 0x7838918: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==10385== 
==10385== Warning: set address range perms: large range [0x1000000000, 0x2100000000) (noaccess)
==10385== Warning: set address range perms: large range [0x200000000, 0x400000000) (noaccess)
*** stack smashing detected ***: ./reproducingGPUbug terminated
==10385== 
==10385== Process terminating with default action of signal 6 (SIGABRT)
==10385==    at 0xCA54428: raise (raise.c:54)
==10385==    by 0xCA56029: abort (abort.c:89)
==10385==    by 0xCA967E9: __libc_message (libc_fatal.c:175)
==10385==    by 0xCB3811B: __fortify_fail (fortify_fail.c:37)
==10385==    by 0xCB380BF: __stack_chk_fail (stack_chk_fail.c:28)
==10385==    by 0x40ACA0: kaldi::CuDevice::IsComputeExclusive() (cu-device.cc:280)
==10385==    by 0x409EAC: kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (cu-device.cc:177)
==10385==    by 0x409247: main (main.cpp:11)
==10385== 
==10385== HEAP SUMMARY:
==10385==     in use at exit: 906,345 bytes in 5,704 blocks
==10385==   total heap usage: 21,080 allocs, 15,376 frees, 7,106,870 bytes allocated
==10385== 
==10385== LEAK SUMMARY:
==10385==    definitely lost: 0 bytes in 0 blocks
==10385==    indirectly lost: 0 bytes in 0 blocks
==10385==      possibly lost: 19,140 bytes in 350 blocks
==10385==    still reachable: 887,205 bytes in 5,354 blocks
==10385==         suppressed: 0 bytes in 0 blocks
==10385== Rerun with --leak-check=full to see details of leaked memory
==10385== 
==10385== For counts of detected and suppressed errors, rerun with: -v
==10385== Use --track-origins=yes to see where uninitialised values come from
==10385== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)
Killed

I really don't know what other debugging information should I post so please let me know.

Daniel Povey

unread,
Aug 3, 2017, 3:12:49 PM8/3/17
to kaldi-help
You seem to have a different version of Kaldi than the current master;
what line of code is cu-device.cc:280?
>>>> > email to kaldi-help+...@googlegroups.com.
>>>> > For more options, visit https://groups.google.com/d/optout.
>>>>
>>>> --
>>>> Go to http://kaldi-asr.org/forums.html find out how to join
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "kaldi-help" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to kaldi-help+...@googlegroups.com.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>>
>>>
>>> --
>>> Daniel Galvez
>>>
>>> --
>>> Go to http://kaldi-asr.org/forums.html find out how to join
>>> ---
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "kaldi-help" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/kaldi-help/eifhWfGJjno/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> kaldi-help+...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
> --
> Go to http://kaldi-asr.org/forums.html find out how to join
> ---
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.

Adam Zahran

unread,
Aug 6, 2017, 6:37:01 AM8/6/17
to kaldi-help, dpo...@gmail.com
Yes, it was a few months old. I built it with the current kaldi master. and here is the output of valgrind including a stacktrace:
==20751== Memcheck, a memory error detector
==20751== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==20751== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==20751== Command: ./reproducingGPUbug
==20751== 
==20751== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints.
==20751==    This could cause spurious value errors to appear.
==20751==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==20751== Warning: noted but unhandled ioctl 0x27 with no size/direction hints.
==20751==    This could cause spurious value errors to appear.
==20751==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==20751== Warning: noted but unhandled ioctl 0x7ff with no size/direction hints.
==20751==    This could cause spurious value errors to appear.
==20751==    See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper.
==20751== Conditional jump or move depends on uninitialised value(s)
==20751==    at 0xD4E81CC: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD6BBE1E: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD5B3295: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD6BE1CC: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD6BE5BB: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD5B4820: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD58711C: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD5D493C: cuInit (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0x7805DD4: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==20751==    by 0x7805E30: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==20751==    by 0xBD6BA98: __pthread_once_slow (pthread_once.c:116)
==20751==    by 0x7838918: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==20751== 
==20751== Conditional jump or move depends on uninitialised value(s)
==20751==    at 0xD4E823C: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD6BBE1E: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD5B3295: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD6BE1CC: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD6BE5BB: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD5B4820: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD58711C: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD5D493C: cuInit (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0x7805DD4: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==20751==    by 0x7805E30: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==20751==    by 0xBD6BA98: __pthread_once_slow (pthread_once.c:116)
==20751==    by 0x7838918: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==20751== 
==20751== Conditional jump or move depends on uninitialised value(s)
==20751==    at 0xD4E7EEA: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD6BBE6C: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD5B3295: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD6BE1CC: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD6BE5BB: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD5B4820: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD58711C: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0xD5D493C: cuInit (in /usr/lib/x86_64-linux-gnu/libcuda.so.375.66)
==20751==    by 0x7805DD4: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==20751==    by 0x7805E30: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==20751==    by 0xBD6BA98: __pthread_once_slow (pthread_once.c:116)
==20751==    by 0x7838918: ??? (in /usr/local/cuda-8.0/targets/x86_64-linux/lib/libcudart.so.8.0.44)
==20751== 
==20751== Warning: set address range perms: large range [0x1000000000, 0x2100000000) (noaccess)
==20751== Warning: set address range perms: large range [0x200000000, 0x400000000) (noaccess)
*** stack smashing detected ***: ./reproducingGPUbug terminated
==20751== 
==20751== Process terminating with default action of signal 6 (SIGABRT)
==20751==    at 0xCA54428: raise (raise.c:54)
==20751==    by 0xCA56029: abort (abort.c:89)
==20751==    by 0xCA967E9: __libc_message (libc_fatal.c:175)
==20751==    by 0xCB3811B: __fortify_fail (fortify_fail.c:37)
==20751==    by 0xCB380BF: __stack_chk_fail (stack_chk_fail.c:28)
==20751==    by 0x40AD7D: kaldi::CuDevice::IsComputeExclusive() (cu-device.cc:276)
==20751==    by 0x409FA7: kaldi::CuDevice::SelectGpuId(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) (cu-device.cc:177)
==20751==    by 0x409347: main (main.cpp:11)
==20751== 
==20751== HEAP SUMMARY:
==20751==     in use at exit: 906,505 bytes in 5,705 blocks
==20751==   total heap usage: 21,149 allocs, 15,444 frees, 7,121,674 bytes allocated
==20751== 
==20751== LEAK SUMMARY:
==20751==    definitely lost: 0 bytes in 0 blocks
==20751==    indirectly lost: 0 bytes in 0 blocks
==20751==      possibly lost: 19,124 bytes in 349 blocks
==20751==    still reachable: 887,381 bytes in 5,356 blocks
==20751==         suppressed: 0 bytes in 0 blocks
==20751== Rerun with --leak-check=full to see details of leaked memory
==20751== 
==20751== For counts of detected and suppressed errors, rerun with: -v
==20751== Use --track-origins=yes to see where uninitialised values come from
==20751== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 0 from 0)
Killed



Daniel Povey

unread,
Aug 6, 2017, 2:15:32 PM8/6/17
to Adam Zahran, kaldi-help
I think gcc's stack-smashing detector is reacting to something that
the CUDA library is doing. Maybe some combination of gcc version with
CUDA library version and CUDA hardware that isn't good. Try adding
-fno-stack-protector to CXXFLAGS in kaldi.mk.

Adam Zahran

unread,
Aug 6, 2017, 4:32:29 PM8/6/17
to kaldi-help, lorda...@gmail.com, dpo...@gmail.com
hmmm well it seemed that the problem is gone when I built against the new kaldi when adding that flag, but then I removed the flag and it still worked. I find that weird and it probably means I didn't link my code properly when I first pulled the latest kaldi?

Daniel Povey

unread,
Aug 6, 2017, 4:33:41 PM8/6/17
to Adam Zahran, kaldi-help
Maybe you never did 'make depend' so it was not being compiled properly.

Daniel Povey

unread,
Aug 6, 2017, 4:34:17 PM8/6/17
to Adam Zahran, kaldi-help
Also any time you change the Makefile you need to do 'make clean',
otherwise the changes won't take effect, since Make does not track
dependencies on the Makefile itself.

Adam Zahran

unread,
Aug 8, 2017, 7:23:23 AM8/8/17
to kaldi-help, lorda...@gmail.com, dpo...@gmail.com
For some reason the error just magically disappeared even after I make clean and make depend and make. I tried to checkout to different versions and it's still gone. Now I'm having a segfault while the cuda is freeing memory but I suppose that's an issue for another post.
Thank you. You've been really helpful. Great software indeed.
Reply all
Reply to author
Forward
0 new messages