Crash during stack trace unwinding

509 views
Skip to first unread message

Priyendra Deshwal

unread,
Mar 26, 2016, 4:44:29 AM3/26/16
to gperftools
Hey guys,

I am looking for some help debugging a crash that I occasionally observe when running gcc opt compiled binaries with HEAPCHECK=strict.

Here are various details:

- This is the stack trace that I get. Since heap checker is active, it tries to invoke the NewHook. That hook tries to unwind the callstack and crashes in the process.

    @          0x210d062 GetStackTrace()
    @          0x21056b4 MallocHook_GetCallerStackTrace
    @          0x20f46a4 NewHook()
    @          0x2104f32 MallocHook::InvokeNewHookSlow()
    @          0x21127cf operator new[]()
    @     0x7ff3eb637d17 std::string::_Rep::_S_create()
    @          0x1fdc175 std::string::_S_construct<>()
    @     0x7ff3eb638d30 std::string::string()

- I am running *without libunwind* and hence the unwinding method that gperftools uses is to walk the frame pointer linked list.

- My binary is compiled with -fno-omit-frame-pointer -O3 with gcc 5.1.0

- One of the functions in the stack trace is std::string::_S_construct<>. I looked at the disassembly of that function, and this is what I found:

 1fdc140:  cmp    %rsi,%rdi
 1fdc143:  je     1fdc1b8 <_ZNSs12_S_constructIPKcEEPcT_S3_RKSaIcESt20forward_iterator_tag+0x78>
 1fdc145:  test   %rsi,%rsi
 1fdc148:  push   %r12
 1fdc14a:  push   %rbp
 1fdc14b:  mov    %rdi,%rbp
 1fdc14e:  push   %rbx
 1fdc14f:  mov    %rsi,%rbx
 1fdc152:  je     1fdc168 <_ZNSs12_S_constructIPKcEEPcT_S3_RKSaIcESt20forward_iterator_tag+0x28>
 1fdc154:  test   %rdi,%rdi
 1fdc157:  jne    1fdc168 <_ZNSs12_S_constructIPKcEEPcT_S3_RKSaIcESt20forward_iterator_tag+0x28>
 1fdc159:  mov    $0x2177460,%edi
 1fdc15e:  callq  411350 <_ZSt19__throw_logic_errorPKc@plt>
 1fdc163:  nopl   0x0(%rax,%rax,1)
 1fdc168:  sub    %rbp,%rbx
 1fdc16b:  xor    %esi,%esi
 1fdc16d:  mov    %rbx,%rdi
 1fdc170:  callq  4112d0 <_ZNSs4_Rep9_S_createEmmRKSaIcE@plt>  // next function in the call stack.

Note that the stack contains the value of r12, then the old frame pointer and then instead of the convention push %rsp, %rbp, we have push %rdi, %rbp. This means that the stack frame linked list established by the frame pointers is effectively broken by this stack frame.

Now if I look at the code of GetStackTrace from gperftools, I see the following (taken from stacktrace_x86_64-inl.h)

  while (sp && n < max_depth) {
    if (*(sp+1) == reinterpret_cast<void *>(0)) {
      // In 64-bit code, we often see a frame that
      // points to itself and has a return address of 0.
      break;
    }
  }

In this call stack, @sp is no longer guaranteed to be on the stack. In fact *(sp + 1) is not even guaranteed to be valid memory and hence I get the segfault shown above. Of course the situation does not trigger deterministically and hence, I am unable to to investigate in more detail inside gdb etc.

- This never happens in debug mode and I have verified that the assembly for debug mode is properly setting the frame pointer.

So given all this information,

- Does my diagnosis for why the crash is happening seem reasonable?
- Is there some compiler setting that I am missing which will ensure that frame pointer is not omitted for the function std::string::Rep::_S_construct? Most of my functions do have the usual frame pointer preamble.
- Any suggestions on how to prevent this segfault?

Regards,
-- Priyendra

Aliaksey Kandratsenka

unread,
Mar 26, 2016, 1:54:42 PM3/26/16
to Priyendra Deshwal, gperftools
Hi.

In this code rbp is clearly not used as frame pointer.I believe it is
most plausible that this function is not part of your executable, but
part of some library that isn't built with frame pointers.

>
> Now if I look at the code of GetStackTrace from gperftools, I see the
> following (taken from stacktrace_x86_64-inl.h)
>
> while (sp && n < max_depth) {
> if (*(sp+1) == reinterpret_cast<void *>(0)) {
> // In 64-bit code, we often see a frame that
> // points to itself and has a return address of 0.
> break;
> }
> }
>
> In this call stack, @sp is no longer guaranteed to be on the stack. In fact
> *(sp + 1) is not even guaranteed to be valid memory and hence I get the
> segfault shown above. Of course the situation does not trigger
> deterministically and hence, I am unable to to investigate in more detail
> inside gdb etc.
>
> - This never happens in debug mode and I have verified that the assembly for
> debug mode is properly setting the frame pointer.
>
> So given all this information,
>
> - Does my diagnosis for why the crash is happening seem reasonable?
> - Is there some compiler setting that I am missing which will ensure that
> frame pointer is not omitted for the function
> std::string::Rep::_S_construct? Most of my functions do have the usual frame
> pointer preamble.
> - Any suggestions on how to prevent this segfault?
>
> Regards,
> -- Priyendra
>
> --
> You received this message because you are subscribed to the Google Groups
> "gperftools" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to gperftools+...@googlegroups.com.
> To post to this group, send email to gperf...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gperftools/91581c22-0988-4bd1-b7cd-f2e94f737760%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Priyendra Deshwal

unread,
Mar 26, 2016, 2:08:26 PM3/26/16
to gperftools
Hi.

In this code rbp is clearly not used as frame pointer.I believe it is
most plausible that this function is not part of your executable, but
part of some library that isn't built with frame pointers.

Thanks for the response.

I can objdump/disassemble my executable and disassemble this function. So this code is actually present in my executable. Even though I compiled it with -fno-omit-frame-pointer.

Aliaksey Kandratsenka

unread,
Mar 26, 2016, 2:50:22 PM3/26/16
to Priyendra Deshwal, gperftools
Interesting.

Given that rbp clearly isn't pointing to stack frame in this code,
this is either some mistake on your end (such as bug in makefiles) or
compiler bug.

I'd suggest double checking everything on your end and then filing the
bug with gcc.

One test worth trying is building everything with clang or with other
version of gcc.

gperftools' code for frame pointer unwinding could be made more robust
by always checking if stack pointer is valid. Current code seems to be
doing those checks only in subset of cases. But it would slow
stacktrace capturing down.


>
> --
> You received this message because you are subscribed to the Google Groups
> "gperftools" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to gperftools+...@googlegroups.com.
> To post to this group, send email to gperf...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/gperftools/CAALYsqHCnzqQo7e-sBDm5H1z3bH3w1jBVqH4YR3Vhv2V4wM_OQ%40mail.gmail.com.

Priyendra Deshwal

unread,
Mar 26, 2016, 4:48:37 PM3/26/16
to Aliaksey Kandratsenka, gperftools
Thanks for the help. I was able to get some evidence that this is a compiler bug. I tried with an older compiler version and all the functions on the call stack seemed to properly deal with rbp.

We are using gcc 5.1.0 but we configure it _GLIBCXX_USE_CXX11_ABI=0 which is possibly a non-standard configuration and may have some bugs in it. I will try to follow up with the gcc folks.

In the mean time, what are good ways to deal with the problem:

- One solution that comes to mind is to "ignore" SIGSEGVs that happen during stack unwinding. We could have a thread local boolean which is set to true when GetStackTrace is invoked and the signal handler would ignore the SIGSEGV if that boolean is set to true. However, GetStackTrace may be invoked in a variety of contexts and it would be useful if you could offer some advice if this seems reasonable.

- Make the unwinding process more robust. There are already some pretty nice checks in there - new_sp should be lesser than old_sp, should not be too far away etc. My understanding is that it is these checks that prevent my program from crashing very often. You mentioned that the unwinding could be made even more robust. I am not getting any immediate ideas on how to go about doing that.

- Switch to libunwind. Interestingly, we used to use libunwind initially but we ran into a bunch of issues - mostly deadlocks and we switched to frame pointer and have been happy for the past two years. Switching back is an option - but I would like to get some sense of the status of the gperftools+libunwind bugs currently. From release notes, it seems a bunch of issues were fixed - but want to get the latest on that.

Aliaksey Kandratsenka

unread,
Apr 9, 2016, 3:16:50 PM4/9/16
to Priyendra Deshwal, gperftools
Hi. I've captured current understanding of backtracing state in wiki
page: https://github.com/gperftools/gperftools/wiki/gperftools'-stacktrace-capturing-methods-and-their-issues

Overall, we've made some fixes, but stack trace capturing from
profiling signal is still a big problem.


>
>
> On Sat, Mar 26, 2016 at 11:50 AM, Aliaksey Kandratsenka
> <alkond...@gmail.com> wrote:
>>/a

Priyendra Deshwal

unread,
Apr 9, 2016, 5:11:57 PM4/9/16
to Aliaksey Kandratsenka, gperftools
Thanks for the great summary of the current status.

In my case, we worked around the problem by having a thread local flag which was set to true just before attempting to capture a stack trace. If capturing the stack trace caused a segfault, it would invoke my signal handler and if the thread local flag was true, I would ignore the segfault and longjmp back to a safe place in the code and return an empty stack trace. This has worked fine so far and we are not seeing erroneous crashes any more.

chen...@gmail.com

unread,
Jan 4, 2017, 10:18:46 PM1/4/17
to gperftools
I think we just need to obtain the stack top of current thread, and add a bound checking to avoid beyond it.
I've worked it out in my private codebase, and it works well.

在 2016年3月26日星期六 UTC+8下午4:44:29,Priyendra Deshwal写道:

chen...@gmail.com

unread,
Jun 20, 2018, 6:39:22 AM6/20/18
to gperftools
See https://github.com/gperftools/gperftools/pull/865
In this change, It only validates stack address when its memory page is changed, may eliminate most cost.

Hope this problem can be resolved ASAP.

在 2017年1月5日星期四 UTC+8上午11:18:46,chen...@gmail.com写道:
Reply all
Reply to author
Forward
0 new messages