sigsetjmp/siglongjmp trouble with new musl

34 views
Skip to first unread message

Waldek Kozaczuk

unread,
Oct 1, 2020, 11:41:13 AM10/1/20
to OSv Development
I have been trying to implement missing support of sigsetjmp/siglongjmp for aarch64 architecture which is critical (it looks) to get boost based unit tests running. To that end, I replaced the stubs of these functions in libc/arch/aarch64/setjmp/ and the real implementations in libc/arch/x64/setjmp/ to point to new musl 1.1.24 code which support aarch64 as well (again one more reason why I did the musl upgrade).

After this change and also implementing the functionality in arch/aarch64/feexcept.cc (taken from newlib) the boost unit tests working on aarch64.

However, the problem is that two unit test crash on x64:
  • tst-feexcept.cc

  • tst-sigaltstack.so

More specifically all assertions actually pass but the tests apps crash due to zero address page fault:

./scripts/run.py -e '/tests/tst-sigaltstack.so' -c 1
OSv v0.55.0-107-g81539853
eth0: 192.168.122.15
Booted up in 169.13 ms
Cmdline: /tests/tst-sigaltstack.so
SUMMARY: 6 tests, 0 failures
page fault outside application, addr: 0x0000000000000000
[registers]
RIP: 0x0000000040428884 <osv::application::run_main()+68>
RFL: 0x0000000000010202  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x0000000000000000  RBX: 0x0000000000000000  RCX: 0x0000000000000001  RDX: 0x0000000000000000
RSI: 0x00001000000070ac  RDI: 0x0000200000700e80  RBP: 0x0000200000700f90  R8:  0xfffffffffffff958
R9:  0x000000004091bb58  R10: 0x0000000000000050  R11: 0xffffa0007ff77400  R12: 0x0000000000000001
R13: 0x000000004091bb58  R14: 0x0000000000000050  R15: 0xffffa0007ff77400  RSP: 0x0000200000700f60
Aborted

[backtrace]
0x000000004020556e <???+1075860846>
0x000000004039268f <page_fault+143>
0x0000000040391546 <???+1077482822>
0x0000000040428a78 <???+1078102648>
0x000000004045a1c5 <???+1078305221>
0x00000000403f7a59 <thread_main_c+41>
0x00000000403924c2 <???+1077486786>

The gdb backtrace looks like this:
(gdb) bt
#0  0x0000000040398e92 in processor::cli_hlt () at arch/x64/processor.hh:247
#1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
#2  osv::halt () at arch/x64/power.cc:26
#3  0x0000000040238530 in abort (fmt=fmt@entry=0x4064451f "Aborted\n") at runtime.cc:132
#4  0x0000000040202ea6 in abort () at runtime.cc:98
#5  0x000000004020556f in mmu::vm_sigsegv (ef=0xffff800000ef9068, addr=0) at core/mmu.cc:1314
#6  mmu::vm_sigsegv (ef=0xffff800000ef9068, addr=0) at core/mmu.cc:1308
#7  mmu::vm_fault (addr=0, addr@entry=88, ef=ef@entry=0xffff800000ef9068) at core/mmu.cc:1336
#8  0x0000000040392690 in page_fault (ef=0xffff800000ef9068) at arch/x64/mmu.cc:42
#9  <signal handler called>
#10 0x0000000040428884 in osv::application::run_main (this=0x0) at /usr/include/c++/10/bits/stl_vector.h:918
#11 0x0000000040428a79 in operator() (app=<optimized out>, __closure=0x0) at core/app.cc:233
#12 _FUN () at core/app.cc:235
#13 0x000000004045a1c6 in operator() (__closure=0xffffa00000cf6800) at libc/pthread.cc:115
#14 std::__invoke_impl<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#15 std::__invoke_r<void, pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()>&> (__fn=...) at /usr/include/c++/10/bits/invoke.h:153
#16 std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const std::_Any_data &) (
    __functor=...) at /usr/include/c++/10/bits/std_function.h:291
#17 0x00000000403f7a5a in sched::thread::main (this=0xffff800000ef4040) at core/sched.cc:1210
#18 sched::thread_main_c (t=0xffff800000ef4040) at arch/x64/arch-switch.hh:321
#19 0x00000000403924c3 in thread_main () at arch/x64/entry.S:113

I think the this=0x0 in frame 10 is the culprit? Did something write over this memory? BTW the crash for tst-feexcept.cc looks pretty much identical.

Now, I have replaced two files in Makefile from libc to musl and added 3rd musl file.
diff --git a/Makefile b/Makefile
index 9d2997a1..f7d740b0 100644
--- a/Makefile
+++ b/Makefile
@@ -1398,8 +1398,12 @@ musl += process/wait.o
 
 musl += setjmp/$(musl_arch)/setjmp.o
 musl += setjmp/$(musl_arch)/longjmp.o
-libc += arch/$(arch)/setjmp/siglongjmp.o
-libc += arch/$(arch)/setjmp/sigsetjmp.o
+#libc += arch/$(arch)/setjmp/siglongjmp.o
+#libc += arch/$(arch)/setjmp/sigsetjmp.o
+musl += signal/$(musl_arch)/sigsetjmp.o
+musl += signal/siglongjmp.o
+musl += signal/sigsetjmp_tail.o
+$(out)/musl/src/signal/sigsetjmp_tail.o: CFLAGS += --include libc/syscall_to_function.h
 libc += arch/$(arch)/setjmp/block.o
 ifeq ($(arch),x64)
 libc += arch/$(arch)/ucontext/getcontext.o

Now here is how the 2 files look in libc:
cat libc/arch/x64/setjmp/siglongjmp.c
#include <setjmp.h>

extern void __restore_sigs(void *set);

_Noreturn void siglongjmp(sigjmp_buf buf, int ret)
{
if (buf->__fl) __restore_sigs(buf->__ss);
longjmp(buf, ret);
}
cat libc/arch/x64/setjmp/sigsetjmp.s 
/* Copyright 2011-2012 Nicholas J. Kain, licensed under standard MIT license */
.global sigsetjmp
.global __sigsetjmp
.type sigsetjmp,@function
.type __sigsetjmp,@function
__sigsetjmp:
sigsetjmp:
andl %esi,%esi
movq %rsi,64(%rdi)
jz 1f
pushq %rdi
leaq 72(%rdi),%rdx
xorl %esi,%esi
movl $2,%edi
call sigprocmask
popq %rdi
1: jmp setjmp


And here how 2 changed and 1 extra new file looks in musl:
cat musl/src/signal/siglongjmp.c 
#include <setjmp.h>
#include <signal.h>
#include "syscall.h"
#include "pthread_impl.h"

_Noreturn void siglongjmp(sigjmp_buf buf, int ret)
{
longjmp(buf, ret);
}
cat musl/src/signal/x86_64/sigsetjmp.s 
.global sigsetjmp
.global __sigsetjmp
.type sigsetjmp,@function
.type __sigsetjmp,@function
sigsetjmp:
__sigsetjmp:
test %esi,%esi
jz 1f

popq 64(%rdi)
mov %rbx,72+8(%rdi)
mov %rdi,%rbx

call setjmp@PLT

pushq 64(%rbx)
mov %rbx,%rdi
mov %eax,%esi
mov 72+8(%rbx),%rbx

.hidden __sigsetjmp_tail
jmp __sigsetjmp_tail

1: jmp setjmp@PLT
cat musl/src/signal/sigsetjmp_tail.c 
#include <setjmp.h>
#include <signal.h>
#include "syscall.h"

hidden int __sigsetjmp_tail(sigjmp_buf jb, int ret)
{
void *p = jb->__ss;
__syscall(SYS_rt_sigprocmask, SIG_SETMASK, ret?p:0, ret?0:p, _NSIG/8);
return ret;
}

Please note that we are actually replacing __syscall() in the las file with the local function call to sigprocmask() using the  --include libc/syscall_to_function.h trick.

As one can tell the implementation of sigsetjmp has changed in musl and that is causing some issues on our side. Here is the link to the musl commit - https://git.musl-libc.org/cgit/musl/commit/?id=583e55122e767b1586286a0d9c35e2a4027998ab (redesign sigsetjmp so that signal mask is restored after longjmp) that changed it.

I am not sure if I fully understand the change but it looks the return address is stored in a different place. Maybe because we use thread local variable for env causes some issues? Why are using thread local variables in those 2 tests?

Now, these two tests also fail on aarch64 but for a different reason - we do not have TLS implemented on aarch64 but I guess if we did those tests would fail in similar way.

Unfortunately, I do not understand signal handling on OSv and its limitations so I am not sure if the changes to sigsetjmp on musl side are actually helpful to us and we should upgrade to the newest version. Or maybe it would be better to keep the x64 version of it as is and only use (copy) the pre-583e55122e767b1586286a0d9c35e2a4027998ab-commit version of aarch64 sigsetjmp from musl instead which would be sad.

Any ideas?

Waldek

Nadav Har'El

unread,
Oct 1, 2020, 2:56:09 PM10/1/20
to Waldek Kozaczuk, OSv Development
It definitely looks bad. It's like we took a nullptr "app" and called app->run_main().

The code that called it is probably application::start(), which calls

    auto err = pthread_create(&_thread, NULL, [](void *app) -> void* {
        ((application*)app)->main();
        return nullptr;
    }, this);

which means that application->start() was called with nullptr application.

We have in app.cc a bunch of code that looks like
    auto app = std::make_shared<application>(command, args, new_program, env,
                                             main_function_name, post_main);
    app->start();

Could it be that make_shared failed the allocation and returned nullptr? You can add an assert in this code - or even in malloc - to fail immediately on allocation failure.

All that being said, i don't understand why this happened *after* the test ran. Are we running another program after the test?

Finally, the problem is very likely to have something to do with sigsetjmp, because this is what you said below that you changed, and
these two specific tests use it. However, I have no idea why a broken setjmp caused what you saw here.
 
Did something write over this memory? BTW the crash for tst-feexcept.cc looks pretty much identical.

Now, I have replaced two files in Makefile from libc to musl and added 3rd musl file.
diff --git a/Makefile b/Makefile
index 9d2997a1..f7d740b0 100644
--- a/Makefile
+++ b/Makefile
@@ -1398,8 +1398,12 @@ musl += process/wait.o
 
 musl += setjmp/$(musl_arch)/setjmp.o
 musl += setjmp/$(musl_arch)/longjmp.o
-libc += arch/$(arch)/setjmp/siglongjmp.o
-libc += arch/$(arch)/setjmp/sigsetjmp.o
+#libc += arch/$(arch)/setjmp/siglongjmp.o
+#libc += arch/$(arch)/setjmp/sigsetjmp.o
+musl += signal/$(musl_arch)/sigsetjmp.o

I looked at the difference between the libc and musl implementation of sigsetjmp.s, and there are extensive differences I don't understand.
One of them is a difference between a "call" and a "jmp". One of them is the use of "@PLT" in the code.
Maybe that's an ABI (stack alignment) problem? Maybe linker?
There's also a new bizarre __sigsetjmp_tail which I don't understand.

Can you try switching those different functions one by one and perhaps finding which one is causing the problem?
This commit has a very good explanation. It's just that it will take a lot of effort to understand it :-( From the few minutes I spent on it, I didn' understand :-(


I am not sure if I fully understand the change but it looks the return address is stored in a different place. Maybe because we use thread local variable for env causes some issues? Why are using thread local variables in those 2 tests?

Interesting guess, but you can easily verify it by making it non-thread-local.
I don't know why I made this variable thread-local - I think it should have been "static".
"thread-local" can allow running this test multiple times concurrently, but I don't see why anyone would want to do that...

I think you can get rid of the thread-local and also have a chance to run it on aarch64 then.


Now, these two tests also fail on aarch64 but for a different reason - we do not have TLS implemented on aarch64 but I guess if we did those tests would fail in similar way.

Unfortunately, I do not understand signal handling on OSv and its limitations so I am not sure if the changes to sigsetjmp on musl side are actually helpful to us and we should upgrade to the newest version. Or maybe it would be better to keep the x64 version of it as is and only use (copy) the pre-583e55122e767b1586286a0d9c35e2a4027998ab-commit version of aarch64 sigsetjmp from musl instead which would be sad.

Any ideas?

Waldek

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/bc3d65f7-f8d1-4bf6-afeb-76be861afa25n%40googlegroups.com.

Waldek Kozaczuk

unread,
Oct 5, 2020, 2:47:48 PM10/5/20
to OSv Development
I have narrowed down the tst-feexcept.cc to have only the following two assertions on (remaining code after in main is commented out) and also have changed env to be regular variable (see below) :

static sigjmp_buf env;

...

int main(int argc, char **argv)
{
    // Test that fegetexcept() does not return a negative number
    expect(fegetexcept() >= 0, true);

    // Test that *integer* division by zero generates, ironically, a SIGFPE
    expect(sig_check([] { printf("%d\n", 1 / zero_i()); }, SIGFPE), true);

    std::cout << "SUMMARY: " << tests << " tests, " << fails << " failures\n";
    return fails == 0 ? 0 : 1;
}

And the test still crashes (and 2 assertions pass). Obviously, if I keep the 1st assertion only, it does not crash. So something to do with sig_check().

Now, I have discovered that if I comment out the invocation of f() in sig_check the test does not crash but the 2nd assertion obviously fails.

template<typename Func>
bool sig_check(Func f, int sig) {
    // save_except works around a bug in Linux glibc
    // longjmp resets the FPU exception mask. So if we want it to  survive
    // the longjmp in this test, we unfortunately need to save it ourselves
    int save_except = fegetexcept();
    if (sigsetjmp(env, 1)) {
        // got the signal (and automatically restored handler)
        fedisableexcept(FE_ALL_EXCEPT);
        feenableexcept(save_except);
        return true;
    }
    struct sigaction sa;
    sa.sa_flags = SA_RESETHAND;
    sigemptyset(&sa.sa_mask);
    sa.sa_handler = [] (int) {
        siglongjmp(env, 1);
    };
    assert(sigaction(sig, &sa, NULL) == 0);
    //f();
    sa.sa_handler = SIG_DFL;
    assert(sigaction(sig, &sa, NULL) == 0);
    fedisableexcept(FE_ALL_EXCEPT);
    feenableexcept(save_except);
    return false;
}

So maybe something is wrong with sa_handler - siglongjmp(env, 1); - which when called for division by zero exception possibly wrote over the memory.

I wonder if we have a bug in our sigprocmask (libc/signal.cc) called by sigsetjmp() that then somehow affects siglongjmp? Could it be that because of some changes in sigsetjmp some destructors are called to early and memory erased because of it (if that makes any sense)? 
Getting rid of thread_local for env and making it static does not help on x64 - still exact same crash (which is good news in sense:-). Interestingly the same test does not crash on aarch64 now (no thread_local which is not supported yet) but some assertions fail. 

Nadav Har'El

unread,
Oct 5, 2020, 4:35:49 PM10/5/20
to Waldek Kozaczuk, OSv Development
On Mon, Oct 5, 2020 at 9:48 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:

On Thursday, October 1, 2020 at 2:56:09 PM UTC-4 Nadav Har'El wrote:

I looked at the difference between the libc and musl implementation of sigsetjmp.s, and there are extensive differences I don't understand.
One of them is a difference between a "call" and a "jmp". One of them is the use of "@PLT" in the code.
Maybe that's an ABI (stack alignment) problem? Maybe linker?
There's also a new bizarre __sigsetjmp_tail which I don't understand.

Can you try switching those different functions one by one and perhaps finding which one is causing the problem?
I have narrowed down the tst-feexcept.cc to have only the following two assertions on (remaining code after in main is commented out) and also have changed env to be regular variable (see below) :

static sigjmp_buf env;

I thought about it, thread_local is weird, but should have also worked (and apparently, doesn't work exactly the same) -  the reason is that for synchronic signals (sigfpe, sigsegv - things that happen because of bad instructions, not signal from another thread), Osv runs the handler in the same thread.


...

int main(int argc, char **argv)
{
    // Test that fegetexcept() does not return a negative number
    expect(fegetexcept() >= 0, true);

I guess you can drop this from the test too, and it will fail just the same.


    // Test that *integer* division by zero generates, ironically, a SIGFPE
    expect(sig_check([] { printf("%d\n", 1 / zero_i()); }, SIGFPE), true);

    std::cout << "SUMMARY: " << tests << " tests, " << fails << " failures\n";
    return fails == 0 ? 0 : 1;
}

And the test still crashes (and 2 assertions pass). Obviously, if I keep the 1st assertion only, it does not crash. So something to do with sig_check().

Now, I have discovered that if I comment out the invocation of f() in sig_check the test does not crash

The invocation of f() is what causes siglongjmp() (in the signal handler) to be called.
You can add a printout in the signal handler before siglongjmp() and see that if it was reached, and then add a printout in if (sigsetjmp(env, 1)) { (inside the if) - if that printout isn't shown, the siglongjmp didn't work.
On a wild hunch, can you please try to compile this test without optimization (-O0), maybe it changes something (as in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56982)? But I guess it's not that, it's something else :-(

What do we actually benefit from taking Musl's new code and not keep the old code, which worked well?

I think you can get rid of the thread-local and also have a chance to run it on aarch64 then.
Getting rid of thread_local for env and making it static does not help on x64 - still exact same crash (which is good news in sense:-). Interestingly the same test does not crash on aarch64 now (no thread_local which is not supported yet) but some assertions fail. 

Yes - it shouldn't be thread_local but for synchronous signals (like SIGFPE and SIGSEGV) it doesn't matter - the handler runs in the same thread.

Waldek Kozaczuk

unread,
Oct 6, 2020, 12:22:05 PM10/6/20
to OSv Development
On Monday, October 5, 2020 at 4:35:49 PM UTC-4 Nadav Har'El wrote:
On Mon, Oct 5, 2020 at 9:48 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:

On Thursday, October 1, 2020 at 2:56:09 PM UTC-4 Nadav Har'El wrote:

I looked at the difference between the libc and musl implementation of sigsetjmp.s, and there are extensive differences I don't understand.
One of them is a difference between a "call" and a "jmp". One of them is the use of "@PLT" in the code.
Maybe that's an ABI (stack alignment) problem? Maybe linker?
There's also a new bizarre __sigsetjmp_tail which I don't understand.

Can you try switching those different functions one by one and perhaps finding which one is causing the problem?
I have narrowed down the tst-feexcept.cc to have only the following two assertions on (remaining code after in main is commented out) and also have changed env to be regular variable (see below) :

static sigjmp_buf env;

I thought about it, thread_local is weird, but should have also worked (and apparently, doesn't work exactly the same) -  the reason is that for synchronic signals (sigfpe, sigsegv - things that happen because of bad instructions, not signal from another thread), Osv runs the handler in the same thread.


...

int main(int argc, char **argv)
{
    // Test that fegetexcept() does not return a negative number
    expect(fegetexcept() >= 0, true);

I guess you can drop this from the test too, and it will fail just the same.


    // Test that *integer* division by zero generates, ironically, a SIGFPE
    expect(sig_check([] { printf("%d\n", 1 / zero_i()); }, SIGFPE), true);

    std::cout << "SUMMARY: " << tests << " tests, " << fails << " failures\n";
    return fails == 0 ? 0 : 1;
}

And the test still crashes (and 2 assertions pass). Obviously, if I keep the 1st assertion only, it does not crash. So something to do with sig_check().

Now, I have discovered that if I comment out the invocation of f() in sig_check the test does not crash

The invocation of f() is what causes siglongjmp() (in the signal handler) to be called.
You can add a printout in the signal handler before siglongjmp() and see that if it was reached, and then add a printout in if (sigsetjmp(env, 1)) { (inside the if) - if that printout isn't shown, the siglongjmp didn't work.
The printouts in this "if" do show up so it is all very weird. As if somehow siglongjmp() ran one more time than necessary to re-run the same code again. Or maybe new sigsetjmp() is broken in a weird way where it jumps to continue in sig_check() and once main() of this unit test completes, the control is returned to the same original sigsetjmp() invocation (somewhere in that assembly code) where most of the variables like the original app are destroyed. I do not know what to make of it.
From a strictly technical perspective, we do not know until we understand that musl commit message :-)
 
From non-technical perspective, the main motivation is to get setjmp/longjmp along with their signal friends working on aarch64. The best and most logical way to accomplish it would be to point to the newest 1.1.24 code in musl for aarch64 and at the same time for x64. That way we would get rid of even more OSv version/old musl copy of the code under libc/. If the new musl code does not work on x64 we can keep the old musl copies of siglongjmp.c and sigsetjmp.s under libc (siglongjmp.c is actually the same for both x64 and aarch64) and add the pre-583e55122e767b1586286a0d9c35e2a4027998ab  commit copy of ssigsetjmp.s for aarch64 under libc/ and be done with it. This unfortunately means we would have a mix of both new (1.1.24) and old musl code. But how often will that code change in musl and how significantly will it affect us?

At this moment we only use setjmp.s and longjmp.s from musl as-is for both x64 and aarch64. And use the following signal related code from musl:

musl += signal/killpg.o
musl += signal/siginterrupt.o
musl += signal/sigrtmin.o
musl += signal/sigrtmax.o
musl += string/strsignal.o 

What is the likelihood that once we upgrade to newer musl (1.2.1 for example), would we have come logical conflicts between the as-is musl code and "frozen-in-time" implementation of sigsetjmp/siglongjmp? 


I think you can get rid of the thread-local and also have a chance to run it on aarch64 then.
Getting rid of thread_local for env and making it static does not help on x64 - still exact same crash (which is good news in sense:-). Interestingly the same test does not crash on aarch64 now (no thread_local which is not supported yet) but some assertions fail. 

Yes - it shouldn't be thread_local but for synchronous signals (like SIGFPE and SIGSEGV) it doesn't matter - the handler runs in the same thread.
It looks like you think that we can change both tst-feexcept.cc and tst-sigalstack.cc to replace thread local env variable with a regular static one. It will help with testing aarch64 until we get TLS working there (more about it in a separate email). I will send a separate patch.
Reply all
Reply to author
Forward
0 new messages