AArch64 debug build woes

350 views
Skip to first unread message

Waldek Kozaczuk

unread,
Feb 11, 2021, 12:42:23 AM2/11/21
to OSv Development
Apart from the TLS issue reported here OSv can be built in the aarch64 debug mode.

Some of the tests pass as well (as on release) but there are some that seem to fail in a similar way due to possibly wrong compiled code in kernel possibly due to -O0.

Here is one example:

./scripts/run.py -e '/tests/tst-bsd-tcp1-zsnd.so' -c 1

page fault outside application, addr: 0x0000000000000000

[registers]

PC: 0x0000000040111e40 <zcopy_tx+84>

X00: 0x0000000000000001 X01: 0xffffa0004100f9c0 X02: 0x0000000000000008

X03: 0x0000000000000008 X04: 0x0000000000000008 X05: 0x0000000000007001

X06: 0x0000000000000000 X07: 0x00000000b71b0000 X08: 0xffff800041782aa0

X09: 0x0000000000000000 X10: 0x0000000000000002 X11: 0x0000000000000000

X12: 0x2050435420612073 X13: 0x006567617373656d X14: 0x0000000000001af8

X15: 0x0000000000000000 X16: 0x000010000005b5d0 X17: 0x0000000040111dec

X18: 0x0000000000001120 X19: 0xffffa0004100f9c0 X20: 0x0000000000000190

X21: 0x0000000000000001 X22: 0xffff800041782db8 X23: 0x0000000000000001

X24: 0xffffa000414c4b80 X25: 0xffff800041793d98 X26: 0xffff800041793da8

X27: 0x00002000006ffb00 X28: 0x000010000005a000 X29: 0xffff800041782c10

X30: 0x0000000040111e34 SP:  0xffff800041782c10 ESR: 0x0000000096000046

PSTATE: 0x0000000060000345

Aborted


[backtrace]

0x00000000400e9e14 <abort(char const*, ...)+288>


After connecting with gdb and reconstructing the stacktrace, it looks like this:

0  0x0000000040111e40 in zcopy_tx (s=5, zm=0x1) at bsd/sys/kern/uipc_syscalls.cc:1027

#1  0x0000100000037954 in test_bsd_tcp1::tcp_server (this=0x2000006ff988) at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zsnd.cc:114

#2  0x0000100000037a64 in test_bsd_tcp1::run()::{lambda()#1}::operator()() const (__closure=<optimized out>) at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zsnd.cc:229

#3  std::__invoke_impl<void, test_bsd_tcp1::run()::{lambda()#1}&>(std::__invoke_other, test_bsd_tcp1::run()::{lambda()#1}&) (__f=...) at /usr/include/c++/10/bits/invoke.h:60

#4  std::__invoke_r<void, test_bsd_tcp1::run()::{lambda()#1}&>(std::__is_invocable&&, (test_bsd_tcp1::run()::{lambda()#1}&)...) (__fn=...) at /usr/include/c++/10/bits/invoke.h:153

#5  std::_Function_handler<void (), test_bsd_tcp1::run()::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/10/bits/std_function.h:291

#6  0x000000004031cba8 in std::function<void ()>::operator()() const (this=0xffffa0004168d630) at /usr/include/c++/10/bits/std_function.h:622

#7  0x000000004043e1cc in sched::thread::main (this=0xffffa0004168d600) at core/sched.cc:1219

#8  0x000000004043a188 in sched::thread_main_c (t=0xffffa0004168d600) at arch/aarch64/arch-switch.hh:186

#9  0x0000000040439cf4 in sched::thread::switch_to (this=0x0) at arch/aarch64/arch-switch.hh:28

#10 0x0000000000000000 in ?? ()

Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) frame 1

#1  0x0000100000037954 in test_bsd_tcp1::tcp_server (this=0x2000006ff988) at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zsnd.cc:114

114             int bytes2 = zcopy_tx(client_s, &zm);

(gdb) p client_s

$1 = 5

(gdb) p &zm

$2 = (zmsghdr *) 0xffff800041782d40


As you can see the test app calls zcopy_tx() which takes 2 arguments:

ssize_t zcopy_tx(int s, struct zmsghdr *zm)

the 1st one is int and has value 5 in the caller - the test app - and is received as such 

in the kernel zcopy_tx.


The second one - the address of struct zmsghdr - is problematic. On the caller's side looks OK but when received in the kernel it is wrong - 0x1.

Why?


I saw another test crashing in a similar way when the caller (another test) would pass 3 arguments to kernel function and 2 of those (non-addresses) were passed correctly but the 3rd one - address one was not.


Any ideas what might be going on?


Waldek

Nadav Har'El

unread,
Feb 11, 2021, 9:06:07 AM2/11/21
to Waldek Kozaczuk, OSv Development
On Thu, Feb 11, 2021 at 7:42 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:

#1  0x0000100000037954 in test_bsd_tcp1::tcp_server (this=0x2000006ff988) at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zsnd.cc:114

114             int bytes2 = zcopy_tx(client_s, &zm);

(gdb) p client_s

$1 = 5

(gdb) p &zm

$2 = (zmsghdr *) 0xffff800041782d40


As you can see the test app calls zcopy_tx() which takes 2 arguments:

ssize_t zcopy_tx(int s, struct zmsghdr *zm)

the 1st one is int and has value 5 in the caller - the test app - and is received as such 

in the kernel zcopy_tx.


The second one - the address of struct zmsghdr - is problematic. On the caller's side looks OK but when received in the kernel it is wrong - 0x1.

Why?


Not being an expert on aarch64 or it's function calling conventions, all I can do is raise some wild guesses, I hope one of them is correct and you can figure out which - perhaps by reading the code or trying to reproduce it in new tests (you can perhaps write a new test which loops calling some function f() with a bunch of parameters in multiple threads, and printing an error if f ever gets called with wrong parameters) .

One possibility is that our context-switch implementation is forgetting to save some of the registers, and the register which is used to hold the third argument of a function is lost on the context switch.

Another possibility is that we lose this register in situations smaller asynchronous events, not just context switches between threads. We have interrupts (e.g., the timer interrupt), exceptions, and signals, which can run complex OSv code in the middle of the user's function without the function knowing that this is happening, so when we switch to these interrupts or exceptions we mustn't forget the registers which the OSv code may clobber.
 


I saw another test crashing in a similar way when the caller (another test) would pass 3 arguments to kernel function and 2 of those (non-addresses) were passed correctly but the 3rd one - address one was not.


Any ideas what might be going on?


Waldek

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/4a97809f-d207-48b9-88e7-06e218e5d829n%40googlegroups.com.

Waldek Kozaczuk

unread,
Feb 13, 2021, 11:24:13 AM2/13/21
to Nadav Har'El, OSv Development
Hi,

On Thu, Feb 11, 2021 at 9:06 AM Nadav Har'El <n...@scylladb.com> wrote:
On Thu, Feb 11, 2021 at 7:42 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:

#1  0x0000100000037954 in test_bsd_tcp1::tcp_server (this=0x2000006ff988) at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zsnd.cc:114

114             int bytes2 = zcopy_tx(client_s, &zm);

(gdb) p client_s

$1 = 5

(gdb) p &zm

$2 = (zmsghdr *) 0xffff800041782d40


As you can see the test app calls zcopy_tx() which takes 2 arguments:

ssize_t zcopy_tx(int s, struct zmsghdr *zm)

the 1st one is int and has value 5 in the caller - the test app - and is received as such 

in the kernel zcopy_tx.


The second one - the address of struct zmsghdr - is problematic. On the caller's side looks OK but when received in the kernel it is wrong - 0x1.

Why?


Not being an expert on aarch64 or it's function calling conventions, all I can do is raise some wild guesses, I hope one of them is correct and you can figure out which - perhaps by reading the code or trying to reproduce it in new tests (you can perhaps write a new test which loops calling some function f() with a bunch of parameters in multiple threads, and printing an error if f ever gets called with wrong parameters) .

One possibility is that our context-switch implementation is forgetting to save some of the registers, and the register which is used to hold the third argument of a function is lost on the context switch.

Another possibility is that we lose this register in situations smaller asynchronous events, not just context switches between threads. We have interrupts (e.g., the timer interrupt), exceptions, and signals, which can run complex OSv code in the middle of the user's function without the function knowing that this is happening, so when we switch to these interrupts or exceptions we mustn't forget the registers which the OSv code may clobber.
 

I think you are right that we are "losing" a register or overwriting stack memory we restore registers from. It is just not clear where.

So here is what I have tried since last email:
1) I had a theory that maybe there is a bug in the __elf_resolve_pltgot assembly which would be called lazily as the app is executing it might be messing the registers. So I forced the app symbols to be resolved eagerly and still getting same exact crash. So back to the drawing board.
2) I made gradual changed to conf/debug.mk to find the minimal difference with release.mk that makes the crashes happen:
  • Disabling the debug messages with 'onf-logger_debug=0' still give the exact same crashes
  • Removing '-Wno-maybe-uninitialized' from debug.mk also does change anything - still same crash
  • So at this point the only difference is optimization level and missing '-DNDEBUG'. So I thought of bumping the debug optimization level to '-O1'. Interestingly this makes the crashes to happen in the kernel code when mounting ROFS filesystem in rofs_mount() static function and it is very repeatable. But at the end of the day it is similar to the original crashes with -O0 - some register - typically x2* (x24, x22 or x28) changes in the middle of the function after it calls another one that register typically holding an address has wrong bogus value just as it was not restored by callee.
In the specific kernel crash with -O1 above, the rofs_mount() calls rofs_read_blocks() which reads from a block device so for sure it causes an exit to hypervisor and results in an interrupt handled so for sure there are multiples threads being switched. So again maybe something wrong in interrupt handler?

Now looking at all assembly code (which is I think they issue is) there are not that many candidates: entry.S where we have sync (page fault) and async (interrupts) handlers entry points. Both save and restore all registers they should I think using those 2 macros:
.macro push_state_to_exception_frame
        sub     sp, sp, #48 // make space for align2, align1+ESR, PSTATE, PC, SP
        push_pair x28, x29
        push_pair x26, x27
        push_pair x24, x25
        push_pair x22, x23
        push_pair x20, x21
        push_pair x18, x19
        push_pair x16, x17
        push_pair x14, x15
        push_pair x12, x13
        push_pair x10, x11
        push_pair x8, x9
        push_pair x6, x7
        push_pair x4, x5
        push_pair x2, x3
        push_pair x0, x1
        add     x1, sp, #288         // x1 := old SP (48 + 16 * 15 = 288)
        mrs     x2, elr_el1
        mrs     x3, spsr_el1
        stp     x30, x1, [sp, #240]  // store lr, old SP
        stp     x2, x3, [sp, #256]   // store elr_el1, spsr_el1

.macro pop_state_from_exception_frame
        ldp     x21, x22, [sp, #256] // load elr_el1, spsr_el1
        pop_pair x0, x1
        pop_pair x2, x3
        pop_pair x4, x5
        pop_pair x6, x7
        pop_pair x8, x9
        msr     elr_el1, x21         // set loaded elr and spsr
        msr     spsr_el1, x22
        pop_pair x10, x11
        pop_pair x12, x13
        pop_pair x14, x15
        pop_pair x16, x17
        pop_pair x18, x19
        pop_pair x20, x21
        pop_pair x22, x23
        pop_pair x24, x25
        pop_pair x26, x27
        pop_pair x28, x29
        ldr     x30, [sp], #48

There maybe an issue in context switch functions which look like this but I do not see any obvious issues here:

void thread::switch_to()
{
    thread* old = current();
    asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory");

    asm volatile("\n"
                 "str x29,     %0  \n"
                 "mov x2, sp       \n"
                 "adr x1, 1f       \n" /* address of label */
                 "stp x2, x1,  %1  \n"

                 "ldp x29, x0, %2  \n"
                 "ldp x2, x1,  %3  \n"

                 "mov sp, x2       \n"
                 "blr x1           \n"

                 "1:               \n" /* label */
                 :
                 : "Q"(old->_state.fp), "Ump"(old->_state.sp),
                   "Ump"(this->_state.fp), "Ump"(this->_state.sp)
                 : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
                   "x9", "x10", "x11", "x12", "x13", "x14", "x15",
                   "x16", "x17", "x18", "x30", "memory");
}

void thread::switch_to_first()
{
    asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory");

    /* check that the tls variable preempt_counter is correct */
    assert(sched::get_preempt_counter() == 1);

    s_current = this;
    current_cpu = _detached_state->_cpu;
    remote_thread_local_var(percpu_base) = _detached_state->_cpu->percpu_base;

    asm volatile("\n"
                 "ldp x29, x0, %0  \n"
                 "ldp x2, x1, %1   \n"
                 "mov sp, x2       \n"
                 "blr x1           \n"
                 :
                 : "Ump"(this->_state.fp), "Ump"(this->_state.sp)
                 : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
                   "x9", "x10", "x11", "x12", "x13", "x14", "x15",
                   "x16", "x17", "x18", "x30", "memory");
}


Just for reference the ARM function calling convention is this (as far as registers go):

Caller:
* Save x0-x18 registers if calling a function (it may change them)
* Use x0-x7 registers for 8 first parameters
* push extra parameters on stack
* call a function
* Restore ant x0-x18 registers if saved


Callee:
* push lr and any x19-x30 registered if used on stack
* execute code
* pop any x19-x30 registered if used above from stack

Waldek

Nadav Har'El

unread,
Feb 14, 2021, 2:33:16 PM2/14/21
to Waldek Kozaczuk, OSv Development
I have two problems trying to understand this: first, sadly my memory of how we did all of this in x86 is very rusty.
Second, I have no idea what was done differently in aarch64, and if so why.

You seem to be pushing registers on the stack here. Where is this stack? In x86, we had separate stacks for exceptions, for nested exceptions, and interrupts.
Is this also true in the arm version?

This discussion of the stack made me think of another possible reason for losing data in functions.
The red zone.
Do we have a "red zone" on arm64 as well?
Basically, the "red zone" is 128 bytes below the stack pointer that a function can use as scratch space, and it can use it for example to store some of the parameters if it needs the registers to store something else - without wasting time on instructions to change the stack pointer. If some interrupt or exception overwrites this redzone, we lose data. 
To avoid this, we usually had separate stacks for interrupts and exceptions and nested exceptions, but where we didn't want to do this, e.g., in syscalls, we had to skip the redzone (see for example commit 499b9433ae748b6c04dedc2125ea17010ffbdaf1).

I have another wild guess below - caller-saved registers:
The "typical" problem here (I don't know if it happens in your case, but it happened in the past in various cases)
is that "something" (interrupt, exception, signal, etc.) gets called in the middle of the user's function code, so he
did not know he was going to call a function, so it didn't save these caller-saved registers. This is why all of that
asynchronous code needs to save all those caller-saved registers. In x86, we had these problems with the FPU
and had to save the FPU state in a bunch of places. Maybe in aarch64 we need to save additional registers in
the same place we saved the FPU state for x86?

Waldek Kozaczuk

unread,
Feb 15, 2021, 12:43:22 AM2/15/21
to OSv Development
Good question. So in our aarch64 port, there is no dedicated exception nor interrupt stack unlike in x64. There is a single stack per thread where everything happens. And this might be the issue. But somehow it works with '-O2' but maybe some bugs we have for aarch64 which we do not understand do happen because of a single stack. 

This discussion of the stack made me think of another possible reason for losing data in functions.
The red zone.
Do we have a "red zone" on arm64 as well?
Basically, the "red zone" is 128 bytes below the stack pointer that a function can use as scratch space, and it can use it for example to store some of the parameters if it needs the registers to store something else - without wasting time on instructions to change the stack pointer. If some interrupt or exception overwrites this redzone, we lose data. 
To avoid this, we usually had separate stacks for interrupts and exceptions and nested exceptions, but where we didn't want to do this, e.g., in syscalls, we had to skip the redzone (see for example commit 499b9433ae748b6c04dedc2125ea17010ffbdaf1).

The ARMv8  Procedure Call Standard - https://developer.arm.com/documentation/ihi0055/latest/- does not mention any "red zone". However, both Windows (https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#red-zone) and Apple (https://developer.apple.com/documentation/xcode/writing_arm64_code_for_apple_platforms) extensions both mention the red zone - 16 bytes and 128 bytes respectively. I could not find anything specific for Linux though. For sure Linux requires 128 red zone for x86_64.

Now I devised a simple experiment and subtracted 128 bytes from sp at the beginning of push_frame and added 128 bytes at the end of the pop_frame macros that did not help in any way. I also tried 256, 512 to the same effect.
There is the arm64 FPU save/restore code in OSv where it saves floating point registers. But are you suggesting we save/restore extra registers in there? But why if push_frame/pop_frame do that for us already when the interrupt is taken? 

Now in arm64, which is RISC, the stack is manipulated very differently than in x64, and very often storing or reading from the stack does not touch the stack pointer but merely references it, and then at some point, it is adjusted accordingly (it may happen before). So I wonder if the interrupt is called in the middle of that before the stack pointer is adjusted we might be pushing the frame at the wrong place and overriding some registers. But then my experiment with adding/subtracting 128 or 256 bytes as I described above should have helped.

BTW compiling with 'O1 -fcaller-saves' makes the crash happen in another place.

Nadav Har'El

unread,
Feb 15, 2021, 2:19:20 AM2/15/21
to Waldek Kozaczuk, OSv Development
On Mon, Feb 15, 2021 at 7:43 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:


On Sunday, February 14, 2021 at 2:33:16 PM UTC-5 Nadav Har'El wrote:

You seem to be pushing registers on the stack here. Where is this stack? In x86, we had separate stacks for exceptions, for nested exceptions, and interrupts.
Is this also true in the arm version?
 
Good question. So in our aarch64 port, there is no dedicated exception nor interrupt stack unlike in x64. There is a single stack per thread where everything happens. And this might be the issue.
But somehow it works with '-O2' but maybe some bugs we have for aarch64 which we do not understand do happen because of a single stack. 

As usual this is just a wild theory, but it's possible that O2 code uses fewer or more registers, or uses the stack more or less or differently.
 

This discussion of the stack made me think of another possible reason for losing data in functions.
The red zone.
Do we have a "red zone" on arm64 as well?
Basically, the "red zone" is 128 bytes below the stack pointer that a function can use as scratch space, and it can use it for example to store some of the parameters if it needs the registers to store something else - without wasting time on instructions to change the stack pointer. If some interrupt or exception overwrites this redzone, we lose data. 
To avoid this, we usually had separate stacks for interrupts and exceptions and nested exceptions, but where we didn't want to do this, e.g., in syscalls, we had to skip the redzone (see for example commit 499b9433ae748b6c04dedc2125ea17010ffbdaf1).

The ARMv8  Procedure Call Standard - https://developer.arm.com/documentation/ihi0055/latest/- does not mention any "red zone". However, both Windows (https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#red-zone) and Apple (https://developer.apple.com/documentation/xcode/writing_arm64_code_for_apple_platforms) extensions both mention the red zone - 16 bytes and 128 bytes respectively. I could not find anything specific for Linux though. For sure Linux requires 128 red zone for x86_64.

I suspect that gcc does *not* use the red zone on aarch64 - it seems there is no "-mno-red-zone" option, and https://github.com/iains/gcc-darwin-arm64 suggests that unlike Apple's compiler, gcc doesn't use a red zone - and your experiment below also suggests that this is not the problem. So that's probably (hopefully) not the problem, so there is one less reason to use separate stacks.

By the way another consequence of using the user's stacks for interrupts, exceptions, etc., is that it becomes more important for all stacks to be big enough. I wonder if the problem could be that some of our thread stacks are too small. Maybe you can hack sched::thread to always use a larger minimum stack size and see if it helps?

 

Now I devised a simple experiment and subtracted 128 bytes from sp at the beginning of push_frame and added 128 bytes at the end of the pop_frame macros that did not help in any way. I also tried 256, 512 to the same effect.



I have another wild guess below - caller-saved registers:

The "typical" problem here (I don't know if it happens in your case, but it happened in the past in various cases)
is that "something" (interrupt, exception, signal, etc.) gets called in the middle of the user's function code, so he
did not know he was going to call a function, so it didn't save these caller-saved registers. This is why all of that
asynchronous code needs to save all those caller-saved registers. In x86, we had these problems with the FPU
and had to save the FPU state in a bunch of places. Maybe in aarch64 we need to save additional registers in
the same place we saved the FPU state for x86?
There is the arm64 FPU save/restore code in OSv where it saves floating point registers. But are you suggesting we save/restore extra registers in there? But why if push_frame/pop_frame do that for us already when the interrupt is taken? 

OSv does this FPU save/restore in *more* than just interrupts. We also have exceptions, signal handlers, and SYSCALL, all of which can wind up calling code in the middle of user code. So the first code that leaves the user's code needs to save these registers. It sounds like you're doing this correctly for interrupts, but maybe it's missing for some other things?

That being said, if this were something as "obvious" as not saving the registers, I would suspect this would have been more obvious and more frequent, and not specific to O1. So maybe that's not the problem.
 

Now in arm64, which is RISC, the stack is manipulated very differently than in x64, and very often storing or reading from the stack does not touch the stack pointer but merely references it, and then at some point, it is adjusted accordingly (it may happen before). So I wonder if the interrupt is called in the middle of that before the stack pointer is adjusted we might be pushing the frame at the wrong place and overriding some registers.

Hmm...
 
But then my experiment with adding/subtracting 128 or 256 bytes as I described above should have helped.

Yes, it sounds like it would.
 

BTW compiling with 'O1 -fcaller-saves' makes the crash happen in another place.
 
* call a function
* Restore ant x0-x18 registers if saved


Callee:
* push lr and any x19-x30 registered if used on stack
* execute code
* pop any x19-x30 registered if used above from stack

Waldek


I saw another test crashing in a similar way when the caller (another test) would pass 3 arguments to kernel function and 2 of those (non-addresses) were passed correctly but the 3rd one - address one was not.


Any ideas what might be going on?


Waldek

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/4a97809f-d207-48b9-88e7-06e218e5d829n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.

Waldek Kozaczuk

unread,
Feb 15, 2021, 11:32:57 AM2/15/21
to Nadav Har'El, OSv Development
On Mon, Feb 15, 2021 at 02:19 Nadav Har'El <n...@scylladb.com> wrote:
On Mon, Feb 15, 2021 at 7:43 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:


On Sunday, February 14, 2021 at 2:33:16 PM UTC-5 Nadav Har'El wrote:

You seem to be pushing registers on the stack here. Where is this stack? In x86, we had separate stacks for exceptions, for nested exceptions, and interrupts.
Is this also true in the arm version?
 
Good question. So in our aarch64 port, there is no dedicated exception nor interrupt stack unlike in x64. There is a single stack per thread where everything happens. And this might be the issue.
But somehow it works with '-O2' but maybe some bugs we have for aarch64 which we do not understand do happen because of a single stack. 

As usual this is just a wild theory, but it's possible that O2 code uses fewer or more registers, or uses the stack more or less or differently.
Right. 
 

This discussion of the stack made me think of another possible reason for losing data in functions.
The red zone.
Do we have a "red zone" on arm64 as well?
Basically, the "red zone" is 128 bytes below the stack pointer that a function can use as scratch space, and it can use it for example to store some of the parameters if it needs the registers to store something else - without wasting time on instructions to change the stack pointer. If some interrupt or exception overwrites this redzone, we lose data. 
To avoid this, we usually had separate stacks for interrupts and exceptions and nested exceptions, but where we didn't want to do this, e.g., in syscalls, we had to skip the redzone (see for example commit 499b9433ae748b6c04dedc2125ea17010ffbdaf1).

The ARMv8  Procedure Call Standard - https://developer.arm.com/documentation/ihi0055/latest/- does not mention any "red zone". However, both Windows (https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#red-zone) and Apple (https://developer.apple.com/documentation/xcode/writing_arm64_code_for_apple_platforms) extensions both mention the red zone - 16 bytes and 128 bytes respectively. I could not find anything specific for Linux though. For sure Linux requires 128 red zone for x86_64.

I suspect that gcc does *not* use the red zone on aarch64 - it seems there is no "-mno-red-zone" option, and https://github.com/iains/gcc-darwin-arm64 suggests that unlike Apple's compiler, gcc doesn't use a red zone - and your experiment below also suggests that this is not the problem. So that's probably (hopefully) not the problem, so there is one less reason to use separate stacks.

By the way another consequence of using the user's stacks for interrupts, exceptions, etc., is that it becomes more important for all stacks to be big enough. I wonder if the problem could be that some of our thread stacks are too small. Maybe you can hack sched::thread to always use a larger minimum stack size and see if it helps?

So here is what I tried at the same time:
1) Enforced eager resolving of symbols (for one type of the crash). 
2) Added a space of 128 bytes (tried 256, 512 even 4K) on the stack BEFORE pushing the exception frame for both interrupts and other exceptions.   
3) Added a space of 256 bytes on the stack AFTER pushing the exception frame for both interrupts and exceptions. 
4) Doubled the size of the default stack from 64K to 128. 

Same crashes as before. Different for O0 and O1 as before my changes. So none of the above helps or changes anything. 

So my theory was that either the exception frame would overwrite portion of the stack where registers are restored from, or something overwrites part of the exception frame. But the experiments above seem to prove that this may not be the case. 

Then what else? Compiler bug? I will try with older version of 9.3. 

Avi Kivity

unread,
Feb 15, 2021, 1:11:50 PM2/15/21
to Waldek Kozaczuk, OSv Development
> const(__closure=<optimized out>) at
> /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zsnd.cc:229
>
> #3std::__invoke_impl<void,
> test_bsd_tcp1::run()::{lambda()#1}&>(std::__invoke_other,
> test_bsd_tcp1::run()::{lambda()#1}&)(__f=...) at
> /usr/include/c++/10/bits/invoke.h:60
>
> #4std::__invoke_r<void,
> test_bsd_tcp1::run()::{lambda()#1}&>(std::__is_invocable&&,
> (test_bsd_tcp1::run()::{lambda()#1}&)...)(__fn=...) at
> /usr/include/c++/10/bits/invoke.h:153
>
> #5std::_Function_handler<void (),
> test_bsd_tcp1::run()::{lambda()#1}>::_M_invoke(std::_Any_data
> const&)(__functor=...) at /usr/include/c++/10/bits/std_function.h:291
>
> #60x000000004031cba8in std::function<void ()>::operator()()
> const(this=0xffffa0004168d630) at
> /usr/include/c++/10/bits/std_function.h:622
>
> #70x000000004043e1ccin sched::thread::main(this=0xffffa0004168d600) at
> core/sched.cc:1219
>
> #80x000000004043a188in sched::thread_main_c(t=0xffffa0004168d600) at
> arch/aarch64/arch-switch.hh:186
>
> #90x0000000040439cf4in sched::thread::switch_to(this=0x0) at
> arch/aarch64/arch-switch.hh:28
>
> #10 0x0000000000000000in ??()
>
> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>
> (gdb) frame 1
>
> #10x0000100000037954in test_bsd_tcp1::tcp_server(this=0x2000006ff988)
> at /home/wkozaczuk/projects/osv/tests/tst-bsd-tcp1-zsnd.cc:114
>
> 114int bytes2 = zcopy_tx(client_s, &zm);
>
> (gdb) p client_s
>
> $1 = 5
>
> (gdb) p &zm
>
> $2 = (zmsghdr *) 0xffff800041782d40
>
>
> As you can see the test app calls zcopy_tx() which takes 2 arguments:
>
> ssize_t zcopy_tx(int s, struct zmsghdr *zm)
>
> the 1st one is int and has value 5 in the caller - the test app - and
> is received as such
>
> in the kernel zcopy_tx.
>
>
> The second one - the address of struct zmsghdr - is problematic. On
> the caller's side looks OK but when received in the kernel it is wrong
> - 0x1.
>
> Why?
>
>
> I saw another test crashing in a similar way when the caller (another
> test) would pass 3 arguments to kernel function and 2 of those
> (non-addresses) were passed correctly but the 3rd one - address one
> was not.
>
>
> Any ideas what might be going on?
>
>


Can you provide a disassembly of zcopy_tx? From the start of the
function until the crash site (there should only be register saves and
other preamble, and the call to operator new, if I read it correctly).


Maybe save zm in some global before calling new, to see if operator new
is the problem.

Waldek Kozaczuk

unread,
Feb 16, 2021, 1:01:54 AM2/16/21
to Avi Kivity, OSv Development

For comparison fragment of zcopy_tx in release loader.elf until a call to eventfd which is after new:

Dump of assembler code for function zcopy_tx(int, zmsghdr*):
   0x0000000040100da0 <+0>:     stp     x29, x30, [sp, #-144]!
   0x0000000040100da4 <+4>:     mov     x29, sp
   0x0000000040100da8 <+8>:     stp     x21, x22, [sp, #32]
   0x0000000040100dac <+12>:    mov     x22, x1
   0x0000000040100db0 <+16>:    stp     x19, x20, [sp, #16]
   0x0000000040100db4 <+20>:    mov     w20, w0
   0x0000000040100db8 <+24>:    mov     x0, #0x8                        // #8
   0x0000000040100dbc <+28>:    stp     x23, x24, [sp, #48]
   0x0000000040100dc0 <+32>:    stp     x25, x26, [sp, #64]
   0x0000000040100dc4 <+36>:    stp     x27, x28, [sp, #80]
   0x0000000040100dc8 <+40>:    stp     xzr, xzr, [sp, #104]
   0x0000000040100dcc <+44>:    stp     xzr, xzr, [sp, #120]
   0x0000000040100dd0 <+48>:    str     xzr, [sp, #136]
   0x0000000040100dd4 <+52>:    bl      0x403920e0 <_Znwm>              
   0x0000000040100dd8 <+56>:    mov     x23, x0
   0x0000000040100ddc <+60>:    str     x0, [x22, #64]
   0x0000000040100de0 <+64>:    mov     w1, #0x800                      // #2048
   0x0000000040100de4 <+68>:    mov     w0, #0x0                        // #0
   0x0000000040100de8 <+72>:    movk    w1, #0x8, lsl #16
   0x0000000040100dec <+76>:    str     xzr, [x23]
   0x0000000040100df0 <+80>:    bl      0x403520e0 <eventfd(unsigned int, int)>

Now for debug version equivalent fragment (the crash happens at PC: 0x0000000040111e80 <zcopy_tx+84>):

Dump of assembler code for function zcopy_tx(int, zmsghdr*):
   0x0000000040111e2c <+0>:     stp     x29, x30, [sp, #-208]!
   0x0000000040111e30 <+4>:     mov     x29, sp
   0x0000000040111e34 <+8>:     stp     x19, x20, [sp, #16]
   0x0000000040111e38 <+12>:    str     w0, [sp, #44]
   0x0000000040111e3c <+16>:    str     x1, [sp, #32]
   0x0000000040111e40 <+20>:    str     xzr, [sp, #88]
   0x0000000040111e44 <+24>:    str     xzr, [sp, #96]
   0x0000000040111e48 <+28>:    str     xzr, [sp, #104]
   0x0000000040111e4c <+32>:    str     xzr, [sp, #112]
   0x0000000040111e50 <+36>:    str     xzr, [sp, #120]
   0x0000000040111e54 <+40>:    str     xzr, [sp, #184]
   0x0000000040111e58 <+44>:    ldr     x0, [sp, #32]
   0x0000000040111e5c <+48>:    str     x0, [sp, #176]
   0x0000000040111e60 <+52>:    mov     x0, #0x8                        // #8
   0x0000000040111e64 <+56>:    bl      0x405b7e60 <_Znwm>
   0x0000000040111e68 <+60>:    mov     x19, x0
   0x0000000040111e6c <+64>:    mov     x0, x19
   0x0000000040111e70 <+68>:    bl      0x40112544 <ztx_handle::ztx_handle()>
   0x0000000040111e74 <+72>:    str     x19, [sp, #168]
   0x0000000040111e78 <+76>:    ldr     x0, [sp, #32]
   0x0000000040111e7c <+80>:    ldr     x1, [sp, #168]
   0x0000000040111e80 <+84>:    str     x1, [x0, #64] -> pc reported in the stack trace
   0x0000000040111e84 <+88>:    mov     w1, #0x800                      // #2048
   0x0000000040111e88 <+92>:    movk    w1, #0x8, lsl #16
   0x0000000040111e8c <+96>:    mov     w0, #0x0                        // #0
   0x0000000040111e90 <+100>:   bl      0x40557c1c <eventfd(unsigned int, int)>

Assembly of _Znwm which I believe is the same in both cases:

Dump of assembler code for function _Znwm:
   0x00000000405b7e60 <+0>: stp x29, x30, [sp, #-32]!
   0x00000000405b7e64 <+4>: cmp x0, #0x0
   0x00000000405b7e68 <+8>: mov x29, sp
   0x00000000405b7e6c <+12>: str x19, [sp, #16]
   0x00000000405b7e70 <+16>: csinc x19, x0, xzr, ne  // ne = any
   0x00000000405b7e74 <+20>: mov x0, x19
   0x00000000405b7e78 <+24>: bl 0x40406f4c <malloc(size_t)>
   0x00000000405b7e7c <+28>: cbz x0, 0x405b7e8c <_Znwm+44>
   0x00000000405b7e80 <+32>: ldr x19, [sp, #16]
   0x00000000405b7e84 <+36>: ldp x29, x30, [sp], #32
   0x00000000405b7e88 <+40>: ret
   0x00000000405b7e8c <+44>: bl 0x405b7e50 <_ZSt15get_new_handlerv>
   0x00000000405b7e90 <+48>: cbz x0, 0x405b7e9c <_Znwm+60>
   0x00000000405b7e94 <+52>: blr x0
   0x00000000405b7e98 <+56>: b 0x405b7e74 <_Znwm+20>
   0x00000000405b7e9c <+60>: mov x0, #0x8                   // #8
   0x00000000405b7ea0 <+64>: bl 0x405b6170 <__cxa_allocate_exception>
   0x00000000405b7ea4 <+68>: adrp x3, 0x40098000
   0x00000000405b7ea8 <+72>: adrp x2, 0x40099000
   0x00000000405b7eac <+76>: adrp x1, 0x40098000
   0x00000000405b7eb0 <+80>: ldr x3, [x3, #1128]
   0x00000000405b7eb4 <+84>: ldr x2, [x2, #928]
   0x00000000405b7eb8 <+88>: add x3, x3, #0x10
   0x00000000405b7ebc <+92>: ldr x1, [x1, #2992]
   0x00000000405b7ec0 <+96>: str x3, [x0]
   0x00000000405b7ec4 <+100>: bl 0x405b76e0 <__cxa_throw>

And the ztx_handle::ztx_handle(): for debug:

Dump of assembler code for function ztx_handle::ztx_handle():
   0x0000000040112544 <+0>: stp x29, x30, [sp, #-32]!
   0x0000000040112548 <+4>: mov x29, sp
   0x000000004011254c <+8>: str x0, [sp, #24]
   0x0000000040112550 <+12>: ldr x0, [sp, #24]
   0x0000000040112554 <+16>: mov x1, #0x0                   // #0
   0x0000000040112558 <+20>: bl 0x4011245c <std::atomic<unsigned long>::atomic(unsigned long)>
   0x000000004011255c <+24>: nop
   0x0000000040112560 <+28>: ldp x29, x30, [sp], #32
   0x0000000040112564 <+32>: ret
End of assembler dump.

Waldek Kozaczuk

unread,
Feb 16, 2021, 4:11:56 PM2/16/21
to Avi Kivity, OSv Development
OK, I think I have narrowed it down to cpu::reschedule_from_interrupt(), which of course switches from one thread to another.

But let me tell you what I did before. So I tried to save a copy of zm in zcopy_zx() function in global variable. It did not help. I realized later that this test is multithreaded so that zcopy_zx() is called by client and server thread. What was interesting when I changed the number of ITERATIONS from 400 to 1 the test passed. With 2 hung and 3 and above crashed the same way.

Then I started thinking again that maybe the problem is with thread switching. So I pursued an experiment where I compiled all sources but code/sched.cc with '-O0' and core/sched.cc with '-02' like release build. And voila, all the tests that were failing now pass. So then I thought that maybe the culprit is switch_to() where we have some critical assembly. So I compiled core/sched.cc with '-O0' but switch_to() with '-02' using the '__attribute__ ((optimize("O2")))' annotation - this did not help but crash looked different. So then I pinpointed just reschedule_from_interrupt() and made it compile with '-O2' (so everything else is with '-O0) and now the tests are working.

So might be going on? For sure this is a very critical function where we change thread and TLS and call switch_to(). 

Here is the disassembled working version (compiled with -O2):

Dump of assembler code for function _ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE:
   0x00000000402d75b4 <+0>: stp x29, x30, [sp, #-128]!
   0x00000000402d75b8 <+4>: mov x29, sp
   0x00000000402d75bc <+8>: stp x19, x20, [sp, #16]
   0x00000000402d75c0 <+12>: mov x19, x0
   0x00000000402d75c4 <+16>: and w0, w1, #0xff
   0x00000000402d75c8 <+20>: stp x21, x22, [sp, #32]
   0x00000000402d75cc <+24>: adrp x22, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d75d0 <+28>: add x22, x22, #0x540
   0x00000000402d75d4 <+32>: stp x23, x24, [sp, #48]
   0x00000000402d75d8 <+36>: mrs x24, tpidr_el0
   0x00000000402d75dc <+40>: stp x25, x26, [sp, #64]
   0x00000000402d75e0 <+44>: stp x27, x28, [sp, #80]
   0x00000000402d75e4 <+48>: str w0, [sp, #100]
   0x00000000402d75e8 <+52>: add x0, x22, #0x510
   0x00000000402d75ec <+56>: str x2, [sp, #104]
   0x00000000402d75f0 <+60>: bl 0x402d45d0 <(anonymous namespace)::tracepointv<21, std::tuple<>(), identity_assign<> >::operator()(void)>
   0x00000000402d75f4 <+64>: add x0, x24, #0x0, lsl #12
   0x00000000402d75f8 <+68>: add x0, x0, #0x70
   0x00000000402d75fc <+72>: ldr w0, [x0]
   0x00000000402d7600 <+76>: cmp w0, #0x1
   0x00000000402d7604 <+80>: b.hi 0x402d79a8 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1012>  // b.pmore
   0x00000000402d7608 <+84>: add x20, x24, #0x0, lsl #12
   0x00000000402d760c <+88>: add x20, x20, #0x90
   0x00000000402d7610 <+92>: mov x0, x19
   0x00000000402d7614 <+96>: strb wzr, [x20, #24]
   0x00000000402d7618 <+100>: bl 0x402d64a0 <_ZN5sched3cpu23handle_incoming_wakeupsEv>
   0x00000000402d761c <+104>: bl 0x401f2450 <_ZN3osv5clock6uptime3nowEv>
   0x00000000402d7620 <+108>: mov x23, x0
   0x00000000402d7624 <+112>: ldr x1, [x19, #5912]
   0x00000000402d7628 <+116>: str x23, [x19, #5912]
   0x00000000402d762c <+120>: ldr x20, [x20]
   0x00000000402d7630 <+124>: sub x1, x23, x1
   0x00000000402d7634 <+128>: cmp x1, #0x0
   0x00000000402d7638 <+132>: mov x0, #0x2710                 // #10000
   0x00000000402d763c <+136>: csel x1, x1, x0, gt
   0x00000000402d7640 <+140>: ldr x0, [x20, #136]
   0x00000000402d7644 <+144>: add x0, x0, #0x14
   0x00000000402d7648 <+148>: ldar w21, [x0]
   0x00000000402d764c <+152>: cmp w21, #0x6
   0x00000000402d7650 <+156>: b.eq 0x402d7988 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+980>  // b.none
   0x00000000402d7654 <+160>: ldr x2, [x20, #376]
   0x00000000402d7658 <+164>: add x26, x20, #0x78
   0x00000000402d765c <+168>: mov x0, x26
   0x00000000402d7660 <+172>: add x2, x2, x1
   0x00000000402d7664 <+176>: str x2, [x20, #376]
   0x00000000402d7668 <+180>: bl 0x402d6710 <_ZN5sched14thread_runtime7ran_forENSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE>
   0x00000000402d766c <+184>: cmp w21, #0x5
   0x00000000402d7670 <+188>: b.ne 0x402d7868 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+692>  // b.any
   0x00000000402d7674 <+192>: ldr x0, [x19, #4184]
   0x00000000402d7678 <+196>: cbz x0, 0x402d78c0 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+780>
   0x00000000402d767c <+200>: ldr w0, [sp, #100]
   0x00000000402d7680 <+204>: cbnz w0, 0x402d78e4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+816>
   0x00000000402d7684 <+208>: ldr x21, [x19, #4200]
   0x00000000402d7688 <+212>: add x27, x19, #0x1, lsl #12
   0x00000000402d768c <+216>: ldr s0, [x26, #4]
   0x00000000402d7690 <+220>: sub x21, x21, #0xb8
   0x00000000402d7694 <+224>: ldr s1, [x21, #4]
   0x00000000402d7698 <+228>: fcmpe s1, s0
   0x00000000402d769c <+232>: b.gt 0x402d7940 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+908>
   0x00000000402d76a0 <+236>: mov x0, x26
   0x00000000402d76a4 <+240>: bl 0x402d69d0 <_ZN5sched14thread_runtime19hysteresis_run_stopEv>
   0x00000000402d76a8 <+244>: ldr x0, [x20, #136]
   0x00000000402d76ac <+248>: mov w1, #0x6                   // #6
   0x00000000402d76b0 <+252>: add x0, x0, #0x14
   0x00000000402d76b4 <+256>: stlr w1, [x0]
   0x00000000402d76b8 <+260>: mov x1, x20
   0x00000000402d76bc <+264>: mov x0, x19
   0x00000000402d76c0 <+268>: bl 0x402d4e80 <_ZN5sched3cpu7enqueueERNS_6threadE>
   0x00000000402d76c4 <+272>: add x0, x22, #0x580
   0x00000000402d76c8 <+276>: bl 0x402d4654 <(anonymous namespace)::tracepointv<17, std::tuple<>(), identity_assign<> >::operator()(void)>
   0x00000000402d76cc <+280>: add x0, x20, #0x168
   0x00000000402d76d0 <+284>: mov x1, #0x1                   // #1
   0x00000000402d76d4 <+288>: bl 0x402dbd70 <_ZN5sched6thread12stat_counter4incrEm>
   0x00000000402d76d8 <+292>: ldr x25, [x27, #104]
   0x00000000402d76dc <+296>: mov x2, #0x1058                 // #4184
   0x00000000402d76e0 <+300>: add x8, sp, #0x78
   0x00000000402d76e4 <+304>: add x0, x19, x2
   0x00000000402d76e8 <+308>: sub x21, x25, #0x130
   0x00000000402d76ec <+312>: add x1, sp, #0x70
   0x00000000402d76f0 <+316>: str x25, [sp, #112]
   0x00000000402d76f4 <+320>: bl 0x402dc690 <_ZN5boost9intrusive11bstree_implINS0_8mhtraitsIN5sched6threadENS0_15set_member_hookIJEEEXadL_ZNS4_14_runqueue_linkEEEEEvNS3_22thread_runtime_compareEmLb1ELNS0_10algo_typesE5EvE5eraseENS0_13tree_iteratorIS7_Lb1EEE>
   0x00000000402d76f8 <+324>: mov x0, x21
   0x00000000402d76fc <+328>: mov x1, x23
   0x00000000402d7700 <+332>: ldr x2, [x21, #376]
   0x00000000402d7704 <+336>: bl 0x402dbd80 <_ZN5sched6thread21cputime_estimator_setENSt6chrono10time_pointIN3osv5clock6uptimeENS1_8durationIlSt5ratioILl1ELl1000000000EEEEEES9_>
   0x00000000402d7708 <+340>: ldur x0, [x25, #-168]
   0x00000000402d770c <+344>: add x0, x0, #0x14
   0x00000000402d7710 <+348>: ldar w0, [x0]
   0x00000000402d7714 <+352>: cmp w0, #0x6
   0x00000000402d7718 <+356>: b.ne 0x402d79c8 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1044>  // b.any
   0x00000000402d771c <+360>: ldur s1, [x25, #-180]
   0x00000000402d7720 <+364>: add x0, x22, #0x5f0
   0x00000000402d7724 <+368>: ldr s0, [x26, #4]
   0x00000000402d7728 <+372>: mov x1, x21
   0x00000000402d772c <+376>: sub x28, x25, #0xb8
   0x00000000402d7730 <+380>: bl 0x402d4880 <(anonymous namespace)::tracepointv<10, std::tuple<sched::thread*, float, float>(sched::thread*, float, float), identity_assign<sched::thread*, float, float> >::operator()(sched::thread *, float, float)>
   0x00000000402d7734 <+384>: ldr w0, [sp, #100]
   0x00000000402d7738 <+388>: cbnz w0, 0x402d7890 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+732>
   0x00000000402d773c <+392>: ldr x0, [x19, #5872]
   0x00000000402d7740 <+396>: add x26, x25, #0x30
   0x00000000402d7744 <+400>: cmp x0, x21
   0x00000000402d7748 <+404>: b.eq 0x402d78ac <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+760>  // b.none
   0x00000000402d774c <+408>: cmp x0, x20
   0x00000000402d7750 <+412>: b.eq 0x402d7904 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+848>  // b.none
   0x00000000402d7754 <+416>: mov x0, x26
   0x00000000402d7758 <+420>: mov x1, #0x1                   // #1
   0x00000000402d775c <+424>: bl 0x402dbd70 <_ZN5sched6thread12stat_counter4incrEm>
   0x00000000402d7760 <+428>: ldr x1, [x19, #4184]
   0x00000000402d7764 <+432>: add x0, x22, #0x420
   0x00000000402d7768 <+436>: bl 0x402d4c30 <(anonymous namespace)::tracepointv<16, std::tuple<long unsigned int>(long unsigned int), identity_assign<long unsigned int> >::operator()(unsigned long)>
   0x00000000402d776c <+440>: ldur x0, [x25, #-168]
   0x00000000402d7770 <+444>: mov w1, #0x5                   // #5
   0x00000000402d7774 <+448>: add x0, x0, #0x14
   0x00000000402d7778 <+452>: stlr w1, [x0]
   0x00000000402d777c <+456>: mov x0, x28
   0x00000000402d7780 <+460>: bl 0x402d6954 <_ZN5sched14thread_runtime20hysteresis_run_startEv>
   0x00000000402d7784 <+464>: cmp x20, x21
   0x00000000402d7788 <+468>: b.eq 0x402d79e8 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1076>  // b.none
   0x00000000402d778c <+472>: ldr x0, [x20, #136]
   0x00000000402d7790 <+476>: add x0, x0, #0x14
   0x00000000402d7794 <+480>: ldr w0, [x0]
   0x00000000402d7798 <+484>: cmp w0, #0x6
   0x00000000402d779c <+488>: b.eq 0x402d7878 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+708>  // b.none
   0x00000000402d77a0 <+492>: mov x1, #0x16b8                 // #5816
   0x00000000402d77a4 <+496>: add x20, x19, x1
   0x00000000402d77a8 <+500>: mov x0, x20
   0x00000000402d77ac <+504>: bl 0x402d5e20 <_ZN5sched10timer_base6cancelEv>
   0x00000000402d77b0 <+508>: ldr w0, [sp, #100]
   0x00000000402d77b4 <+512>: cbnz w0, 0x402d7854 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+672>
   0x00000000402d77b8 <+516>: ldr x0, [x19, #4184]
   0x00000000402d77bc <+520>: cbz x0, 0x402d77d8 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+548>
   0x00000000402d77c0 <+524>: ldr x1, [x27, #104]
   0x00000000402d77c4 <+528>: mov x0, x28
   0x00000000402d77c8 <+532>: ldur s0, [x1, #-180]
   0x00000000402d77cc <+536>: bl 0x402d6a74 <_ZNK5sched14thread_runtime10time_untilEf>
   0x00000000402d77d0 <+540>: cmp x0, #0x0
   0x00000000402d77d4 <+544>: b.gt 0x402d7858 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+676>
   0x00000000402d77d8 <+548>: mov x0, #0x16fa                 // #5882
   0x00000000402d77dc <+552>: add x19, x19, x0
   0x00000000402d77e0 <+556>: ldrb w0, [x19]
   0x00000000402d77e4 <+560>: tst w0, #0xff
   0x00000000402d77e8 <+564>: ldrb w0, [x21, #272]
   0x00000000402d77ec <+568>: cset w1, ne  // ne = any
   0x00000000402d77f0 <+572>: cmp w1, w0
   0x00000000402d77f4 <+576>: b.eq 0x402d77fc <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+584>  // b.none
   0x00000000402d77f8 <+580>: strb w0, [x19]
   0x00000000402d77fc <+584>: add x1, x27, #0x6f9
   0x00000000402d7800 <+588>: mov w0, #0x0                   // #0
   0x00000000402d7804 <+592>: bl 0x40478b00 <__aarch64_swp1_acq_rel>
   0x00000000402d7808 <+596>: tst w0, #0xff
   0x00000000402d780c <+600>: b.ne 0x402d78b8 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+772>  // b.any
   0x00000000402d7810 <+604>: add x24, x24, #0x0, lsl #12
   0x00000000402d7814 <+608>: add x24, x24, #0x90
   0x00000000402d7818 <+612>: mov x0, x21
   0x00000000402d781c <+616>: bl 0x402d4c60 <_ZN5sched6thread9switch_toEv>
   0x00000000402d7820 <+620>: ldr x0, [x24, #8]
   0x00000000402d7824 <+624>: ldr x0, [x0, #5904]
   0x00000000402d7828 <+628>: cbz x0, 0x402d7838 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+644>
   0x00000000402d782c <+632>: bl 0x402d7a60 <_ZN5sched6thread7destroyEv>
   0x00000000402d7830 <+636>: ldr x0, [x24, #8]
   0x00000000402d7834 <+640>: str xzr, [x0, #5904]
   0x00000000402d7838 <+644>: ldp x19, x20, [sp, #16]
   0x00000000402d783c <+648>: ldp x21, x22, [sp, #32]
   0x00000000402d7840 <+652>: ldp x23, x24, [sp, #48]
   0x00000000402d7844 <+656>: ldp x25, x26, [sp, #64]
   0x00000000402d7848 <+660>: ldp x27, x28, [sp, #80]
   0x00000000402d784c <+664>: ldp x29, x30, [sp], #128
   0x00000000402d7850 <+668>: ret
   0x00000000402d7854 <+672>: ldr x0, [sp, #104]
   0x00000000402d7858 <+676>: add x1, x0, x23
   0x00000000402d785c <+680>: mov x0, x20
   0x00000000402d7860 <+684>: bl 0x402d5bf0 <_ZN5sched10timer_base3setENSt6chrono10time_pointIN3osv5clock6uptimeENS1_8durationIlSt5ratioILl1ELl1000000000EEEEEE>
   0x00000000402d7864 <+688>: b 0x402d77d8 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+548>
   0x00000000402d7868 <+692>: add x27, x19, #0x1, lsl #12
   0x00000000402d786c <+696>: mov x0, x26
   0x00000000402d7870 <+700>: bl 0x402d69d0 <_ZN5sched14thread_runtime19hysteresis_run_stopEv>
   0x00000000402d7874 <+704>: b 0x402d76d8 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+292>
   0x00000000402d7878 <+708>: ldr x0, [x19, #5872]
   0x00000000402d787c <+712>: cmp x0, x20
   0x00000000402d7880 <+716>: b.eq 0x402d77a0 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+492>  // b.none
   0x00000000402d7884 <+720>: mov x0, x28
   0x00000000402d7888 <+724>: bl 0x402d6a34 <_ZN5sched14thread_runtime26add_context_switch_penaltyEv>
   0x00000000402d788c <+728>: b 0x402d77a0 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+492>
   0x00000000402d7890 <+732>: mov x0, x19
   0x00000000402d7894 <+736>: mov x1, x20
   0x00000000402d7898 <+740>: bl 0x402d4e80 <_ZN5sched3cpu7enqueueERNS_6threadE>
   0x00000000402d789c <+744>: add x26, x25, #0x30
   0x00000000402d78a0 <+748>: ldr x0, [x19, #5872]
   0x00000000402d78a4 <+752>: cmp x0, x21
   0x00000000402d78a8 <+756>: b.ne 0x402d774c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+408>  // b.any
   0x00000000402d78ac <+760>: add x0, x22, #0x660
   0x00000000402d78b0 <+764>: bl 0x402d46e0 <(anonymous namespace)::tracepointv<8, std::tuple<>(), identity_assign<> >::operator()(void)>
   0x00000000402d78b4 <+768>: b 0x402d7754 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+416>
   0x00000000402d78b8 <+772>: bl 0x401f8c50 <_ZN3mmu15flush_tlb_localEv>
   0x00000000402d78bc <+776>: b 0x402d7810 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+604>
   0x00000000402d78c0 <+780>: mov x4, #0x16b8                 // #5816
   0x00000000402d78c4 <+784>: add x0, x19, x4
   0x00000000402d78c8 <+788>: ldp x19, x20, [sp, #16]
   0x00000000402d78cc <+792>: ldp x21, x22, [sp, #32]
   0x00000000402d78d0 <+796>: ldp x23, x24, [sp, #48]
   0x00000000402d78d4 <+800>: ldp x25, x26, [sp, #64]
   0x00000000402d78d8 <+804>: ldp x27, x28, [sp, #80]
   0x00000000402d78dc <+808>: ldp x29, x30, [sp], #128
   0x00000000402d78e0 <+812>: b 0x402d5e20 <_ZN5sched10timer_base6cancelEv>
   0x00000000402d78e4 <+816>: mov x0, x26
   0x00000000402d78e8 <+820>: bl 0x402d69d0 <_ZN5sched14thread_runtime19hysteresis_run_stopEv>
   0x00000000402d78ec <+824>: ldr x0, [x20, #136]
   0x00000000402d78f0 <+828>: mov w1, #0x6                   // #6
   0x00000000402d78f4 <+832>: add x0, x0, #0x14
   0x00000000402d78f8 <+836>: stlr w1, [x0]
   0x00000000402d78fc <+840>: add x27, x19, #0x1, lsl #12
   0x00000000402d7900 <+844>: b 0x402d76c4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+272>
   0x00000000402d7904 <+848>: add x0, x22, #0x6d0
   0x00000000402d7908 <+852>: bl 0x402d4764 <(anonymous namespace)::tracepointv<9, std::tuple<>(), identity_assign<> >::operator()(void)>
   0x00000000402d790c <+856>: mov x0, x26
   0x00000000402d7910 <+860>: mov x1, #0x1                   // #1
   0x00000000402d7914 <+864>: bl 0x402dbd70 <_ZN5sched6thread12stat_counter4incrEm>
   0x00000000402d7918 <+868>: ldr x1, [x19, #4184]
   0x00000000402d791c <+872>: add x0, x22, #0x420
   0x00000000402d7920 <+876>: bl 0x402d4c30 <(anonymous namespace)::tracepointv<16, std::tuple<long unsigned int>(long unsigned int), identity_assign<long unsigned int> >::operator()(unsigned long)>
   0x00000000402d7924 <+880>: ldur x0, [x25, #-168]
   0x00000000402d7928 <+884>: mov w1, #0x5                   // #5
   0x00000000402d792c <+888>: add x0, x0, #0x14
   0x00000000402d7930 <+892>: stlr w1, [x0]
   0x00000000402d7934 <+896>: mov x0, x28
   0x00000000402d7938 <+900>: bl 0x402d6954 <_ZN5sched14thread_runtime20hysteresis_run_startEv>
   0x00000000402d793c <+904>: b 0x402d778c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+472>
   0x00000000402d7940 <+908>: mov x3, #0x16b8                 // #5816
   0x00000000402d7944 <+912>: add x19, x19, x3
   0x00000000402d7948 <+916>: mov x0, x19
   0x00000000402d794c <+920>: bl 0x402d5e20 <_ZN5sched10timer_base6cancelEv>
   0x00000000402d7950 <+924>: ldr s0, [x21, #4]
   0x00000000402d7954 <+928>: mov x0, x26
   0x00000000402d7958 <+932>: bl 0x402d6a74 <_ZNK5sched14thread_runtime10time_untilEf>
   0x00000000402d795c <+936>: cmp x0, #0x0
   0x00000000402d7960 <+940>: b.le 0x402d7838 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+644>
   0x00000000402d7964 <+944>: add x1, x0, x23
   0x00000000402d7968 <+948>: mov x0, x19
   0x00000000402d796c <+952>: ldp x19, x20, [sp, #16]
   0x00000000402d7970 <+956>: ldp x21, x22, [sp, #32]
   0x00000000402d7974 <+960>: ldp x23, x24, [sp, #48]
   0x00000000402d7978 <+964>: ldp x25, x26, [sp, #64]
   0x00000000402d797c <+968>: ldp x27, x28, [sp, #80]
   0x00000000402d7980 <+972>: ldp x29, x30, [sp], #128
   0x00000000402d7984 <+976>: b 0x402d5bf0 <_ZN5sched10timer_base3setENSt6chrono10time_pointIN3osv5clock6uptimeENS1_8durationIlSt5ratioILl1ELl1000000000EEEEEE>
   0x00000000402d7988 <+980>: adrp x3, 0x4055c000
   0x00000000402d798c <+984>: adrp x1, 0x4055b000
   0x00000000402d7990 <+988>: adrp x0, 0x4055c000
   0x00000000402d7994 <+992>: add x3, x3, #0x130
   0x00000000402d7998 <+996>: add x1, x1, #0xf70
   0x00000000402d799c <+1000>: add x0, x0, #0x170
   0x00000000402d79a0 <+1004>: mov w2, #0xfc                   // #252
   0x00000000402d79a4 <+1008>: bl 0x400d892c <__assert_fail(char const*, char const*, unsigned int, char const*)>
   0x00000000402d79a8 <+1012>: adrp x3, 0x4055c000
   0x00000000402d79ac <+1016>: adrp x1, 0x4055b000
   0x00000000402d79b0 <+1020>: adrp x0, 0x4055c000
   0x00000000402d79b4 <+1024>: add x3, x3, #0x130
   0x00000000402d79b8 <+1028>: add x1, x1, #0xf70
   0x00000000402d79bc <+1032>: add x0, x0, #0x150
   0x00000000402d79c0 <+1036>: mov w2, #0xec                   // #236
   0x00000000402d79c4 <+1040>: bl 0x400d892c <__assert_fail(char const*, char const*, unsigned int, char const*)>
   0x00000000402d79c8 <+1044>: adrp x3, 0x4055c000
   0x00000000402d79cc <+1048>: adrp x1, 0x4055b000
   0x00000000402d79d0 <+1052>: adrp x0, 0x4055c000
   0x00000000402d79d4 <+1056>: add x3, x3, #0x130
   0x00000000402d79d8 <+1060>: add x1, x1, #0xf70
   0x00000000402d79dc <+1064>: add x0, x0, #0x198
   0x00000000402d79e0 <+1068>: mov w2, #0x127                 // #295
   0x00000000402d79e4 <+1072>: bl 0x400d892c <__assert_fail(char const*, char const*, unsigned int, char const*)>
   0x00000000402d79e8 <+1076>: adrp x3, 0x4055c000
   0x00000000402d79ec <+1080>: adrp x1, 0x4055b000
   0x00000000402d79f0 <+1084>: adrp x0, 0x4055c000
   0x00000000402d79f4 <+1088>: add x3, x3, #0x130
   0x00000000402d79f8 <+1092>: add x1, x1, #0xf70
   0x00000000402d79fc <+1096>: add x0, x0, #0x1d0
   0x00000000402d7a00 <+1100>: mov w2, #0x13a                 // #314
   0x00000000402d7a04 <+1104>: bl 0x400d892c <__assert_fail(char const*, char const*, unsigned int, char const*)>
End of assembler dump.

Here is the disassembled NOT working version (this time compiled with -O1, -O0 probably even longer):

Dump of assembler code for function _ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE:
   0x00000000402d716c <+0>: stp x29, x30, [sp, #-144]!
   0x00000000402d7170 <+4>: mov x29, sp
   0x00000000402d7174 <+8>: stp x19, x20, [sp, #16]
   0x00000000402d7178 <+12>: stp x21, x22, [sp, #32]
   0x00000000402d717c <+16>: stp x23, x24, [sp, #48]
   0x00000000402d7180 <+20>: stp x25, x26, [sp, #64]
   0x00000000402d7184 <+24>: mov x19, x0
   0x00000000402d7188 <+28>: and w24, w1, #0xff
   0x00000000402d718c <+32>: mov x23, x2
   0x00000000402d7190 <+36>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d7194 <+40>: ldrb w0, [x0, #2698]
   0x00000000402d7198 <+44>: cbnz w0, 0x402d7284 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+280>
   0x00000000402d719c <+48>: mrs x0, tpidr_el0
   0x00000000402d71a0 <+52>: add x0, x0, #0x0, lsl #12
   0x00000000402d71a4 <+56>: add x0, x0, #0x70
   0x00000000402d71a8 <+60>: ldr w0, [x0]
   0x00000000402d71ac <+64>: cmp w0, #0x1
   0x00000000402d71b0 <+68>: b.hi 0x402d7304 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+408>  // b.pmore
   0x00000000402d71b4 <+72>: mrs x0, tpidr_el0
   0x00000000402d71b8 <+76>: add x0, x0, #0x0, lsl #12
   0x00000000402d71bc <+80>: add x0, x0, #0x90
   0x00000000402d71c0 <+84>: strb wzr, [x0, #24]
   0x00000000402d71c4 <+88>: mov x0, x19
   0x00000000402d71c8 <+92>: bl 0x402d60a8 <_ZN5sched3cpu23handle_incoming_wakeupsEv>
   0x00000000402d71cc <+96>: bl 0x401d4284 <_ZN5clock3getEv>
   0x00000000402d71d0 <+100>: ldr x1, [x0]
   0x00000000402d71d4 <+104>: ldr x1, [x1, #16]
   0x00000000402d71d8 <+108>: blr x1
   0x00000000402d71dc <+112>: mov x22, x0
   0x00000000402d71e0 <+116>: ldr x1, [x19, #5912]
   0x00000000402d71e4 <+120>: sub x1, x0, x1
   0x00000000402d71e8 <+124>: str x0, [x19, #5912]
   0x00000000402d71ec <+128>: cmp x1, #0x0
   0x00000000402d71f0 <+132>: mov x0, #0x2710                 // #10000
   0x00000000402d71f4 <+136>: csel x1, x1, x0, gt
   0x00000000402d71f8 <+140>: mrs x0, tpidr_el0
   0x00000000402d71fc <+144>: add x0, x0, #0x0, lsl #12
   0x00000000402d7200 <+148>: add x0, x0, #0x90
   0x00000000402d7204 <+152>: ldr x21, [x0]
   0x00000000402d7208 <+156>: ldr x0, [x21, #136]
   0x00000000402d720c <+160>: add x0, x0, #0x14
   0x00000000402d7210 <+164>: ldar w20, [x0]
   0x00000000402d7214 <+168>: cmp w20, #0x6
   0x00000000402d7218 <+172>: b.eq 0x402d732c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+448>  // b.none
   0x00000000402d721c <+176>: ldr x0, [x21, #376]
   0x00000000402d7220 <+180>: add x0, x0, x1
   0x00000000402d7224 <+184>: str x0, [x21, #376]
   0x00000000402d7228 <+188>: add x25, x21, #0x78
   0x00000000402d722c <+192>: mov x0, x25
   0x00000000402d7230 <+196>: bl 0x402d6308 <_ZN5sched14thread_runtime7ran_forENSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE>
   0x00000000402d7234 <+200>: cmp w20, #0x5
   0x00000000402d7238 <+204>: b.ne 0x402d7418 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+684>  // b.any
   0x00000000402d723c <+208>: ldr x0, [x19, #4184]
   0x00000000402d7240 <+212>: cbz x0, 0x402d7354 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+488>
   0x00000000402d7244 <+216>: cbnz w24, 0x402d78f4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1928>
   0x00000000402d7248 <+220>: ldr x20, [x19, #4200]
   0x00000000402d724c <+224>: ldur s1, [x20, #-180]
   0x00000000402d7250 <+228>: ldr s0, [x21, #124]
   0x00000000402d7254 <+232>: fcmpe s1, s0
   0x00000000402d7258 <+236>: b.gt 0x402d7364 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+504>
   0x00000000402d725c <+240>: mov x0, x25
   0x00000000402d7260 <+244>: bl 0x402d65b8 <_ZN5sched14thread_runtime19hysteresis_run_stopEv>
   0x00000000402d7264 <+248>: ldr x0, [x21, #136]
   0x00000000402d7268 <+252>: add x0, x0, #0x14
   0x00000000402d726c <+256>: mov w1, #0x6                   // #6
   0x00000000402d7270 <+260>: stlr w1, [x0]
   0x00000000402d7274 <+264>: mov x1, x21
   0x00000000402d7278 <+268>: mov x0, x19
   0x00000000402d727c <+272>: bl 0x402d4af0 <_ZN5sched3cpu7enqueueERNS_6threadE>
   0x00000000402d7280 <+276>: b 0x402d790c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1952>
   0x00000000402d7284 <+280>: mrs x21, daif
   0x00000000402d7288 <+284>: msr daifset, #0x2
   0x00000000402d728c <+288>: isb
   0x00000000402d7290 <+292>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d7294 <+296>: ldrb w0, [x0, #2697]
   0x00000000402d7298 <+300>: cbz w0, 0x402d72ec <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+384>
   0x00000000402d729c <+304>: mov x1, #0x0                   // #0
   0x00000000402d72a0 <+308>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d72a4 <+312>: add x0, x0, #0x540
   0x00000000402d72a8 <+316>: add x0, x0, #0x510
   0x00000000402d72ac <+320>: bl 0x402dd5e8 <_ZN15tracepoint_base21allocate_trace_recordEm>
   0x00000000402d72b0 <+324>: mov x20, x0
   0x00000000402d72b4 <+328>: add x0, x0, #0x30
   0x00000000402d72b8 <+332>: str x0, [sp, #112]
   0x00000000402d72bc <+336>: ldr x0, [x20, #40]
   0x00000000402d72c0 <+340>: tbz x0, #32, 0x402d72dc <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+368>
   0x00000000402d72c4 <+344>: add x2, sp, #0x70
   0x00000000402d72c8 <+348>: mov x1, x20
   0x00000000402d72cc <+352>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d72d0 <+356>: add x0, x0, #0x540
   0x00000000402d72d4 <+360>: add x0, x0, #0x510
   0x00000000402d72d8 <+364>: bl 0x402dd56c <_ZN15tracepoint_base16do_log_backtraceEP12trace_recordRPh>
   0x00000000402d72dc <+368>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d72e0 <+372>: add x0, x0, #0x540
   0x00000000402d72e4 <+376>: add x0, x0, #0x510
   0x00000000402d72e8 <+380>: str x0, [x20]
   0x00000000402d72ec <+384>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d72f0 <+388>: add x0, x0, #0x540
   0x00000000402d72f4 <+392>: add x0, x0, #0x510
   0x00000000402d72f8 <+396>: bl 0x402dd3c0 <_ZN15tracepoint_base10run_probesEv>
   0x00000000402d72fc <+400>: msr daif, x0
   0x00000000402d7300 <+404>: b 0x402d719c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+48>
   0x00000000402d7304 <+408>: str x27, [sp, #80]
   0x00000000402d7308 <+412>: stp d8, d9, [sp, #96]
   0x00000000402d730c <+416>: adrp x3, 0x4055b000
   0x00000000402d7310 <+420>: add x3, x3, #0xc20
   0x00000000402d7314 <+424>: mov w2, #0xed                   // #237
   0x00000000402d7318 <+428>: adrp x1, 0x4055b000
   0x00000000402d731c <+432>: add x1, x1, #0xa60
   0x00000000402d7320 <+436>: adrp x0, 0x4055b000
   0x00000000402d7324 <+440>: add x0, x0, #0xc40
   0x00000000402d7328 <+444>: bl 0x400d892c <__assert_fail(char const*, char const*, unsigned int, char const*)>
   0x00000000402d732c <+448>: str x27, [sp, #80]
   0x00000000402d7330 <+452>: stp d8, d9, [sp, #96]
   0x00000000402d7334 <+456>: adrp x3, 0x4055b000
   0x00000000402d7338 <+460>: add x3, x3, #0xc20
   0x00000000402d733c <+464>: mov w2, #0xfd                   // #253
   0x00000000402d7340 <+468>: adrp x1, 0x4055b000
   0x00000000402d7344 <+472>: add x1, x1, #0xa60
   0x00000000402d7348 <+476>: adrp x0, 0x4055b000
   0x00000000402d734c <+480>: add x0, x0, #0xc60
   0x00000000402d7350 <+484>: bl 0x400d892c <__assert_fail(char const*, char const*, unsigned int, char const*)>
   0x00000000402d7354 <+488>: add x0, x19, #0x1, lsl #12
   0x00000000402d7358 <+492>: add x0, x0, #0x6b8
   0x00000000402d735c <+496>: bl 0x402d5a44 <_ZN5sched10timer_base6cancelEv>
   0x00000000402d7360 <+500>: b 0x402d78cc <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1888>
   0x00000000402d7364 <+504>: add x19, x19, #0x1, lsl #12
   0x00000000402d7368 <+508>: add x19, x19, #0x6b8
   0x00000000402d736c <+512>: mov x0, x19
   0x00000000402d7370 <+516>: bl 0x402d5a44 <_ZN5sched10timer_base6cancelEv>
   0x00000000402d7374 <+520>: ldur s0, [x20, #-180]
   0x00000000402d7378 <+524>: mov x0, x25
   0x00000000402d737c <+528>: bl 0x402d665c <_ZNK5sched14thread_runtime10time_untilEf>
   0x00000000402d7380 <+532>: cmp x0, #0x0
   0x00000000402d7384 <+536>: b.le 0x402d78cc <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1888>
   0x00000000402d7388 <+540>: add x1, x0, x22
   0x00000000402d738c <+544>: mov x0, x19
   0x00000000402d7390 <+548>: bl 0x402d5814 <_ZN5sched10timer_base3setENSt6chrono10time_pointIN3osv5clock6uptimeENS1_8durationIlSt5ratioILl1ELl1000000000EEEEEE>
   0x00000000402d7394 <+552>: b 0x402d78cc <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1888>
   0x00000000402d7398 <+556>: mrs x25, daif
   0x00000000402d739c <+560>: msr daifset, #0x2
   0x00000000402d73a0 <+564>: isb
   0x00000000402d73a4 <+568>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d73a8 <+572>: ldrb w0, [x0, #2809]
   0x00000000402d73ac <+576>: cbz w0, 0x402d7400 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+660>
   0x00000000402d73b0 <+580>: mov x1, #0x0                   // #0
   0x00000000402d73b4 <+584>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d73b8 <+588>: add x0, x0, #0x540
   0x00000000402d73bc <+592>: add x0, x0, #0x580
   0x00000000402d73c0 <+596>: bl 0x402dd5e8 <_ZN15tracepoint_base21allocate_trace_recordEm>
   0x00000000402d73c4 <+600>: mov x20, x0
   0x00000000402d73c8 <+604>: add x0, x0, #0x30
   0x00000000402d73cc <+608>: str x0, [sp, #112]
   0x00000000402d73d0 <+612>: ldr x0, [x20, #40]
   0x00000000402d73d4 <+616>: tbz x0, #32, 0x402d73f0 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+644>
   0x00000000402d73d8 <+620>: add x2, sp, #0x70
   0x00000000402d73dc <+624>: mov x1, x20
   0x00000000402d73e0 <+628>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d73e4 <+632>: add x0, x0, #0x540
   0x00000000402d73e8 <+636>: add x0, x0, #0x580
   0x00000000402d73ec <+640>: bl 0x402dd56c <_ZN15tracepoint_base16do_log_backtraceEP12trace_recordRPh>
   0x00000000402d73f0 <+644>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d73f4 <+648>: add x0, x0, #0x540
   0x00000000402d73f8 <+652>: add x0, x0, #0x580
   0x00000000402d73fc <+656>: str x0, [x20]
   0x00000000402d7400 <+660>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d7404 <+664>: add x0, x0, #0x540
   0x00000000402d7408 <+668>: add x0, x0, #0x580
   0x00000000402d740c <+672>: bl 0x402dd3c0 <_ZN15tracepoint_base10run_probesEv>
   0x00000000402d7410 <+676>: msr daif, x0
   0x00000000402d7414 <+680>: b 0x402d7918 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1964>
   0x00000000402d7418 <+684>: mov x0, x25
   0x00000000402d741c <+688>: bl 0x402d65b8 <_ZN5sched14thread_runtime19hysteresis_run_stopEv>
   0x00000000402d7420 <+692>: ldr x20, [x19, #4200]
   0x00000000402d7424 <+696>: sub x25, x20, #0x130
   0x00000000402d7428 <+700>: ldr x0, [x20, #16]
   0x00000000402d742c <+704>: cbz x0, 0x402d757c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1040>
   0x00000000402d7430 <+708>: ldr x1, [x0, #8]
   0x00000000402d7434 <+712>: cbz x1, 0x402d7440 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+724>
   0x00000000402d7438 <+716>: ldr x1, [x1, #8]
   0x00000000402d743c <+720>: cbnz x1, 0x402d7438 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+716>
   0x00000000402d7440 <+724>: ldr x0, [x20]
   0x00000000402d7444 <+728>: cbz x0, 0x402d75a4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1080>
   0x00000000402d7448 <+732>: add x26, x19, #0x1, lsl #12
   0x00000000402d744c <+736>: add x26, x26, #0x60
   0x00000000402d7450 <+740>: add x2, sp, #0x70
   0x00000000402d7454 <+744>: mov x1, x20
   0x00000000402d7458 <+748>: mov x0, x26
   0x00000000402d745c <+752>: bl 0x401cb5fc <_ZN5boost9intrusive17bstree_algorithmsINS0_18rbtree_node_traitsIPvLb0EEEE5eraseEPNS0_11rbtree_nodeIS3_EES8_RNS0_20data_for_rebalance_tIS8_EE>
   0x00000000402d7460 <+756>: ldr x0, [sp, #128]
   0x00000000402d7464 <+760>: cmp x20, x0
   0x00000000402d7468 <+764>: b.eq 0x402d75cc <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1120>  // b.none
   0x00000000402d746c <+768>: ldr w1, [x0, #24]
   0x00000000402d7470 <+772>: ldr w2, [x20, #24]
   0x00000000402d7474 <+776>: str w2, [x0, #24]
   0x00000000402d7478 <+780>: cbnz w1, 0x402d75d4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1128>
   0x00000000402d747c <+784>: ldr x0, [x19, #4184]
   0x00000000402d7480 <+788>: sub x0, x0, #0x1
   0x00000000402d7484 <+792>: str x0, [x19, #4184]
   0x00000000402d7488 <+796>: str xzr, [x20]
   0x00000000402d748c <+800>: str xzr, [x20, #8]
   0x00000000402d7490 <+804>: str xzr, [x20, #16]
   0x00000000402d7494 <+808>: ldr x0, [x20, #72]
   0x00000000402d7498 <+812>: asr x0, x0, #10
   0x00000000402d749c <+816>: ubfx x1, x22, #10, #32
   0x00000000402d74a0 <+820>: orr x0, x1, x0, lsl #32
   0x00000000402d74a4 <+824>: add x1, x20, #0x50
   0x00000000402d74a8 <+828>: str x0, [x1]
   0x00000000402d74ac <+832>: ldur x0, [x20, #-168]
   0x00000000402d74b0 <+836>: add x0, x0, #0x14
   0x00000000402d74b4 <+840>: ldar w0, [x0]
   0x00000000402d74b8 <+844>: cmp w0, #0x6
   0x00000000402d74bc <+848>: b.ne 0x402d75e8 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1148>  // b.any
   0x00000000402d74c0 <+852>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d74c4 <+856>: ldrb w0, [x0, #2922]
   0x00000000402d74c8 <+860>: cbnz w0, 0x402d7610 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1188>
   0x00000000402d74cc <+864>: cbnz w24, 0x402d76b8 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1356>
   0x00000000402d74d0 <+868>: ldr x0, [x19, #5872]
   0x00000000402d74d4 <+872>: cmp x0, x25
   0x00000000402d74d8 <+876>: b.eq 0x402d76c8 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1372>  // b.none
   0x00000000402d74dc <+880>: cmp x0, x21
   0x00000000402d74e0 <+884>: b.eq 0x402d775c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1520>  // b.none
   0x00000000402d74e4 <+888>: add x1, x20, #0x30
   0x00000000402d74e8 <+892>: ldr x0, [x1]
   0x00000000402d74ec <+896>: add x0, x0, #0x1
   0x00000000402d74f0 <+900>: str x0, [x1]
   0x00000000402d74f4 <+904>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d74f8 <+908>: ldrb w0, [x0, #2458]
   0x00000000402d74fc <+912>: cbnz w0, 0x402d77f0 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1668>
   0x00000000402d7500 <+916>: ldur x0, [x20, #-168]
   0x00000000402d7504 <+920>: add x0, x0, #0x14
   0x00000000402d7508 <+924>: mov w1, #0x5                   // #5
   0x00000000402d750c <+928>: stlr w1, [x0]
   0x00000000402d7510 <+932>: sub x26, x20, #0xb8
   0x00000000402d7514 <+936>: mov x0, x26
   0x00000000402d7518 <+940>: bl 0x402d6544 <_ZN5sched14thread_runtime20hysteresis_run_startEv>
   0x00000000402d751c <+944>: cmp x21, x25
   0x00000000402d7520 <+948>: b.eq 0x402d7810 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1700>  // b.none
   0x00000000402d7524 <+952>: ldr x0, [x21, #136]
   0x00000000402d7528 <+956>: add x0, x0, #0x14
   0x00000000402d752c <+960>: ldr w0, [x0]
   0x00000000402d7530 <+964>: cmp w0, #0x6
   0x00000000402d7534 <+968>: b.eq 0x402d7838 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1740>  // b.none
   0x00000000402d7538 <+972>: add x21, x19, #0x1, lsl #12
   0x00000000402d753c <+976>: add x21, x21, #0x6b8
   0x00000000402d7540 <+980>: mov x0, x21
   0x00000000402d7544 <+984>: bl 0x402d5a44 <_ZN5sched10timer_base6cancelEv>
   0x00000000402d7548 <+988>: cbnz w24, 0x402d7850 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1764>
   0x00000000402d754c <+992>: ldr x0, [x19, #4184]
   0x00000000402d7550 <+996>: cbz x0, 0x402d785c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1776>
   0x00000000402d7554 <+1000>: ldr x0, [x19, #4200]
   0x00000000402d7558 <+1004>: ldur s0, [x0, #-180]
   0x00000000402d755c <+1008>: mov x0, x26
   0x00000000402d7560 <+1012>: bl 0x402d665c <_ZNK5sched14thread_runtime10time_untilEf>
   0x00000000402d7564 <+1016>: cmp x0, #0x0
   0x00000000402d7568 <+1020>: b.le 0x402d785c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1776>
   0x00000000402d756c <+1024>: add x1, x0, x22
   0x00000000402d7570 <+1028>: mov x0, x21
   0x00000000402d7574 <+1032>: bl 0x402d5814 <_ZN5sched10timer_base3setENSt6chrono10time_pointIN3osv5clock6uptimeENS1_8durationIlSt5ratioILl1ELl1000000000EEEEEE>
   0x00000000402d7578 <+1036>: b 0x402d785c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1776>
   0x00000000402d757c <+1040>: ldr x1, [x20]
   0x00000000402d7580 <+1044>: ldr x0, [x1, #16]
   0x00000000402d7584 <+1048>: cmp x20, x0
   0x00000000402d7588 <+1052>: b.ne 0x402d7448 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+732>  // b.any
   0x00000000402d758c <+1056>: mov x0, x1
   0x00000000402d7590 <+1060>: ldr x1, [x1]
   0x00000000402d7594 <+1064>: ldr x2, [x1, #16]
   0x00000000402d7598 <+1068>: cmp x2, x0
   0x00000000402d759c <+1072>: b.eq 0x402d758c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1056>  // b.none
   0x00000000402d75a0 <+1076>: b 0x402d7448 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+732>
   0x00000000402d75a4 <+1080>: str x27, [sp, #80]
   0x00000000402d75a8 <+1084>: stp d8, d9, [sp, #96]
   0x00000000402d75ac <+1088>: adrp x3, 0x4054e000
   0x00000000402d75b0 <+1092>: add x3, x3, #0x5d8
   0x00000000402d75b4 <+1096>: mov w2, #0x591                 // #1425
   0x00000000402d75b8 <+1100>: adrp x1, 0x4054e000
   0x00000000402d75bc <+1104>: add x1, x1, #0x3e0
   0x00000000402d75c0 <+1108>: adrp x0, 0x4054e000
   0x00000000402d75c4 <+1112>: add x0, x0, #0x6e8
   0x00000000402d75c8 <+1116>: bl 0x400d892c <__assert_fail(char const*, char const*, unsigned int, char const*)>
   0x00000000402d75cc <+1120>: ldr w1, [x20, #24]
   0x00000000402d75d0 <+1124>: b 0x402d7478 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+780>
   0x00000000402d75d4 <+1128>: ldr x2, [sp, #120]
   0x00000000402d75d8 <+1132>: ldr x1, [sp, #112]
   0x00000000402d75dc <+1136>: mov x0, x26
   0x00000000402d75e0 <+1140>: bl 0x401cb8d8 <_ZN5boost9intrusive17rbtree_algorithmsINS0_18rbtree_node_traitsIPvLb0EEEE42rebalance_after_erasure_restore_invariantsEPNS0_11rbtree_nodeIS3_EES8_S8_>
   0x00000000402d75e4 <+1144>: b 0x402d747c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+784>
   0x00000000402d75e8 <+1148>: str x27, [sp, #80]
   0x00000000402d75ec <+1152>: stp d8, d9, [sp, #96]
   0x00000000402d75f0 <+1156>: adrp x3, 0x4055b000
   0x00000000402d75f4 <+1160>: add x3, x3, #0xc20
   0x00000000402d75f8 <+1164>: mov w2, #0x128                 // #296
   0x00000000402d75fc <+1168>: adrp x1, 0x4055b000
   0x00000000402d7600 <+1172>: add x1, x1, #0xa60
   0x00000000402d7604 <+1176>: adrp x0, 0x4055b000
   0x00000000402d7608 <+1180>: add x0, x0, #0xc88
   0x00000000402d760c <+1184>: bl 0x400d892c <__assert_fail(char const*, char const*, unsigned int, char const*)>
   0x00000000402d7610 <+1188>: str x27, [sp, #80]
   0x00000000402d7614 <+1192>: stp d8, d9, [sp, #96]
   0x00000000402d7618 <+1196>: ldr s9, [x21, #124]
   0x00000000402d761c <+1200>: ldur s8, [x20, #-180]
   0x00000000402d7620 <+1204>: mrs x27, daif
   0x00000000402d7624 <+1208>: msr daifset, #0x2
   0x00000000402d7628 <+1212>: isb
   0x00000000402d762c <+1216>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d7630 <+1220>: ldrb w0, [x0, #2921]
   0x00000000402d7634 <+1224>: cbz w0, 0x402d7698 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1324>
   0x00000000402d7638 <+1228>: mov x1, #0x10                   // #16
   0x00000000402d763c <+1232>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d7640 <+1236>: add x0, x0, #0x540
   0x00000000402d7644 <+1240>: add x0, x0, #0x5f0
   0x00000000402d7648 <+1244>: bl 0x402dd5e8 <_ZN15tracepoint_base21allocate_trace_recordEm>
   0x00000000402d764c <+1248>: mov x26, x0
   0x00000000402d7650 <+1252>: add x0, x0, #0x30
   0x00000000402d7654 <+1256>: str x0, [sp, #112]
   0x00000000402d7658 <+1260>: ldr x0, [x26, #40]
   0x00000000402d765c <+1264>: tbz x0, #32, 0x402d7678 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1292>
   0x00000000402d7660 <+1268>: add x2, sp, #0x70
   0x00000000402d7664 <+1272>: mov x1, x26
   0x00000000402d7668 <+1276>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d766c <+1280>: add x0, x0, #0x540
   0x00000000402d7670 <+1284>: add x0, x0, #0x5f0
   0x00000000402d7674 <+1288>: bl 0x402dd56c <_ZN15tracepoint_base16do_log_backtraceEP12trace_recordRPh>
   0x00000000402d7678 <+1292>: ldr x0, [sp, #112]
   0x00000000402d767c <+1296>: str x25, [x0]
   0x00000000402d7680 <+1300>: str s9, [x0, #8]
   0x00000000402d7684 <+1304>: str s8, [x0, #12]
   0x00000000402d7688 <+1308>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d768c <+1312>: add x0, x0, #0x540
   0x00000000402d7690 <+1316>: add x0, x0, #0x5f0
   0x00000000402d7694 <+1320>: str x0, [x26]
   0x00000000402d7698 <+1324>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d769c <+1328>: add x0, x0, #0x540
   0x00000000402d76a0 <+1332>: add x0, x0, #0x5f0
   0x00000000402d76a4 <+1336>: bl 0x402dd3c0 <_ZN15tracepoint_base10run_probesEv>
   0x00000000402d76a8 <+1340>: msr daif, x0
   0x00000000402d76ac <+1344>: ldr x27, [sp, #80]
   0x00000000402d76b0 <+1348>: ldp d8, d9, [sp, #96]
   0x00000000402d76b4 <+1352>: b 0x402d74cc <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+864>
   0x00000000402d76b8 <+1356>: mov x1, x21
   0x00000000402d76bc <+1360>: mov x0, x19
   0x00000000402d76c0 <+1364>: bl 0x402d4af0 <_ZN5sched3cpu7enqueueERNS_6threadE>
   0x00000000402d76c4 <+1368>: b 0x402d74d0 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+868>
   0x00000000402d76c8 <+1372>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d76cc <+1376>: ldrb w0, [x0, #3034]
   0x00000000402d76d0 <+1380>: cbz w0, 0x402d74e4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+888>
   0x00000000402d76d4 <+1384>: str x27, [sp, #80]
   0x00000000402d76d8 <+1388>: mrs x27, daif
   0x00000000402d76dc <+1392>: msr daifset, #0x2
   0x00000000402d76e0 <+1396>: isb
   0x00000000402d76e4 <+1400>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d76e8 <+1404>: ldrb w0, [x0, #3033]
   0x00000000402d76ec <+1408>: cbz w0, 0x402d7740 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1492>
   0x00000000402d76f0 <+1412>: mov x1, #0x0                   // #0
   0x00000000402d76f4 <+1416>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d76f8 <+1420>: add x0, x0, #0x540
   0x00000000402d76fc <+1424>: add x0, x0, #0x660
   0x00000000402d7700 <+1428>: bl 0x402dd5e8 <_ZN15tracepoint_base21allocate_trace_recordEm>
   0x00000000402d7704 <+1432>: mov x26, x0
   0x00000000402d7708 <+1436>: add x0, x0, #0x30
   0x00000000402d770c <+1440>: str x0, [sp, #112]
   0x00000000402d7710 <+1444>: ldr x0, [x26, #40]
   0x00000000402d7714 <+1448>: tbz x0, #32, 0x402d7730 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1476>
   0x00000000402d7718 <+1452>: add x2, sp, #0x70
   0x00000000402d771c <+1456>: mov x1, x26
   0x00000000402d7720 <+1460>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d7724 <+1464>: add x0, x0, #0x540
   0x00000000402d7728 <+1468>: add x0, x0, #0x660
   0x00000000402d772c <+1472>: bl 0x402dd56c <_ZN15tracepoint_base16do_log_backtraceEP12trace_recordRPh>
   0x00000000402d7730 <+1476>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d7734 <+1480>: add x0, x0, #0x540
   0x00000000402d7738 <+1484>: add x0, x0, #0x660
   0x00000000402d773c <+1488>: str x0, [x26]
   0x00000000402d7740 <+1492>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d7744 <+1496>: add x0, x0, #0x540
   0x00000000402d7748 <+1500>: add x0, x0, #0x660
   0x00000000402d774c <+1504>: bl 0x402dd3c0 <_ZN15tracepoint_base10run_probesEv>
   0x00000000402d7750 <+1508>: msr daif, x0
   0x00000000402d7754 <+1512>: ldr x27, [sp, #80]
   0x00000000402d7758 <+1516>: b 0x402d74e4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+888>
   0x00000000402d775c <+1520>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d7760 <+1524>: ldrb w0, [x0, #3146]
   0x00000000402d7764 <+1528>: cbz w0, 0x402d74e4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+888>
   0x00000000402d7768 <+1532>: str x27, [sp, #80]
   0x00000000402d776c <+1536>: mrs x27, daif
   0x00000000402d7770 <+1540>: msr daifset, #0x2
   0x00000000402d7774 <+1544>: isb
   0x00000000402d7778 <+1548>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d777c <+1552>: ldrb w0, [x0, #3145]
   0x00000000402d7780 <+1556>: cbz w0, 0x402d77d4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1640>
   0x00000000402d7784 <+1560>: mov x1, #0x0                   // #0
   0x00000000402d7788 <+1564>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d778c <+1568>: add x0, x0, #0x540
   0x00000000402d7790 <+1572>: add x0, x0, #0x6d0
   0x00000000402d7794 <+1576>: bl 0x402dd5e8 <_ZN15tracepoint_base21allocate_trace_recordEm>
   0x00000000402d7798 <+1580>: mov x26, x0
   0x00000000402d779c <+1584>: add x0, x0, #0x30
   0x00000000402d77a0 <+1588>: str x0, [sp, #112]
   0x00000000402d77a4 <+1592>: ldr x0, [x26, #40]
   0x00000000402d77a8 <+1596>: tbz x0, #32, 0x402d77c4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1624>
   0x00000000402d77ac <+1600>: add x2, sp, #0x70
   0x00000000402d77b0 <+1604>: mov x1, x26
   0x00000000402d77b4 <+1608>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d77b8 <+1612>: add x0, x0, #0x540
   0x00000000402d77bc <+1616>: add x0, x0, #0x6d0
   0x00000000402d77c0 <+1620>: bl 0x402dd56c <_ZN15tracepoint_base16do_log_backtraceEP12trace_recordRPh>
   0x00000000402d77c4 <+1624>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d77c8 <+1628>: add x0, x0, #0x540
   0x00000000402d77cc <+1632>: add x0, x0, #0x6d0
   0x00000000402d77d0 <+1636>: str x0, [x26]
   0x00000000402d77d4 <+1640>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d77d8 <+1644>: add x0, x0, #0x540
   0x00000000402d77dc <+1648>: add x0, x0, #0x6d0
   0x00000000402d77e0 <+1652>: bl 0x402dd3c0 <_ZN15tracepoint_base10run_probesEv>
   0x00000000402d77e4 <+1656>: msr daif, x0
   0x00000000402d77e8 <+1660>: ldr x27, [sp, #80]
   0x00000000402d77ec <+1664>: b 0x402d74e4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+888>
   0x00000000402d77f0 <+1668>: ldr x0, [x19, #4184]
   0x00000000402d77f4 <+1672>: str x0, [sp, #136]
   0x00000000402d77f8 <+1676>: add x1, sp, #0x88
   0x00000000402d77fc <+1680>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d7800 <+1684>: add x0, x0, #0x540
   0x00000000402d7804 <+1688>: add x0, x0, #0x420
   0x00000000402d7808 <+1692>: bl 0x402d4860 <(anonymous namespace)::tracepointv<16, std::tuple<long unsigned int>(long unsigned int), identity_assign<long unsigned int> >::trace_slow_path(std::tuple<unsigned long>)>
   0x00000000402d780c <+1696>: b 0x402d7500 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+916>
   0x00000000402d7810 <+1700>: str x27, [sp, #80]
   0x00000000402d7814 <+1704>: stp d8, d9, [sp, #96]
   0x00000000402d7818 <+1708>: adrp x3, 0x4055b000
   0x00000000402d781c <+1712>: add x3, x3, #0xc20
   0x00000000402d7820 <+1716>: mov w2, #0x13b                 // #315
   0x00000000402d7824 <+1720>: adrp x1, 0x4055b000
   0x00000000402d7828 <+1724>: add x1, x1, #0xa60
   0x00000000402d782c <+1728>: adrp x0, 0x4055b000
   0x00000000402d7830 <+1732>: add x0, x0, #0xcc0
   0x00000000402d7834 <+1736>: bl 0x400d892c <__assert_fail(char const*, char const*, unsigned int, char const*)>
   0x00000000402d7838 <+1740>: ldr x0, [x19, #5872]
   0x00000000402d783c <+1744>: cmp x0, x21
   0x00000000402d7840 <+1748>: b.eq 0x402d7538 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+972>  // b.none
   0x00000000402d7844 <+1752>: mov x0, x26
   0x00000000402d7848 <+1756>: bl 0x402d661c <_ZN5sched14thread_runtime26add_context_switch_penaltyEv>
   0x00000000402d784c <+1760>: b 0x402d7538 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+972>
   0x00000000402d7850 <+1764>: add x1, x23, x22
   0x00000000402d7854 <+1768>: mov x0, x21
   0x00000000402d7858 <+1772>: bl 0x402d5814 <_ZN5sched10timer_base3setENSt6chrono10time_pointIN3osv5clock6uptimeENS1_8durationIlSt5ratioILl1ELl1000000000EEEEEE>
   0x00000000402d785c <+1776>: add x1, x19, #0x1, lsl #12
   0x00000000402d7860 <+1780>: add x1, x1, #0x6fa
   0x00000000402d7864 <+1784>: ldrb w0, [x1]
   0x00000000402d7868 <+1788>: ands w0, w0, #0xff
   0x00000000402d786c <+1792>: ldurb w2, [x20, #-32]
   0x00000000402d7870 <+1796>: cset w0, ne  // ne = any
   0x00000000402d7874 <+1800>: cmp w0, w2
   0x00000000402d7878 <+1804>: b.ne 0x402d78e4 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1912>  // b.any
   0x00000000402d787c <+1808>: add x1, x19, #0x1, lsl #12
   0x00000000402d7880 <+1812>: add x1, x1, #0x6f9
   0x00000000402d7884 <+1816>: mov w0, #0x0                   // #0
   0x00000000402d7888 <+1820>: bl 0x40478760 <__aarch64_swp1_acq_rel>
   0x00000000402d788c <+1824>: and w0, w0, #0xff
   0x00000000402d7890 <+1828>: cbnz w0, 0x402d78ec <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1920>
   0x00000000402d7894 <+1832>: mov x0, x25
   0x00000000402d7898 <+1836>: bl 0x402d48f0 <_ZN5sched6thread9switch_toEv>
   0x00000000402d789c <+1840>: mrs x0, tpidr_el0
   0x00000000402d78a0 <+1844>: add x0, x0, #0x0, lsl #12
   0x00000000402d78a4 <+1848>: add x0, x0, #0x90
   0x00000000402d78a8 <+1852>: ldr x0, [x0, #8]
   0x00000000402d78ac <+1856>: ldr x0, [x0, #5904]
   0x00000000402d78b0 <+1860>: cbz x0, 0x402d78cc <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1888>
   0x00000000402d78b4 <+1864>: bl 0x402d7978 <_ZN5sched6thread7destroyEv>
   0x00000000402d78b8 <+1868>: mrs x0, tpidr_el0
   0x00000000402d78bc <+1872>: add x0, x0, #0x0, lsl #12
   0x00000000402d78c0 <+1876>: add x0, x0, #0x90
   0x00000000402d78c4 <+1880>: ldr x0, [x0, #8]
   0x00000000402d78c8 <+1884>: str xzr, [x0, #5904]
   0x00000000402d78cc <+1888>: ldp x19, x20, [sp, #16]
   0x00000000402d78d0 <+1892>: ldp x21, x22, [sp, #32]
   0x00000000402d78d4 <+1896>: ldp x23, x24, [sp, #48]
   0x00000000402d78d8 <+1900>: ldp x25, x26, [sp, #64]
   0x00000000402d78dc <+1904>: ldp x29, x30, [sp], #144
   0x00000000402d78e0 <+1908>: ret
   0x00000000402d78e4 <+1912>: strb w2, [x1]
   0x00000000402d78e8 <+1916>: b 0x402d787c <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1808>
   0x00000000402d78ec <+1920>: bl 0x401f8c50 <_ZN3mmu15flush_tlb_localEv>
   0x00000000402d78f0 <+1924>: b 0x402d7894 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+1832>
   0x00000000402d78f4 <+1928>: mov x0, x25
   0x00000000402d78f8 <+1932>: bl 0x402d65b8 <_ZN5sched14thread_runtime19hysteresis_run_stopEv>
   0x00000000402d78fc <+1936>: ldr x0, [x21, #136]
   0x00000000402d7900 <+1940>: add x0, x0, #0x14
   0x00000000402d7904 <+1944>: mov w1, #0x6                   // #6
   0x00000000402d7908 <+1948>: stlr w1, [x0]
   0x00000000402d790c <+1952>: adrp x0, 0x40785000 <trace_memory_huge_failure+64>
   0x00000000402d7910 <+1956>: ldrb w0, [x0, #2810]
   0x00000000402d7914 <+1960>: cbnz w0, 0x402d7398 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+556>
   0x00000000402d7918 <+1964>: add x1, x21, #0x168
   0x00000000402d791c <+1968>: ldr x0, [x1]
   0x00000000402d7920 <+1972>: add x0, x0, #0x1
   0x00000000402d7924 <+1976>: str x0, [x1]
   0x00000000402d7928 <+1980>: b 0x402d7420 <_ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE+692>
End of assembler dump.

Any ideas what might be wrong?

Nadav Har'El

unread,
Feb 17, 2021, 2:25:59 AM2/17/21
to Waldek Kozaczuk, Avi Kivity, OSv Development
On Tue, Feb 16, 2021 at 11:11 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
OK, I think I have narrowed it down to cpu::reschedule_from_interrupt()

Nice.

So beyond the "obvious" stuff we've been talking about (interrupts/exceptions/whatever call reschedule_from_interrupt() need to save the registers), there is another possibility now: maybe the code that reschedule_from_interrupt() is using to restore the registers, namely switch_to(), somehow destroys something when optimization is on.

I don't think there's any way that switch_to() can actually destroy the registers you found being destroyed, because OSv already expects it (as a C++ function which uses registers) to destroy these registers - so whatever code was running in the target thread (e.g., the interrupt handler) is already saving and restoring these registers. So I see (but again, just guesses...) two options:

1. switch_to() or reschedule_from_interrupt() ruins the stack where the caller of reschedule_from_interrupt() saved its register. However, since it doesn't happen all the time, and you already checked the "obvious" problems (red zone, too-small stack) it has to be something relatively rare, I don't know what to guess...
2. Maybe the problem is not  in an asynchronous call to reschedule_from_interrupt() from an interrupt, where all registers are saved, but in a cooperative rescheduling (the user called a function), where nothing else except switch_to() saves registers? Maybe switch_to() fails to save/restore some of them? Maybe when the C function reschedule_from_interrupt() returns there are some callee-saved registers (?) that it should restore but doesn't because it doesn't know they changed?

 

Here is the disassembled working version (compiled with -O2):

Which register got ruined, and what does this code do with this register - and does it do anything different with it from the faulty code?

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.

Waldek Kozaczuk

unread,
Feb 18, 2021, 6:32:21 PM2/18/21
to OSv Development
Hi,

On Wednesday, February 17, 2021 at 2:25:59 AM UTC-5 Nadav Har'El wrote:
On Tue, Feb 16, 2021 at 11:11 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
OK, I think I have narrowed it down to cpu::reschedule_from_interrupt()

Nice.

So beyond the "obvious" stuff we've been talking about (interrupts/exceptions/whatever call reschedule_from_interrupt() need to save the registers), there is another possibility now: maybe the code that reschedule_from_interrupt() is using to restore the registers, namely switch_to(), somehow destroys something when optimization is on.

You mean when optimization is off, right?

I don't think there's any way that switch_to() can actually destroy the registers you found being destroyed, because OSv already expects it (as a C++ function which uses registers) to destroy these registers - so whatever code was running in the target thread (e.g., the interrupt handler) is already saving and restoring these registers. So I see (but again, just guesses...) two options:

1. switch_to() or reschedule_from_interrupt() ruins the stack where the caller of reschedule_from_interrupt() saved its register. However, since it doesn't happen all the time, and you already checked the "obvious" problems (red zone, too-small stack) it has to be something relatively rare, I don't know what to guess...
2. Maybe the problem is not  in an asynchronous call to reschedule_from_interrupt() from an interrupt, where all registers are saved, but in a cooperative rescheduling (the user called a function), where nothing else except switch_to() saves registers? Maybe switch_to() fails to save/restore some of them? Maybe when the C function reschedule_from_interrupt() returns there are some callee-saved registers (?) that it should restore but doesn't because it doesn't know they changed?

I have grown tired for now looking at this but I opened a new issue describing it - https://github.com/cloudius-systems/osv/issues/1124 

Waldek Kozaczuk

unread,
Feb 21, 2021, 12:02:53 PM2/21/21
to OSv Development
The plot thickens.

I am not sure why I missed it but with '-O2' (release mode), the thread::switch_to() gets inlined whereas with '-O0' and '-O1' it does not - reschedule_from_interrupt actually calls it using bl instruction. 

Now, the 'bl' instruction actually implicitly stores the return address for ret in the x30 register. Now switch_to saves the registers x29 and x30 in the prologue:

stp     x29, x30, [sp, #-80]!


and then restores it at the end (meanwhile sp got changed):

ldp     x29, x30, [sp], #80

ret


But it restores it from the different stack, right? So it might return to the wrong place, no? Is this why we are crashing?

So maybe the mystery is solved? But then how possible any context switches work up until this point in debug mode, because it crashes quite late. 

Ideas?

Waldek

Avi Kivity

unread,
Feb 21, 2021, 12:09:24 PM2/21/21
to OSv Development
On Sunday, February 21, 2021 at 7:02:53 PM UTC+2 jwkoz...@gmail.com wrote:
The plot thickens.

I am not sure why I missed it but with '-O2' (release mode), the thread::switch_to() gets inlined whereas with '-O0' and '-O1' it does not - reschedule_from_interrupt actually calls it using bl instruction. 

Now, the 'bl' instruction actually implicitly stores the return address for ret in the x30 register. Now switch_to saves the registers x29 and x30 in the prologue:

stp     x29, x30, [sp, #-80]!


and then restores it at the end (meanwhile sp got changed):

ldp     x29, x30, [sp], #80

ret


But it restores it from the different stack, right? So it might return to the wrong place, no? Is this why we are crashing?


No, switch_to() returns on the same stack. Or rather, after switch_to() runs, it returns on another thread's stack, but with another thread's registers. When control eventually returns to the original thread, the original switch_to() will return on the first stack.
 

So maybe the mystery is solved? But then how possible any context switches work up until this point in debug mode, because it crashes quite late. 

Ideas?


Could we have had a stack overrun? Debug mode uses stack like there is no tomorrow.

Waldek Kozaczuk

unread,
Feb 21, 2021, 1:51:35 PM2/21/21
to OSv Development
On Sunday, February 21, 2021 at 12:09:24 PM UTC-5 Avi Kivity wrote:
On Sunday, February 21, 2021 at 7:02:53 PM UTC+2 jwkoz...@gmail.com wrote:
The plot thickens.

I am not sure why I missed it but with '-O2' (release mode), the thread::switch_to() gets inlined whereas with '-O0' and '-O1' it does not - reschedule_from_interrupt actually calls it using bl instruction. 

Now, the 'bl' instruction actually implicitly stores the return address for ret in the x30 register. Now switch_to saves the registers x29 and x30 in the prologue:

stp     x29, x30, [sp, #-80]!


and then restores it at the end (meanwhile sp got changed):

ldp     x29, x30, [sp], #80

ret


But it restores it from the different stack, right? So it might return to the wrong place, no? Is this why we are crashing?


No, switch_to() returns on the same stack. Or rather, after switch_to() runs, it returns on another thread's stack, but with another thread's registers. When control eventually returns to the original thread, the original switch_to() will return on the first stack.

I am not sure about x86_64 but the full disassembled switch_to in debug mode (-O0) for aarch64 looks like this:

   0x000000004043636c <+0>: stp x29, x30, [sp, #-80]!
   0x0000000040436370 <+4>: mov x29, sp
   0x0000000040436374 <+8>: stp x19, x20, [sp, #16]
   0x0000000040436378 <+12>: stp x21, x22, [sp, #32]
   0x000000004043637c <+16>: str x0, [sp, #56]
   0x0000000040436380 <+20>: bl 0x400ce690 <_ZN5sched6thread7currentEv>
   0x0000000040436384 <+24>: str x0, [sp, #72]
   0x0000000040436388 <+28>: ldr x0, [sp, #56]
   0x000000004043638c <+32>: ldr x0, [x0, #112]
   0x0000000040436390 <+36>: msr tpidr_el0, x0
   0x0000000040436394 <+40>: isb
   0x0000000040436398 <+44>: ldr x19, [sp, #72]
   0x000000004043639c <+48>: ldr x20, [sp, #72]
   0x00000000404363a0 <+52>: ldr x21, [sp, #56]
   0x00000000404363a4 <+56>: ldr x22, [sp, #56]
   0x00000000404363a8 <+60>: add x19, x19, #0x50
   0x00000000404363ac <+64>: str x29, [x19]
   0x00000000404363b0 <+68>: mov x2, sp
   0x00000000404363b4 <+72>: adr x1, 0x404363cc <_ZN5sched6thread9switch_toEv+96>
   0x00000000404363b8 <+76>: stp x2, x1, [x20, #96]
   0x00000000404363bc <+80>: ldp x29, x0, [x21, #80]
   0x00000000404363c0 <+84>: ldp x2, x1, [x22, #96]
   0x00000000404363c4 <+88>: mov sp, x2 //stack register pointer changes
   0x00000000404363c8 <+92>: blr x1
   0x00000000404363cc <+96>: nop
   0x00000000404363d0 <+100>: ldp x19, x20, [sp, #16] // restore from new stack
   0x00000000404363d4 <+104>: ldp x21, x22, [sp, #32] // restore from new stack
   0x00000000404363d8 <+108>: ldp x29, x30, [sp], #80 // restore from new stack
   0x00000000404363dc <+112>: ret

Looking at it, it does change the stack register (sp) to a new thread stack (line +88) and then restores 6 registers from the new stack. Now having said that, if the new thread we are switching to was switched from before, then it most likely (unless its stack got corrupted) should have correct values. Now as I think about it, even x30 (lr) should have the right address as well, as all threads go through the same reschedule_from_interrupt/switch_to code, no?

Also, as we can see it uses 80 bytes of the stack (in release build the switch_to uses 48 bytes, but is not called by reschedule_from_interrupt). I ran experiments with increasing stack size twofold or even 4 times - it did not help. So this rather eliminates the possibility of the stack overrun.

What is also interesting when I build all code but reschedule_from_interrupt() in debug mode (-O0 or change to -O1) AND force compiling reschedule_from_interrupt() with -O2 (which obviously inlines call to switch_to), the crashes go away. Why? All the other code compiled with -O0 still uses a lot of stack as you say.

I wonder if Nadav's theory about cooperative (non-preemptive) switches (yield()?) might be true. Do we have situations when a thread voluntarily calls reschedule_from_interrupt (which calls switch_to) and then we switch to a new thread T2 that was preempted way before this switch and then T2 continues on its stack but its registers have not been restored because the exception frame is not restored because it is a voluntary switch? Could this happen? But even if it can happen, why things work in release mode? Because on average it uses way less stack and we are lucky?

Waldek Kozaczuk

unread,
Feb 21, 2021, 2:06:31 PM2/21/21
to OSv Development
After thinking a bit more switching from yield() to a new thread that was preempted way before that is handled OK as eventually it should actually pop exception frame and restore all registers.

But then what about the opposite scenario: some thread gets preempted and then switched to a new thread that was yielded way before? In this case, the new thread would NOT come back to the interrupt routine but rather continue with registers that possibly got overwritten by thread that got preempted, no? But again why inlining helps? Again, are we just lucky?

Waldek Kozaczuk

unread,
Feb 21, 2021, 3:50:57 PM2/21/21
to OSv Development
One more thing. The preamble of the reschedule_from_interrupt in release mode (-O2) looks like this:

1 Dump of assembler code for function _ZN5sched3cpu25reschedule_from_interruptEbNSt6chrono8durationIlSt5ratioILl1ELl1000000000EEEE:
  2    0x00000000402e48d4 <+0>:     stp     x29, x30, [sp, #-144]!
  3    0x00000000402e48d8 <+4>:     mov     x29, sp
  4    0x00000000402e48dc <+8>:     stp     x21, x22, [sp, #32]
  5    0x00000000402e48e0 <+12>:    mrs     x21, tpidr_el0
  6    0x00000000402e48e4 <+16>:    add     x3, x21, #0x0, lsl #12
  7    0x00000000402e48e8 <+20>:    add     x3, x3, #0x70
  8    0x00000000402e48ec <+24>:    stp     x25, x26, [sp, #64]
  9    0x00000000402e48f0 <+28>:    and     w25, w1, #0xff
 10    0x00000000402e48f4 <+32>:    ldr     w1, [x3]
 11    0x00000000402e48f8 <+36>:    stp     x19, x20, [sp, #16]
 12    0x00000000402e48fc <+40>:    stp     x23, x24, [sp, #48]
 13    0x00000000402e4900 <+44>:    stp     x27, x28, [sp, #80]

and the return part:
    0x00000000402e4b94 <+704>:   ldp     x19, x20, [sp, #16]
179    0x00000000402e4b98 <+708>:   ldp     x21, x22, [sp, #32]
180    0x00000000402e4b9c <+712>:   ldp     x23, x24, [sp, #48]
181    0x00000000402e4ba0 <+716>:   ldp     x25, x26, [sp, #64]
182    0x00000000402e4ba4 <+720>:   ldp     x27, x28, [sp, #80]
183    0x00000000402e4ba8 <+724>:   ldp     x29, x30, [sp], #144
184    0x00000000402e4bac <+728>:   ret

As you can see it happens to save and restore all callee save registers (>= x19). The debug one does NOT save/restore those registers. Could this be a reason?

Waldek Kozaczuk

unread,
Feb 22, 2021, 12:30:31 AM2/22/21
to OSv Development
I think I have an explanation of what is going on. Before I present it let me recap the calling convention for aarch64:

Caller:
  1. If we need any of x0-x18 registers, save them. They are corruptible.
  2. Move the first 8 parameters into registers x0-x7.
  3. Push any additional parameters on the stack.
  4. Use BL to call the function.
  5. Evaluate the return code in x0.
  6. Restore any of x0-x18 that we saved in step 1.
Callee:
  1. Push LR (x30) and x19-x29 onto the stack if used by this routine.
  2. Do the work
  3. Put return code in x0.
  4. Pop LR and x19-x29 if pushed in step 1.
  5. Use RET instruction to return execution to the caller (this will implicitly use LR (x30) as an address to return to).
Now imagine the following scenario involving function F executing on thread T1 that calls thread::yield() or another function calling yield():
  1. Function F pushes one of the callee saved registers - x23 (just an example) - on the T1 stack becuase it uses it for something and it must do it per the calling convention.
  2. Function F stores some value in x23.
  3. Function F calls thread::yield() directly or indirectly.
  4. Eventually, reschedule_from_interrupt() is called and it calls switch_to() to switch stack pointer to the new thread T2 stack. The debug version of  reschedule_from_interrupt() nor switch_to() stores x23 as they do not use this register (unlike the release version).
  5. At some point, later reschedule_from_interrupt() is called again (not necessarily the very next time) and calls switch_to() that switches back to T1.
  6. T1 resumes and eventually returns the control to the function F1 right after it called yield().
  7. The code in F1 after calling yield() reads the value of x23 ... and boom. The x23 quite likely contains garbage because it was never restored by F1 after calling yield() because per calling convention yield() or other callees should have saved and restored. But it did not, did it? Or rather different routines on different threads running on the same cpu in between ruined it.
Why does it all work with the release version? It does because the reschedule_from_interrupt() compiled with -02 happens to use and save all callee-saved registers x19-x28. So they happen to be restored to correct values after the switch.

So it seems that the right solution is to save and restore x19-x28 (callee saved registers) in switch_to() like so:

diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh
index dff7467c..45aff4a7 100644
--- a/arch/aarch64/arch-switch.hh
+++ b/arch/aarch64/arch-switch.hh
@@ -27,6 +27,7 @@ void thread::switch_to()
 
     asm volatile("\n"
                  "str x29,     %0  \n"
+                "sub sp, sp, #0x50\n"
                  "mov x2, sp       \n"
                  "adr x1, 1f       \n" /* address of label */
                  "stp x2, x1,  %1  \n"
@@ -34,10 +35,23 @@ void thread::switch_to()
                  "ldp x29, x0, %2  \n"
                  "ldp x2, x1,  %3  \n"
 
+                "stp x19, x20, [sp, #0]\n"
+                "stp x21, x22, [sp, #16]\n"
+                "stp x23, x24, [sp, #32]\n"
+                "stp x25, x26, [sp, #48]\n"
+                "stp x27, x28, [sp, #64]\n"
+
                  "mov sp, x2       \n"
                  "blr x1           \n"
 
                  "1:               \n" /* label */
+
+                "ldp x19, x20, [sp, #0]\n"
+                "ldp x21, x22, [sp, #16]\n"
+                "ldp x23, x24, [sp, #32]\n"
+                "ldp x25, x26, [sp, #48]\n"
+                "ldp x27, x28, [sp, #64]\n"
+                "add sp, sp, #0x50\n"
                  :
                  : "Q"(old->_state.fp), "Ump"(old->_state.sp),
                    "Ump"(this->_state.fp), "Ump"(this->_state.sp)

And indeed the crashes in both -00 and -O1 go away.

Does my explanation have holes? Or am I completely wrong?

BTW could we have a similar problem with x86_64 port? The callee saved registers are RBX, RBP, and R12-R15. And this does not seem to be saving RBX nor R12-R15:

void thread::switch_to()
{
    thread* old = current();
    // writing to fs_base invalidates memory accesses, so surround with
    // barriers
    barrier();
    set_fsbase(reinterpret_cast<u64>(_tcb));
    barrier();
    auto c = _detached_state->_cpu;
    old->_state.exception_stack = c->arch.get_exception_stack();
    c->arch.set_interrupt_stack(&_arch);
    c->arch.set_exception_stack(_state.exception_stack);
    auto fpucw = processor::fnstcw();
    auto mxcsr = processor::stmxcsr();
    asm volatile
        ("mov %%rbp, %c[rbp](%0) \n\t"
         "movq $1f, %c[rip](%0) \n\t"
         "mov %%rsp, %c[rsp](%0) \n\t"
         "mov %c[rsp](%1), %%rsp \n\t"
         "mov %c[rbp](%1), %%rbp \n\t"
         "jmpq *%c[rip](%1) \n\t"
         "1: \n\t"
         :
         : "a"(&old->_state), "c"(&this->_state),
           [rsp]"i"(offsetof(thread_state, rsp)),
           [rbp]"i"(offsetof(thread_state, rbp)),
           [rip]"i"(offsetof(thread_state, rip))
         : "rbx", "rdx", "rsi", "rdi", "r8", "r9",
           "r10", "r11", "r12", "r13", "r14", "r15", "memory");
    // As the catch-all solution, reset FPU state and more specifically
    // its status word. For details why we need it please see issue #1020.
    asm volatile ("emms");
    processor::fldcw(fpucw);
    processor::ldmxcsr(mxcsr);
}

Nadav Har'El

unread,
Feb 22, 2021, 1:36:12 AM2/22/21
to Waldek Kozaczuk, OSv Development
On Mon, Feb 22, 2021 at 7:30 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
I think I have an explanation of what is going on.

Excellent! This seems a very plausible explanation - I think it's similar to one of my wild guesses :-)
I think I understand now why the x86 version worked, and perhaps you can do the same in aarch64. See below:
Note this list of registers here! I think (but it's been years since I looked at this, so I'm rusty...), that the idea
is exactly to have the compiler save the callee-saved registers, if it didn't already. It tells the compiler to
pretend that this assembly instruction just modified all the registers in the list. So now we have a function
switch_to() which thinks it modifies r15 et al., so it needs to save and restore these registers. 

Perhaps exactly the same solution would work for aarch64 as well.

This would be better than explicit copying, because as you noted, in some builds reschedule_from_interrupt() already
saves these registers so there is no need to do it again.

If this is the solution, we need a comment next to this list of registers explaining its raison d'etre :-(

 

Avi Kivity

unread,
Feb 22, 2021, 6:30:07 AM2/22/21
to OSv Development
I think you're completely right, well spotted.

I think you're completely right. Here's the equivalent from Linux.

Here's the Linux equivalent:

SYM_FUNC_START(cpu_switch_to)
        mov     x10, #THREAD_CPU_CONTEXT
        add     x8, x0, x10
        mov     x9, sp
        stp     x19, x20, [x8], #16             // store callee-saved registers
        stp     x21, x22, [x8], #16
        stp     x23, x24, [x8], #16
        stp     x25, x26, [x8], #16
        stp     x27, x28, [x8], #16
        stp     x29, x9, [x8], #16
        str     lr, [x8]
        add     x8, x1, x10
        ldp     x19, x20, [x8], #16             // restore callee-saved registers
        ldp     x21, x22, [x8], #16
        ldp     x23, x24, [x8], #16
        ldp     x25, x26, [x8], #16
        ldp     x27, x28, [x8], #16
        ldp     x29, x9, [x8], #16
        ldr     lr, [x8]
        mov     sp, x9
        msr     sp_el0, x1
        ptrauth_keys_install_kernel x1, x8, x9, x10
        scs_save x0, x8
        scs_load x1, x8
        ret
SYM_FUNC_END(cpu_switch_to)
NOKPROBE(cpu_switch_to)


 

Waldek Kozaczuk

unread,
Feb 22, 2021, 7:00:48 PM2/22/21
to OSv Development
Indeed when you disassemble the x64 version of switch_to() (this one is from a release build), it looks like this:

Dump of assembler code for function _ZN5sched6thread9switch_toEv:
   0x00000000403f8140 <+0>: push   %rbp
   0x00000000403f8141 <+1>: mov    %rsp,%rbp
   0x00000000403f8144 <+4>: push   %r15
   0x00000000403f8146 <+6>: push   %r14
   0x00000000403f8148 <+8>: push   %r13
   0x00000000403f814a <+10>: push   %r12
   0x00000000403f814c <+12>: push   %rbx
...
   0x00000000403f81e1 <+161>: pop    %rbx
   0x00000000403f81e2 <+162>: pop    %r12
   0x00000000403f81e4 <+164>: pop    %r13
   0x00000000403f81e6 <+166>: pop    %r14
   0x00000000403f81e8 <+168>: pop    %r15
   0x00000000403f81ea <+170>: pop    %rbp
   0x00000000403f81eb <+171>: ret    
 
So all callee-save registers indeed are pushed and popped from the stack. Some of them like r12 are used by the body of switch_to(), but others are not - r13, r14, r15. This is interesting because according to the documentation of inline assembly - https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html - it does not mention the callee-save registers. The section "6.47.2.6 Clobbers and Scratch Registers" has this statement:

"When the compiler selects which registers to use to represent input and output operands, it does not use any of the clobbered registers. As a result, clobbered registers are available for any use in the assembler code."

But based on how it works with x64, the clobbered registers list is treated somewhat differently from that statement above. Maybe, in addition, it also means that any callee-saved registers in that list will be pushed and restored from the stack around this inline assembly (more-less, as the pushes are actually in the preamble).

Anyway, when I follow the same idea and modify the inline assembly in the aarch64 version of switch_to() to contain the callee-save registers x19-x28 like so:

diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh
index dff7467c..f0ec61f2 100644
--- a/arch/aarch64/arch-switch.hh
+++ b/arch/aarch64/arch-switch.hh
@@ -42,8 +42,9 @@ void thread::switch_to()
                  : "Q"(old->_state.fp), "Ump"(old->_state.sp),
                    "Ump"(this->_state.fp), "Ump"(this->_state.sp)
                  : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
-                   "x9", "x10", "x11", "x12", "x13", "x14", "x15",
-                   "x16", "x17", "x18", "x30", "memory");
+                   "x9", "x10", "x11", "x12", "x13", "x14",
+                   "x19", "x20", "x21", "x22", "x23", "x24",
+                  "x25", "x26", "x27", "x28", "x29", "x30", "memory");
 }

then the assembly of switch_to in debug build shows new pushes and pops as we would to:

Dump of assembler code for function sched::thread::switch_to():
   0x000000004043717c <+0>: stp x29, x30, [sp, #-128]!
   0x0000000040437180 <+4>: mov x29, sp
   0x0000000040437184 <+8>: stp x19, x20, [sp, #16]
   0x0000000040437188 <+12>: stp x21, x22, [sp, #32]
   0x000000004043718c <+16>: stp x23, x24, [sp, #48]
   0x0000000040437190 <+20>: stp x25, x26, [sp, #64]
   0x0000000040437194 <+24>: stp x27, x28, [sp, #80]
...
   0x00000000404371ec <+112>: ldp x19, x20, [sp, #16]
   0x00000000404371f0 <+116>: ldp x21, x22, [sp, #32]
   0x00000000404371f4 <+120>: ldp x23, x24, [sp, #48]
   0x00000000404371f8 <+124>: ldp x25, x26, [sp, #64]
   0x00000000404371fc <+128>: ldp x27, x28, [sp, #80]
   0x0000000040437200 <+132>: ldp x29, x30, [sp], #128
   0x0000000040437204 <+136>: ret

And indeed when I run the same tests that were broken in both debug builds (-O0, -01) now they all work which is great.

Please note that I added x19-x28 registers to the list in the inline assembly, but then I also removed some from the original list - x15-x18. Otherwise, the compiler would complain with this error:
"arch/aarch64/arch-switch.hh: In member function ‘void sched::thread::switch_to()’:
arch/aarch64/arch-switch.hh:28:5: error: ‘asm’ operand has impossible constraints
   28 |     asm volatile("\n"
      |     ^~~"
 
This makes sense as the compiler needs some registers for inputs and outputs as I understand. But I also wonder why the original list had all the registers x0-x18 where only 3 of those - x0, x1, and x2 would be used in the inline assembly (shouldn't it be enough to list only x0, x1, and x2 besides x29 and x30?):

    asm volatile("\n"
                 "str x29,     %0  \n"
                 "mov x2, sp       \n"
                 "adr x1, 1f       \n" /* address of label */
                 "stp x2, x1,  %1  \n"

                 "ldp x29, x0, %2  \n"
                 "ldp x2, x1,  %3  \n"

                 "mov sp, x2       \n"
                 "blr x1           \n"

                 "1:               \n" /* label */

By the way, x0 actually holds the address of "this" (thread object) - the implicit parameter to the switch_to function.

Anyhow, it would be all great except the changes I made to switch_to inline assembly break the release build - OSv seems to hang early on the 1st thread switch, it seems. When I connected with gdb and looked at the assembly of reschedule_from_interrupt() (please note that the release build of aarch64 inlines the call to switch_to(), unlike the x64 where both release and debug build does NOT inline switch_to and uses the CALL instruction), something looks messed up in the part of the assembly of reschedule_from_interrupt where switch_to() was inlined:

   0x00000000402e546c <+764>:   mov     x2, sp
   0x00000000402e5470 <+768>:   adr     x1, 0x402e5488 <sched::cpu::reschedule_from_interrupt(bool, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)+792>
   0x00000000402e5474 <+772>:   stp     x2, x1, [x16, #96]
   0x00000000402e5478 <+776>:   ldp     x29, x0, [x1, #80]
   0x00000000402e547c <+780>:   ldp     x2, x1, [x1, #96]
   0x00000000402e5480 <+784>:   mov     sp, x2
   0x00000000402e5484 <+788>:   blr     x1
   0x00000000402e5488 <+792>:   ldr     x19, [sp, #120]
   0x00000000402e548c <+796>:   ldr     x0, [x19, #8]
   0x00000000402e5490 <+800>:   ldr     x0, [x0, #5904]
   0x00000000402e5494 <+804>:   cbz     x0, 0x402e54a4 <sched::cpu::reschedule_from_interrupt(bool, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)+820>
   0x00000000402e5498 <+808>:   bl      0x402e5970 <sched::thread::destroy()>
   0x00000000402e549c <+812>:   ldr     x0, [x19, #8]
   0x00000000402e54a0 <+816>:   str     xzr, [x0, #5904]
   0x00000000402e54a4 <+820>:   ldp     x19, x20, [sp, #16]
   0x00000000402e54a8 <+824>:   ldp     x21, x22, [sp, #32]
   0x00000000402e54ac <+828>:   ldp     x23, x24, [sp, #48]
   0x00000000402e54b0 <+832>:   ldp     x25, x26, [sp, #64]
   0x00000000402e54b4 <+836>:   ldp     x27, x28, [sp, #80]
   0x00000000402e54b8 <+840>:   ldp     x29, x30, [sp], #176
   0x00000000402e54bc <+844>:   ret

Starting with line 764, it loads register x2 with the stack pointer. Then in line 768, it loads the address of the label "1:" into the register x1. Then in line 772, it stores the pair of registers x2 and x1 into the fields sp and pc of the old->_state structure to save the stack and instruction pointer of the old thread.

Now the next two lines - 776 and 780 - are weird and I think contain a bug. In both lines, it tries to load pairs of registers - (x20,x0) and (x2,x1) from memory based on the address stored in x1. But x1 contains the address of the label 0  - ???, what?. As a result, both the new stack register (line 784) gets loaded with the junk value and 'bl x1' in line 788 ends up jumping to some junk address and we get an exception and cpu halts.

Now compare a similar part of the assembly of the reschedule_from_interrupt() with the unmodified switch_to:

   0x00000000402e5480 <+764>:   mov     x2, sp
   0x00000000402e5484 <+768>:   adr     x1, 0x402e549c <sched::cpu::reschedule_from_interrupt(bool, std::chrono::duration<long, std::ratio<1l, 1000000000l> >)+792>
   0x00000000402e5488 <+772>:   stp     x2, x1, [x19, #96]
   0x00000000402e548c <+776>:   ldp     x29, x0, [x22, #80]
   0x00000000402e5490 <+780>:   ldp     x2, x1, [x22, #96]
   0x00000000402e5494 <+784>:   mov     sp, x2
   0x00000000402e5498 <+788>:   blr     x1

In this one both lines 776 and 780 load values from the address specified in the register x22 which seems to be correct when I follow where it originates from.

So what is going on? Is it is a compiler bug? Or is there something wrong with our code? Maybe when inlining, the compiler treats the inlined code in switch_to() as a black box and still has no idea it uses registers x1 that is why we have some sort of collision problem?

Anyhow after some trial and error, I arrived with this version of switch_to() which seems to work in both debug (non-inlined) and release (inlined) build.

diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh
index dff7467c..c9cb07d8 100644
--- a/arch/aarch64/arch-switch.hh
+++ b/arch/aarch64/arch-switch.hh
@@ -27,23 +27,22 @@ void thread::switch_to()
 
     asm volatile("\n"
                  "str x29,     %0  \n"
-                 "mov x2, sp       \n"
-                 "adr x1, 1f       \n" /* address of label */
-                 "stp x2, x1,  %1  \n"
+                 "mov x22, sp      \n"
+                 "adr x21, 1f      \n" /* address of label */
+                 "stp x22, x21, %1 \n"
 
                  "ldp x29, x0, %2  \n"
-                 "ldp x2, x1,  %3  \n"
+                 "ldp x22, x21, %3 \n"
 
-                 "mov sp, x2       \n"
-                 "blr x1           \n"
+                 "mov sp, x22      \n"
+                 "blr x21          \n"
 
                  "1:               \n" /* label */
                  :
                  : "Q"(old->_state.fp), "Ump"(old->_state.sp),
                    "Ump"(this->_state.fp), "Ump"(this->_state.sp)
-                 : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
-                   "x9", "x10", "x11", "x12", "x13", "x14", "x15",
-                   "x16", "x17", "x18", "x30", "memory");
+                 : "x0", "x19", "x20", "x21", "x22", "x23", "x24",
+                  "x25", "x26", "x27", "x28", "x29", "x30", "memory");
 }

In essence, we replace x1 with x21 and x2 with x22 which we know get saved and restored anyway. Then also I am only listing x0 + all callee saved registers in the clobbers list.

Now, I wonder: are we just lucky and it just works? Could that issue when x1 and x2 are used happen again depending on how the compiler chooses to compile the reschedule_from_interrupt() even when we use x21 and x22? 

On top of this, I think that in theory, we might have another issue in aarch64 - regardless if switch_to is inlined or not, if reschedule_from_interrupt happens to use any of callee-save registers, which switch_to saves/restores, before calling switch_to and after, we might have a situation that it will read wrong value as it was ruined by switch_to. I have found a somewhat simillar issue here - https://github.com/cloudius-systems/osv/issues/1121.

The situation above can only happen in the portion of code in reschedule_from_interrupt() after switch_to():

    n->switch_to();

    // Note: after the call to n->switch_to(), we should no longer use any of
    // the local variables, nor "this" object, because we just switched to n's
    // stack and the values we can access now are those that existed in the
    // reschedule call which scheduled n out, and will now be returning.
    // So to get the current cpu, we must use cpu::current(), not "this".
    if (cpu::current()->terminating_thread) {
        cpu::current()->terminating_thread->destroy();
        cpu::current()->terminating_thread = nullptr;
    }

Please see the comment about not using the stack. But I think in aarch64 it needs to be stricter than that - it cannot read any caller-save registers because their original values set before would be ruined by switch_to as it restores those from the new thread.

To counter that I think we should put this code in a class static method that we would call after switch_to(), at least for aarch64:

 diff --git a/core/sched.cc b/core/sched.cc
index 06f849d1..50d1a7e7 100644
--- a/core/sched.cc
+++ b/core/sched.cc
@@ -229,6 +229,13 @@ void cpu::schedule()
     }
 }
 
+void __attribute__ ((noinline)) cpu::destroy_current_cpu_terminating_thread() {
+    if (cpu::current()->terminating_thread) {
+        cpu::current()->terminating_thread->destroy();
+        cpu::current()->terminating_thread = nullptr;
+    }
+}
+
 void cpu::reschedule_from_interrupt(bool called_from_yield,
                                     thread_runtime::duration preempt_after)
 {
@@ -343,10 +350,11 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
     // stack and the values we can access now are those that existed in the
     // reschedule call which scheduled n out, and will now be returning.
     // So to get the current cpu, we must use cpu::current(), not "this".
-    if (cpu::current()->terminating_thread) {
-         cpu::current()->terminating_thread->destroy();
-         cpu::current()->terminating_thread = nullptr;
-    }
+   destroy_current_cpu_terminating_thread();
 }

So even though we seem to better understand the problem, the correct solution is not that clear. Unless you see one.

Waldek

Nadav Har'El

unread,
Feb 23, 2021, 11:50:09 AM2/23/21
to Waldek Kozaczuk, OSv Development
I think this statement only partially describes what "clobber registers" means.
In my experience, what this list does is that it tells the C compiler (which doesn't understand assembly language - it's a C compiler, not an assembler!) which registers this assembly-language code may be using. The C compiler then needs to do two things:
  1. If the C compiler used one of these registers for its own purposes in the enclosing function before the assembly code, and wants to use it after the assembly code, it will need to save it.
  2. If this is one of the callee saved registers, the ABI says the function must restore this register when it returns. So if the function now includes this (assembly) code which ruins this register, the C compiler must add code to save and restore this register. 
I think this is exactly what we need to save the callee-saved registers.
Excellent.


Please note that I added x19-x28 registers to the list in the inline assembly, but then I also removed some from the original list - x15-x18. Otherwise, the compiler would complain with this error:
"arch/aarch64/arch-switch.hh: In member function ‘void sched::thread::switch_to()’:
arch/aarch64/arch-switch.hh:28:5: error: ‘asm’ operand has impossible constraints
   28 |     asm volatile("\n"
      |     ^~~"
 
This makes sense as the compiler needs some registers for inputs and outputs as I understand. But I also wonder why the original list had all the registers x0-x18 where only 3 of those - x0, x1, and x2 would be used in the inline assembly (shouldn't it be enough to list only x0, x1, and x2 besides x29 and x30?):

I don't know who wrote this list originally or why that specific list.
It should list all the callee-saved registers, plus the registers you actually use in the instructions themselves. Not more (I think).
I really don't know. It looks like one, but I'm not "swimming" in assembly language (let alone the Arm one) enough to say for sure :-(
I'm worried that the answer is yes. But then again, if the reason (and I'm not sure...) that x1 and x2 don't work is just a compiler bug, then when this bug is eventually fixed, x21/x22 will work just like x1/x2 does.
 

On top of this, I think that in theory, we might have another issue in aarch64 - regardless if switch_to is inlined or not, if reschedule_from_interrupt happens to use any of callee-save registers, which switch_to saves/restores, before calling switch_to and after, we might have a situation that it will read wrong value as it was ruined by switch_to. I have found a somewhat simillar issue here - https://github.com/cloudius-systems/osv/issues/1121.

This is true :-( 

Maybe the only reliable solution is to write the top-level function (reschedule_from_interrupt) in assembly - it will then save some registers, call the C function, and end with restoring the registers... I still don't understand why we never had these problems in the x86 version. Maybe we were just lucky?

Stewart Hildebrand

unread,
Feb 23, 2021, 3:20:04 PM2/23/21
to OSv Development
Wouldn't we NOT want to have x29 (frame pointer) in the clobber list? If I understand the clobber list correctly, the compiler uses it to determine which registers to save/restore before/after the "asm volatile" statement, but only if the compiler happens to needs to use those registers elsewhere in the function (wild guess: does that become more likely if the function gets inlined?). By listing x29 in the clobber list, would we be undoing what we're trying to accomplish with the inline assembly?

Stewart Hildebrand

unread,
Feb 24, 2021, 11:42:33 AM2/24/21
to OSv Development
Unsure if this is related or not, but could there be an issue with saving/restoring the FPU state (i.e. Q0-Q31)? I took a look at the disassembly for fpu_state_save/fpu_state_load (arch/aarch64/processor.hh), and I noticed the compiler adds a few unwanted register save/restore of the registers d8-d15, which is the 64-bit version of the 128-bit FPU/SIMD registers q8-q15. Unless I misread the code, it appears like this is not actually restoring q8-q15 during context switch. Here's the disassembly of fpu_state_load with mode=debug:

00000000402bae64 <_ZN9processor14fpu_state_loadEPNS_9fpu_stateE>:
    402bae64:   6dbb27e8        stp     d8, d9, [sp, #-80]!
    402bae68:   6d012fea        stp     d10, d11, [sp, #16]
    402bae6c:   6d0237ec        stp     d12, d13, [sp, #32]
    402bae70:   6d033fee        stp     d14, d15, [sp, #48]
    402bae74:   f90027e0        str     x0, [sp, #72]
    402bae78:   f94027e0        ldr     x0, [sp, #72]
    402bae7c:   ad400400        ldp     q0, q1, [x0]
    402bae80:   f94027e0        ldr     x0, [sp, #72]
    402bae84:   ad410c02        ldp     q2, q3, [x0, #32]
    402bae88:   f94027e0        ldr     x0, [sp, #72]
    402bae8c:   ad421404        ldp     q4, q5, [x0, #64]
    402bae90:   f94027e0        ldr     x0, [sp, #72]
    402bae94:   ad431c06        ldp     q6, q7, [x0, #96]
    402bae98:   f94027e0        ldr     x0, [sp, #72]
    402bae9c:   ad442408        ldp     q8, q9, [x0, #128]
    402baea0:   f94027e0        ldr     x0, [sp, #72]
    402baea4:   ad452c0a        ldp     q10, q11, [x0, #160]
    402baea8:   f94027e0        ldr     x0, [sp, #72]
    402baeac:   ad46340c        ldp     q12, q13, [x0, #192]
    402baeb0:   f94027e0        ldr     x0, [sp, #72]
    402baeb4:   ad473c0e        ldp     q14, q15, [x0, #224]
    402baeb8:   f94027e0        ldr     x0, [sp, #72]
    402baebc:   ad484410        ldp     q16, q17, [x0, #256]
    402baec0:   f94027e0        ldr     x0, [sp, #72]
    402baec4:   ad494c12        ldp     q18, q19, [x0, #288]
    402baec8:   f94027e0        ldr     x0, [sp, #72]
    402baecc:   ad4a5414        ldp     q20, q21, [x0, #320]
    402baed0:   f94027e0        ldr     x0, [sp, #72]
    402baed4:   ad4b5c16        ldp     q22, q23, [x0, #352]
    402baed8:   f94027e0        ldr     x0, [sp, #72]
    402baedc:   ad4c6418        ldp     q24, q25, [x0, #384]
    402baee0:   f94027e0        ldr     x0, [sp, #72]
    402baee4:   ad4d6c1a        ldp     q26, q27, [x0, #416]
    402baee8:   f94027e0        ldr     x0, [sp, #72]
    402baeec:   ad4e741c        ldp     q28, q29, [x0, #448]
    402baef0:   f94027e0        ldr     x0, [sp, #72]
    402baef4:   ad4f7c1e        ldp     q30, q31, [x0, #480]
    402baef8:   f94027e0        ldr     x0, [sp, #72]
    402baefc:   b9420000        ldr     w0, [x0, #512]
    402baf00:   d51b4420        msr     fpsr, x0
    402baf04:   f94027e0        ldr     x0, [sp, #72]
    402baf08:   b9420400        ldr     w0, [x0, #516]
    402baf0c:   d51b4400        msr     fpcr, x0
    402baf10:   d503201f        nop
    402baf14:   6d412fea        ldp     d10, d11, [sp, #16]
    402baf18:   6d4237ec        ldp     d12, d13, [sp, #32]
    402baf1c:   6d433fee        ldp     d14, d15, [sp, #48]
    402baf20:   6cc527e8        ldp     d8, d9, [sp], #80
    402baf24:   d65f03c0        ret

Waldek Kozaczuk

unread,
Feb 24, 2021, 4:13:06 PM2/24/21
to OSv Development
Regarding x29, I agree that it is unnecessary to restore it from the stack. Unless I misunderstand what is going on, the `fp` field of the new thread we are switching to, the inline assembly would be setting x29 to the same value that was stored on the stack in the preamble when the previous time this thread was switched from. So restoring from the stack is unnecessary, but I think it is not harmful.

I actually tried to remove it from the clobbers list, but the compiler would still generate code to push/pop both x29 and x30 on/from the stack. I think it does it regardless of clobbers inline assembly because it needs to do it for a function (standard preamble).

Now the callee-save registers in the clobbers list are treated differently (and they are not mentioned in this doc - https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html). For example, if I add x18 (not callee-save) which is also not used by the compiler, the compiler would NOT push and pop this register. On the other hand, if I remove x19 (first callee-save) from the list, it would also NOT push and pop this register (it would if x19 is in the list). This seems to indicate that callee-save and non-callee-saved registers are treated differently. 

For example this code:
void thread::switch_to()
{
    thread* old = current();
    asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory");

    asm volatile("\n"
                 "str x29,     %0  \n"
                 "mov x22, sp      \n"
                 "adr x21, 1f      \n" /* address of label */
                 "stp x22, x21, %1 \n"

                 "ldp x29, x0, %2  \n"
                 "ldp x22, x21, %3 \n"

                 "mov sp, x22      \n"
                 "blr x21          \n"

                 "1:               \n" /* label */
                 :
                 : "Q"(old->_state.fp), "Ump"(old->_state.sp),
                   "Ump"(this->_state.fp), "Ump"(this->_state.sp)
                 : "x0", "x20", "x21", "x22", "x23", "x24",
   "x25", "x26", "x27", "x28", "x30", "memory");
}

gets translated to:
gdb -batch -ex 'file build/release/loader.elf' -ex 'disassemble sched::thread::switch_to'
Dump of assembler code for function sched::thread::switch_to():
   0x00000000402e2710 <+0>: mrs x2, tpidr_el0
   0x00000000402e2714 <+4>: stp x29, x30, [sp, #-96]!
   0x00000000402e2718 <+8>: add x2, x2, #0x0, lsl #12
   0x00000000402e271c <+12>: add x2, x2, #0x90
   0x00000000402e2720 <+16>: mov x29, sp
   0x00000000402e2724 <+20>: mov x1, x0
   0x00000000402e2728 <+24>: stp x20, x21, [sp, #16]
   0x00000000402e272c <+28>: stp x22, x23, [sp, #32]
   0x00000000402e2730 <+32>: stp x24, x25, [sp, #48]
   0x00000000402e2734 <+36>: stp x26, x27, [sp, #64]
   0x00000000402e2738 <+40>: ldr x2, [x2]
   0x00000000402e273c <+44>: ldr x0, [x0, #112]
   0x00000000402e2740 <+48>: str x28, [sp, #80]
   0x00000000402e2744 <+52>: msr tpidr_el0, x0
   0x00000000402e2748 <+56>: isb
   0x00000000402e274c <+60>: add x3, x2, #0x50
   0x00000000402e2750 <+64>: str x29, [x3]
   0x00000000402e2754 <+68>: mov x22, sp
   0x00000000402e2758 <+72>: adr x21, 0x402e2770 <sched::thread::switch_to()+96>
   0x00000000402e275c <+76>: stp x22, x21, [x2, #96]
   0x00000000402e2760 <+80>: ldp x29, x0, [x1, #80]
   0x00000000402e2764 <+84>: ldp x22, x21, [x1, #96]
   0x00000000402e2768 <+88>: mov sp, x22
   0x00000000402e276c <+92>: blr x21
   0x00000000402e2770 <+96>: ldp x20, x21, [sp, #16]
   0x00000000402e2774 <+100>: ldp x22, x23, [sp, #32]
   0x00000000402e2778 <+104>: ldp x24, x25, [sp, #48]
   0x00000000402e277c <+108>: ldp x26, x27, [sp, #64]
   0x00000000402e2780 <+112>: ldr x28, [sp, #80]
   0x00000000402e2784 <+116>: ldp x29, x30, [sp], #96
   0x00000000402e2788 <+120>: ret
End of assembler dump.

Waldek Kozaczuk

unread,
Feb 24, 2021, 4:38:02 PM2/24/21
to OSv Development
This is weird. I see the same thing. These d8-15 registers do NOT show up in processor::fpu_state_save() assembly.

They do not show up in the page_fault in release build:
gdb -batch -ex 'file build/release/loader.elf' -ex 'disassemble page_fault'
Dump of assembler code for function page_fault(exception_frame*):
   0x000000004020b994 <+0>: sub sp, sp, #0x270
   0x000000004020b998 <+4>: stp x29, x30, [sp]
   0x000000004020b99c <+8>: mov x29, sp
   0x000000004020b9a0 <+12>: stp x19, x20, [sp, #16]
   0x000000004020b9a4 <+16>: mov x19, x0
   0x000000004020b9a8 <+20>: stp d8, d9, [sp, #32]
   0x000000004020b9ac <+24>: stp d10, d11, [sp, #48]
   0x000000004020b9b0 <+28>: stp d12, d13, [sp, #64]
   0x000000004020b9b4 <+32>: stp d14, d15, [sp, #80]
   0x000000004020b9b8 <+36>: stp q0, q1, [sp, #96]
   0x000000004020b9bc <+40>: stp q2, q3, [sp, #128]
   0x000000004020b9c0 <+44>: stp q4, q5, [sp, #160]
   0x000000004020b9c4 <+48>: stp q6, q7, [sp, #192]
   0x000000004020b9c8 <+52>: stp q8, q9, [sp, #224]
   0x000000004020b9cc <+56>: stp q10, q11, [sp, #256]
   0x000000004020b9d0 <+60>: stp q12, q13, [sp, #288]
   0x000000004020b9d4 <+64>: stp q14, q15, [sp, #320]
   0x000000004020b9d8 <+68>: stp q16, q17, [sp, #352]
   0x000000004020b9dc <+72>: stp q18, q19, [sp, #384]
   0x000000004020b9e0 <+76>: stp q20, q21, [sp, #416]
   0x000000004020b9e4 <+80>: stp q22, q23, [sp, #448]
   0x000000004020b9e8 <+84>: stp q24, q25, [sp, #480]
   0x000000004020b9ec <+88>: add x1, sp, #0x200
   0x000000004020b9f0 <+92>: stp q26, q27, [x1]
   0x000000004020b9f4 <+96>: stp q28, q29, [x1, #32]
   0x000000004020b9f8 <+100>: add x1, sp, #0x200
   0x000000004020b9fc <+104>: stp q30, q31, [x1, #64]
   0x000000004020ba00 <+108>: mrs x1, fpsr
   0x000000004020ba04 <+112>: str w1, [sp, #608]
   0x000000004020ba08 <+116>: mrs x1, fpcr
   0x000000004020ba0c <+120>: str w1, [sp, #612]
   0x000000004020ba10 <+124>: mrs x20, far_el1
   0x000000004020ba14 <+128>: bl 0x4020cb70 <fixup_fault(exception_frame*)>
   0x000000004020ba18 <+132>: tst w0, #0xff
   0x000000004020ba1c <+136>: b.ne 0x4020ba60 <page_fault(exception_frame*)+204>  // b.any
   0x000000004020ba20 <+140>: ldr x0, [x19, #256]
   0x000000004020ba24 <+144>: cbz x0, 0x4020bad8 <page_fault(exception_frame*)+324>
   0x000000004020ba28 <+148>: mrs x0, tpidr_el0
   0x000000004020ba2c <+152>: add x0, x0, #0x0, lsl #12
   0x000000004020ba30 <+156>: add x0, x0, #0x50
   0x000000004020ba34 <+160>: ldr w0, [x0]
   0x000000004020ba38 <+164>: cbnz w0, 0x4020bb04 <page_fault(exception_frame*)+368>
   0x000000004020ba3c <+168>: ldr x0, [x19, #264]
   0x000000004020ba40 <+172>: tbnz w0, #7, 0x4020bae4 <page_fault(exception_frame*)+336>
   0x000000004020ba44 <+176>: msr daifclr, #0x2
   0x000000004020ba48 <+180>: isb
   0x000000004020ba4c <+184>: mov x1, x19
   0x000000004020ba50 <+188>: mov x0, x20
   0x000000004020ba54 <+192>: bl 0x401da290 <mmu::vm_fault(unsigned long, exception_frame*)>
   0x000000004020ba58 <+196>: msr daifset, #0x2
   0x000000004020ba5c <+200>: isb
   0x000000004020ba60 <+204>: ldp q0, q1, [sp, #96]
   0x000000004020ba64 <+208>: ldp q2, q3, [sp, #128]
   0x000000004020ba68 <+212>: ldp q4, q5, [sp, #160]
   0x000000004020ba6c <+216>: ldp q6, q7, [sp, #192]
   0x000000004020ba70 <+220>: ldp q8, q9, [sp, #224]
   0x000000004020ba74 <+224>: ldp q10, q11, [sp, #256]
   0x000000004020ba78 <+228>: ldp q12, q13, [sp, #288]
   0x000000004020ba7c <+232>: ldp q14, q15, [sp, #320]
   0x000000004020ba80 <+236>: ldp q16, q17, [sp, #352]
   0x000000004020ba84 <+240>: ldp q18, q19, [sp, #384]
   0x000000004020ba88 <+244>: ldp q20, q21, [sp, #416]
   0x000000004020ba8c <+248>: ldp q22, q23, [sp, #448]
   0x000000004020ba90 <+252>: ldp q24, q25, [sp, #480]
   0x000000004020ba94 <+256>: add x0, sp, #0x200
   0x000000004020ba98 <+260>: ldp q26, q27, [x0]
   0x000000004020ba9c <+264>: ldp q28, q29, [x0, #32]
   0x000000004020baa0 <+268>: add x0, sp, #0x200
   0x000000004020baa4 <+272>: ldp q30, q31, [x0, #64]
   0x000000004020baa8 <+276>: ldr w0, [sp, #608]
   0x000000004020baac <+280>: msr fpsr, x0
   0x000000004020bab0 <+284>: ldr w0, [sp, #612]
   0x000000004020bab4 <+288>: msr fpcr, x0
   0x000000004020bab8 <+292>: ldp x29, x30, [sp]
   0x000000004020babc <+296>: ldp x19, x20, [sp, #16]
   0x000000004020bac0 <+300>: ldp d8, d9, [sp, #32]
   0x000000004020bac4 <+304>: ldp d10, d11, [sp, #48]
   0x000000004020bac8 <+308>: ldp d12, d13, [sp, #64]
   0x000000004020bacc <+312>: ldp d14, d15, [sp, #80]
   0x000000004020bad0 <+316>: add sp, sp, #0x270
   0x000000004020bad4 <+320>: ret
   0x000000004020bad8 <+324>: adrp x0, 0x40558000 <_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
   0x000000004020badc <+328>: add x0, x0, #0x248
   0x000000004020bae0 <+332>: bl 0x400e8e50 <abort(char const*, ...)>
   0x000000004020bae4 <+336>: adrp x3, 0x40558000 <_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
   0x000000004020bae8 <+340>: adrp x1, 0x40558000 <_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
   0x000000004020baec <+344>: adrp x0, 0x40558000 <_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
   0x000000004020baf0 <+348>: add x3, x3, #0x268
   0x000000004020baf4 <+352>: add x1, x1, #0x278
   0x000000004020baf8 <+356>: add x0, x0, #0x290
   0x000000004020bafc <+360>: mov w2, #0x2f                  // #47
   0x000000004020bb00 <+364>: bl 0x400e8f50 <__assert_fail(char const*, char const*, unsigned int, char const*)>
   0x000000004020bb04 <+368>: adrp x3, 0x40558000 <_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
   0x000000004020bb08 <+372>: adrp x1, 0x40558000 <_ZTSN12_GLOBAL__N_111tracepointvILj1EFSt5tupleIJPvbmEES2_bmEXadL_Z15identity_assignIJS2_bmEES1_IJDpT_EES7_EEEE+48>
   0x000000004020bb0c <+376>: adrp x0, 0x40552000
   0x000000004020bb10 <+380>: add x3, x3, #0x268
   0x000000004020bb14 <+384>: add x1, x1, #0x278
   0x000000004020bb18 <+388>: add x0, x0, #0x450
   0x000000004020bb1c <+392>: mov w2, #0x2e                  // #46
   0x000000004020bb20 <+396>: bl 0x400e8f50 <__assert_fail(char const*, char const*, unsigned int, char const*)>
   0x000000004020bb24 <+400>: ldp q0, q1, [sp, #96]
   0x000000004020bb28 <+404>: ldp q2, q3, [sp, #128]
   0x000000004020bb2c <+408>: ldp q4, q5, [sp, #160]
   0x000000004020bb30 <+412>: ldp q6, q7, [sp, #192]
   0x000000004020bb34 <+416>: ldp q8, q9, [sp, #224]
   0x000000004020bb38 <+420>: ldp q10, q11, [sp, #256]
   0x000000004020bb3c <+424>: ldp q12, q13, [sp, #288]
   0x000000004020bb40 <+428>: ldp q14, q15, [sp, #320]
   0x000000004020bb44 <+432>: ldp q16, q17, [sp, #352]
   0x000000004020bb48 <+436>: ldp q18, q19, [sp, #384]
   0x000000004020bb4c <+440>: ldp q20, q21, [sp, #416]
   0x000000004020bb50 <+444>: ldp q22, q23, [sp, #448]
   0x000000004020bb54 <+448>: ldp q24, q25, [sp, #480]
   0x000000004020bb58 <+452>: add x1, sp, #0x200
   0x000000004020bb5c <+456>: ldp q26, q27, [x1]
   0x000000004020bb60 <+460>: ldp q28, q29, [x1, #32]
   0x000000004020bb64 <+464>: add x1, sp, #0x200
   0x000000004020bb68 <+468>: ldp q30, q31, [x1, #64]
   0x000000004020bb6c <+472>: ldr w1, [sp, #608]
   0x000000004020bb70 <+476>: msr fpsr, x1
   0x000000004020bb74 <+480>: ldr w1, [sp, #612]
   0x000000004020bb78 <+484>: msr fpcr, x1
   0x000000004020bb7c <+488>: bl 0x40475940 <_Unwind_Resume>
   0x000000004020bb80 <+492>: msr daifset, #0x2
   0x000000004020bb84 <+496>: isb
   0x000000004020bb88 <+500>: b 0x4020bb24 <page_fault(exception_frame*)+400>
End of assembler dump.

Compiler bug in debug mode?

These functions are used by fpu_lock which we use as RAII construct letting us save/restore those FPU registers when we handle interrupts and page faults (see interrupt() and page_fault() method). So we save/restore those for preemptive switches but not for the non-preemptive (cooperative) ones which I think is correct. But maybe someone can complain.

In x64 the fpu_lock is also used to handle signals but signals handling is missing in aarch64 (see stubbed call_signal_handler_thunk in arch/aarch64/entry.S).

Stewart Hildebrand

unread,
Feb 24, 2021, 5:07:36 PM2/24/21
to OSv Development
Makes sense, since they're not in the clobber lists in fpu_state_save(), so the compiler shouldn't have an awareness that we're modifying those registers. And in fact, we're not in the _save() function, we're just reading them.

They do not show up in the page_fault in release build:

Actually they do
 
gdb -batch -ex 'file build/release/loader.elf' -ex 'disassemble page_fault'
Dump of assembler code for function page_fault(exception_frame*):
   0x000000004020b994 <+0>: sub sp, sp, #0x270
   0x000000004020b998 <+4>: stp x29, x30, [sp]
   0x000000004020b99c <+8>: mov x29, sp
   0x000000004020b9a0 <+12>: stp x19, x20, [sp, #16]
   0x000000004020b9a4 <+16>: mov x19, x0
   0x000000004020b9a8 <+20>: stp d8, d9, [sp, #32]
   0x000000004020b9ac <+24>: stp d10, d11, [sp, #48]
   0x000000004020b9b0 <+28>: stp d12, d13, [sp, #64]
   0x000000004020b9b4 <+32>: stp d14, d15, [sp, #80]

Here
And here
No, the compiler is doing the right thing. Per the AArch64 Procedure Call Standard: "Registers v8-v15 must be preserved by a callee across subroutine calls; ... <snip>...  only the bottom 64-bits of each value stored in v8-v15 need to be preserved." I think the best path forward here is to move fpu_state_save/restore to an assembly file (*.S). I'll work on a patch to do this. I'm thinking I can add them to entry.S.

These functions are used by fpu_lock which we use as RAII construct letting us save/restore those FPU registers when we handle interrupts and page faults (see interrupt() and page_fault() method). So we save/restore those for preemptive switches but not for the non-preemptive (cooperative) ones which I think is correct. But maybe someone can complain.

I can see the thought process behind this... But if a non-preemptive (cooperative) context switch is essentially considered a subroutine call, then we need to at least save/restore d8-d15 to fully adhere to the Procedure Call Standard, because other threads might clobber those. Am I correct in guessing that we're not doing this? I guess I can investigate and try come up with a patch for this too...

Waldek Kozaczuk

unread,
Feb 24, 2021, 6:49:05 PM2/24/21
to OSv Development
You are correct. I missed them. It looks like I need a new pair of glasses :-)  
 
Compiler bug in debug mode?

No, the compiler is doing the right thing. Per the AArch64 Procedure Call Standard: "Registers v8-v15 must be preserved by a callee across subroutine calls; ... <snip>...  only the bottom 64-bits of each value stored in v8-v15 need to be preserved." I think the best path forward here is to move fpu_state_save/restore to an assembly file (*.S). I'll work on a patch to do this. I'm thinking I can add them to entry.S.

Interesting. But given it is an inline assembly which C compiler knows nothing about (what registers are used) how does it know it needs to push registers d8-d15 in processor::fpu_state_load() to follow the Call Standard? I think this does not happen in "normal" functions, does it?

These functions are used by fpu_lock which we use as RAII construct letting us save/restore those FPU registers when we handle interrupts and page faults (see interrupt() and page_fault() method). So we save/restore those for preemptive switches but not for the non-preemptive (cooperative) ones which I think is correct. But maybe someone can complain.
I meant "explain" instead of "complain" :-) 

I can see the thought process behind this... But if a non-preemptive (cooperative) context switch is essentially considered a subroutine call, then we need to at least save/restore d8-d15 to fully adhere to the Procedure Call Standard, because other threads might clobber those. Am I correct in guessing that we're not doing this? I guess I can investigate and try come up with a patch for this too...
I think you are correct. This would be similar to what we have to do with x19-x28 registers in switch_to() which I am describing in one of the 30 replies ;-) So to sum it in switch_to() we need to save/restore both x19-x28 AND d8-15 registers to make sure that in cooperative switch they would be restored properly.

Avi Kivity

unread,
Feb 25, 2021, 12:14:52 PM2/25/21
to Stewart Hildebrand, OSv Development

For switch_to() I think they need to be added to the clobber list. For exceptions, I think they should be added to the exception entry/exit code.



These functions are used by fpu_lock which we use as RAII construct letting us save/restore those FPU registers when we handle interrupts and page faults (see interrupt() and page_fault() method). So we save/restore those for preemptive switches but not for the non-preemptive (cooperative) ones which I think is correct. But maybe someone can complain.

I can see the thought process behind this... But if a non-preemptive (cooperative) context switch is essentially considered a subroutine call, then we need to at least save/restore d8-d15 to fully adhere to the Procedure Call Standard, because other threads might clobber those. Am I correct in guessing that we're not doing this? I guess I can investigate and try come up with a patch for this too...


Right, I think it can just be added to switch_to(). x64 doesn't need it because all fpu registers are clobbered (except mxcsr and mmx state, there's some hack in switch_to()).


Avi Kivity

unread,
Feb 25, 2021, 12:25:21 PM2/25/21
to Waldek Kozaczuk, OSv Development

They're in the clobber lists of fpu_state_load(), so the compiler saves them.


Managing the fpu in C++ for aarch64 doesn't work, because the compiler uses those registers freely. It can happen on x86 too, as the compiler is allowed to use those registers for moving data around, but it is rarer. I think it's overall safer to move fpu handling to entry.S for exceptions.



These functions are used by fpu_lock which we use as RAII construct letting us save/restore those FPU registers when we handle interrupts and page faults (see interrupt() and page_fault() method). So we save/restore those for preemptive switches but not for the non-preemptive (cooperative) ones which I think is correct. But maybe someone can complain.
I meant "explain" instead of "complain" :-)


"complain" works well in the area of cpu/asm integration arcana. But you're correct.



I can see the thought process behind this... But if a non-preemptive (cooperative) context switch is essentially considered a subroutine call, then we need to at least save/restore d8-d15 to fully adhere to the Procedure Call Standard, because other threads might clobber those. Am I correct in guessing that we're not doing this? I guess I can investigate and try come up with a patch for this too...
I think you are correct. This would be similar to what we have to do with x19-x28 registers in switch_to() which I am describing in one of the 30 replies ;-) So to sum it in switch_to() we need to save/restore both x19-x28 AND d8-15 registers to make sure that in cooperative switch they would be restored properly.


Yes. And fpu_lock should be migrated out of C++, and probably be part of exception_frame.


Reply all
Reply to author
Forward
0 new messages