[PATCH v0] aarch64: implement reschedule_from

Waldemar Kozaczuk

unread,

Mar 3, 2021, 4:57:18 PM3/3/21

to osv...@googlegroups.com, Waldemar Kozaczuk

As the issue #1124 and the thread https://groups.google.com/g/osv-dev/c/ystZElFNdBI
on the mailing list explains, some tests in the aarch64 debug build
fail in a consistent manner with page faults suggesting that possibly
the values of parameters passed from function to function get somehow
currupt. After painful research, I pin-pointed the crashes to the
reschedule_from_interrupt() method that does not work correctly in
voluntary (non-preemptive) context switch scenarios where
thread::yield() method is called by an app. It turns out that in those
situations so called callee-save registers used by an app function
calling thread::yield() are not saved during context switch and then
later not restored when such thread gets resumed later. Following steps
illustrate such scenario:

1. Function F pushes one of the callee saved registers - x23 (just an example) - on the T1
stack becuase it uses it for something and it must do it per the calling convention.
2. Function F stores some value in x23.
3. Function F calls thread::yield() directly or indirectly.
4. Eventually, reschedule_from_interrupt() is called and it calls switch_to() to switch
stack pointer to the new thread T2 stack. The debug version of neither reschedule_from_interrupt()
nor switch_to() stores x23 as they do not use this register (unlike the release version).
5. At some point, later reschedule_from_interrupt() is called again (not necessarily the very next time)
and calls switch_to() that switches back to T1.
6. T1 resumes and eventually returns the control to the function F1 right after it called yield().
7. The code in F1 after calling yield() reads the value of x23 ... and boom. The x23 quite likely
contains garbage because it was never restored by F1 after calling yield() because per calling
convention yield() or other callees should have saved and restored it. But they did not, did they?
Or rather different routines on different threads running on the same cpu in between ruined it.

One way to solve this problem would be to add all callee-saved registers
(x19-x28 and d8-d15 per https://github.com/ARM-software/abi-aa/blob/f52e1ad3f81254497a83578dc102f6aac89e52d0/aapcs64/aapcs64.rst -
see "6.1.1 General-purpose Registers" and "6.1.2 SIMD and Floating-Point Registers" about v8-v15 registers)
to the list of clobberred registers of the inline assembly in
thread::switch_to() in arch/aarch/arch-switch.hh. But this does not
always work as some versions of gcc (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99342) depending on
the optimization level (-O0, -O2, etc) inline switch_to() into
reschedule_from_interrupt or not and generate machine code that does not do
what our intention is. On top of that, the code in reschedule_from_interrupt()
that destroys any terminated thread on the current cpu uses the thread
local memory (TLS) pointer register - tpidr_el0 - which is not re-read
after the switch_to() and causes interesting crashes as the issue #1121
describes.

Given all above and the fact that reschedule_from_interrupt() calls
switch_to() to switch threads which is very tricky to do right using
pure C++ and some inline assembly, this patch implements high-level
version of reschedule_from_interrupt() in assembly for aarch64 (the x64
stays as is).

This patch modifies include/osv/sched.hh and core/sched.cc with number
of `#ifdef AARCH64_PORT_STUB` to keep reschedule_from_interrupt() as is
for x64 and split it into 2 new methods - schedule_next_thread() and
destroy_current_cpu_terminating_thread() for aarch64. The two new
functions correspond to the parts of the reschedule_from_interrupt() up
to the switch_to() call and after. Please note that
schedule_next_thread() returns simple 2-field struct with pointers to
old and new thread state (pc,sp,tcb,etc) that gets used by
reschedule_from_interrupt.

On top of this, this patch adds arch/aarch64/sched.S which implements
reschedule_from_interrupt() as a C function in assembly. The reschedule_from_interrupt
in essence calls new schedule_next_thread(), saves all callee-saves
registers on old thread stack, executes a context switch as the old switch_to()
did, restores the callee-saves registers from new thread stack and
finally calls destroy_current_cpu_terminating_thread().

This patch has been tested on with the -O0, -O1, -O2 optimization levels
(release, debug 1 and 2) on real ARM hardware and emulated TGI mode.

Please note that the new code makes the context switch a little bit more
expensive (~3%) per this numbers:

BEFORE:
taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so'
OSv v0.55.0-204-g837a9b5d
getauxval() stubbed
eth0: 192.168.122.15
Booted up in 58.19 ms
Cmdline: /tests/misc-ctxsw.so
664 colocated
665 apart
665 nopin

AFTER:
taskset -c 2-5 ./scripts/run.py -c 1 -e '/tests/misc-ctxsw.so'
OSv v0.55.0-205-g7e7c49f7
getauxval() stubbed
eth0: 192.168.122.15
Booted up in 58.20 ms
Cmdline: /tests/misc-ctxsw.so
683 colocated
690 apart
690 nopin

Fixes #1121
Fixes #1124

Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
Makefile | 1 +
arch/aarch64/arch-switch.hh | 38 +++--------------
arch/aarch64/arch-thread-state.hh | 1 +
arch/aarch64/sched.S | 70 +++++++++++++++++++++++++++++++
core/sched.cc | 63 ++++++++++++++++++++++++++--
include/osv/sched.hh | 34 ++++++++++++++-
6 files changed, 170 insertions(+), 37 deletions(-)
create mode 100644 arch/aarch64/sched.S

diff --git a/Makefile b/Makefile
index bbc57252..e1866c3b 100644
--- a/Makefile
+++ b/Makefile
@@ -878,6 +878,7 @@ objects += arch/$(arch)/memset.o
objects += arch/$(arch)/memcpy.o
objects += arch/$(arch)/memmove.o
objects += arch/$(arch)/tlsdesc.o
+objects += arch/$(arch)/sched.o
objects += $(libfdt)
endif

diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh
index dff7467c..1a78d132 100644
--- a/arch/aarch64/arch-switch.hh
+++ b/arch/aarch64/arch-switch.hh
@@ -20,32 +20,6 @@ void thread_main_c(sched::thread* t);

namespace sched {

-void thread::switch_to()
-{
- thread* old = current();
- asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory");
-
- asm volatile("\n"
- "str x29, %0 \n"
- "mov x2, sp \n"
- "adr x1, 1f \n" /* address of label */
- "stp x2, x1, %1 \n"
-
- "ldp x29, x0, %2 \n"
- "ldp x2, x1, %3 \n"
-
- "mov sp, x2 \n"
- "blr x1 \n"
-
- "1: \n" /* label */
- :
- : "Q"(old->_state.fp), "Ump"(old->_state.sp),
- "Ump"(this->_state.fp), "Ump"(this->_state.sp)
- : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
- "x9", "x10", "x11", "x12", "x13", "x14", "x15",
- "x16", "x17", "x18", "x30", "memory");
-}
-
void thread::switch_to_first()
{
asm volatile ("msr tpidr_el0, %0; isb; " :: "r"(_tcb) : "memory");
@@ -59,14 +33,13 @@ void thread::switch_to_first()

asm volatile("\n"
"ldp x29, x0, %0 \n"
- "ldp x2, x1, %1 \n"
- "mov sp, x2 \n"
- "blr x1 \n"
+ "ldp x22, x21, %1 \n"
+ "mov sp, x22 \n"
+ "blr x21 \n"
:
: "Ump"(this->_state.fp), "Ump"(this->_state.sp)
- : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
- "x9", "x10", "x11", "x12", "x13", "x14", "x15",
- "x16", "x17", "x18", "x30", "memory");
+ : "x0", "x19", "x20", "x21", "x22", "x23", "x24",
+ "x25", "x26", "x27", "x28", "x29", "x30", "memory");
}

void thread::init_stack()
@@ -150,6 +123,7 @@ void thread::setup_tcb()
void* p = aligned_alloc(64, total_tls_size + sizeof(*_tcb));
_tcb = (thread_control_block *)p;
_tcb[0].tls_base = &_tcb[1];
+ _state.tcb = p;
//
// First goes kernel TLS data
auto kernel_tls = _tcb[0].tls_base;
diff --git a/arch/aarch64/arch-thread-state.hh b/arch/aarch64/arch-thread-state.hh
index af424472..6f1b680d 100644
--- a/arch/aarch64/arch-thread-state.hh
+++ b/arch/aarch64/arch-thread-state.hh
@@ -14,6 +14,7 @@ struct thread_state {

void* sp;
void* pc;
+ void* tcb;
};

#endif /* ARCH_THREAD_STATE_HH_ */
diff --git a/arch/aarch64/sched.S b/arch/aarch64/sched.S
new file mode 100644
index 00000000..1e69f8fd
--- /dev/null
+++ b/arch/aarch64/sched.S
@@ -0,0 +1,70 @@
+#void cpu::reschedule_from_interrupt(bool called_from_yield,
+# thread_runtime::duration preempt_after)
+.global reschedule_from_interrupt
+.type reschedule_from_interrupt, @function
+reschedule_from_interrupt:
+ #.cfi_startproc
+ stp x29, x30, [sp, #-176]!
+ #.cfi_adjust_cfa_offset 176
+ mov x29, sp
+ #.cfi_def_cfa_register x29
+
+ #Call cpu_schedule_next_thread() to determine next thread to switch to if any
+ bl cpu_schedule_next_thread
+
+ #The cpu_schedule_next_thread returns thread_switch_data in x0 and x1,
+ #where x0 holds old_thread_state and x1 holds new_thread_state
+ #If cpu_schedule_next_thread() returns thread_switch_data with null pointers, exit
+ cmp x0, #0
+ b.eq 2f
+
+ #Store all callee-save registers on the old thread stack
+ stp x19, x20, [sp, #16]
+ stp x21, x22, [sp, #32]
+ stp x23, x24, [sp, #48]
+ stp x25, x26, [sp, #64]
+ stp x27, x28, [sp, #80]
+
+ stp d8, d9, [sp, #96]
+ stp d10, d11, [sp, #112]
+ stp d12, d13, [sp, #128]
+ stp d14, d15, [sp, #144]
+
+ #Switch to new thread
+ ldr x2, [x1, #32] //Fetch _tcb of the new thread
+ msr tpidr_el0, x2 //Set thread pointer
+ isb
+
+ str x29, [x0, #0] //Save frame pointer of the old thread
+ mov x3, sp //Fetch old thread stack pointer
+ adr x4, 1f //Fetch old thread instruction point
+ stp x3, x4, [x0, #16] //Save old thread sp and pc
+
+ ldp x29, x0, [x1, #0] //Set frame pointer of the new thread and this (x0) of the new thread
+ //Please note that the pc may point to thread_main_c(thread*) which is
+ //why we have to set x0 (1st argument) to new thread object
+ ldp x3, x4, [x1, #16] //Fetch new thread sp and pc
+ mov sp, x3 //Set new thread stack pointer
+ blr x4 //Jump to the new thread pc
+
+1:
+ #Restore all callee-save registers from the new thread stack
+ ldp x19, x20, [sp, #16]
+ ldp x21, x22, [sp, #32]
+ ldp x23, x24, [sp, #48]
+ ldp x25, x26, [sp, #64]
+ ldp x27, x28, [sp, #80]
+
+ ldp d8, d9, [sp, #96]
+ ldp d10, d11, [sp, #112]
+ ldp d12, d13, [sp, #128]
+ ldp d14, d15, [sp, #144]
+
+ #Call destroy_current_cpu_terminating_thread()
+ bl destroy_current_cpu_terminating_thread
+
+2:
+ ldp x29, x30, [sp], #176
+ #.cfi_def_cfa sp, 0
+ ret
+ #.cfi_endproc
diff --git a/core/sched.cc b/core/sched.cc
index 06f849d1..74b4ab95 100644
--- a/core/sched.cc
+++ b/core/sched.cc
@@ -225,13 +225,34 @@ void thread::cputime_estimator_get(
void cpu::schedule()
{
WITH_LOCK(irq_lock) {
+#ifdef AARCH64_PORT_STUB
+ reschedule_from_interrupt(sched::cpu::current(), false, thyst);
+#else
current()->reschedule_from_interrupt();
- }
-}
-
+#endif
+ }
+}
+
+// In aarch64 port, the reschedule_from_interrupt() needs to be implemented
+// in assembly (please see arch/aarch64/sched.S) to give us better control
+// which registers are used and which ones are saved and restored during
+// the context switch. In essence, most of the original reschedule_from_interrupt()
+// code up to switch_to() is shared with aarch64-specific cpu::schedule_next_thread()
+// that is called from reschedule_from_interrupt() in arch/arch64/sched.S assembly.
+// The logic executed after switch_to() that makes up destroy_current_cpu_terminating_thread()
+// is called from arch/arch64/sched.S as well.
+// At the end, we define reschedule_from_interrupt() in C++ for x86_64 and schedule_next_thread()
+// and destroy_current_cpu_terminating_thread() in C++ for aarch64.
+#ifdef AARCH64_PORT_STUB
+inline thread_switch_data cpu::schedule_next_thread(bool called_from_yield,
+ thread_runtime::duration preempt_after)
+{
+ thread_switch_data switch_data;
+#else
void cpu::reschedule_from_interrupt(bool called_from_yield,
thread_runtime::duration preempt_after)
{
+#endif
trace_sched_sched();
assert(sched::exception_depth <= 1);
need_reschedule = false;
@@ -259,7 +280,11 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
// lowest runtime, and update the timer until the next thread's turn.
if (runqueue.empty()) {
preemption_timer.cancel();
+#ifdef AARCH64_PORT_STUB
+ return switch_data;
+#else
return;
+#endif
} else if (!called_from_yield) {
auto &t = *runqueue.begin();
if (p->_runtime.get_local() < t._runtime.get_local()) {
@@ -268,7 +293,11 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
if (delta > 0) {
preemption_timer.set(now + delta);
}
+#ifdef AARCH64_PORT_STUB
+ return switch_data;
+#else
return;
+#endif
}
}
// If we're here, p no longer has the lowest runtime. Before queuing
@@ -336,19 +365,39 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
if (lazy_flush_tlb.exchange(false, std::memory_order_seq_cst)) {
mmu::flush_tlb_local();
}
+#ifdef AARCH64_PORT_STUB
+ switch_data.old_thread_state = &(p->_state);
+ switch_data.new_thread_state = &(n->_state);
+ return switch_data;
+}
+#else
n->switch_to();
+#endif

// Note: after the call to n->switch_to(), we should no longer use any of
// the local variables, nor "this" object, because we just switched to n's
// stack and the values we can access now are those that existed in the
// reschedule call which scheduled n out, and will now be returning.
// So to get the current cpu, we must use cpu::current(), not "this".
+#ifdef AARCH64_PORT_STUB
+extern "C" void destroy_current_cpu_terminating_thread()
+{
+#endif
if (cpu::current()->terminating_thread) {
cpu::current()->terminating_thread->destroy();
cpu::current()->terminating_thread = nullptr;
}
}

+#ifdef AARCH64_PORT_STUB
+extern "C" thread_switch_data cpu_schedule_next_thread(cpu* cpu,
+ bool called_from_yield,
+ thread_runtime::duration preempt_after)
+{
+ return cpu->schedule_next_thread(called_from_yield, preempt_after);
+}
+#endif
+
void cpu::timer_fired()
{
// nothing to do, preemption will happen if needed
@@ -521,7 +570,11 @@ void thread::pin(cpu *target_cpu)
// Note that wakeme is on the same CPU, and irq is disabled,
// so it will not actually run until we stop running.
wakeme->wake_with([&] { do_wakeme = true; });
+#ifdef AARCH64_PORT_STUB
+ reschedule_from_interrupt(source_cpu, false, thyst);
+#else
source_cpu->reschedule_from_interrupt();
+#endif
}
// wakeme will be implicitly join()ed here.
}
@@ -760,7 +813,11 @@ void thread::yield(thread_runtime::duration preempt_after)
}
trace_sched_yield_switch();

+#ifdef AARCH64_PORT_STUB
+ reschedule_from_interrupt(cpu::current(), true, preempt_after);
+#else
cpu::current()->reschedule_from_interrupt(true, preempt_after);
+#endif
}

void thread::set_priority(float priority)
diff --git a/include/osv/sched.hh b/include/osv/sched.hh
index 0fb44f77..1dacd623 100644
--- a/include/osv/sched.hh
+++ b/include/osv/sched.hh
@@ -31,6 +31,9 @@ typedef float runtime_t;

extern "C" {
void smp_main();
+#ifdef AARCH64_PORT_STUB
+void destroy_current_cpu_terminating_thread();
+#endif
};
void smp_launch();

@@ -321,7 +324,12 @@ constexpr thread_runtime::duration thyst = std::chrono::milliseconds(5);
constexpr thread_runtime::duration context_switch_penalty =
std::chrono::microseconds(10);

-
+#ifdef AARCH64_PORT_STUB
+struct thread_switch_data {
+ thread_state* old_thread_state = nullptr;
+ thread_state* new_thread_state = nullptr;
+};
+#endif
/**
* OSv thread
*/
@@ -600,7 +608,9 @@ private:
unsigned allowed_initial_states_mask = 1 << unsigned(status::waiting));
static void sleep_impl(timer &tmr);
void main();
+#ifndef AARCH64_PORT_STUB
void switch_to();
+#endif
void switch_to_first();
void prepare_wait();
void wait();
@@ -702,7 +712,12 @@ private:
std::vector<char*> _tls;
bool _app;
std::shared_ptr<osv::application_runtime> _app_runtime;
+public: //TODO: For whatever reason we need to make this public even though destroy_current_cpu_terminating_thread is defined as friend below, so compiler still complains.
void destroy();
+private:
+#ifdef AARCH64_PORT_STUB
+ friend void ::destroy_current_cpu_terminating_thread();
+#endif
friend class thread_ref_guard;
friend void thread_main_c(thread* t);
friend class wait_guard;
@@ -884,8 +899,13 @@ struct cpu : private timer_base::client {
* @param preempt_after fire the preemption timer after this time period
* (relevant only when "called_from_yield" is TRUE)
*/
+#ifdef AARCH64_PORT_STUB
+ thread_switch_data schedule_next_thread(bool called_from_yield = false,
+ thread_runtime::duration preempt_after = thyst);
+#else
void reschedule_from_interrupt(bool called_from_yield = false,
- thread_runtime::duration preempt_after = thyst);
+ thread_runtime::duration preempt_after = thyst);
+#endif
void enqueue(thread& t);
void init_idle_thread();
virtual void timer_fired() override;
@@ -895,6 +915,12 @@ struct cpu : private timer_base::client {
int renormalize_count;
};

+#ifdef AARCH64_PORT_STUB
+extern "C" void reschedule_from_interrupt(cpu* cpu,
+ bool called_from_yield,
+ thread_runtime::duration preempt_after);
+#endif
+
class cpu::notifier {
public:
explicit notifier(std::function<void ()> cpu_up);
@@ -996,7 +1022,11 @@ inline bool preemptable()
inline void preempt()
{
if (preemptable()) {
+#ifdef AARCH64_PORT_STUB
+ reschedule_from_interrupt(sched::cpu::current(), false, thyst);
+#else
sched::cpu::current()->reschedule_from_interrupt();
+#endif
} else {
// preempt_enable() will pick this up eventually
need_reschedule = true;
--
2.27.0

Waldek Kozaczuk

unread,

Mar 3, 2021, 5:50:45 PM3/3/21

to OSv Development

BTW the same test app compiled as pie (g++ tests/misc-ctxsw.cc -lpthread -o misc-ctxsw) to run on Linux host natively yields the following on the same hardware:

taskset -c 2-2 ./misc-ctxsw
4075 colocated
5319 apart
1813 nopin

Stewart Hildebrand

unread,

Mar 5, 2021, 10:13:10 AM3/5/21

to OSv Development

On Wednesday, March 3, 2021 at 5:50:45 PM UTC-5 jwkoz...@gmail.com wrote:

On Wed, Mar 3, 2021 at 4:57 PM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:
As the issue #1124 and the thread https://groups.google.com/g/osv-dev/c/ystZElFNdBI
on the mailing list explains, some tests in the aarch64 debug build
fail in a consistent manner with page faults suggesting that possibly
the values of parameters passed from function to function get somehow
currupt. After painful research, I pin-pointed the crashes to the
reschedule_from_interrupt() method that does not work correctly in
voluntary (non-preemptive) context switch scenarios where
thread::yield() method is called by an app. It turns out that in those
situations so called callee-save registers used by an app function
calling thread::yield() are not saved during context switch and then
later not restored when such thread gets resumed later. Following steps
illustrate such scenario:

Thank you so much for all your work investigating this issue.

1. Function F pushes one of the callee saved registers - x23 (just an example) - on the T1
stack becuase it uses it for something and it must do it per the calling convention.
2. Function F stores some value in x23.
3. Function F calls thread::yield() directly or indirectly.
4. Eventually, reschedule_from_interrupt() is called and it calls switch_to() to switch
stack pointer to the new thread T2 stack. The debug version of neither reschedule_from_interrupt()
nor switch_to() stores x23 as they do not use this register (unlike the release version).
5. At some point, later reschedule_from_interrupt() is called again (not necessarily the very next time)
and calls switch_to() that switches back to T1.
6. T1 resumes and eventually returns the control to the function F1 right after it called yield().
7. The code in F1 after calling yield() reads the value of x23 ... and boom. The x23 quite likely
contains garbage because it was never restored by F1 after calling yield() because per calling
convention yield() or other callees should have saved and restored it. But they did not, did they?
Or rather different routines on different threads running on the same cpu in between ruined it.

One way to solve this problem would be to add all callee-saved registers
(x19-x28 and d8-d15 per https://github.com/ARM-software/abi-aa/blob/f52e1ad3f81254497a83578dc102f6aac89e52d0/aapcs64/aapcs64.rst -
see "6.1.1 General-purpose Registers" and "6.1.2 SIMD and Floating-Point Registers" about v8-v15 registers)
to the list of clobberred registers of the inline assembly in
thread::switch_to() in arch/aarch/arch-switch.hh. But this does not
always work as some versions of gcc (see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99342) depending on
the optimization level (-O0, -O2, etc) inline switch_to() into
reschedule_from_interrupt or not and generate machine code that does not do
what our intention is. On top of that, the code in reschedule_from_interrupt()

So, after having attempted to file a bug with GCC and been proven wrong, I've discovered that the following bit of GCC documentation is incorrect:

"When the compiler selects which registers to use to represent input and output operands, it does not use any of the clobbered registers."

Reference: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Clobbers-and-Scratch-Registers

In fact, the compiler does use clobbered registers to represent input operands. For more details and an example of this, see the following (invalid) bug report: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99342

It just happened that we were lucky that the compiler hadn't chosen any clobbered registers to represent input operands, until we tried it with GCC 10.2. While we could have solved this by using earlyclobbers (as was suggested in the linked GCC bug report thread), we still wouldn't have satisfied the following requirement from GCC documentation: "the compiler requires the value of the stack pointer to be the same after an asm statement as it was on entry to the statement".

I personally favor correctness over performance...

This fails when trying to compile with GCC 7.5.0:

In file included from core/sched.cc:126:0:

arch/aarch64/arch-switch.hh: In member function ‘void sched::thread::switch_to_first()’:

arch/aarch64/arch-switch.hh:43:1: error: x29 cannot be used in asm here

}

^

Removing x29 from the clobber list allows it to build again with GCC 7.5.0.

You're using x21/x22 instead of x1/x2, and I'm guessing this is an attempt to trick the compiler into not using the clobbered registers to represent the input operands?

Apparently the way to tell the compiler not to use clobbered registers to represent input operands is to use earlyclobbers. Here's the obscure syntax for turning them into earlyclobbers:

asm volatile("\n"

"ldp x29, x0, %2 \n"

"ldp x22, x21, %3 \n"

"mov sp, x22 \n"

"blr x21 \n"

: // No output operands - this is to designate the input operands as earlyclobbers

"=&Ump"(this->_state.fp), "=&Ump"(this->_state.sp)

: "Ump"(this->_state.fp), "Ump"(this->_state.sp)

: "x0", "x19", "x20", "x21", "x22", "x23", "x24",

"x25", "x26", "x27", "x28", "x30", "memory");

The operands apparently have to be listed as both outputs and inputs even they're only being read in the asm.

It looks like there are other cases throughout the OSv codebase where this could be an issue. I don't think those other potential issues should block this patch from going in.

We're still violating the requirement that the stack pointer should be the same value after an asm statement as it was upon entry, so it seems that switch_to_first() should also be moved to assembly (eventually). Again, I don't think that should block this patch from going in.

Waldek Kozaczuk

unread,

Mar 8, 2021, 6:17:15 PM3/8/21

to OSv Development

It would be also nice to see what others think about the C++ code changes - #ifdef, using extra C functions, etc. Is there a better way to re-structure the code?

This is a pretty earth-shattering statement which seems to imply that even for x86_64 we might no longer be lucky which registers the compiler chooses to use so we may need to re-write switch_to() in assembly as well at some point. Clearly, the purpose of the inline assembly in switch_to() is to switch the stack.

Honestly, at this point, we might skip touching switch_to_first() altogether, and as you point to re-write it in assembly as well. The switch_to_first() is only used once per cpu on startup so I do not think it is as essential to fix it now. What do you think?

BTW I think I also missed adding d8-d15 registers to the list of clobbers.

Waldek Kozaczuk

unread,

Mar 8, 2021, 10:48:59 PM3/8/21

to OSv Development

I think that one more register needs to be saved and restored during context switch - FPCR (Floating Point Control Register):

"The FPSR is a status register that holds the cumulative exception bits of the floating-point unit. It contains the fields IDC, IXC, UFC, OFC, DZC, IOC and QC. These fields are not preserved across a public interface and may have any value on entry to a subroutine.

The FPCR is used to control the behavior of the floating-point unit. It is a global register with the following properties.

The exception-control bits (8-12), rounding mode bits (22-23) and flush-to-zero bits (24) may be modified by calls to specific support functions that affect the global state of the application.
All other bits are reserved and must not be modified. It is not defined whether the bits read as zero or one, or whether they are preserved across a public interface."

Saving and restoring it makes tst-fpu.cc work now for aarch64 where yield() is used.

I do not think we need to save/restore FPSR - status register.

Waldemar Kozaczuk

unread,

Mar 15, 2021, 1:18:35 AM3/15/21

to osv...@googlegroups.com, Waldemar Kozaczuk

Comparing to the original v0 version, this one refines switch_to_first()
to use early clobbers. The switch_to() also saves/restores one extra
register FPCR to make tst-fpu.cc work correctly.

arch/aarch64/arch-switch.hh | 43 ++++--------------
arch/aarch64/arch-thread-state.hh | 1 +
arch/aarch64/sched.S | 75 +++++++++++++++++++++++++++++++
core/sched.cc | 63 ++++++++++++++++++++++++--
include/osv/sched.hh | 34 +++++++++++++-
6 files changed, 178 insertions(+), 39 deletions(-)

create mode 100644 arch/aarch64/sched.S

diff --git a/Makefile b/Makefile

index e6867cac..d8f888c4 100644
--- a/Makefile
+++ b/Makefile
@@ -879,6 +879,7 @@ objects += arch/$(arch)/memset.o

objects += arch/$(arch)/memcpy.o
objects += arch/$(arch)/memmove.o
objects += arch/$(arch)/tlsdesc.o
+objects += arch/$(arch)/sched.o
objects += $(libfdt)
endif

diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh

index dff7467c..bbf832db 100644

@@ -58,15 +32,15 @@ void thread::switch_to_first()
remote_thread_local_var(percpu_base) = _detached_state->_cpu->percpu_base;

asm volatile("\n"
- "ldp x29, x0, %0 \n"

- "ldp x2, x1, %1 \n"
- "mov sp, x2 \n"
- "blr x1 \n"

- :
+ "ldp x29, x0, %2 \n"
+ "ldp x22, x21, %3 \n"

+ "mov sp, x22 \n"
+ "blr x21 \n"

+ : // No output operands - this is to designate the input operands as earlyclobbers
+ "=&Ump"(this->_state.fp), "=&Ump"(this->_state.sp)

: "Ump"(this->_state.fp), "Ump"(this->_state.sp)
- : "x0", "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8",
- "x9", "x10", "x11", "x12", "x13", "x14", "x15",
- "x16", "x17", "x18", "x30", "memory");
+ : "x0", "x19", "x20", "x21", "x22", "x23", "x24",

+ "x25", "x26", "x27", "x28", "x30", "memory");
}

void thread::init_stack()
@@ -150,6 +124,7 @@ void thread::setup_tcb()

void* p = aligned_alloc(64, total_tls_size + sizeof(*_tcb));
_tcb = (thread_control_block *)p;
_tcb[0].tls_base = &_tcb[1];
+ _state.tcb = p;
//
// First goes kernel TLS data
auto kernel_tls = _tcb[0].tls_base;
diff --git a/arch/aarch64/arch-thread-state.hh b/arch/aarch64/arch-thread-state.hh
index af424472..6f1b680d 100644
--- a/arch/aarch64/arch-thread-state.hh
+++ b/arch/aarch64/arch-thread-state.hh
@@ -14,6 +14,7 @@ struct thread_state {

void* sp;
void* pc;
+ void* tcb;
};

#endif /* ARCH_THREAD_STATE_HH_ */
diff --git a/arch/aarch64/sched.S b/arch/aarch64/sched.S
new file mode 100644

index 00000000..98603393
--- /dev/null
+++ b/arch/aarch64/sched.S
@@ -0,0 +1,75 @@

+#void cpu::reschedule_from_interrupt(bool called_from_yield,
+# thread_runtime::duration preempt_after)
+.global reschedule_from_interrupt
+.type reschedule_from_interrupt, @function
+reschedule_from_interrupt:

+ stp x29, x30, [sp, #-192]!

+ mov x29, sp
+

+ #Call cpu_schedule_next_thread() to determine next thread to switch to if any
+ bl cpu_schedule_next_thread
+
+ #The cpu_schedule_next_thread returns thread_switch_data in x0 and x1,
+ #where x0 holds old_thread_state and x1 holds new_thread_state
+ #If cpu_schedule_next_thread() returns thread_switch_data with null pointers, exit
+ cmp x0, #0
+ b.eq 2f
+

+ #Store all regular callee-save registers on the old thread stack

+ stp x19, x20, [sp, #16]
+ stp x21, x22, [sp, #32]
+ stp x23, x24, [sp, #48]
+ stp x25, x26, [sp, #64]
+ stp x27, x28, [sp, #80]
+

+ #Store all SIMD/FP callee-save registers on the old thread stack

+ stp d8, d9, [sp, #96]
+ stp d10, d11, [sp, #112]
+ stp d12, d13, [sp, #128]
+ stp d14, d15, [sp, #144]
+

+ #Store FP Control Register with flags that control rounding, etc
+ mrs x2, fpcr
+ str x2, [sp, #160]

+
+ #Switch to new thread
+ ldr x2, [x1, #32] //Fetch _tcb of the new thread
+ msr tpidr_el0, x2 //Set thread pointer
+ isb
+
+ str x29, [x0, #0] //Save frame pointer of the old thread
+ mov x3, sp //Fetch old thread stack pointer
+ adr x4, 1f //Fetch old thread instruction point
+ stp x3, x4, [x0, #16] //Save old thread sp and pc
+
+ ldp x29, x0, [x1, #0] //Set frame pointer of the new thread and this (x0) of the new thread
+ //Please note that the pc may point to thread_main_c(thread*) which is
+ //why we have to set x0 (1st argument) to new thread object
+ ldp x3, x4, [x1, #16] //Fetch new thread sp and pc
+ mov sp, x3 //Set new thread stack pointer
+ blr x4 //Jump to the new thread pc
+
+1:

+ #Restore all regular callee-save registers from the new thread stack

+ ldp x19, x20, [sp, #16]
+ ldp x21, x22, [sp, #32]
+ ldp x23, x24, [sp, #48]
+ ldp x25, x26, [sp, #64]
+ ldp x27, x28, [sp, #80]
+

+ #Restore all SIMD/FP callee-save registers from the new thread stack

+ ldp d8, d9, [sp, #96]
+ ldp d10, d11, [sp, #112]
+ ldp d12, d13, [sp, #128]
+ ldp d14, d15, [sp, #144]

+
+ #Restore FP Control Register with flags that control rounding, etc
+ ldr x2, [sp, #160]
+ msr fpcr, x2

+
+ #Call destroy_current_cpu_terminating_thread()
+ bl destroy_current_cpu_terminating_thread
+
+2:

+ ldp x29, x30, [sp], #192
+ ret

index 0fb44f77..1b2847be 100644

Commit Bot

unread,

Mar 15, 2021, 6:50:11 AM3/15/21

to osv...@googlegroups.com, Waldemar Kozaczuk

From: Waldemar Kozaczuk <jwkoz...@gmail.com>
Committer: Nadav Har'El <n...@scylladb.com>
Branch: master

aarch64: implement reschedule_from_interrupt() in assembly

Message-Id: <20210315051824.8...@gmail.com>

---
diff --git a/Makefile b/Makefile

--- a/Makefile
+++ b/Makefile
@@ -879,6 +879,7 @@ objects += arch/$(arch)/memset.o
objects += arch/$(arch)/memcpy.o
objects += arch/$(arch)/memmove.o
objects += arch/$(arch)/tlsdesc.o
+objects += arch/$(arch)/sched.o
objects += $(libfdt)
endif

diff --git a/arch/aarch64/arch-switch.hh b/arch/aarch64/arch-switch.hh

--- a/arch/aarch64/arch-thread-state.hh
+++ b/arch/aarch64/arch-thread-state.hh
@@ -14,6 +14,7 @@ struct thread_state {

void* sp;
void* pc;
+ void* tcb;
};

#endif /* ARCH_THREAD_STATE_HH_ */
diff --git a/arch/aarch64/sched.S b/arch/aarch64/sched.S

--- a/arch/aarch64/sched.S

Nadav Har'El

unread,

Mar 15, 2021, 7:03:00 AM3/15/21

to Waldemar Kozaczuk, Osv Dev

Thanks, very impressive work!

The code looks very reasonable to me, but I can't really read the arm assembly to verify every small detail. I'm sure you are testing it much better than I can, and it's really great to see how the ARM port is coming along.

I committed this patch as-is, but I have a couple of ideas for future consideration:

1. I see you used the "AARCH64_PORT_STUB" ifdef. I think the original intention of this macro, as its name suggests, was to mark code that was done as a quick hack or stub for aarch64 - and should be revisited in the future. If this isn't quite the case here, and it's not a stub but really fully-working code, maybe you should have used the regular #ifdef __arch64__ (I think), and stop using the AARCH64_PORT_STUB in new code?

2. You split off the "schedule_next_thread" function just for aarch64, with an #if putting the same code in either schedule_next_thread or reschedule_from_interrupt. I wonder if we couldn't have the same schedule_next_thread also for x86, so we don't need a special case for aarch64. Hopefully (?) the extra function won't cause any performance deterioration for x86 if it's inlined and optimized. Moreover, from this discussion it's not entirely clear to me why this whole C function worked correctly in x86 - maybe it was just luck and in the future we'll need to do this for x86 as well.

--
Nadav Har'El

n...@scylladb.com

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20210315051824.88794-1-jwkozaczuk%40gmail.com.

Waldek Kozaczuk

unread,

Mar 18, 2021, 1:40:27 PM3/18/21

to Nadav Har'El, Osv Dev

On Mon, Mar 15, 2021 at 7:02 AM Nadav Har'El <n...@scylladb.com> wrote:

Thanks, very impressive work!
The code looks very reasonable to me, but I can't really read the arm assembly to verify every small detail. I'm sure you are testing it much better than I can, and it's really great to see how the ARM port is coming along.

I committed this patch as-is, but I have a couple of ideas for future consideration:

1. I see you used the "AARCH64_PORT_STUB" ifdef. I think the original intention of this macro, as its name suggests, was to mark code that was done as a quick hack or stub for aarch64 - and should be revisited in the future. If this isn't quite the case here, and it's not a stub but really fully-working code, maybe you should have used the regular #ifdef __arch64__ (I think), and stop using the AARCH64_PORT_STUB in new code?

Very good idea. I will try to prepare a patch that fixes it.

2. You split off the "schedule_next_thread" function just for aarch64, with an #if putting the same code in either schedule_next_thread or reschedule_from_interrupt. I wonder if we couldn't have the same schedule_next_thread also for x86, so we don't need a special case for aarch64. Hopefully (?) the extra function won't cause any performance deterioration for x86 if it's inlined and optimized. Moreover, from this discussion it's not entirely clear to me why this whole C function worked correctly in x86 - maybe it was just luck and in the future we'll need to do this for x86 as well.

I think it is worth creating an issue.

Reply all

Reply to author

Forward

[PATCH v0] aarch64: implement reschedule_from_interrupt() in assembly

Waldemar Kozaczuk

Waldek Kozaczuk

Stewart Hildebrand

Waldek Kozaczuk

Waldek Kozaczuk

Waldemar Kozaczuk

Commit Bot

Nadav Har'El

Waldek Kozaczuk