[PATCH] threads: allocate application stacks lazily to save memory

47 views
Skip to first unread message

Waldemar Kozaczuk

unread,
Jan 29, 2020, 8:14:30 AM1/29/20
to osv...@googlegroups.com, Waldemar Kozaczuk, Matthew Pabst
As the issue #143 explains, currently applications stacks are eagerly
allocated (aka pre-populated) which may lead to substantial memory waste.
We want those stacks to be lazily allocated instead, so that physical
memory behind the stack gets allocated as needed by standard fault handling mechanism.

The page fault handling logic requires that both interrupts and preemption
are enabled. Therefore this patch implements simple strategy to read single byte on a stack
one page (4K) deeper before disabling interrupts or preemption.

It turns out we cannot read that single byte "blindly" because in some scenarios
interrupts might be already disabled when we are about to disable preemption and vice versa.
Also similarly, which is more speculative, disabling preemption (and possibly interrupts) may nest.
Therefore we need a "conditional" read of that byte on stack. More specifically
we should ONLY read the byte IF both interrupts and preemption are enabled.

This patch achieves this by adding extra thread-local counter to track interrupts disabling/enabling
that operates in a similar way the preemption counter does. So every time interrupts
are disabled we increment the counter and decrement it everytime interrupts are enabled back.
Then everytime before we disable interrupts or preemption, we check if both counters
are 0 and only then try to read a byte to potentially trigger a fault.

Please note that performance-wise this patch is slighy more expensive than it
was originally hoped. As the disassembled snippets illustrate it costs extra 3-4 instructions
extra every time we disable preemption or interrupts.

A. Disabling preemption in __tls_get_addr before the patch:
```
4035cae8: e8 83 f6 ff ff callq 4035c170 <elf::get_program()>
4035caed: 4c 8b 2d 6c 40 51 00 mov 0x51406c(%rip),%r13 # 40870b60 <sched::preempt_counter+0x40870b60>
4035caf4: 4c 8b 33 mov (%rbx),%r14
4035caf7: 64 41 83 45 00 01 addl $0x1,%fs:0x0(%r13)
4035cafd: e8 6e f6 ff ff callq 4035c170 <elf::get_program()>
```

B. Disabling preemption in __tls_get_addr after the patch:
```
4035fdd8: e8 63 f6 ff ff callq 4035f440 <elf::get_program()>
4035fddd: 4c 8b 2d a4 4d 51 00 mov 0x514da4(%rip),%r13 # 40874b88 <arch::irq_preempt_counters+0x40874b88>
4035fde4: 4c 8b 33 mov (%rbx),%r14
4035fde7: 64 49 83 7d 00 00 cmpq $0x0,%fs:0x0(%r13)
4035fded: 75 07 jne 4035fdf6 <__tls_get_addr+0x76>
4035fdef: 8a 84 24 00 f0 ff ff mov -0x1000(%rsp),%al
4035fdf6: 64 41 83 45 00 01 addl $0x1,%fs:0x0(%r13)
4035fdfc: e8 3f f6 ff ff callq 4035f440 <elf::get_program()>
```

As an example of memory saving, tomcat using ~ 300 thread ended up 365MB instead
of 620MB before the patch.

Fixes #143

Signed-off-by: Matthew Pabst <pabstm...@gmail.com>
Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
arch/aarch64/arch.hh | 6 ++++++
arch/x64/arch-switch.hh | 10 ++++++++++
arch/x64/arch.hh | 20 ++++++++++++++++++++
core/sched.cc | 5 ++++-
include/osv/counters.hh | 31 +++++++++++++++++++++++++++++++
include/osv/irqlock.hh | 2 ++
include/osv/mmu-defs.hh | 3 +--
include/osv/sched.hh | 9 +++++----
libc/mman.cc | 9 ++++-----
libc/pthread.cc | 5 +++++
10 files changed, 88 insertions(+), 12 deletions(-)
create mode 100644 include/osv/counters.hh

diff --git a/arch/aarch64/arch.hh b/arch/aarch64/arch.hh
index 855f1987..7dae53f5 100644
--- a/arch/aarch64/arch.hh
+++ b/arch/aarch64/arch.hh
@@ -11,6 +11,8 @@
#define ARCH_HH_

#include "processor.hh"
+#include <osv/barrier.hh>
+#include <osv/counters.hh>

// architecture independent interface for architecture dependent operations

@@ -20,6 +22,10 @@ namespace arch {
#define INSTR_SIZE_MIN 4
#define ELF_IMAGE_START (OSV_KERNEL_BASE + 0x10000)

+inline void ensure_next_stack_page() {
+ //TODO: Implement lazy stack check for AARCH64
+}
+
inline void irq_disable()
{
processor::irq_disable();
diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
index 6803498f..048beb6f 100644
--- a/arch/x64/arch-switch.hh
+++ b/arch/x64/arch-switch.hh
@@ -148,6 +148,13 @@ void thread::init_stack()
_state.rip = reinterpret_cast<void*>(thread_main);
_state.rsp = stacktop;
_state.exception_stack = _arch.exception_stack + sizeof(_arch.exception_stack);
+
+ if (stack.lazy) {
+ // If the thread stack is setup to get lazily allocated and given the thread initially starts
+ // running with preemption disabled, we need to pre-fault the first stack page.
+ volatile char r = *((char *) (stacktop - 1));
+ (void) r; // trick the compiler into thinking that r is used
+ }
}

void thread::setup_tcb()
@@ -311,6 +318,9 @@ void thread::free_tcb()

void thread_main_c(thread* t)
{
+ if (t->get_stack_info().lazy) {
+ arch::irq_preempt_counters.irq = arch::irq_counter_lazy_stack_init_value;
+ }
arch::irq_enable();
#ifdef CONF_preempt
preempt_enable();
diff --git a/arch/x64/arch.hh b/arch/x64/arch.hh
index 17df5f5c..5cadc562 100644
--- a/arch/x64/arch.hh
+++ b/arch/x64/arch.hh
@@ -10,6 +10,8 @@

#include "processor.hh"
#include "msr.hh"
+#include <osv/barrier.hh>
+#include <osv/counters.hh>

// namespace arch - architecture independent interface for architecture
// dependent operations (e.g. irq_disable vs. cli)
@@ -20,8 +22,19 @@ namespace arch {
#define INSTR_SIZE_MIN 1
#define ELF_IMAGE_START OSV_KERNEL_BASE

+inline void ensure_next_stack_page() {
+ if (irq_preempt_counters.any_is_on) {
+ return;
+ }
+
+ char i;
+ asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
+}
+
inline void irq_disable()
{
+ ensure_next_stack_page();
+ ++irq_preempt_counters.irq;
processor::cli();
}

@@ -36,11 +49,15 @@ inline void irq_disable_notrace()
inline void irq_enable()
{
processor::sti();
+ --irq_preempt_counters.irq;
+ barrier();
}

inline void wait_for_interrupt()
{
processor::sti_hlt();
+ --irq_preempt_counters.irq;
+ barrier();
}

inline void halt_no_interrupts()
@@ -78,12 +95,15 @@ private:

inline void irq_flag_notrace::save()
{
+ ensure_next_stack_page();
+ ++irq_preempt_counters.irq;
asm volatile("sub $128, %%rsp; pushfq; popq %0; add $128, %%rsp" : "=r"(_rflags));
}

inline void irq_flag_notrace::restore()
{
asm volatile("sub $128, %%rsp; pushq %0; popfq; add $128, %%rsp" : : "r"(_rflags));
+ --irq_preempt_counters.irq;
}

inline bool irq_flag_notrace::enabled() const
diff --git a/core/sched.cc b/core/sched.cc
index 06f849d1..295e3de0 100644
--- a/core/sched.cc
+++ b/core/sched.cc
@@ -42,6 +42,10 @@ extern char _percpu_start[], _percpu_end[];
using namespace osv;
using namespace osv::clock::literals;

+namespace arch {
+counters __thread irq_preempt_counters = {{1, irq_counter_default_init_value}};
+}
+
namespace sched {

TRACEPOINT(trace_sched_idle, "");
@@ -69,7 +73,6 @@ std::vector<cpu*> cpus __attribute__((init_priority((int)init_prio::cpus)));
thread __thread * s_current;
cpu __thread * current_cpu;

-unsigned __thread preempt_counter = 1;
bool __thread need_reschedule = false;

elf::tls_data tls;
diff --git a/include/osv/counters.hh b/include/osv/counters.hh
new file mode 100644
index 00000000..38d40fb6
--- /dev/null
+++ b/include/osv/counters.hh
@@ -0,0 +1,31 @@
+/*
+ * Copyright (C) 2020 Waldemar Kozaczuk
+ *
+ * This work is open source software, licensed under the terms of the
+ * BSD license as described in the LICENSE file in the top-level directory.
+ */
+
+#ifndef OSV_COUNTERS_HH
+#define OSV_COUNTERS_HH
+
+namespace arch {
+
+// Both preempt and irq counters are colocated next to each other in this
+// union by design to let compiler optimize ensure_next_stack_page() so
+// that it can check if any counter is non-zero by checking single 64-bit any_is_on field
+union counters {
+ struct {
+ unsigned preempt;
+ unsigned irq;
+ };
+ uint64_t any_is_on;
+};
+
+extern counters __thread irq_preempt_counters;
+
+constexpr unsigned irq_counter_default_init_value = 11;
+constexpr unsigned irq_counter_lazy_stack_init_value = 1;
+
+}
+
+#endif //OSV_COUNTERS_HH
diff --git a/include/osv/irqlock.hh b/include/osv/irqlock.hh
index 2d5372fc..1a163e45 100644
--- a/include/osv/irqlock.hh
+++ b/include/osv/irqlock.hh
@@ -34,6 +34,8 @@ inline void irq_save_lock_type::lock()
inline void irq_save_lock_type::unlock()
{
_flags.restore();
+ --arch::irq_preempt_counters.irq;
+ barrier();
}

extern irq_lock_type irq_lock;
diff --git a/include/osv/mmu-defs.hh b/include/osv/mmu-defs.hh
index 18edf441..a731ed6e 100644
--- a/include/osv/mmu-defs.hh
+++ b/include/osv/mmu-defs.hh
@@ -15,8 +15,6 @@

struct exception_frame;

-struct exception_frame;
-
namespace mmu {

constexpr uintptr_t page_size = 4096;
@@ -84,6 +82,7 @@ enum {
mmap_small = 1ul << 5,
mmap_jvm_balloon = 1ul << 6,
mmap_file = 1ul << 7,
+ mmap_stack = 1ul << 8,
};

enum {
diff --git a/include/osv/sched.hh b/include/osv/sched.hh
index 0fb44f77..d0e9b00f 100644
--- a/include/osv/sched.hh
+++ b/include/osv/sched.hh
@@ -335,6 +335,7 @@ public:
void* begin;
size_t size;
void (*deleter)(stack_info si); // null: don't delete
+ bool lazy = false;
static void default_deleter(stack_info si);
};
struct attr {
@@ -978,14 +979,13 @@ inline void release(mutex_t* mtx)
}
}

-extern unsigned __thread preempt_counter;
extern bool __thread need_reschedule;

#ifdef __OSV_CORE__
inline unsigned int get_preempt_counter()
{
barrier();
- return preempt_counter;
+ return arch::irq_preempt_counters.preempt;
}

inline bool preemptable()
@@ -1005,14 +1005,15 @@ inline void preempt()

inline void preempt_disable()
{
- ++preempt_counter;
+ arch::ensure_next_stack_page();
+ ++arch::irq_preempt_counters.preempt;
barrier();
}

inline void preempt_enable()
{
barrier();
- --preempt_counter;
+ --arch::irq_preempt_counters.preempt;
if (preemptable() && need_reschedule && arch::irq_enabled()) {
cpu::schedule();
}
diff --git a/libc/mman.cc b/libc/mman.cc
index d0803ac4..d00befc3 100644
--- a/libc/mman.cc
+++ b/libc/mman.cc
@@ -38,12 +38,11 @@ unsigned libc_flags_to_mmap(int flags)
mmap_flags |= mmu::mmap_populate;
}
if (flags & MAP_STACK) {
- // OSv currently requires that stacks be pinned (see issue #143). So
- // if an application wants to mmap() a stack for pthread_attr_setstack
- // and did us the courtesy of telling this to ue (via MAP_STACK),
- // let's return the courtesy by returning pre-faulted memory.
- // FIXME: If issue #143 is fixed, this workaround should be removed.
+#ifdef AARCH64_PORT_STUB
mmap_flags |= mmu::mmap_populate;
+#else
+ mmap_flags |= mmu::mmap_stack;
+#endif
}
if (flags & MAP_SHARED) {
mmap_flags |= mmu::mmap_shared;
diff --git a/libc/pthread.cc b/libc/pthread.cc
index 8c976bf6..5c00e8b9 100644
--- a/libc/pthread.cc
+++ b/libc/pthread.cc
@@ -140,10 +140,15 @@ namespace pthread_private {
return {attr.stack_begin, attr.stack_size};
}
size_t size = attr.stack_size;
+#ifdef AARCH64_PORT_STUB
void *addr = mmu::map_anon(nullptr, size, mmu::mmap_populate, mmu::perm_rw);
+#else
+ void *addr = mmu::map_anon(nullptr, size, mmu::mmap_stack, mmu::perm_rw);
+#endif
mmu::mprotect(addr, attr.guard_size, 0);
sched::thread::stack_info si{addr, size};
si.deleter = free_stack;
+ si.lazy = true;
return si;
}

--
2.20.1

Waldek Kozaczuk

unread,
Jan 29, 2020, 8:17:02 AM1/29/20
to OSv Development
Please note this is an optimized and final version of the option 1 patch sent earlier

Waldek Kozaczuk

unread,
Jan 30, 2020, 12:18:38 AM1/30/20
to OSv Development
Unfortunately, after more extensive testing (even though all unit tests pass), I have realized this patch has some major flaws but hopefully repairable.

I have tried to run the http test (cd modules/httpserver-api && make check-http) and got the app hung. After some debugging, I found this stack trace that explains why:

#0  sched::thread::switch_to (this=0xffff8000009c8040, this@entry=0xffff80007fee1040) at arch/x64/arch-switch.hh:108
#1  0x000000004040c57a in sched::cpu::reschedule_from_interrupt (this=0xffff80000001f040, called_from_yield=called_from_yield@entry=false, preempt_after=..., preempt_after@entry=...) at core/sched.cc:339
#2  0x000000004040d2e8 in sched::cpu::schedule () at include/osv/sched.hh:1316
#3  0x000000004040d406 in sched::thread::wait (this=this@entry=0xffff800003fa8040) at core/sched.cc:1216
#4  0x000000004043a856 in sched::thread::do_wait_for<lockfree::mutex, sched::wait_object<waitqueue> > (mtx=...) at include/osv/mutex.h:41
#5  sched::thread::wait_for<waitqueue&> (mtx=...) at include/osv/sched.hh:1226
#6  waitqueue::wait (this=this@entry=0x408f04d0 <mmu::vma_list_mutex+48>, mtx=...) at core/waitqueue.cc:56
#7  0x00000000403ea41b in rwlock::reader_wait_lockable (this=<optimized out>) at core/rwlock.cc:174
#8  rwlock::rlock (this=this@entry=0x408f04a0 <mmu::vma_list_mutex>) at core/rwlock.cc:29
#9  0x000000004034cbac in rwlock_for_read::lock (this=0x408f04a0 <mmu::vma_list_mutex>) at include/osv/rwlock.h:113
#10 std::lock_guard<rwlock_for_read&>::lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/9/bits/std_mutex.h:159
#11 lock_guard_for_with_lock<rwlock_for_read&>::lock_guard_for_with_lock (lock=..., this=<synthetic pointer>) at include/osv/mutex.h:89
#12 mmu::vm_fault (addr=35184666537984, addr@entry=35184666541728, ef=ef@entry=0xffff800003fad068) at core/mmu.cc:1333
#13 0x00000000403ad539 in page_fault (ef=0xffff800003fad068) at arch/x64/mmu.cc:42
#14 <signal handler called>
#15 arch::ensure_next_stack_page () at arch/x64/arch.hh:37
#16 sched::preempt_disable () at include/osv/sched.hh:1008
#17 preempt_lock_t::lock (this=<optimized out>) at include/osv/preempt-lock.hh:15
#18 std::lock_guard<preempt_lock_t>::lock_guard (__m=..., this=<synthetic pointer>) at /usr/include/c++/9/bits/std_mutex.h:159
#19 lock_guard_for_with_lock<preempt_lock_t>::lock_guard_for_with_lock (lock=..., this=<synthetic pointer>) at include/osv/mutex.h:89
#20 memory::pool::alloc (this=0x40907608 <memory::malloc_pools+168>) at core/mempool.cc:214
#21 0x00000000403f936f in std_malloc (size=80, alignment=16) at core/mempool.cc:1679
#22 0x00000000403f97db in malloc (size=80) at core/mempool.cc:1887
#23 0x00000000404c0089 in operator new(unsigned long) ()
#24 0x000000004034c9d2 in mmu::anon_vma::split (this=0xffffa00001f69e80, edge=4136108032) at include/osv/addr_range.hh:16
#25 0x000000004034ef93 in mmu::evacuate (start=<optimized out>, end=<optimized out>) at /usr/include/boost/move/detail/meta_utils.hpp:267
#26 0x000000004035007c in mmu::allocate (v=v@entry=0xffffa000050e4600, start=start@entry=4127195136, size=size@entry=8912896, search=search@entry=false)
    at core/mmu.cc:1116
#27 0x0000000040350e87 in mmu::map_anon (addr=addr@entry=0xf6000000, size=size@entry=8912896, flags=flags@entry=1, perm=perm@entry=3) at core/mmu.cc:1219
#28 0x000000004047d9f0 in mmap (addr=0xf6000000, length=8912896, prot=<optimized out>, flags=<optimized out>, fd=<optimized out>, offset=0)
    at libc/mman.cc:152
#29 0x0000100000f2dcef in os::Linux::commit_memory_impl(char*, unsigned long, bool) ()
#30 0x0000100000f2dfe9 in os::pd_commit_memory(char*, unsigned long, unsigned long, bool) ()
#31 0x0000100000f22a4a in os::commit_memory(char*, unsigned long, unsigned long, bool) ()
#32 0x0000100000f956db in PSVirtualSpace::expand_by(unsigned long) ()
#33 0x0000100000f96998 in PSYoungGen::resize_generation(unsigned long, unsigned long) ()
#34 0x0000100000f95ad2 in PSYoungGen::resize(unsigned long, unsigned long) ()
#35 0x0000100000f92e9f in PSScavenge::invoke_no_policy() ()
#36 0x0000100000f93673 in PSScavenge::invoke() ()
#37 0x0000100000f4de5d in ParallelScavengeHeap::failed_mem_allocate(unsigned long) ()
#38 0x00001000010cfd97 in VM_ParallelGCFailedAllocation::doit() ()
#39 0x00001000010d7572 in VM_Operation::evaluate() ()
#40 0x00001000010d526b in VMThread::evaluate_operation(VM_Operation*) ()
#41 0x00001000010d65ab in VMThread::loop() ()
#42 0x00001000010d6a13 in VMThread::run() ()
#43 0x0000100000f2e152 in java_start(Thread*) ()
#44 0x00000000404773ba in pthread_private::pthread::<lambda()>::operator() (__closure=0xffffa00002ddbd00) at libc/pthread.cc:115
#45 std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /usr/include/c++/9/bits/std_function.h:300
#46 0x000000004040da1c in sched::thread_main_c (t=0xffff800003fa8040) at arch/x64/arch-switch.hh:326
#47 0x00000000403ad2f3 in thread_main () at arch/x64/entry.S:113

First of I realized that reschedule_from_interrupt() happens within the context of irq_lock which ends up calling disable/enable irq methods which increment/decrement the thread-local irq counter. This means that at least in this case where irq_lock is used (and I think in most others), the counter gets incremented on a different thread that decremented. It looks i have been just lucky that all other tests have worked so far. Or possibly most cases where interrupts are enabled/disabled not much stack is used. But clearly I have to either replace this counter mechanism with something else or adjust it.

The first bug is not really a reason for the deadlock you see above - frame 27 core/mmu.cc:1219 and frame 12 core/mmu.cc:1333. I think we have a situation when we have ongoing memory allocation (see memory::pool::alloc) which then requires disabling preemption which triggers ensure_next_stack_page() trigger a fault that tries to allocate memory and ends up trying to acquire same lock - mmu::vma_list_mutex. 

I am not sure what is the right way to fix it. One way could be to prevent this deadlock could be accomplished by preventing the fault. I experimented a bit and adding these 3 lines to mmap() to trigger earlier fault to ensure 2 pages of stack made the deadlock go away:

--- a/libc/mman.cc
+++ b/libc/mman.cc
@@ -154,6 +149,9 @@ void *mmap(void *addr, size_t length, int prot, int flags,
             }
         }
         try {
+            char i;
+            asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
+            asm volatile("movb -8096(%%rsp), %0" : "=r"(i));
             ret = mmu::map_anon(addr, length, mmap_flags, mmap_perm);
         } catch (error& err) {
             err.to_libc(); // sets errno

Now he most troubling question is whether it an example of the situation where memory allocation causes a page fault by ensure_next_stack_page() on disabling preemption or interrupts and we just need to pre-fault deeper in all these cases (hopefully not to many) or there are many other types of scenarios we have not foreseen?

Waldek


On Wednesday, January 29, 2020 at 8:14:30 AM UTC-5, Waldek Kozaczuk wrote:

Waldek Kozaczuk

unread,
Jan 30, 2020, 2:35:18 PM1/30/20
to OSv Development
This is false alarm. I somehow got it wrong in my head. The irq counter getting incremented on one thread when irq are disabled and getting decremented on another one after a switch is OK. It should not cause any issues which I somehow imagined in my head :-)

The first bug is not really a reason for the deadlock you see above - frame 27 core/mmu.cc:1219 and frame 12 core/mmu.cc:1333. I think we have a situation when we have ongoing memory allocation (see memory::pool::alloc) which then requires disabling preemption which triggers ensure_next_stack_page() trigger a fault that tries to allocate memory and ends up trying to acquire same lock - mmu::vma_list_mutex. 

I am not sure what is the right way to fix it. One way could be to prevent this deadlock could be accomplished by preventing the fault. I experimented a bit and adding these 3 lines to mmap() to trigger earlier fault to ensure 2 pages of stack made the deadlock go away:

--- a/libc/mman.cc
+++ b/libc/mman.cc
@@ -154,6 +149,9 @@ void *mmap(void *addr, size_t length, int prot, int flags,
             }
         }
         try {
+            char i;
+            asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
+            asm volatile("movb -8096(%%rsp), %0" : "=r"(i));
             ret = mmu::map_anon(addr, length, mmap_flags, mmap_perm);
         } catch (error& err) {
             err.to_libc(); // sets errno

Now he most troubling question is whether it an example of the situation where memory allocation causes a page fault by ensure_next_stack_page() on disabling preemption or interrupts and we just need to pre-fault deeper in all these cases (hopefully not to many) or there are many other types of scenarios we have not foreseen?

This is obviously still a problem. And I have a better understanding of it and more confidence there is one category this deadlock falls into and there is a solution to deal with. 

Let me explain what exactly is happening per this stack trace. First  the mmu::map_anon() acquires the lock vma_list_mutex  for write, then downstream malloc() called by mmu::anon_vma::split() eventually leads to  preempt_disable() being called which cause a fault which then eventually tried to acquire same lock vma_list_mutex for read which it cannot (see mmu::vm_faultt()). Even if this lock was recursive (which I think it is non), trying to handle memory allocation that ends up requiring vma list change when vma list change is being attempted as triggered in this example by  mmu::map_anon() cannot not be really handled as it is a logical recursion or "chicker-or-egg" problem, right?

So really the only solution is to prevent the page fault caused by ensure_next_stack_page() downstream. One way could be to read 2 pages in all the places in core/mmu.cc before they acquire the lock vma_list_mutex  to write so that once ensure_next_stack_page is called, it would not trigger the page fault. Based on some initials tests the deadlock issue went away. Do you think it is a right fix?

Waldemar Kozaczuk

unread,
Feb 2, 2020, 11:15:30 AM2/2/20
to osv...@googlegroups.com, Waldemar Kozaczuk, Matthew Pabst
arch/aarch64/arch.hh | 5 ++++
arch/x64/arch-switch.hh | 12 ++++++++++
arch/x64/arch.hh | 34 ++++++++++++++++++++++++++
arch/x64/mmu.cc | 5 ++++
core/mmu.cc | 24 +++++++++++++++++++
core/sched.cc | 6 ++++-
include/osv/counters.hh | 53 +++++++++++++++++++++++++++++++++++++++++
include/osv/irqlock.hh | 4 ++++
include/osv/mmu-defs.hh | 3 +--
include/osv/sched.hh | 10 ++++----
libc/mman.cc | 9 ++++---
libc/pthread.cc | 10 +++++++-
12 files changed, 162 insertions(+), 13 deletions(-)
create mode 100644 include/osv/counters.hh

diff --git a/arch/aarch64/arch.hh b/arch/aarch64/arch.hh
index 855f1987..473e84e9 100644
--- a/arch/aarch64/arch.hh
+++ b/arch/aarch64/arch.hh
@@ -20,6 +20,11 @@ namespace arch {
#define INSTR_SIZE_MIN 4
#define ELF_IMAGE_START (OSV_KERNEL_BASE + 0x10000)

+inline void ensure_next_stack_page()
+{
+ //TODO: Implement lazy stack check for AARCH64
+}
+
inline void irq_disable()
{
processor::irq_disable();
diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
index 6803498f..3e52f7f8 100644
--- a/arch/x64/arch-switch.hh
+++ b/arch/x64/arch-switch.hh
@@ -148,6 +148,13 @@ void thread::init_stack()
_state.rip = reinterpret_cast<void*>(thread_main);
_state.rsp = stacktop;
_state.exception_stack = _arch.exception_stack + sizeof(_arch.exception_stack);
+
+ if (stack.lazy) {
+ // If the thread stack is setup to get lazily allocated and given the thread initially starts
+ // running with preemption disabled, we need to pre-fault the first stack page.
+ volatile char r = *((char *) (stacktop - 1));
+ (void) r; // trick the compiler into thinking that r is used
+ }
}

void thread::setup_tcb()
@@ -311,6 +318,11 @@ void thread::free_tcb()

void thread_main_c(thread* t)
{
+ // Disable lazy stack pre-fault logic for this thread
+ // if stack is not lazy (most likely kernel thread)
+ if (t->get_stack_info().lazy) {
+ arch::irq_preempt_counters._data.set_lazy_flags_bit(arch::lazy_stack_disable, false);
+ }
arch::irq_enable();
#ifdef CONF_preempt
preempt_enable();
diff --git a/arch/x64/arch.hh b/arch/x64/arch.hh
index 17df5f5c..74b728dc 100644
--- a/arch/x64/arch.hh
+++ b/arch/x64/arch.hh
@@ -10,6 +10,8 @@

#include "processor.hh"
#include "msr.hh"
+#include <osv/barrier.hh>
+#include <osv/counters.hh>

// namespace arch - architecture independent interface for architecture
// dependent operations (e.g. irq_disable vs. cli)
@@ -20,8 +22,33 @@ namespace arch {
#define INSTR_SIZE_MIN 1
#define ELF_IMAGE_START OSV_KERNEL_BASE

+inline void ensure_next_stack_page() {
+ // Because both irq and preempt counters and lazy stack
+ // flags are co-located within same 8-bytes long union
+ // we can check single field for non-zero value to determine
+ // if we should trigger pre-fault on lazy stack
+ if (irq_preempt_counters.lazy_stack_no_pre_fault) {
+ return;
+ }
+
+ char i;
+ asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
+}
+
+inline void ensure_next_two_stack_pages() {
+ if (irq_preempt_counters.lazy_stack_no_pre_fault) {
+ return;
+ }
+
+ char i;
+ asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
+ asm volatile("movb -8192(%%rsp), %0" : "=r"(i));
+}
+
inline void irq_disable()
{
+ ensure_next_stack_page();
+ ++irq_preempt_counters._data.irq;
processor::cli();
}

@@ -36,11 +63,15 @@ inline void irq_disable_notrace()
inline void irq_enable()
{
processor::sti();
+ --irq_preempt_counters._data.irq;
+ barrier();
}

inline void wait_for_interrupt()
{
processor::sti_hlt();
+ --irq_preempt_counters._data.irq;
+ barrier();
}

inline void halt_no_interrupts()
@@ -78,12 +109,15 @@ private:

inline void irq_flag_notrace::save()
{
+ ensure_next_stack_page();
+ ++irq_preempt_counters._data.irq;
asm volatile("sub $128, %%rsp; pushfq; popq %0; add $128, %%rsp" : "=r"(_rflags));
}

inline void irq_flag_notrace::restore()
{
asm volatile("sub $128, %%rsp; pushq %0; popfq; add $128, %%rsp" : : "r"(_rflags));
+ --irq_preempt_counters._data.irq;
}

inline bool irq_flag_notrace::enabled() const
diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc
index 24da5caa..ae619a9f 100644
--- a/arch/x64/mmu.cc
+++ b/arch/x64/mmu.cc
@@ -37,6 +37,11 @@ void page_fault(exception_frame *ef)
assert(sched::preemptable());
assert(ef->rflags & processor::rflags_if);

+ // Even though we are on exception stack that is not lazy
+ // for pre-caution we are disabling the pre-fault logic
+ // for the time the fault is being handled
+ arch::lazy_stack_flags_guard lazy_stack_guard(arch::lazy_stack_fault_disable);
+
// And since we may sleep, make sure interrupts are enabled.
DROP_LOCK(irq_lock) { // irq_lock is acquired by HW
mmu::vm_fault(addr, ef);
diff --git a/core/mmu.cc b/core/mmu.cc
index ff3fab47..b6e7cc36 100644
--- a/core/mmu.cc
+++ b/core/mmu.cc
@@ -42,6 +42,18 @@ extern const char text_start[], text_end[];

namespace mmu {

+static void prevent_stack_page_fault() {
+ // We need to ensure that lazy stack is populated
+ // deeply enough (2 pages) for all cases where
+ // the vma_list_mutex is taken for write so that we can disable
+ // the pre-fault check (arch::ensure_next_stack_page) which,
+ // if page fault triggered, would make page-fault handling logic
+ // attempt to take same vma_list_mutex fo read and end up with deadlock
+#ifndef AARCH64_PORT_STUB
+ arch::ensure_next_two_stack_pages();
+#endif
+}
+
namespace bi = boost::intrusive;

class vma_compare {
@@ -1183,6 +1195,8 @@ static void nohugepage(void* addr, size_t length)

error advise(void* addr, size_t size, int advice)
{
+ prevent_stack_page_fault();
+ arch::lazy_stack_flags_guard lazy_stack_guard(arch::lazy_stack_mmu_disable);
WITH_LOCK(vma_list_mutex.for_write()) {
if (!ismapped(addr, size)) {
return make_error(ENOMEM);
@@ -1215,6 +1229,8 @@ void* map_anon(const void* addr, size_t size, unsigned flags, unsigned perm)
size = align_up(size, mmu::page_size);
auto start = reinterpret_cast<uintptr_t>(addr);
auto* vma = new mmu::anon_vma(addr_range(start, start + size), perm, flags);
+ prevent_stack_page_fault();
+ arch::lazy_stack_flags_guard lazy_stack_guard(arch::lazy_stack_mmu_disable);
SCOPE_LOCK(vma_list_mutex.for_write());
auto v = (void*) allocate(vma, start, size, search);
if (flags & mmap_populate) {
@@ -1241,6 +1257,8 @@ void* map_file(const void* addr, size_t size, unsigned flags, unsigned perm,
auto start = reinterpret_cast<uintptr_t>(addr);
auto *vma = f->mmap(addr_range(start, start + size), flags | mmap_file, perm, offset).release();
void *v;
+ prevent_stack_page_fault();
+ arch::lazy_stack_flags_guard lazy_stack_guard(arch::lazy_stack_mmu_disable);
WITH_LOCK(vma_list_mutex.for_write()) {
v = (void*) allocate(vma, start, size, search);
if (flags & mmap_populate) {
@@ -1607,6 +1625,8 @@ ulong map_jvm(unsigned char* jvm_addr, size_t size, size_t align, balloon_ptr b)

auto* vma = new mmu::jvm_balloon_vma(jvm_addr, start, start + size, b, v->perm(), v->flags());

+ prevent_stack_page_fault();
+ arch::lazy_stack_flags_guard lazy_stack_guard(arch::lazy_stack_mmu_disable);
WITH_LOCK(vma_list_mutex.for_write()) {
// This means that the mapping that we had before was a balloon mapping
// that was laying around and wasn't updated to an anon mapping. If we
@@ -1868,6 +1888,8 @@ void free_initial_memory_range(uintptr_t addr, size_t size)

error mprotect(const void *addr, size_t len, unsigned perm)
{
+ prevent_stack_page_fault();
+ arch::lazy_stack_flags_guard lazy_stack_guard(arch::lazy_stack_mmu_disable);
SCOPE_LOCK(vma_list_mutex.for_write());

if (!ismapped(addr, len)) {
@@ -1879,6 +1901,8 @@ error mprotect(const void *addr, size_t len, unsigned perm)

error munmap(const void *addr, size_t length)
{
+ prevent_stack_page_fault();
+ arch::lazy_stack_flags_guard lazy_stack_guard(arch::lazy_stack_mmu_disable);
SCOPE_LOCK(vma_list_mutex.for_write());

length = align_up(length, mmu::page_size);
diff --git a/core/sched.cc b/core/sched.cc
index 06f849d1..4510a85e 100644
--- a/core/sched.cc
+++ b/core/sched.cc
@@ -42,6 +42,11 @@ extern char _percpu_start[], _percpu_end[];
using namespace osv;
using namespace osv::clock::literals;

+namespace arch {
+// By default disable lazy stack pre-fault logic
+counters __thread irq_preempt_counters = {{1, 1, 1}};
+}
+
namespace sched {

TRACEPOINT(trace_sched_idle, "");
@@ -69,7 +74,6 @@ std::vector<cpu*> cpus __attribute__((init_priority((int)init_prio::cpus)));
thread __thread * s_current;
cpu __thread * current_cpu;

-unsigned __thread preempt_counter = 1;
bool __thread need_reschedule = false;

elf::tls_data tls;
diff --git a/include/osv/counters.hh b/include/osv/counters.hh
new file mode 100644
index 00000000..b862f32c
--- /dev/null
+++ b/include/osv/counters.hh
@@ -0,0 +1,53 @@
+/*
+ * Copyright (C) 2020 Waldemar Kozaczuk
+ *
+ * This work is open source software, licensed under the terms of the
+ * BSD license as described in the LICENSE file in the top-level directory.
+ */
+
+#ifndef OSV_COUNTERS_HH
+#define OSV_COUNTERS_HH
+
+namespace arch {
+
+// Both preempt and irq counters are co-located next to each other
+// along with the the flags that drive lazy stack related logic
+// union by design. It is so that compiler can optimize ensure_next_stack_page()
+// to test against single 64-bit lazy_stack_no_pre_fault field
+union counters {
+ struct data {
+ unsigned preempt;
+ u16 irq;
+ u16 lazy_stack_flags;
+
+ inline void set_lazy_flags_bit(unsigned nr, bool v) {
+ lazy_stack_flags &= ~(u64(1) << nr);
+ lazy_stack_flags |= u64(v) << nr;
+ }
+ } _data;
+ uint64_t lazy_stack_no_pre_fault;
+};
+
+extern counters __thread irq_preempt_counters;
+
+class lazy_stack_flags_guard {
+ unsigned _flag_nr;
+public:
+ lazy_stack_flags_guard(unsigned flag_nr) {
+ _flag_nr = flag_nr;
+ irq_preempt_counters._data.set_lazy_flags_bit(_flag_nr, true);
+ }
+ ~lazy_stack_flags_guard() {
+ irq_preempt_counters._data.set_lazy_flags_bit(_flag_nr, false);
+ }
+};
+
+enum {
+ lazy_stack_disable = 0,
+ lazy_stack_mmu_disable = 1,
+ lazy_stack_fault_disable = 2,
+};
+
+}
+
+#endif //OSV_COUNTERS_HH
diff --git a/include/osv/irqlock.hh b/include/osv/irqlock.hh
index 2d5372fc..1e0ab6fa 100644
--- a/include/osv/irqlock.hh
+++ b/include/osv/irqlock.hh
@@ -34,6 +34,10 @@ inline void irq_save_lock_type::lock()
inline void irq_save_lock_type::unlock()
{
_flags.restore();
+#ifndef AARCH64_PORT_STUB
+ --arch::irq_preempt_counters._data.irq;
+ barrier();
+#endif
}

extern irq_lock_type irq_lock;
diff --git a/include/osv/mmu-defs.hh b/include/osv/mmu-defs.hh
index 18edf441..a731ed6e 100644
--- a/include/osv/mmu-defs.hh
+++ b/include/osv/mmu-defs.hh
@@ -15,8 +15,6 @@

struct exception_frame;

-struct exception_frame;
-
namespace mmu {

constexpr uintptr_t page_size = 4096;
@@ -84,6 +82,7 @@ enum {
mmap_small = 1ul << 5,
mmap_jvm_balloon = 1ul << 6,
mmap_file = 1ul << 7,
+ mmap_stack = 1ul << 8,
};

enum {
diff --git a/include/osv/sched.hh b/include/osv/sched.hh
index 0fb44f77..96453a8b 100644
--- a/include/osv/sched.hh
+++ b/include/osv/sched.hh
@@ -25,6 +25,7 @@
#include <osv/rcu.hh>
#include <osv/clock.hh>
#include <osv/timer-set.hh>
+#include "osv/counters.hh"
#include <string.h>

typedef float runtime_t;
@@ -335,6 +336,7 @@ public:
void* begin;
size_t size;
void (*deleter)(stack_info si); // null: don't delete
+ bool lazy = false;
static void default_deleter(stack_info si);
};
struct attr {
@@ -978,14 +980,13 @@ inline void release(mutex_t* mtx)
}
}

-extern unsigned __thread preempt_counter;
extern bool __thread need_reschedule;

#ifdef __OSV_CORE__
inline unsigned int get_preempt_counter()
{
barrier();
- return preempt_counter;
+ return arch::irq_preempt_counters._data.preempt;
}

inline bool preemptable()
@@ -1005,14 +1006,15 @@ inline void preempt()

inline void preempt_disable()
{
- ++preempt_counter;
+ arch::ensure_next_stack_page();
+ ++arch::irq_preempt_counters._data.preempt;
barrier();
}

inline void preempt_enable()
{
barrier();
- --preempt_counter;
+ --arch::irq_preempt_counters._data.preempt;
if (preemptable() && need_reschedule && arch::irq_enabled()) {
cpu::schedule();
}
diff --git a/libc/mman.cc b/libc/mman.cc
index d0803ac4..d00befc3 100644
--- a/libc/mman.cc
+++ b/libc/mman.cc
@@ -38,12 +38,11 @@ unsigned libc_flags_to_mmap(int flags)
mmap_flags |= mmu::mmap_populate;
}
if (flags & MAP_STACK) {
- // OSv currently requires that stacks be pinned (see issue #143). So
- // if an application wants to mmap() a stack for pthread_attr_setstack
- // and did us the courtesy of telling this to ue (via MAP_STACK),
- // let's return the courtesy by returning pre-faulted memory.
- // FIXME: If issue #143 is fixed, this workaround should be removed.
+#ifdef AARCH64_PORT_STUB
mmap_flags |= mmu::mmap_populate;
+#else
+ mmap_flags |= mmu::mmap_stack;
+#endif
}
if (flags & MAP_SHARED) {
mmap_flags |= mmu::mmap_shared;
diff --git a/libc/pthread.cc b/libc/pthread.cc
index 8c976bf6..2279ac0c 100644
--- a/libc/pthread.cc
+++ b/libc/pthread.cc
@@ -140,10 +140,18 @@ namespace pthread_private {
return {attr.stack_begin, attr.stack_size};
}
size_t size = attr.stack_size;
- void *addr = mmu::map_anon(nullptr, size, mmu::mmap_populate, mmu::perm_rw);
+#ifdef AARCH64_PORT_STUB
+ unsigned stack_flags = mmu::mmap_populate;
+#else
+ unsigned stack_flags = mmu::mmap_stack;
+#endif
+ void *addr = mmu::map_anon(nullptr, size, stack_flags, mmu::perm_rw);
mmu::mprotect(addr, attr.guard_size, 0);
sched::thread::stack_info si{addr, size};
si.deleter = free_stack;
+#ifndef AARCH64_PORT_STUB
+ si.lazy = true;
+#endif
return si;
}

--
2.20.1

Waldek Kozaczuk

unread,
Feb 10, 2020, 12:14:52 AM2/10/20
to OSv Development
Nadav,

What do you think about this patch? Also, do you think that the deadlock issue related to vma_list_mutex I discovered and had to overcome (see my previous email in this chain) is the only one we need to worry about or are there any potential hidden problems like that I have missed to notice?

Shall I make this lazy-stack feature activated with #ifdef and only if we feel confident we could make it permanent?

Waldek

Waldek Kozaczuk

unread,
Mar 1, 2020, 10:30:07 AM3/1/20
to OSv Development
Bump!

Waldek Kozaczuk

unread,
Mar 8, 2020, 1:16:30 AM3/8/20
to OSv Development
Bump again :-)

Waldek Kozaczuk

unread,
Jun 7, 2020, 9:37:10 AM6/7/20
to OSv Development
Nadav,

I am not sure if at this point this patch still applies to the source tree at this point, but could you please try to see if you can review it as is? Otherwise I will revise it with the latest master.

Waldek

Reply all
Reply to author
Forward
0 new messages