[PATCH] Option 1: threads: allocate application stacks lazily

9 views

Skip to first unread message

Waldemar Kozaczuk

unread,

Jan 26, 2020, 12:33:30 PM1/26/20

to osv...@googlegroups.com, Waldemar Kozaczuk, Matthew Pabst

As the issue #143 explains, currently applications stacks are eagerly
allocated (aka pre-populated) which may lead to substantial memory waste.
We want those stacks to be lazily allocated instead, so that physical
memory behind the stack gets allocated as needed by standard fault handling mechanism.

The page fault handling logic requires that both interrupts and preemption
are enabled. Therefore this patch implements simple strategy to read single byte on a stack
one page (4K) deeper before disabling interrupts or preemption.

It turns out we cannot read that single byte "blindly" because in some scenarios
interrupts might be already disabled when we are about to disable preemption and vice versa.
Also similarly, which is more speculative, disabling preemption (and possibly interrupts) may nest.
Therefore we need a "conditional" read of that byte on stack. More specifically
we should ONLY read the byte IF both interrupts and preemption are enabled.

This patch achieves this by adding extra thread-local counter to track interrupts disabling/enabling
that operates in a similar way the preemption counter does. So every time interrupts
are disabled we increment the counter and decrement it everytime interrupts are enabled back.
Then everytime before we disable interrupts or preemption, we check if both counters
are 0 and only then try to read a byte to potentially trigger a fault.

Please note that performance wise this patch is not as "cheap" as it
possibly was hoped. As the disassembled snippets illustrate it costs extra 7-8 instructions
extra every time we disable preemption or interrupts. It would be nice to properly measure
the added cost of this patch:

A. Disabling preemption in __tls_get_addr before the patch:
```
4035cae8: e8 83 f6 ff ff callq 4035c170 <elf::get_program()>
4035caed: 4c 8b 2d 6c 40 51 00 mov 0x51406c(%rip),%r13 # 40870b60 <sched::preempt_counter+0x40870b60>
4035caf4: 4c 8b 33 mov (%rbx),%r14
4035caf7: 64 41 83 45 00 01 addl $0x1,%fs:0x0(%r13)
4035cafd: e8 6e f6 ff ff callq 4035c170 <elf::get_program()>
```

B. Disabling preemption in __tls_get_addr after the patch:
```
40360c78: e8 33 f6 ff ff callq 403602b0 <elf::get_program()>
40360c7d: 4c 8b 2d dc 4e 51 00 mov 0x514edc(%rip),%r13 # 40875b60 <sched::preempt_counter+0x40875b5c>
40360c84: 48 8b 15 05 4f 51 00 mov 0x514f05(%rip),%rdx # 40875b90 <mmu::irq_counter+0x40875b90>
40360c8b: 4c 8b 33 mov (%rbx),%r14
40360c8e: 64 41 8b 45 00 mov %fs:0x0(%r13),%eax
40360c93: 89 c6 mov %eax,%esi
40360c95: 64 0b 32 or %fs:(%rdx),%esi
40360c98: 75 07 jne 40360ca1 <__tls_get_addr+0x81>
40360c9a: 8a 94 24 00 f0 ff ff mov -0x1000(%rsp),%dl
40360ca1: 83 c0 01 add $0x1,%eax
40360ca4: 64 41 89 45 00 mov %eax,%fs:0x0(%r13)
40360ca9: e8 02 f6 ff ff callq 403602b0 <elf::get_program()>
```

As an example of memory saving, tomcat using ~ 300 thread ended up 365MB instead
of 620MB before the patch.

Fixes #143

Signed-off-by: Matthew Pabst <pabstm...@gmail.com>
Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
arch/x64/arch-switch.hh | 10 ++++++++++
arch/x64/arch.hh | 29 +++++++++++++++++++++++++++++
arch/x64/exceptions.cc | 1 +
core/mmu.cc | 2 ++
include/osv/irqlock.hh | 6 ++++++
include/osv/mmu-defs.hh | 1 +
include/osv/sched.hh | 2 ++
libc/mman.cc | 7 +------
libc/pthread.cc | 3 ++-
9 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
index 6803498f..eefa505e 100644
--- a/arch/x64/arch-switch.hh
+++ b/arch/x64/arch-switch.hh
@@ -148,6 +148,13 @@ void thread::init_stack()
_state.rip = reinterpret_cast<void*>(thread_main);
_state.rsp = stacktop;
_state.exception_stack = _arch.exception_stack + sizeof(_arch.exception_stack);
+
+ if (stack.lazy) {
+ // If thread stack ia lazily allocated and given the thread initially starts
+ // running with preemption disabled, we need to pre-fault the first stack page.
+ volatile char r = *((char *) (stacktop - 1));
+ (void) r; // trick the compiler into thinking that r is used
+ }
}

void thread::setup_tcb()
@@ -311,6 +318,9 @@ void thread::free_tcb()

void thread_main_c(thread* t)
{
+ if (t->get_stack_info().lazy) {
+ mmu::irq_counter = mmu::irq_counter_lazy_stack_init_value;
+ }
arch::irq_enable();
#ifdef CONF_preempt
preempt_enable();
diff --git a/arch/x64/arch.hh b/arch/x64/arch.hh
index 17df5f5c..d246df0c 100644
--- a/arch/x64/arch.hh
+++ b/arch/x64/arch.hh
@@ -10,18 +10,40 @@

#include "processor.hh"
#include "msr.hh"
+#include <osv/barrier.hh>

// namespace arch - architecture independent interface for architecture
// dependent operations (e.g. irq_disable vs. cli)

+namespace mmu {
+constexpr unsigned irq_counter_default_init_value = 11;
+constexpr unsigned irq_counter_lazy_stack_init_value = 1;
+extern unsigned __thread irq_counter;
+}
+
+namespace sched {
+extern unsigned __thread preempt_counter;
+}
+
namespace arch {

#define CACHELINE_ALIGNED __attribute__((aligned(64)))
#define INSTR_SIZE_MIN 1
#define ELF_IMAGE_START OSV_KERNEL_BASE

+inline void ensure_next_stack_page() {
+ if (sched::preempt_counter || mmu::irq_counter) {
+ return;
+ }
+
+ char i;
+ asm volatile("movb -4096(%%rsp), %0" : "=r"(i));
+}
+
inline void irq_disable()
{
+ ensure_next_stack_page();
+ ++mmu::irq_counter;
processor::cli();
}

@@ -36,11 +58,15 @@ inline void irq_disable_notrace()
inline void irq_enable()
{
processor::sti();
+ barrier();
+ --mmu::irq_counter;
}

inline void wait_for_interrupt()
{
processor::sti_hlt();
+ barrier();
+ --mmu::irq_counter;
}

inline void halt_no_interrupts()
@@ -78,12 +104,15 @@ private:

inline void irq_flag_notrace::save()
{
+ ensure_next_stack_page();
+ ++mmu::irq_counter;
asm volatile("sub $128, %%rsp; pushfq; popq %0; add $128, %%rsp" : "=r"(_rflags));
}

inline void irq_flag_notrace::restore()
{
asm volatile("sub $128, %%rsp; pushq %0; popfq; add $128, %%rsp" : : "r"(_rflags));
+ --mmu::irq_counter;
}

inline bool irq_flag_notrace::enabled() const
diff --git a/arch/x64/exceptions.cc b/arch/x64/exceptions.cc
index 7c9eaf51..889ca48c 100644
--- a/arch/x64/exceptions.cc
+++ b/arch/x64/exceptions.cc
@@ -302,6 +302,7 @@ extern "C" void simd_exception(exception_frame *ef)

extern "C" void nmi(exception_frame* ef)
{
+ //TODO: Possibly needs to be handled
while (true) {
processor::cli_hlt();
}
diff --git a/core/mmu.cc b/core/mmu.cc
index ff3fab47..7c6a4a97 100644
--- a/core/mmu.cc
+++ b/core/mmu.cc
@@ -42,6 +42,8 @@ extern const char text_start[], text_end[];

namespace mmu {

+unsigned __thread irq_counter = irq_counter_default_init_value;
+
namespace bi = boost::intrusive;

class vma_compare {
diff --git a/include/osv/irqlock.hh b/include/osv/irqlock.hh
index 2d5372fc..01609b7f 100644
--- a/include/osv/irqlock.hh
+++ b/include/osv/irqlock.hh
@@ -10,6 +10,10 @@

#include "arch.hh"

+namespace mmu {
+extern unsigned __thread irq_counter;
+}
+
class irq_lock_type {
public:
static void lock() { arch::irq_disable(); }
@@ -34,6 +38,8 @@ inline void irq_save_lock_type::lock()
inline void irq_save_lock_type::unlock()
{
_flags.restore();
+ barrier();
+ --mmu::irq_counter;
}

extern irq_lock_type irq_lock;
diff --git a/include/osv/mmu-defs.hh b/include/osv/mmu-defs.hh
index 18edf441..694815f8 100644
--- a/include/osv/mmu-defs.hh
+++ b/include/osv/mmu-defs.hh
@@ -84,6 +84,7 @@ enum {
mmap_small = 1ul << 5,
mmap_jvm_balloon = 1ul << 6,
mmap_file = 1ul << 7,
+ mmap_stack = 1ul << 8,
};

enum {
diff --git a/include/osv/sched.hh b/include/osv/sched.hh
index 0fb44f77..fd5e003c 100644
--- a/include/osv/sched.hh
+++ b/include/osv/sched.hh
@@ -335,6 +335,7 @@ public:
void* begin;
size_t size;
void (*deleter)(stack_info si); // null: don't delete
+ bool lazy = false;
static void default_deleter(stack_info si);
};
struct attr {
@@ -1005,6 +1006,7 @@ inline void preempt()

inline void preempt_disable()
{
+ arch::ensure_next_stack_page();
++preempt_counter;
barrier();
}
diff --git a/libc/mman.cc b/libc/mman.cc
index d0803ac4..de7ee4dd 100644
--- a/libc/mman.cc
+++ b/libc/mman.cc
@@ -38,12 +38,7 @@ unsigned libc_flags_to_mmap(int flags)
mmap_flags |= mmu::mmap_populate;
}
if (flags & MAP_STACK) {
- // OSv currently requires that stacks be pinned (see issue #143). So
- // if an application wants to mmap() a stack for pthread_attr_setstack
- // and did us the courtesy of telling this to ue (via MAP_STACK),
- // let's return the courtesy by returning pre-faulted memory.
- // FIXME: If issue #143 is fixed, this workaround should be removed.
- mmap_flags |= mmu::mmap_populate;
+ mmap_flags |= mmu::mmap_stack;
}
if (flags & MAP_SHARED) {
mmap_flags |= mmu::mmap_shared;
diff --git a/libc/pthread.cc b/libc/pthread.cc
index 8c976bf6..c7dbd017 100644
--- a/libc/pthread.cc
+++ b/libc/pthread.cc
@@ -140,10 +140,11 @@ namespace pthread_private {
return {attr.stack_begin, attr.stack_size};
}
size_t size = attr.stack_size;
- void *addr = mmu::map_anon(nullptr, size, mmu::mmap_populate, mmu::perm_rw);
+ void *addr = mmu::map_anon(nullptr, size, mmu::mmap_stack, mmu::perm_rw);
mmu::mprotect(addr, attr.guard_size, 0);
sched::thread::stack_info si{addr, size};
si.deleter = free_stack;
+ si.lazy = true;
return si;
}

--
2.20.1

Waldek Kozaczuk

unread,

Jan 27, 2020, 8:40:59 AM1/27/20

to OSv Development

Obviously the most important concern is whether this patch is logically correct.

But another issue is performance. Does it make disabling interrupts and preemption too costly? It seems that both are used when applying/handling locks. By disassembling I saw at least 100 references to sched::preempt_counter. But I am not sure how frequently this code is called and if it will really have significant impact on performance.

If so I think there are many way to optimized the code in ensure_next_stack_page():

I noticed many (maybe most) accesses to thread local variables use initial-exec instead of local-exec. We could force all code in kernel to use local-exec. I already have a patch that does it.
We could pack both preempt_counter and irq_counter into single 64 bits union field so that the generated code makes single access instead of two.
We could avoid branching (jne) and somehow force the code to use cmovz/cmove (conditional move). I have tried it but to no avail.

Any other ideas?

Waldek

Reply all

Reply to author

Forward

0 new messages