OSv can run statically linked executable

31 views
Skip to first unread message

Waldek Kozaczuk

unread,
Apr 23, 2023, 11:26:53 PM4/23/23
to OSv Development
Hi,

Over the recent week, I have been working to get OSv to run a simple "hello world" app (aka native-example) built as a position-dependent statically linked executable. In essence, I picked up where Pekka Enberg left over 8 years ago (see https://github.com/cloudius-systems/osv/tree/static-elf). Obviously, given these days OSv has pretty robust support of over 70 syscalls (and 60 more that should be trivial to add), and the remaining work is much more manageable.

./scripts/firecracker.py -e /hello
OSv v0.57.0-37-g0de155a4
Booted up in 5.23 ms
Cmdline: /hello
 -> syscall: 107
 -> syscall: 102
 -> syscall: 108
 -> syscall: 104
 -> syscall: 158
 -> syscall: 012
 -> syscall: 012
 -> syscall: 158
 -> syscall: 218
 -> syscall: 273
 -> syscall: 063
 -> syscall: 302
 -> syscall: 089
 -> syscall: 318
 -> syscall: 228
 -> syscall: 228
 -> syscall: 012
 -> syscall: 012
 -> syscall: 010
 -> syscall: 262
 -> syscall: 016
 -> syscall: 001
Hello from C code
 -> syscall: 231

I will be sending a series of proper patches later after I clean some issues but in essence here is a list of things I had to do including what Pekka started:
  1. Tweak dynamic linker to support static executable:
    • Handle missing DT_SYMTAB, DT_STRTAB and DT_NEEDED.
    • Handle ET_EXEC
    • Support statically-linked executable base address
  2. Add basic handling of static ELF entry point and initial stack state setup (see figure 3.9 ("Initial Process Stack") of the x86-64 ABI specification - https://refspecs.linuxbase.org/elf/x86_64-abi-0.99.pdf
    • make sure the RDX registers in zeroed and the basic AUX vector with AT_RANDOM is set up
    • more is left to do to fully support argv and full aux vector
  3. Add support of the brk() syscall (see issue 1138)
  4. Add dummy support of sys_set_robust_list and set_tid_address syscalls (possibly needs something more for multithreaded apps).
  5. Support the arch_prctl syscall that sets the app TLS 
    • this was by far the most complicated element that required changing OSv to store new per-pcpu data pointer in GS register and enhancing both syscall handler and interrupt/page fault handler to detect and switch if needed the FS base to the kernel TLS on entry and back to the app one on exit (see https://github.com/cloudius-systems/osv/issues/1137#issuecomment-1512315880
  6. Fixing a potential bug in handling TCGETS in the console driver.
  7. Implement sys_prlimit
  8. Enable the readlink, geteuid and getegid
This was enough to run a single-threaded app but we will need to implement the clone syscall to support multi-threaded apps. In addition, we would want to support the static pies as well which I hope should not be very difficult.

Regards,
Waldek 

Dor Laor

unread,
Apr 24, 2023, 1:56:50 AM4/24/23
to Waldek Kozaczuk, OSv Development
Very impressive, it improves OSv's ability to be a very safe and
fast sandbox

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/3a561595-e072-4980-8375-3b742717dd6dn%40googlegroups.com.

Nadav Har'El

unread,
Apr 24, 2023, 4:34:06 AM4/24/23
to Waldek Kozaczuk, OSv Development
On Mon, Apr 24, 2023 at 6:26 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Hi,

Over the recent week, I have been working to get OSv to run a simple "hello world" app (aka native-example) built as a position-dependent statically linked executable.

Nice!
 
  1. Support the arch_prctl syscall that sets the app TLS 
    • this was by far the most complicated element that required changing OSv to store new per-pcpu data pointer in GS register and enhancing both syscall handler and interrupt/page fault handler to detect and switch if needed the FS base to the kernel TLS on entry and back to the app one on exit (see https://github.com/cloudius-systems/osv/issues/1137#issuecomment-1512315880

If this has noticeable overhead, perhaps it makes sense to make it optional?
 
  1. Fixing a potential bug in handling TCGETS in the console driver.
I'm curious what this bug was - I am personally fond of this area of this code, as you can see from the history
lesson in drivers/line-discipline.cc :-)
 
  1. Implement sys_prlimit
  2. Enable the readlink, geteuid and getegid
I think we already had those - or did you mean the system call?
 
This was enough to run a single-threaded app but we will need to implement the clone syscall to support multi-threaded apps.

Very nice. You can probably start by implementing the "simple" case of clone() used by a simple multi-threaded application and
leave the other cases with UNIMPLEMENTED (or "ignore" various parameters and leave them to be perfected later, with WARN_ONCE)
 

Waldek Kozaczuk

unread,
Apr 24, 2023, 2:08:26 PM4/24/23
to Nadav Har'El, OSv Development
On Mon, Apr 24, 2023 at 4:34 AM Nadav Har'El <n...@scylladb.com> wrote:
On Mon, Apr 24, 2023 at 6:26 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Hi,

Over the recent week, I have been working to get OSv to run a simple "hello world" app (aka native-example) built as a position-dependent statically linked executable.

Nice!
 
  1. Support the arch_prctl syscall that sets the app TLS 
    • this was by far the most complicated element that required changing OSv to store new per-pcpu data pointer in GS register and enhancing both syscall handler and interrupt/page fault handler to detect and switch if needed the FS base to the kernel TLS on entry and back to the app one on exit (see https://github.com/cloudius-systems/osv/issues/1137#issuecomment-1512315880

If this has noticeable overhead, perhaps it makes sense to make it optional?
I have not measured it in any formal way. But when testing some of the earlier versions of the code, I could see the context switch time (the colocated one measured by misc-ctxsw) go up from 313 to 362 ns caused by adding this line: 

processor::wrmsr(msr::IA32_FS_KERNEL_BASE, reinterpret_cast<u64>(_tcb));
which may indirectly measure the cost of the code to change GS or FS base using the MSR instruction at ~ 50 ns (yikes). I would think the FSGSBASE instruction should be faster.

Here is a subset of the changes I had to make to the context-switching code and the interrupt/syscall handler:

1. Add 2 new fields to the thread control block:

    unsigned long app_tcb;   //holds address of the address the app passed to arch_prctl

    long kernel_tcb_counter; //if 0 means we have to do an app/kernel/app FS base switch


2. Setup new per-cpu data intended to hold a pointer to the tcb:

--- a/arch/x64/arch-cpu.hh

+++ b/arch/x64/arch-cpu.hh


+struct tcb_data {

+    u64 kernel_tcb;

+    u64 tmp[2];

+};

+

 struct arch_cpu {

     arch_cpu();

     processor::aligned_task_state_segment atss;

@@ -46,6 +52,7 @@ struct arch_cpu {

     u32 apic_id;

     u32 acpi_id;

     u64 gdt[nr_gdt];

+    tcb_data _tcb_data;

     void init_on_cpu();

     void set_ist_entry(unsigned ist, char* base, size_t size);

     char* get_ist_entry(unsigned ist);

@@ -181,6 +188,8 @@ inline void arch_cpu::init_on_cpu()

     processor::init_fpu();

 

     processor::init_syscall();

+

+    processor::wrmsr(msr::IA32_GS_BASE, reinterpret_cast<u64>(&_tcb_data.kernel_tcb));

 }



3. Change kernel fs pointer on each context switch.

--- a/arch/x64/arch-switch.hh

+++ b/arch/x64/arch-switch.hh


@@ -81,11 +81,13 @@ void thread::switch_to()

...

     c->arch.set_exception_stack(_state.exception_stack);

+    c->arch._tcb_data.kernel_tcb = reinterpret_cast<u64>(_tcb); //This should be very fast

     auto fpucw = processor::fnstcw();

...

@@ -258,6 +260,7 @@ void thread::setup_tcb()

     else {

         _tcb->syscall_stack_top = 0;

     }

+    _tcb->kernel_tcb_counter = 1; //By default disable fs base switch

 }


4. Handle fs switch if necessary on entry/exit of syscall/exception/page fault handler:

This is just a code change around syscall entry but we have to do the opposite for exit and similar for page fault/interrupt handler (possibly signal handler as well) 

@@ -174,6 +214,26 @@ syscall_entry:

     .cfi_register rip, rcx # rcx took previous rip value

     .cfi_register rflags, r11 # r11 took previous rflags value

     # There is no ring transition and rflags are left unchanged.

+    #

+    # app->kernel tcb switch

+    movq %rax, %gs:8  # save register rax so we can restore it later

+    movq %gs:0, %rax  # copy address of kernel tcb to the temp register rax

+    #1. Check if kernel_tcb_counter 0 and jump over to 3 if not (no need to do fsbase switch)

+    cmpq $0, 40(%rax)

+    jne on_kernel_tcb

+

+    #2. If zero set fs MSR to kernel tcb

+    movq %rbx, %gs:16  # save register rbx so we can restore it later

+    movq (%rax), %rbx # set kernel tcb

+    wrfsbase %rbx //TODO: In reality we need to check if wrfsbase is available and use wrmsr if not

+    movq %gs:16, %rbx

+

+on_kernel_tcb:

+    #3. Increment counter (for nested case)

+    incq 40(%rax)

+    #4. Restore %rax

+    movq %gs:8, %rax

+

     #

     # Unfortunately the mov instruction cannot be used to dereference an address

     # on syscall stack pointed by address in TCB (%fs:16) - double memory dereference.



I did measure that the context switch code is not affected in any way. But I am sure the syscall/page fault/interrupt handler is affected but hopefully by a tiny bit for all the cases except when the application thread (of the static elf) gets interrupted, triggers page fault, or makes a syscall call. In other words, I hope that kernel threads and normal (non-static-elf) threads would not be affected. We could also add the necessary #ifdef static_elf.

Any ideas on how to measure how much slower the interrupt/syscall/page fault handler would become?

  1. Fixing a potential bug in handling TCGETS in the console driver.
I'm curious what this bug was - I am personally fond of this area of this code, as you can see from the history
lesson in drivers/line-discipline.cc :-)
I think it may have to do with some size difference of the termios struct between glibc and OSv. The symptom seemed to be a corrupted stack after ioctl syscall call that ended up calling the code to handle TCGETS. This change seems to fix it:

--- a/drivers/console.cc

+++ b/drivers/console.cc

@@ -68,7 +68,16 @@ console_ioctl(u_long request, void *arg)

 {

     switch (request) {

     case TCGETS:

-        *static_cast<termios*>(arg) = tio;

+        //*static_cast<termios*>(arg) = tio;

+        {

+          termios *in = static_cast<termios*>(arg);

+          in->c_iflag = tio.c_iflag;

+          in->c_oflag = tio.c_oflag;

+          in->c_cflag = tio.c_cflag;

+          in->c_lflag = tio.c_lflag;

+          in->c_line = tio.c_line;

+        }

         return 0;

  
I think I have missed the c_cc field.

 
  1. Implement sys_prlimit
  2. Enable the readlink, geteuid and getegid
I think we already had those - or did you mean the system call?
Yes, just add SYSCALLx() macros to linux.cc 

Nadav Har'El

unread,
Apr 24, 2023, 4:51:44 PM4/24/23
to Waldek Kozaczuk, OSv Development
On Mon, Apr 24, 2023 at 9:08 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:



  1. Fixing a potential bug in handling TCGETS in the console driver.
I'm curious what this bug was - I am personally fond of this area of this code, as you can see from the history
lesson in drivers/line-discipline.cc :-)
I think it may have to do with some size difference of the termios struct between glibc and OSv. The symptom seemed to be a corrupted stack after ioctl syscall call that ended up calling the code to handle TCGETS. This change seems to fix it:

--- a/drivers/console.cc

+++ b/drivers/console.cc

@@ -68,7 +68,16 @@ console_ioctl(u_long request, void *arg)

 {

     switch (request) {

     case TCGETS:

-        *static_cast<termios*>(arg) = tio;

+        //*static_cast<termios*>(arg) = tio;

+        {

+          termios *in = static_cast<termios*>(arg);

+          in->c_iflag = tio.c_iflag;

+          in->c_oflag = tio.c_oflag;

+          in->c_cflag = tio.c_cflag;

+          in->c_lflag = tio.c_lflag;

+          in->c_line = tio.c_line;

+        }

         return 0;

  
I think I have missed the c_cc field.


Ok, I think I know what's going on.

OSv's "struct termios" from include/api/termios.h is identical to that which gcc defines in /usr/include/bits/termios-struct.h,
But looking at the code above, it turns out that glibc does NOT assume that the kernel uses this termios structure, but something else called
__kernel_termios

So although our tcgetattr() function should return our usual termios structure as-is, TCGETS should do something different - it should write a __kernel_termios structure.
I think _kernel_termios is ktermios that you have in  /usr/include/asm-generic/termbits.h - you can see there that NCCS (the number of control characters) is lower, just 19 instead of 32, which explains the overflow you noticed.

I think the fix is simple - just copy a part of the termios struct - only up to the 19th c_cc member, not the whole thing.
Nadav.

Waldek Kozaczuk

unread,
Apr 25, 2023, 10:55:10 PM4/25/23
to OSv Development
On Monday, April 24, 2023 at 2:08:26 PM UTC-4 Waldek Kozaczuk wrote:
On Mon, Apr 24, 2023 at 4:34 AM Nadav Har'El <n...@scylladb.com> wrote:
On Mon, Apr 24, 2023 at 6:26 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Hi,

Over the recent week, I have been working to get OSv to run a simple "hello world" app (aka native-example) built as a position-dependent statically linked executable.

Nice!
 
  1. Support the arch_prctl syscall that sets the app TLS 
    • this was by far the most complicated element that required changing OSv to store new per-pcpu data pointer in GS register and enhancing both syscall handler and interrupt/page fault handler to detect and switch if needed the FS base to the kernel TLS on entry and back to the app one on exit (see https://github.com/cloudius-systems/osv/issues/1137#issuecomment-1512315880

If this has noticeable overhead, perhaps it makes sense to make it optional?
I have not measured it in any formal way. But when testing some of the earlier versions of the code, I could see the context switch time (the colocated one measured by misc-ctxsw) go up from 313 to 362 ns caused by adding this line: 

processor::wrmsr(msr::IA32_FS_KERNEL_BASE, reinterpret_cast<u64>(_tcb));
which may indirectly measure the cost of the code to change GS or FS base using the MSR instruction at ~ 50 ns (yikes). I would think the FSGSBASE instruction should be faster.

BTW I have measured indirectly the cost of the MSR and wrgsbase indirectly by modifying the thread::switch() like this and running misc-ctxsw (I assume the cost of wrfsbase would be identical):
 
+uint32_t IA32_GS_BASE = 0xc0000101;
 void thread::switch_to()
 {
     thread* old = current();
@@ -81,6 +82,8 @@ void thread::switch_to()
     // barriers
     barrier();
     set_fsbase(reinterpret_cast<u64>(_tcb));
+    //asm volatile("wrgsbase %0" : : "r"(reinterpret_cast<u64>(_tcb)));
+    //processor::wrmsr(IA32_GS_BASE, reinterpret_cast<u64>(_tcb));
     barrier();
     auto c = _detached_state->_cpu;
     old->_state.exception_stack = c->arch.get_exception_stack();

With uncommented wrgsbase the cost of colocated context switch barely budged. On average, I could see a maximum of 1-2 ns difference if any. Sometimes the times were identical. So it seems the wrgsbase is pretty cheap though we should avoid calling it in the interrupt/page fault/syscall handler.

On the other hand, with uncommented wrmsr code the cost of the context switch bumped by ~50ns so this instruction is very expensive. That is also why we need to especially avoid it if wrgsbase is not available.
Reply all
Reply to author
Forward
0 new messages