[PATCH] Move kernel to 0x40200000 address (1 GiB higher) in virtual memory

29 views
Skip to first unread message

Waldemar Kozaczuk

unread,
Jun 16, 2019, 1:05:57 AM6/16/19
to osv...@googlegroups.com, Waldemar Kozaczuk
This patch provides all necessary changes to move OSv kernel by 1 GiB higher
in virtual memory space to start at 0x40200000. Most changes involve adding
or substracting 0x40000000 (OSV_KERNEL_VM_SHIFT) in all relevant places. Please
note that the kernel is still loaded at 2MiB in physical memory.

The motivation for this patch is to make as much space as possible (or just enough)
in virtual memory to allow running unmodified Linux non-PIE executables (issue #190).
Even though due to the advancement of ASLR more and more applications are PIEs (Position
Independent Executables) which are pretty well supported by OSv, there are still many
non-PIEs (Position Dependent Executables) that are out. The most prominent one is
actualy JVM whose most distributions come with tiny (~20K) bootstrap java non-PIE
executable. There are many other examples where small non-PIE executable loads
other shared libraries.

As issue #1043 explains there are at least 3 possible solutions and
this patch implements the 3rd last one described there. Please note that in future
with little effort we could provide slightly beter scheme for OSV_KERNEL_VM_SHIFT
that would allow us to place the kernel even higher at the end of the 2GiB limit (small memory model)
and thus support virtually any non-PIE built using small memory model.

Fixes #1043

Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
Makefile | 6 +++--
arch/x64/arch-setup.cc | 39 +++++++++++++++++-----------
arch/x64/boot.S | 38 +++++++++++++++++++++-------
arch/x64/entry-xen.S | 3 ++-
arch/x64/loader.ld | 53 ++++++++++++++++++++++-----------------
arch/x64/vmlinux-boot64.S | 6 +++--
core/elf.cc | 2 +-
core/mmu.cc | 8 +++---
loader.cc | 3 ++-
9 files changed, 100 insertions(+), 58 deletions(-)

diff --git a/Makefile b/Makefile
index 314e74f4..fcb0121a 100644
--- a/Makefile
+++ b/Makefile
@@ -312,7 +312,7 @@ gcc-sysroot = $(if $(CROSS_PREFIX), --sysroot external/$(arch)/gcc.bin) \
# To add something that will *not* be part of the main kernel, you can do:
#
# mydir/*.o EXTRA_FLAGS = <MY_STUFF>
-EXTRA_FLAGS = -D__OSV_CORE__ -DOSV_KERNEL_BASE=$(kernel_base) -DOSV_LZKERNEL_BASE=$(lzkernel_base)
+EXTRA_FLAGS = -D__OSV_CORE__ -DOSV_KERNEL_BASE=$(kernel_base) -DOSV_LZKERNEL_BASE=$(lzkernel_base) -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift)
EXTRA_LIBS =
COMMON = $(autodepend) -g -Wall -Wno-pointer-arith $(CFLAGS_WERROR) -Wformat=0 -Wno-format-security \
-D __BSD_VISIBLE=1 -U _FORTIFY_SOURCE -fno-stack-protector $(INCLUDES) \
@@ -421,6 +421,7 @@ ifeq ($(arch),x64)
# lzkernel_base is where the compressed kernel is loaded from disk.
kernel_base := 0x200000
lzkernel_base := 0x100000
+kernel_vm_shift := 0x40000000

$(out)/arch/x64/boot16.o: $(out)/lzloader.elf
$(out)/boot.bin: arch/x64/boot16.ld $(out)/arch/x64/boot16.o
@@ -480,6 +481,7 @@ endif # x64
ifeq ($(arch),aarch64)

kernel_base := 0x40080000
+kernel_vm_shift := 0x0

include $(libfdt_base)/Makefile.libfdt
libfdt-source := $(patsubst %.c, $(libfdt_base)/%.c, $(LIBFDT_SRCS))
@@ -1872,7 +1874,7 @@ stage1: $(stage1_targets) links
.PHONY: stage1

$(out)/loader.elf: $(stage1_targets) arch/$(arch)/loader.ld $(out)/bootfs.o
- $(call quiet, $(LD) -o $@ --defsym=OSV_KERNEL_BASE=$(kernel_base) \
+ $(call quiet, $(LD) -o $@ --defsym=OSV_KERNEL_BASE=$(kernel_base) --defsym=OSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) \
-Bdynamic --export-dynamic --eh-frame-hdr --enable-new-dtags \
$(^:%.ld=-T %.ld) \
--whole-archive \
diff --git a/arch/x64/arch-setup.cc b/arch/x64/arch-setup.cc
index 62236486..210f2d7e 100644
--- a/arch/x64/arch-setup.cc
+++ b/arch/x64/arch-setup.cc
@@ -85,12 +85,15 @@ extern boot_time_chart boot_time;
// it by placing address of start32 at the known offset at memory
// as defined by section .start32_address in loader.ld
extern "C" void start32();
-void * __attribute__((section (".start32_address"))) start32_address = reinterpret_cast<void*>(&start32);
+void * __attribute__((section (".start32_address"))) start32_address =
+ reinterpret_cast<void*>((long)&start32 - OSV_KERNEL_VM_SHIFT);

void arch_setup_free_memory()
{
- static ulong edata;
+ static ulong edata, edata_phys;
asm ("movl $.edata, %0" : "=rm"(edata));
+ edata_phys = edata - OSV_KERNEL_VM_SHIFT;
+
// copy to stack so we don't free it now
auto omb = *osv_multiboot_info;
auto mb = omb.mb;
@@ -129,13 +132,13 @@ void arch_setup_free_memory()
// page tables have been set up, so we can't reference the memory being
// freed.
for_each_e820_entry(e820_buffer, e820_size, [] (e820ent ent) {
- // can't free anything below edata, it's core code.
+ // can't free anything below edata_phys, it's core code.
// can't free anything below kernel at this moment
- if (ent.addr + ent.size <= edata) {
+ if (ent.addr + ent.size <= edata_phys) {
return;
}
- if (intersects(ent, edata)) {
- ent = truncate_below(ent, edata);
+ if (intersects(ent, edata_phys)) {
+ ent = truncate_below(ent, edata_phys);
}
// ignore anything above 1GB, we haven't mapped it yet
if (intersects(ent, initial_map)) {
@@ -149,21 +152,27 @@ void arch_setup_free_memory()
auto base = reinterpret_cast<void*>(get_mem_area_base(area));
mmu::linear_map(base, 0, initial_map, initial_map);
}
- // map the core, loaded 1:1 by the boot loader
- mmu::phys elf_phys = reinterpret_cast<mmu::phys>(elf_header);
- elf_start = reinterpret_cast<void*>(elf_header);
- elf_size = edata - elf_phys;
- mmu::linear_map(elf_start, elf_phys, elf_size, OSV_KERNEL_BASE);
+ // Map the core, loaded by the boot loader
+ // In order to properly setup mapping between virtual
+ // and physical we need to take into account where kernel
+ // is loaded in physical memory - elf_phys_start - and
+ // where it is linked to start in virtual memory - elf_start
+ static mmu::phys elf_phys_start = reinterpret_cast<mmu::phys>(elf_header);
+ // There is simple invariant between elf_phys_start and elf_start
+ // as expressed by the assignment below
+ elf_start = reinterpret_cast<void*>(elf_phys_start + OSV_KERNEL_VM_SHIFT);
+ elf_size = edata_phys - elf_phys_start;
+ mmu::linear_map(elf_start, elf_phys_start, elf_size, OSV_KERNEL_BASE);
// get rid of the command line, before low memory is unmapped
parse_cmdline(mb);
// now that we have some free memory, we can start mapping the rest
mmu::switch_to_runtime_page_tables();
for_each_e820_entry(e820_buffer, e820_size, [] (e820ent ent) {
//
- // Free the memory below elf_start which we could not before
- if (ent.addr < (u64)elf_start) {
- if (ent.addr + ent.size >= (u64)elf_start) {
- ent = truncate_above(ent, (u64) elf_start);
+ // Free the memory below elf_phys_start which we could not before
+ if (ent.addr < (u64)elf_phys_start) {
+ if (ent.addr + ent.size >= (u64)elf_phys_start) {
+ ent = truncate_above(ent, (u64) elf_phys_start);
}
mmu::free_initial_memory_range(ent.addr, ent.size);
return;
diff --git a/arch/x64/boot.S b/arch/x64/boot.S
index 1402e5d0..91c25d5a 100644
--- a/arch/x64/boot.S
+++ b/arch/x64/boot.S
@@ -24,13 +24,25 @@
.align 4096
.global ident_pt_l4
ident_pt_l4:
- .quad ident_pt_l3 + 0x67
+ # The addresses of the paging tables have to be the physical ones, so we have to
+ # manually subtract OSV_KERNEL_VM_SHIFT in all relevant places
+ .quad ident_pt_l3 + 0x67 - OSV_KERNEL_VM_SHIFT
.rept 511
.quad 0
.endr
ident_pt_l3:
- .quad ident_pt_l2 + 0x67
- .rept 511
+ # Each of the 512 entries in this table maps the very 1st 512 GiB of
+ # virtual address space 1 GiB at a time
+ # The very 1st entry maps 1st GiB 1:1 by pointing to ident_pt_l2 table
+ # that specifies addresses of every one of 512 2MiB slots of physical memory
+ .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT
+ # The 2nd entry maps 2nd GiB to the same 1st GiB of physical memory by pointing
+ # to the same ident_pt_l2 table as the 1st entry above
+ # This way we effectively provide correct mapping for the kernel linked
+ # to start at 1 GiB + 2 MiB (0x40200000) in virtual memory and point to
+ # 2 MiB address (0x200000) where it starts in physical memory
+ .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT
+ .rept 510
.quad 0
.endr
ident_pt_l2:
@@ -42,7 +54,8 @@ ident_pt_l2:

gdt_desc:
.short gdt_end - gdt - 1
- .long gdt
+ # subtract OSV_KERNEL_VM_SHIFT because when gdt_desc is referenced, the memory is mapped 1:1
+ .long gdt - OSV_KERNEL_VM_SHIFT

# Set up the 64-bit compatible version of GDT description structure
# that points to the same GDT (Global segments Descriptors Table) and
@@ -53,7 +66,8 @@ gdt_desc:
.align 8
gdt64_desc:
.short gdt_end - gdt - 1
- .quad gdt
+ # subtract OSV_KERNEL_VM_SHIFT because when gdt64_desc is referenced, the memory is mapped 1:1
+ .quad gdt - OSV_KERNEL_VM_SHIFT

.align 8
gdt = . - 8
@@ -77,10 +91,12 @@ init_stack_top = .
.globl start32
.globl start32_from_64
start32:
+ # Because the memory is mapped 1:1 at this point, we have to manualy
+ # subtract OSV_KERNEL_VM_SHIFT from virtual addresses in all relevant places
# boot16.S set %eax to ELF start address, we'll use it later
mov %eax, %ebp
mov $0x0, %edi
- lgdt gdt_desc
+ lgdt gdt_desc-OSV_KERNEL_VM_SHIFT

# Add an address the vmlinux_entry64 will jump to when
# switching from 64-bit to 32-bit mode
@@ -91,7 +107,7 @@ start32_from_64:
mov %eax, %fs
mov %eax, %gs
mov %eax, %ss
- ljmp $0x18, $1f
+ ljmp $0x18, $1f-OSV_KERNEL_VM_SHIFT
1:
and $~7, %esp
# Enable PAE (Physical Address Extension) - ability to address 64GB
@@ -101,6 +117,9 @@ start32_from_64:

# Set root of a page table in cr3
lea ident_pt_l4, %eax
+ # The address of the root paging table has to be physical
+ # so substract OSV_KERNEL_VM_SHIFT from ident_pt_l4
+ sub $OSV_KERNEL_VM_SHIFT, %eax
mov %eax, %cr3

# Set long mode
@@ -128,7 +147,7 @@ start64:
jz start64_continue
call extract_linux_boot_params
mov $0x1000, %rbx
- mov $0x200000, %rbp
+ mov $OSV_KERNEL_BASE, %rbp

start64_continue:
lea .bss, %rdi
@@ -168,6 +187,7 @@ smpboot:
mov smpboot_cr4-smpboot, %eax
mov %eax, %cr4
lea ident_pt_l4, %eax
+ sub $OSV_KERNEL_VM_SHIFT, %eax
mov %eax, %cr3
mov smpboot_efer-smpboot, %eax
mov smpboot_efer+4-smpboot, %edx
@@ -181,7 +201,7 @@ smpboot:

smpboot_gdt_desc:
.short gdt_end - gdt - 1
- .long gdt
+ .long gdt - OSV_KERNEL_VM_SHIFT
.global smpboot_cr0
smpboot_cr0:
.long 0
diff --git a/arch/x64/entry-xen.S b/arch/x64/entry-xen.S
index 11f72da4..81342284 100644
--- a/arch/x64/entry-xen.S
+++ b/arch/x64/entry-xen.S
@@ -23,7 +23,7 @@

elfnote_val(XEN_ELFNOTE_ENTRY, xen_start)
elfnote_val(XEN_ELFNOTE_HYPERCALL_PAGE, hypercall_page)
-elfnote_val(XEN_ELFNOTE_VIRT_BASE, 0)
+elfnote_val(XEN_ELFNOTE_VIRT_BASE, OSV_KERNEL_VM_SHIFT)
elfnote_str(XEN_ELFNOTE_XEN_VERSION, "xen-3.0")
elfnote_str(XEN_ELFNOTE_GUEST_OS, "osv")
elfnote_str(XEN_ELFNOTE_GUEST_VERSION, "?.?")
@@ -50,4 +50,5 @@ xen_start:
mov %rsp, xen_bootstrap_end
mov %rsi, %rdi
call xen_init
+ mov $0x0, %rdi
jmp start64
diff --git a/arch/x64/loader.ld b/arch/x64/loader.ld
index caae1f68..a3ac0790 100644
--- a/arch/x64/loader.ld
+++ b/arch/x64/loader.ld
@@ -15,15 +15,21 @@ SECTIONS
* We can't export the ELF header base as a symbol, because ld
* insists on moving stuff around if we do.
*
+ * Make kernel start OSV_KERNEL_VM_SHIFT bytes higher than where it
+ * starts in physical memory. Also put AT() expressions in all
+ * sections to enforce their placements in physical memory lower
+ * by OSV_KERNEL_VM_SHIFT bytes.
+ */
+ . = OSV_KERNEL_BASE + 0x800 + OSV_KERNEL_VM_SHIFT;
+ /*
* Place address of start32 routine at predefined offset in memory
*/
- . = OSV_KERNEL_BASE + 0x800;
- .start32_address : {
+ .start32_address : AT(ADDR(.start32_address) - OSV_KERNEL_VM_SHIFT) {
*(.start32_address)
}
- . = OSV_KERNEL_BASE + 0x1000;
- .dynamic : { *(.dynamic) } :dynamic :text
- .text : {
+ . = OSV_KERNEL_BASE + 0x1000 + OSV_KERNEL_VM_SHIFT;
+ .dynamic : AT(ADDR(.dynamic) - OSV_KERNEL_VM_SHIFT) { *(.dynamic) } :dynamic :text
+ .text : AT(ADDR(.text) - OSV_KERNEL_VM_SHIFT) {
text_start = .;
*(.text.hot .text.hot.*)
*(.text.unlikely .text.*_unlikely)
@@ -31,60 +37,61 @@ SECTIONS
*(.text.startup .text.startup.*)
*(.text .text.*)
text_end = .;
+ PROVIDE(low_vmlinux_entry64 = vmlinux_entry64 - OSV_KERNEL_VM_SHIFT);
} :text
. = ALIGN(8);
- .fixup : {
+ .fixup : AT(ADDR(.fixup) - OSV_KERNEL_VM_SHIFT) {
fault_fixup_start = .;
*(.fixup)
fault_fixup_end = .;
} :text

. = ALIGN(8);
- .memcpy_decode : {
+ .memcpy_decode : AT(ADDR(.memcpy_decode) - OSV_KERNEL_VM_SHIFT) {
memcpy_decode_start = .;
*(.memcpy_decode)
memcpy_decode_end = .;
} :text

- .eh_frame : { *(.eh_frame) } : text
- .rodata : { *(.rodata*) } :text
- .eh_frame : { *(.eh_frame) } :text
- .eh_frame_hdr : { *(.eh_frame_hdr) } :text :eh_frame
- .note : { *(.note*) } :text :note
- .gcc_except_table : { *(.gcc_except_table) *(.gcc_except_table.*) } : text
- .tracepoint_patch_sites ALIGN(8) : {
+ .eh_frame : AT(ADDR(.eh_frame) - OSV_KERNEL_VM_SHIFT) { *(.eh_frame) } : text
+ .rodata : AT(ADDR(.rodata) - OSV_KERNEL_VM_SHIFT) { *(.rodata*) } :text
+ .eh_frame : AT(ADDR(.eh_frame) - OSV_KERNEL_VM_SHIFT) { *(.eh_frame) } :text
+ .eh_frame_hdr : AT(ADDR(.eh_frame_hdr) - OSV_KERNEL_VM_SHIFT) { *(.eh_frame_hdr) } :text :eh_frame
+ .note : AT(ADDR(.note) - OSV_KERNEL_VM_SHIFT) { *(.note*) } :text :note
+ .gcc_except_table : AT(ADDR(.gcc_except_table) - OSV_KERNEL_VM_SHIFT) { *(.gcc_except_table) *(.gcc_except_table.*) } : text
+ .tracepoint_patch_sites ALIGN(8) : AT(ADDR(.tracepoint_patch_sites) - OSV_KERNEL_VM_SHIFT) {
__tracepoint_patch_sites_start = .;
*(.tracepoint_patch_sites)
__tracepoint_patch_sites_end = .;
} : text
- .data.rel.ro : { *(.data.rel.ro.local* .gnu.linkonce.d.rel.ro.local.*) *(.data.rel.ro .data.rel.ro.* .gnu.linkonce.d.rel.ro.*) } : text
- .data : { *(.data) } :text
+ .data.rel.ro : AT(ADDR(.data.rel.ro) - OSV_KERNEL_VM_SHIFT) { *(.data.rel.ro.local* .gnu.linkonce.d.rel.ro.local.*) *(.data.rel.ro .data.rel.ro.* .gnu.linkonce.d.rel.ro.*) } : text
+ .data : AT(ADDR(.data) - OSV_KERNEL_VM_SHIFT) { *(.data) } :text
_init_array_start = .;
- .init_array : {
+ .init_array : AT(ADDR(.init_array) - OSV_KERNEL_VM_SHIFT) {
*(SORT_BY_INIT_PRIORITY(.init_array.*) SORT_BY_INIT_PRIORITY(.ctors.*))
*(.init_array .ctors)
} : text
_init_array_end = .;
. = ALIGN(4096);
- .percpu : {
+ .percpu : AT(ADDR(.percpu) - OSV_KERNEL_VM_SHIFT) {
_percpu_start = .;
*(.percpu)
. = ALIGN(4096);
_percpu_end = .;
}
- .percpu_workers : {
+ .percpu_workers : AT(ADDR(.percpu_workers) - OSV_KERNEL_VM_SHIFT) {
_percpu_workers_start = .;
*(.percpu_workers)
_percpu_workers_end = .;
}
. = ALIGN(64);
- .tdata : { *(.tdata .tdata.* .gnu.linkonce.td.*) } :tls :text
- .tbss : {
+ .tdata : AT(ADDR(.tdata) - OSV_KERNEL_VM_SHIFT) { *(.tdata .tdata.* .gnu.linkonce.td.*) } :tls :text
+ .tbss : AT(ADDR(.tbss) - OSV_KERNEL_VM_SHIFT) {
*(.tbss .tbss.* .gnu.linkonce.tb.*)
. = ALIGN(64);
} :tls :text
.tls_template_size = SIZEOF(.tdata) + SIZEOF(.tbss);
- .bss : { *(.bss .bss.*) } :text
+ .bss : AT(ADDR(.bss) - OSV_KERNEL_VM_SHIFT) { *(.bss .bss.*) } :text
. = ALIGN(64);
tcb0 = .;
. = . + .tls_template_size + 256;
@@ -114,4 +121,4 @@ PHDRS {
eh_frame PT_GNU_EH_FRAME;
note PT_NOTE;
}
-ENTRY(vmlinux_entry64);
+ENTRY(low_vmlinux_entry64);
diff --git a/arch/x64/vmlinux-boot64.S b/arch/x64/vmlinux-boot64.S
index 230afd3c..12047513 100644
--- a/arch/x64/vmlinux-boot64.S
+++ b/arch/x64/vmlinux-boot64.S
@@ -13,7 +13,9 @@ vmlinux_entry64:
mov %rsi, %rdi

# Load the 64-bit version of the GDT
- lgdt gdt64_desc
+ # Because the memory is mapped 1:1 at this point, we have to manualy
+ # subtract OSV_KERNEL_VM_SHIFT from the gdt address
+ lgdt gdt64_desc-OSV_KERNEL_VM_SHIFT

# Setup the stack to switch back to 32-bit mode in order
# to converge with the code that sets up transiton to 64-bit mode later.
@@ -32,6 +34,6 @@ vmlinux_entry64:
# to start32_from_64 which is where the boot process converges.
subq $8, %rsp
movl $0x18, 4(%rsp)
- movl $start32_from_64, %eax
+ movl $start32_from_64-OSV_KERNEL_VM_SHIFT, %eax # Because memory is mapped 1:1 subtract OSV_KERNEL_VM_SHIFT
movl %eax, (%rsp)
lret
diff --git a/core/elf.cc b/core/elf.cc
index fc2ee0c3..477a0177 100644
--- a/core/elf.cc
+++ b/core/elf.cc
@@ -1099,7 +1099,7 @@ void create_main_program()
program::program(void* addr)
: _next_alloc(addr)
{
- _core = std::make_shared<memory_image>(*this, (void*)ELF_IMAGE_START);
+ _core = std::make_shared<memory_image>(*this, (void*)(ELF_IMAGE_START + OSV_KERNEL_VM_SHIFT));
assert(_core->module_index() == core_module_index);
_core->load_segments();
set_search_path({"/", "/usr/lib"});
diff --git a/core/mmu.cc b/core/mmu.cc
index f9294125..75366360 100644
--- a/core/mmu.cc
+++ b/core/mmu.cc
@@ -91,12 +91,12 @@ phys pte_level_mask(unsigned level)
return ~((phys(1) << shift) - 1);
}

+static void *elf_phys_start = (void*)OSV_KERNEL_BASE;
void* phys_to_virt(phys pa)
{
- // The ELF is mapped 1:1
void* phys_addr = reinterpret_cast<void*>(pa);
- if ((phys_addr >= elf_start) && (phys_addr < elf_start + elf_size)) {
- return phys_addr;
+ if ((phys_addr >= elf_phys_start) && (phys_addr < elf_phys_start + elf_size)) {
+ return (void*)(phys_addr + OSV_KERNEL_VM_SHIFT);
}

return phys_mem + pa;
@@ -108,7 +108,7 @@ phys virt_to_phys(void *virt)
{
// The ELF is mapped 1:1
if ((virt >= elf_start) && (virt < elf_start + elf_size)) {
- return reinterpret_cast<phys>(virt);
+ return reinterpret_cast<phys>((void*)(virt - OSV_KERNEL_VM_SHIFT));
}

#if CONF_debug_memory
diff --git a/loader.cc b/loader.cc
index 7d88e609..7ac99ef5 100644
--- a/loader.cc
+++ b/loader.cc
@@ -102,7 +102,8 @@ void premain()

arch_init_premain();

- auto inittab = elf::get_init(elf_header);
+ auto inittab = elf::get_init(reinterpret_cast<elf::Elf64_Ehdr*>(
+ (void*)elf_header + OSV_KERNEL_VM_SHIFT));

if (inittab.tls.start == nullptr) {
debug_early("premain: failed to get TLS data from ELF\n");
--
2.20.1

Nadav Har'El

unread,
Jun 16, 2019, 5:40:46 AM6/16/19
to Waldemar Kozaczuk, Osv Dev
On Sun, Jun 16, 2019 at 8:05 AM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:
This patch provides all necessary changes to move OSv kernel by 1 GiB higher
in virtual memory space to start at 0x40200000. Most changes involve adding
or substracting 0x40000000 (OSV_KERNEL_VM_SHIFT) in all relevant places. Please
note that the kernel is still loaded at 2MiB in physical memory.

Hi, overall I think this is a very good direction, because there's indeed no reason why the
kernel's physical and virtual addresses need to be the same and the assumption that it
was was hidden throughout our code. I do have a couple of questions indline below though.

By the way, I personally feel that it would be more convenient to just control the virtual
address of the kernel directly - not as a "shift" from its physical address, but I guess it's
just a matter of taste, and some of the code below is indeed more natural when written
with the "shift" available directly. Perhaps we could have the best of both worlds -
define the kernel *position* as the primary setting, and then define OSV_KERNEL_VM_SHIFT
as simply the subtraction of the kerne's virtual position and its physical position.

Also, I see your patch makes 0x40000000 the default. I'm not sure I'm following all
the details here - is there any *downside* to this change from the previous default?
I can't think of any, but wondering if there's something I am missing.
Oh, but doesn't this mean that this only works correctly when OSV_KERNEL_VM_SHIFT is *exactly* 1 GB?
I.e., the reason why you want the mapping of the second gigabyte to be identical to the first gigabyte is
just because the shift is exactly 1GB?

If this is the case (please correct me if I misunderstood!), this code need to be more sophisticated to handle
a general OSV_KERNEL_VM_SHIFT, or you need some sort of compile time check to verify that
OSV_KERNEL_VM_SHIFT must be set to 1GB and nothing else.
Good catch.
I have no idea what this does :-(
 
 elfnote_str(XEN_ELFNOTE_XEN_VERSION, "xen-3.0")
 elfnote_str(XEN_ELFNOTE_GUEST_OS, "osv")
 elfnote_str(XEN_ELFNOTE_GUEST_VERSION, "?.?")
@@ -50,4 +50,5 @@ xen_start:
     mov %rsp, xen_bootstrap_end
     mov %rsi, %rdi
     call xen_init
+    mov $0x0, %rdi
     jmp start64
diff --git a/arch/x64/loader.ld b/arch/x64/loader.ld
index caae1f68..a3ac0790 100644
--- a/arch/x64/loader.ld
+++ b/arch/x64/loader.ld
@@ -15,15 +15,21 @@ SECTIONS
         * We can't export the ELF header base as a symbol, because ld
         * insists on moving stuff around if we do.
         *
+        * Make kernel start OSV_KERNEL_VM_SHIFT bytes higher than where it
+        * starts in physical memory. Also put AT() expressions in all
+        * sections to enforce their placements in physical memory lower
+        * by OSV_KERNEL_VM_SHIFT bytes.

I'm afraid I didn't understand the "AT()" part. Why do we need to add OSV_KERNEL_VM_SHIFT
On the first line ( . = OSV_KERNEL_BASE + 0x800 + OSV_KERNEL_VM_SHIFT) but then
subtract it back on every single address calculation that follows? I didn't understand what
you are trying to achieve.
--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20190616050548.7888-1-jwkozaczuk%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Waldek Kozaczuk

unread,
Jun 16, 2019, 8:22:03 AM6/16/19
to OSv Development


On Sunday, June 16, 2019 at 5:40:46 AM UTC-4, Nadav Har'El wrote:
On Sun, Jun 16, 2019 at 8:05 AM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:
This patch provides all necessary changes to move OSv kernel by 1 GiB higher
in virtual memory space to start at 0x40200000. Most changes involve adding
or substracting 0x40000000 (OSV_KERNEL_VM_SHIFT) in all relevant places. Please
note that the kernel is still loaded at 2MiB in physical memory.

Hi, overall I think this is a very good direction, because there's indeed no reason why the
kernel's physical and virtual addresses need to be the same and the assumption that it
was was hidden throughout our code. I do have a couple of questions indline below though.

By the way, I personally feel that it would be more convenient to just control the virtual
address of the kernel directly - not as a "shift" from its physical address, but I guess it's
just a matter of taste, and some of the code below is indeed more natural when written
with the "shift" available directly. Perhaps we could have the best of both worlds -
define the kernel *position* as the primary setting, and then define OSV_KERNEL_VM_SHIFT
as simply the subtraction of the kerne's virtual position and its physical position.
Either way there should be only single variable so either: 
1) the shift one is a function of the virtual address (shift = virtual - physical) or 
2) virtual address is a function of the shift (virtual = shift + physical)
We might first apply other patch I sent "Make rebuilding loader.elf automatic and more efficient when changing kernel_base" that could make it handy
to define it in single header rather than in Makefile. In either case this could be changed later.

Also, I see your patch makes 0x40000000 the default. I'm not sure I'm following all
the details here - is there any *downside* to this change from the previous default?
I can't think of any, but wondering if there's something I am missing.
I am not sure what change you are talking about. Just to clarify (per this patch):
OSV_KERNEL_VM_SHIFT = 0x40000000
elf_start = 0x200000
phys_elf_start = elf_start + OSV_KERNEL_VM_SHIFT = 0x40200000
So in the comments I meant that "really" kernel starts at 0x40200000 in VMA.
That is correct. The general scheme (which I am planning to make part of the next patch at some time) should be this:
OSV_KERNEL_VM_SHIFT = 1 GiB + N * 2MiB where 0 =< N < 500 (more less as last 24MB of the 2nd GB should be enough for the kernel).
But then instead of re-using and pointing to the ident_pt_l2 table I will have to define extra instance of ident_pt_l2-equivalent-table where the first N entries will be zero.  
 

If this is the case (please correct me if I misunderstood!), this code need to be more sophisticated to handle
a general OSV_KERNEL_VM_SHIFT, or you need some sort of compile time check to verify that
OSV_KERNEL_VM_SHIFT must be set to 1GB and nothing else.
Well this be enforced by the linker which should complain if the kernel code goes beyond 2GiB limit in VM (small model). 
Please also note that (I think) we unnecessarily set ebp/rbp in all these places and then pass all the way to arch-setup.cc but in reality the 0x200000 is predefined in the makefile so I am not sure what is the point of this code in *.S files. Either way we can remove this redundancy later.
Here is the article I read -  https://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management (look for "Start Of Day" section)
This part is critical to keep the segments (and sections they are made of) in right place in physical memory. It makes paddr stay in tact at 0x200000 and for example firecracker relies on it. Without AT the loader.elf would be 1GB.  
To unsubscribe from this group and stop receiving emails from it, send an email to osv...@googlegroups.com.

Waldek Kozaczuk

unread,
Jun 16, 2019, 8:24:53 AM6/16/19
to OSv Development


On Sunday, June 16, 2019 at 8:22:03 AM UTC-4, Waldek Kozaczuk wrote:


On Sunday, June 16, 2019 at 5:40:46 AM UTC-4, Nadav Har'El wrote:
On Sun, Jun 16, 2019 at 8:05 AM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:
This patch provides all necessary changes to move OSv kernel by 1 GiB higher
in virtual memory space to start at 0x40200000. Most changes involve adding
or substracting 0x40000000 (OSV_KERNEL_VM_SHIFT) in all relevant places. Please
note that the kernel is still loaded at 2MiB in physical memory.

Hi, overall I think this is a very good direction, because there's indeed no reason why the
kernel's physical and virtual addresses need to be the same and the assumption that it
was was hidden throughout our code. I do have a couple of questions indline below though.

By the way, I personally feel that it would be more convenient to just control the virtual
address of the kernel directly - not as a "shift" from its physical address, but I guess it's
just a matter of taste, and some of the code below is indeed more natural when written
with the "shift" available directly. Perhaps we could have the best of both worlds -
define the kernel *position* as the primary setting, and then define OSV_KERNEL_VM_SHIFT
as simply the subtraction of the kerne's virtual position and its physical position.
Either way there should be only single variable so either: 
1) the shift one is a function of the virtual address (shift = virtual - physical) or 
2) virtual address is a function of the shift (virtual = shift + physical)
We might first apply other patch I sent "Make rebuilding loader.elf automatic and more efficient when changing kernel_base" that could make it handy
to define it in single header rather than in Makefile. In either case this could be changed later.
Also possible the OSV_KERNEL_VM_BASE is the better name instead of OSV_KERNEL_VM_SHIFT. 

Nadav Har'El

unread,
Jun 16, 2019, 8:44:11 AM6/16/19
to Waldek Kozaczuk, OSv Development
On Sun, Jun 16, 2019 at 3:22 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:


On Sunday, June 16, 2019 at 5:40:46 AM UTC-4, Nadav Har'El wrote:
On Sun, Jun 16, 2019 at 8:05 AM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:
This patch provides all necessary changes to move OSv kernel by 1 GiB higher
in virtual memory space to start at 0x40200000. Most changes involve adding
or substracting 0x40000000 (OSV_KERNEL_VM_SHIFT) in all relevant places. Please
note that the kernel is still loaded at 2MiB in physical memory.

Hi, overall I think this is a very good direction, because there's indeed no reason why the
kernel's physical and virtual addresses need to be the same and the assumption that it
was was hidden throughout our code. I do have a couple of questions indline below though.

By the way, I personally feel that it would be more convenient to just control the virtual
address of the kernel directly - not as a "shift" from its physical address, but I guess it's
just a matter of taste, and some of the code below is indeed more natural when written
with the "shift" available directly. Perhaps we could have the best of both worlds -
define the kernel *position* as the primary setting, and then define OSV_KERNEL_VM_SHIFT
as simply the subtraction of the kerne's virtual position and its physical position.
Either way there should be only single variable so either: 
1) the shift one is a function of the virtual address (shift = virtual - physical) or 
2) virtual address is a function of the shift (virtual = shift + physical)

Right. I thought it would be more convenient to choose the virtual address, not the shift,
but we can do this later (if at all).
 
We might first apply other patch I sent "Make rebuilding loader.elf automatic and more efficient when changing kernel_base" that could make it handy
to define it in single header rather than in Makefile. In either case this could be changed later.


Also, I see your patch makes 0x40000000 the default. I'm not sure I'm following all
the details here - is there any *downside* to this change from the previous default?
I can't think of any, but wondering if there's something I am missing.
I am not sure what change you are talking about. Just to clarify (per this patch):
OSV_KERNEL_VM_SHIFT = 0x40000000
elf_start = 0x200000
phys_elf_start = elf_start + OSV_KERNEL_VM_SHIFT = 0x40200000
So in the comments I meant that "really" kernel starts at 0x40200000 in VMA.

Yes, I know. I mean, this patch not only adds a new optional feature of moving the
kernel's virtual-memory address, it also makes a new address the default. The question
is whether there is any *downside* to changing this default. I know it makes some
non-PIE code easier to run but does it make other things harder to run? I think the
answer is no - there is no downside, but wanted to hear your confirmation.

+    # Each of the 512 entries in this table maps the very 1st 512 GiB of
+    # virtual address space 1 GiB at a time
+    # The very 1st entry maps 1st GiB 1:1 by pointing to ident_pt_l2 table
+    # that specifies addresses of every one of 512 2MiB slots of physical memory
+    .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT
+    # The 2nd entry maps 2nd GiB to the same 1st GiB of physical memory by pointing
+    # to the same ident_pt_l2 table as the 1st entry above
+    # This way we effectively provide correct mapping for the kernel linked
+    # to start at 1 GiB + 2 MiB (0x40200000) in virtual memory and point to
+    # 2 MiB address (0x200000) where it starts in physical memory
+    .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT

Oh, but doesn't this mean that this only works correctly when OSV_KERNEL_VM_SHIFT is *exactly* 1 GB?
I.e., the reason why you want the mapping of the second gigabyte to be identical to the first gigabyte is
just because the shift is exactly 1GB?
That is correct. The general scheme (which I am planning to make part of the next patch at some time) should be this:
OSV_KERNEL_VM_SHIFT = 1 GiB + N * 2MiB where 0 =< N < 500 (more less as last 24MB of the 2nd GB should be enough for the kernel).
But then instead of re-using and pointing to the ident_pt_l2 table I will have to define extra instance of ident_pt_l2-equivalent-table where the first N entries will be zero.  

So although in all other places in the code, OSV_KERNEL_VM_SHIFT can be anything, in this specific
place you really do assume that OSV_KERNEL_VM_SHIFT= 1GB and nothing else will work? If this
is the case, I think you should add a static assertion (if it's possible in the assembly code?) or
equivalent, to fail the compilation if OSV_KERNEL_VM_SHIFT != 1 GB.
 
 

If this is the case (please correct me if I misunderstood!), this code need to be more sophisticated to handle
a general OSV_KERNEL_VM_SHIFT, or you need some sort of compile time check to verify that
OSV_KERNEL_VM_SHIFT must be set to 1GB and nothing else.
Well this be enforced by the linker which should complain if the kernel code goes beyond 2GiB limit in VM (small model). 

But the user can set OSV_KERNEL_VM_SHIFTto 0 or 0.1 GB or 1.3 GB or whatever, and all of these almost
work (the kernel doesn't go beyond 2GB) but if I understand correctly, will then fail because of this one piece of code?

 

     # Set long mode
@@ -128,7 +147,7 @@ start64:
     jz start64_continue
     call extract_linux_boot_params
     mov $0x1000, %rbx
-    mov $0x200000, %rbp
+    mov $OSV_KERNEL_BASE, %rbp

Good catch.
Please also note that (I think) we unnecessarily set ebp/rbp in all these places and then pass all the way to arch-setup.cc but in reality the 0x200000 is predefined in the makefile so I am not sure what is the point of this code in *.S files. Either way we can remove this redundancy later.

I agree.
 


 elfnote_val(XEN_ELFNOTE_ENTRY, xen_start)
 elfnote_val(XEN_ELFNOTE_HYPERCALL_PAGE, hypercall_page)
-elfnote_val(XEN_ELFNOTE_VIRT_BASE, 0)
+elfnote_val(XEN_ELFNOTE_VIRT_BASE, OSV_KERNEL_VM_SHIFT)

I have no idea what this does :-(
Here is the article I read -  https://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management (look for "Start Of Day" section)

I guess we'll need to check on Xen (maybe an older AWS VM with  Xen) to see if this actually works :-(
I don't have Xen installed locally (and never did...).
So, you basically need all the addresses in loader.elf to remain physical - the small numbers below 1GB? So why did you need to set "." in the beginning to the high number (above 1GB)? I am afraid I still don't understand the combination of these things: setting "." to something but then overriding it on every point.

Nadav Har'El

unread,
Jun 16, 2019, 8:45:22 AM6/16/19
to Waldek Kozaczuk, OSv Development
On Sun, Jun 16, 2019 at 3:24 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:


On Sunday, June 16, 2019 at 8:22:03 AM UTC-4, Waldek Kozaczuk wrote:


On Sunday, June 16, 2019 at 5:40:46 AM UTC-4, Nadav Har'El wrote:
On Sun, Jun 16, 2019 at 8:05 AM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:
This patch provides all necessary changes to move OSv kernel by 1 GiB higher
in virtual memory space to start at 0x40200000. Most changes involve adding
or substracting 0x40000000 (OSV_KERNEL_VM_SHIFT) in all relevant places. Please
note that the kernel is still loaded at 2MiB in physical memory.

Hi, overall I think this is a very good direction, because there's indeed no reason why the
kernel's physical and virtual addresses need to be the same and the assumption that it
was was hidden throughout our code. I do have a couple of questions indline below though.

By the way, I personally feel that it would be more convenient to just control the virtual
address of the kernel directly - not as a "shift" from its physical address, but I guess it's
just a matter of taste, and some of the code below is indeed more natural when written
with the "shift" available directly. Perhaps we could have the best of both worlds -
define the kernel *position* as the primary setting, and then define OSV_KERNEL_VM_SHIFT
as simply the subtraction of the kerne's virtual position and its physical position.
Either way there should be only single variable so either: 
1) the shift one is a function of the virtual address (shift = virtual - physical) or 
2) virtual address is a function of the shift (virtual = shift + physical)
We might first apply other patch I sent "Make rebuilding loader.elf automatic and more efficient when changing kernel_base" that could make it handy
to define it in single header rather than in Makefile. In either case this could be changed later.
Also possible the OSV_KERNEL_VM_BASE is the better name instead of OSV_KERNEL_VM_SHIFT. 

OSV_KERNEL_VM_SHIFT is a good name for the 0x4000000
and OSV_KERNEL_VM_BASE for the 0x40200000
as you said, we just need one of them and the second can be calculated
by adding/subtracting two constants.
 
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/f2da1cc4-76e6-4823-8f32-31c82a865d11%40googlegroups.com.

Waldek Kozaczuk

unread,
Jun 16, 2019, 9:37:38 AM6/16/19
to Nadav Har'El, OSv Development
Now I really understand your question. You are right. 

+    # Each of the 512 entries in this table maps the very 1st 512 GiB of
+    # virtual address space 1 GiB at a time
+    # The very 1st entry maps 1st GiB 1:1 by pointing to ident_pt_l2 table
+    # that specifies addresses of every one of 512 2MiB slots of physical memory
+    .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT
+    # The 2nd entry maps 2nd GiB to the same 1st GiB of physical memory by pointing
+    # to the same ident_pt_l2 table as the 1st entry above
+    # This way we effectively provide correct mapping for the kernel linked
+    # to start at 1 GiB + 2 MiB (0x40200000) in virtual memory and point to
+    # 2 MiB address (0x200000) where it starts in physical memory
+    .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT

Oh, but doesn't this mean that this only works correctly when OSV_KERNEL_VM_SHIFT is *exactly* 1 GB?
I.e., the reason why you want the mapping of the second gigabyte to be identical to the first gigabyte is
just because the shift is exactly 1GB?
That is correct. The general scheme (which I am planning to make part of the next patch at some time) should be this:
OSV_KERNEL_VM_SHIFT = 1 GiB + N * 2MiB where 0 =< N < 500 (more less as last 24MB of the 2nd GB should be enough for the kernel).
But then instead of re-using and pointing to the ident_pt_l2 table I will have to define extra instance of ident_pt_l2-equivalent-table where the first N entries will be zero.  

So although in all other places in the code, OSV_KERNEL_VM_SHIFT can be anything, in this specific
place you really do assume that OSV_KERNEL_VM_SHIFT= 1GB and nothing else will work? If this
is the case, I think you should add a static assertion (if it's possible in the assembly code?) or
equivalent, to fail the compilation if OSV_KERNEL_VM_SHIFT != 1 GB.
Indeed at least with this patch (until we have more dynamic formula) it all comes down to a single constraint where we reuse the pl3 table and the shift has to be be 1GB. How can we assert it with the compiler ?

 
 

If this is the case (please correct me if I misunderstood!), this code need to be more sophisticated to handle
a general OSV_KERNEL_VM_SHIFT, or you need some sort of compile time check to verify that
OSV_KERNEL_VM_SHIFT must be set to 1GB and nothing else.
Well this be enforced by the linker which should complain if the kernel code goes beyond 2GiB limit in VM (small model). 

But the user can set OSV_KERNEL_VM_SHIFTto 0 or 0.1 GB or 1.3 GB or whatever, and all of these almost
work (the kernel doesn't go beyond 2GB) but if I understand correctly, will then fail because of this one piece of code?
Right. See above. 
 

     # Set long mode
@@ -128,7 +147,7 @@ start64:
     jz start64_continue
     call extract_linux_boot_params
     mov $0x1000, %rbx
-    mov $0x200000, %rbp
+    mov $OSV_KERNEL_BASE, %rbp

Good catch.
Please also note that (I think) we unnecessarily set ebp/rbp in all these places and then pass all the way to arch-setup.cc but in reality the 0x200000 is predefined in the makefile so I am not sure what is the point of this code in *.S files. Either way we can remove this redundancy later.

I agree.
 


 elfnote_val(XEN_ELFNOTE_ENTRY, xen_start)
 elfnote_val(XEN_ELFNOTE_HYPERCALL_PAGE, hypercall_page)
-elfnote_val(XEN_ELFNOTE_VIRT_BASE, 0)
+elfnote_val(XEN_ELFNOTE_VIRT_BASE, OSV_KERNEL_VM_SHIFT)

I have no idea what this does :-(
Here is the article I read -  https://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management (look for "Start Of Day" section)

I guess we'll need to check on Xen (maybe an older AWS VM with  Xen) to see if this actually works :-(
I don't have Xen installed locally (and never did...).
I should test using EC2 but I am afraid it will not cover all cases with Xen. Is it true that in EC2 OSv starts booting in real mode (we supply usr.img not elf). But you can also boot it using the loader.elf. Either way it seem we need xen server to test. 

Can any Xen gurus shed light on this matter?

Nadav Har'El

unread,
Jun 16, 2019, 9:57:58 AM6/16/19
to Waldek Kozaczuk, OSv Development
On Sun, Jun 16, 2019 at 4:37 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:

+    # virtual address space 1 GiB at a time
+    # The very 1st entry maps 1st GiB 1:1 by pointing to ident_pt_l2 table
+    # that specifies addresses of every one of 512 2MiB slots of physical memory
+    .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT
+    # The 2nd entry maps 2nd GiB to the same 1st GiB of physical memory by pointing
+    # to the same ident_pt_l2 table as the 1st entry above
+    # This way we effectively provide correct mapping for the kernel linked
+    # to start at 1 GiB + 2 MiB (0x40200000) in virtual memory and point to
+    # 2 MiB address (0x200000) where it starts in physical memory
+    .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT

Oh, but doesn't this mean that this only works correctly when OSV_KERNEL_VM_SHIFT is *exactly* 1 GB?
I.e., the reason why you want the mapping of the second gigabyte to be identical to the first gigabyte is
just because the shift is exactly 1GB?
That is correct. The general scheme (which I am planning to make part of the next patch at some time) should be this:
OSV_KERNEL_VM_SHIFT = 1 GiB + N * 2MiB where 0 =< N < 500 (more less as last 24MB of the 2nd GB should be enough for the kernel).
But then instead of re-using and pointing to the ident_pt_l2 table I will have to define extra instance of ident_pt_l2-equivalent-table where the first N entries will be zero.  

So although in all other places in the code, OSV_KERNEL_VM_SHIFT can be anything, in this specific
place you really do assume that OSV_KERNEL_VM_SHIFT= 1GB and nothing else will work? If this
is the case, I think you should add a static assertion (if it's possible in the assembly code?) or
equivalent, to fail the compilation if OSV_KERNEL_VM_SHIFT != 1 GB.
Indeed at least with this patch (until we have more dynamic formula) it all comes down to a single constraint where we reuse the pl3 table and the shift has to be be 1GB. How can we assert it with the compiler ?

Since  OSV_KERNEL_VM_SHIFT is a preprocessor macro, maybe we can check it with the preprocessor - even in the assembler code, something like:

#if OSV_KERNEL_VM_SHIFT != 0x40000000 && OSV_KERNEL_VM_SHIFT != 0
#error This code only works correctly for OSV_KERNEL_VM_SHIFT = 0x40000000 or 0
#endif

Put it right next to the code which only works for these cases (correct me if I'm wrong but
I think the same code works for either 0 and 0x40000000, but not anything else?) and check that you get a compilation error for OSV_KERNEL_VM_SHIFT = 0x12345.

Waldek Kozaczuk

unread,
Jun 18, 2019, 5:35:04 PM6/18/19
to OSv Development
As I somehow suspected mapping kernel higher breaks Xen. At least on EC2. I still have not gotten my laptop properly support to run OSv on Xen (which I will be detailing in my other email). So I cannot reproduce anything locally yet. Very frustrating.

The good news is that OSv at least starts the boot process, displays "OSv .." and exits somewhere in premain just before executing the ELF init functions. Also the very latest master works on EC2. And my kernel mapping changes break it. 

I managed to add couple of debug_early() and here is what I got:
OSv v0.53.0-35-g61070e27
-> arch_init_premain()
-> features_type() constructor
process_xen_bits
------> xen_init(): Start
------> xen_init(): After cpuid
------> xen_init(): Before XENVER_get_features

The arch/x64/xen.cc changes:
diff --git a/arch/x64/xen.cc b/arch/x64/xen.cc
index 462c266c..0247d5eb 100644
--- a/arch/x64/xen.cc
+++ b/arch/x64/xen.cc
@@ -169,17 +169,24 @@ gsi_level_interrupt *xen_set_callback(int irqno)
 
 void xen_init(processor::features_type &features, unsigned base)
 {
+    debug_early("------> xen_init(): Start\n");
+
     // Base + 1 would have given us the version number, it is mostly
     // uninteresting for us now
     auto x = processor::cpuid(base + 2);
     processor::wrmsr(x.b, cast_pointer(&hypercall_page));
+    //processor::wrmsr(x.b, cast_pointer(mmu::virt_to_phys(&hypercall_page)));
 
+    debug_early("------> xen_init(): After cpuid\n");
     struct xen_feature_info info;
     // To fill up the array used by C code
     for (int i = 0; i < XENFEAT_NR_SUBMAPS; i++) {
         info.submap_idx = i;
-        if (version_hypercall(XENVER_get_features, &info) < 0)
+        debug_early("------> xen_init(): Before XENVER_get_features\n");
+        if (version_hypercall(XENVER_get_features, &info) < 0) {
+            debug_early("------> xen_init(): XENVER_get_features failed!\n");
             assert(0);
+        }
         for (int j = 0; j < 32; j++)
             xen_features[i * 32 + j] = !!(info.submap & 1<<j);
     }
@@ -188,18 +195,25 @@ void xen_init(processor::features_type &features, unsigned base)
     if (!features.xen_vector_callback)
         evtchn_irq_is_legacy();
 
+    debug_early("------> xen_init(): Before XENMEM_add_to_physmap\n");
     struct xen_add_to_physmap map;
     map.domid = DOMID_SELF;
     map.idx = 0;
     map.space = 0;
-    map.gpfn = cast_pointer(&xen_shared_info) >> 12;
+    map.gpfn = (cast_pointer(mmu::virt_to_phys(&xen_shared_info))) >> 12;
 
     // 7 => add to physmap
-    if (memory_hypercall(XENMEM_add_to_physmap, &map))
+    if (memory_hypercall(XENMEM_add_to_physmap, &map)) {
+        debug_early("------> xen_init(): XENMEM_add_to_physmap failed!\n");
         assert(0);
+    }
+
+    debug_early("------> xen_init(): After XENMEM_add_to_physmap\n");
 
     features.xen_pci = xen_pci_enabled();
     HYPERVISOR_shared_info = reinterpret_cast<shared_info_t *>(&xen_shared_info);
+
+    debug_early("------> xen_init(): End\n");
 }

So it looks like it breaks right in version_hypercall()  and we do not even see another statement before abort(). 

My theory is that there is somewhere missing minus or plus OSV_KERNEL_VM_SHIFT in that code or mmu::virt_to_phys() or phys_to_virt(). As you can see I already made a change around setting gpfn field (address of mapping table?, https://github.com/cloudius-systems/osv/blob/a3cd022fcda2c88eae89476aa6c29e3c4be04926/bsd/sys/xen/interface/memory.h#L194-L217) which I am guessing needs to by a physical page address >>12. I also thought that writing an address of hypercall_page to MSR required a physical address. But when I made this change (see commented line), it broke on that call - I did not see the following debug statement. 

It is very hard to find very good explanation what exactly needs to happen. I have tried to look at Linux code as an example but I am still confused how the xen init process should look like as far as what types of addresses (physical or virtual) need to be passed at what point. For example per this - https://github.com/torvalds/linux/blob/83f3ef3de625a5766de2382f9e077d4daafd5bac/arch/x86/xen/enlighten_pvh.c#L26-L40 - Linux passes physical address or hypercall page. When we do it it breaks.

Overall I am not sure I understand what exact mode of Xen virtualization OSv is coded to work under - HVM and/or PV (see https://wiki.xen.org/wiki/Xen_Project_Software_Overview#HVM)? I think when it boots on EC2 or when run with './scripts/run.py -p xen' (please see there is also xenpv mode) Xen uses HVM (hardware assisted virtualization) but also exposes para-virtual xen devices (block and net) which OSv takes advantage of. When xenpv is used xen uses loader.elf (I think) so this is when this entry-xen.S comes it play. In this mode xen is supposed to setup all page tables as we do not go through boot.S logic. 

In any case I am also not clear how exactly the memory mapping works under Xen. In HVM mode OSv goes through boot16.S/boot.S so it seems to be setting up paging tables same way it does it for example under KVM or other hypervisor. Does it mean that CR3 works just like on KVM? But then I was reading there are special hypercalls to setup mapping. What is the point of this memory_hypercall XENMEM_add_to_physmap() and why do we need it if we have CR3? When we mmap(), do we something extra (hypercalls?) on xen? I could not find any evidence of it.

Any ideas/suggestions what might be wrong and what should be fixed?

The most frustrating part is that I cannot reproduce it locally.

Waldek Kozaczuk

unread,
Jun 18, 2019, 10:05:27 PM6/18/19
to OSv Development
Finally:
OSv v0.53.0-35-g61070e27
-> arch_init_premain(): elf_start = 0000000000000000
-> features_type() constructor
process_xen_bits
------> xen_init(): Start
------> xen_init(): After cpuid
------> xen_init(): Before XENVER_get_features
------> xen_init(): Before XENMEM_add_to_physmap
------> xen_init(): After XENMEM_add_to_physmap
------> xen_init(): End
-> premain() -> before init tab
-> premain() -> after init tab
1 CPUs detected
Firmware vendor: Xen
bsd: initializing - done
VFS: mounting ramfs at /
VFS: mounting devfs at /dev
net: initializing - done
vga: Add VGA device instance
eth0: ethernet address: 0a:cd:94:8b:60:d2
backend features: feature-sg feature-gso-tcp4
1024MB <Virtual Block Device> at device/vbd/51712random: intel drng, rdrand registered as a source.
random: <Software, Yarrow> initialized
VFS: unmounting /dev
VFS: mounting rofs at /rofs
failed to mount /rofs, error = No error information
VFS: mounting zfs at /zfs
zfs: mounting osv/zfs from device /dev/vblk0.1
random: device unblocked.
VFS: mounting devfs at /dev
VFS: mounting procfs at /proc
program zpool.so returned 1
BSD shrinker: event handler list found: 0xffffa00000b79a00
BSD shrinker found: 1
BSD shrinker: unlocked, running
[E/28 bsd-log]: eth0: link state changed to DOWN

[E/28 bsd-log]: eth0: link state changed to UP

[I/28 dhcp]: Broadcasting DHCPDISCOVER message with xid: [1369833657]
[I/28 dhcp]: Waiting for IP...
[I/28 dhcp]: Broadcasting DHCPDISCOVER message with xid: [999080978]
[I/28 dhcp]: Waiting for IP...
[I/197 dhcp]: DHCP received hostname: ip-172-31-6-160

[I/197 dhcp]: Received DHCPOFFER message from DHCP server: 172.31.0.1 regarding offerred IP address: 172.31.6.160
[I/197 dhcp]: Broadcasting DHCPREQUEST message with xid: [999080978] to SELECT offered IP: 172.31.6.160
[I/197 dhcp]: DHCP received hostname: ip-172-31-6-160

[I/197 dhcp]: Received DHCPACK message from DHCP server: 172.31.0.1 regarding offerred IP address: 172.31.6.160
[I/197 dhcp]: Server acknowledged IP 172.31.6.160 for interface eth0 with time to lease in seconds: 3600
eth0: 172.31.6.160
[I/197 dhcp]: Configuring eth0: ip 172.31.6.160 subnet mask 255.255.240.0 gateway 172.31.0.1 MTU 9001
[E/197 bsd-log]: eth0: link state changed to DOWN

[E/197 bsd-log]: eth0: link state changed to UP

[I/197 dhcp]: Set hostname to: ip-172-31-6-160
disk read (real mode): 1992.96ms, (+1992.96ms)
uncompress lzloader.elf: 2019.09ms, (+26.13ms)
TLS initialization: 2045.69ms, (+26.60ms)
.init functions: 2055.63ms, (+9.94ms)
SMP launched: 2059.00ms, (+3.37ms)
VFS initialized: 2069.98ms, (+10.97ms)
Network initialized: 2072.19ms, (+2.21ms)
pvpanic done: 2072.43ms, (+0.24ms)
pci enumerated: 2085.52ms, (+13.09ms)
drivers probe: 2085.52ms, (+0.00ms)
drivers loaded: 2272.28ms, (+186.76ms)
ZFS mounted: 3042.23ms, (+769.95ms)
Total time: 6137.31ms, (+3095.08ms)
Hello from C code
[I/0 dhcp]: Unicasting DHCPRELEASE message with xid: [2052720604] from client: 172.31.6.160 to server: 172.31.0.1
VFS: unmounting /dev
VFS: unmounting /proc
VFS: unmounting /
Powering off.

This is the output from EC2 instance,

Turns out my thinking was right, just the execution was poor. Simply put virt_to_phys() was the wrong tool at the time - the variables it depends on have not been set yet !!!

So simple subtraction does the job:
     // Base + 1 would have given us the version number, it is mostly
     // uninteresting for us now
     auto x = processor::cpuid(base + 2);
-    processor::wrmsr(x.b, cast_pointer(&hypercall_page));
+    processor::wrmsr(x.b, cast_pointer(&hypercall_page) - OSV_KERNEL_VM_SHIFT);

..

+    debug_early("------> xen_init(): Before XENMEM_add_to_physmap\n");
     struct xen_add_to_physmap map;
     map.domid = DOMID_SELF;
     map.idx = 0;
     map.space = 0;
-    map.gpfn = cast_pointer(&xen_shared_info) >> 12;
+    map.gpfn = (cast_pointer(&xen_shared_info) - OSV_KERNEL_VM_SHIFT) >> 12;


I would still love to test the non-HVM (PV?) path but I need to be able to locally run OSv on Xen.

Waldek

Nadav Har'El

unread,
Jun 19, 2019, 4:45:37 AM6/19/19
to Waldek Kozaczuk, OSv Development
Great, thanks!
 
I would still love to test the non-HVM (PV?) path but I need to be able to locally run OSv on Xen.

Unfortunately I can't help you with that :-(
If you have specific questions, you can try "git blame" to see who is to "blame" for lines of code you don't understand, and try to ask this person directly.

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/6f0686d3-9b87-4e03-b02c-9212a7fdf31b%40googlegroups.com.

Waldemar Kozaczuk

unread,
Jun 20, 2019, 12:07:17 AM6/20/19
to osv...@googlegroups.com, Waldemar Kozaczuk
This patch provides all necessary changes to move OSv kernel by 1 GiB higher
in virtual memory space to start at 0x40200000. Most changes involve adding
or subtracting 0x40000000 (OSV_KERNEL_VM_SHIFT) in all relevant places. Please
note that the kernel is still loaded at 2MiB in physical memory.

The motivation for this patch is to make as much space as possible (or just enough)
in virtual memory to allow running unmodified Linux non-PIE executables (issue #190).
Even though due to the advancement of ASLR more and more applications are PIEs (Position
Independent Executables) which are pretty well supported by OSv, there are still many
non-PIEs (Position Dependent Executables) that are out. The most prominent one is
actualy JVM whose most distributions come with tiny (~20K) bootstrap java non-PIE
executable. There are many other examples where small non-PIE executable loads
other shared libraries.

As issue #1043 explains there are at least 3 possible solutions and
this patch implements the 3rd last one described there. Please note that in future
with little effort we could provide slightly beter scheme for OSV_KERNEL_VM_SHIFT
that would allow us to place the kernel even higher at the end of the 2GiB limit
(small memory model) and thus support virtually any non-PIE built using small memory model.

Due to its impact this patch has been tested on following hypervisors:
- QEMU without KVM
- QEMU with KVM
- Firecracker
- VirtualBox 6
- VMware Player
- XEN on EC2
- XEN locally in HVM mode

Fixes #1043

Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
Makefile | 7 +++++-
arch/x64/arch-setup.cc | 39 ++++++++++++++++++------------
arch/x64/boot.S | 41 +++++++++++++++++++++++++-------
arch/x64/entry-xen.S | 3 ++-
arch/x64/loader.ld | 50 ++++++++++++++++++++-------------------
arch/x64/vmlinux-boot64.S | 6 +++--
arch/x64/xen.cc | 4 ++--
core/elf.cc | 2 +-
core/mmu.cc | 9 ++++---
loader.cc | 3 ++-
10 files changed, 103 insertions(+), 61 deletions(-)

diff --git a/Makefile b/Makefile
index 314e74f4..3c5a2077 100644
--- a/Makefile
+++ b/Makefile
@@ -312,7 +312,7 @@ gcc-sysroot = $(if $(CROSS_PREFIX), --sysroot external/$(arch)/gcc.bin) \
# To add something that will *not* be part of the main kernel, you can do:
#
# mydir/*.o EXTRA_FLAGS = <MY_STUFF>
-EXTRA_FLAGS = -D__OSV_CORE__ -DOSV_KERNEL_BASE=$(kernel_base) -DOSV_LZKERNEL_BASE=$(lzkernel_base)
+EXTRA_FLAGS = -D__OSV_CORE__ -DOSV_KERNEL_BASE=$(kernel_base) -DOSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) -DOSV_LZKERNEL_BASE=$(lzkernel_base)
EXTRA_LIBS =
COMMON = $(autodepend) -g -Wall -Wno-pointer-arith $(CFLAGS_WERROR) -Wformat=0 -Wno-format-security \
-D __BSD_VISIBLE=1 -U _FORTIFY_SOURCE -fno-stack-protector $(INCLUDES) \
@@ -421,6 +421,7 @@ ifeq ($(arch),x64)
# lzkernel_base is where the compressed kernel is loaded from disk.
kernel_base := 0x200000
lzkernel_base := 0x100000
+kernel_vm_base := 0x40200000

$(out)/arch/x64/boot16.o: $(out)/lzloader.elf
$(out)/boot.bin: arch/x64/boot16.ld $(out)/arch/x64/boot16.o
@@ -480,6 +481,7 @@ endif # x64
ifeq ($(arch),aarch64)

kernel_base := 0x40080000
+kernel_vm_base := 0x40080000

include $(libfdt_base)/Makefile.libfdt
libfdt-source := $(patsubst %.c, $(libfdt_base)/%.c, $(LIBFDT_SRCS))
@@ -500,6 +502,8 @@ $(out)/loader.img: $(out)/preboot.bin $(out)/loader-stripped.elf

endif # aarch64

+kernel_vm_shift := $(shell printf "0x%X" $(shell expr $$(( $(kernel_vm_base) - $(kernel_base) )) ))
+
$(out)/bsd/sys/crypto/rijndael/rijndael-api-fst.o: COMMON+=-fno-strict-aliasing
$(out)/bsd/sys/crypto/sha2/sha2.o: COMMON+=-fno-strict-aliasing
$(out)/bsd/sys/net/route.o: COMMON+=-fno-strict-aliasing
@@ -1873,6 +1877,7 @@ stage1: $(stage1_targets) links

$(out)/loader.elf: $(stage1_targets) arch/$(arch)/loader.ld $(out)/bootfs.o
$(call quiet, $(LD) -o $@ --defsym=OSV_KERNEL_BASE=$(kernel_base) \
+ --defsym=OSV_KERNEL_VM_BASE=$(kernel_vm_base) --defsym=OSV_KERNEL_VM_SHIFT=$(kernel_vm_shift) \
index 1402e5d0..bb157bff 100644
--- a/arch/x64/boot.S
+++ b/arch/x64/boot.S
@@ -24,13 +24,28 @@
.align 4096
.global ident_pt_l4
ident_pt_l4:
- .quad ident_pt_l3 + 0x67
+ # The addresses of the paging tables have to be the physical ones, so we have to
+ # manually subtract OSV_KERNEL_VM_SHIFT in all relevant places
+ .quad ident_pt_l3 + 0x67 - OSV_KERNEL_VM_SHIFT
.rept 511
.quad 0
.endr
+#if OSV_KERNEL_VM_SHIFT != 0x40000000 && OSV_KERNEL_VM_SHIFT != 0
+#error This code only works correctly for OSV_KERNEL_VM_SHIFT = 0x40000000 or 0
+#endif
ident_pt_l3:
- .quad ident_pt_l2 + 0x67
- .rept 511
+ # Each of the 512 entries in this table maps the very 1st 512 GiB of
+ # virtual address space 1 GiB at a time
+ # The very 1st entry maps 1st GiB 1:1 by pointing to ident_pt_l2 table
+ # that specifies addresses of every one of 512 2MiB slots of physical memory
+ .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT
+ # The 2nd entry maps 2nd GiB to the same 1st GiB of physical memory by pointing
+ # to the same ident_pt_l2 table as the 1st entry above
+ # This way we effectively provide correct mapping for the kernel linked
+ # to start at 1 GiB + 2 MiB (0x40200000) in virtual memory and point to
+ # 2 MiB address (0x200000) where it starts in physical memory
+ .quad ident_pt_l2 + 0x67 - OSV_KERNEL_VM_SHIFT
+ .rept 510
.quad 0
.endr
ident_pt_l2:
@@ -42,7 +57,8 @@ ident_pt_l2:

gdt_desc:
.short gdt_end - gdt - 1
- .long gdt
+ # subtract OSV_KERNEL_VM_SHIFT because when gdt_desc is referenced, the memory is mapped 1:1
+ .long gdt - OSV_KERNEL_VM_SHIFT

# Set up the 64-bit compatible version of GDT description structure
# that points to the same GDT (Global segments Descriptors Table) and
@@ -53,7 +69,8 @@ gdt_desc:
.align 8
gdt64_desc:
.short gdt_end - gdt - 1
- .quad gdt
+ # subtract OSV_KERNEL_VM_SHIFT because when gdt64_desc is referenced, the memory is mapped 1:1
+ .quad gdt - OSV_KERNEL_VM_SHIFT

.align 8
gdt = . - 8
@@ -77,10 +94,12 @@ init_stack_top = .
.globl start32
.globl start32_from_64
start32:
+ # Because the memory is mapped 1:1 at this point, we have to manualy
+ # subtract OSV_KERNEL_VM_SHIFT from virtual addresses in all relevant places
# boot16.S set %eax to ELF start address, we'll use it later
mov %eax, %ebp
mov $0x0, %edi
- lgdt gdt_desc
+ lgdt gdt_desc-OSV_KERNEL_VM_SHIFT

# Add an address the vmlinux_entry64 will jump to when
# switching from 64-bit to 32-bit mode
@@ -91,7 +110,7 @@ start32_from_64:
mov %eax, %fs
mov %eax, %gs
mov %eax, %ss
- ljmp $0x18, $1f
+ ljmp $0x18, $1f-OSV_KERNEL_VM_SHIFT
1:
and $~7, %esp
# Enable PAE (Physical Address Extension) - ability to address 64GB
@@ -101,6 +120,9 @@ start32_from_64:

# Set root of a page table in cr3
lea ident_pt_l4, %eax
+ # The address of the root paging table has to be physical
+ # so substract OSV_KERNEL_VM_SHIFT from ident_pt_l4
+ sub $OSV_KERNEL_VM_SHIFT, %eax
mov %eax, %cr3

# Set long mode
@@ -128,7 +150,7 @@ start64:
jz start64_continue
call extract_linux_boot_params
mov $0x1000, %rbx
- mov $0x200000, %rbp
+ mov $OSV_KERNEL_BASE, %rbp

start64_continue:
lea .bss, %rdi
@@ -168,6 +190,7 @@ smpboot:
mov smpboot_cr4-smpboot, %eax
mov %eax, %cr4
lea ident_pt_l4, %eax
+ sub $OSV_KERNEL_VM_SHIFT, %eax
mov %eax, %cr3
mov smpboot_efer-smpboot, %eax
mov smpboot_efer+4-smpboot, %edx
@@ -181,7 +204,7 @@ smpboot:

smpboot_gdt_desc:
.short gdt_end - gdt - 1
- .long gdt
+ .long gdt - OSV_KERNEL_VM_SHIFT
.global smpboot_cr0
smpboot_cr0:
.long 0
diff --git a/arch/x64/entry-xen.S b/arch/x64/entry-xen.S
index 11f72da4..81342284 100644
--- a/arch/x64/entry-xen.S
+++ b/arch/x64/entry-xen.S
@@ -23,7 +23,7 @@

elfnote_val(XEN_ELFNOTE_ENTRY, xen_start)
elfnote_val(XEN_ELFNOTE_HYPERCALL_PAGE, hypercall_page)
-elfnote_val(XEN_ELFNOTE_VIRT_BASE, 0)
+elfnote_val(XEN_ELFNOTE_VIRT_BASE, OSV_KERNEL_VM_SHIFT)
elfnote_str(XEN_ELFNOTE_XEN_VERSION, "xen-3.0")
elfnote_str(XEN_ELFNOTE_GUEST_OS, "osv")
elfnote_str(XEN_ELFNOTE_GUEST_VERSION, "?.?")
@@ -50,4 +50,5 @@ xen_start:
mov %rsp, xen_bootstrap_end
mov %rsi, %rdi
call xen_init
+ mov $0x0, %rdi
jmp start64
diff --git a/arch/x64/loader.ld b/arch/x64/loader.ld
index caae1f68..8b82b1bb 100644
--- a/arch/x64/loader.ld
+++ b/arch/x64/loader.ld
@@ -14,16 +14,17 @@ SECTIONS
*
* We can't export the ELF header base as a symbol, because ld
* insists on moving stuff around if we do.
- *
+ */
+ . = OSV_KERNEL_VM_BASE + 0x800;
+ /*
* Place address of start32 routine at predefined offset in memory
*/
- . = OSV_KERNEL_BASE + 0x800;
- .start32_address : {
+ .start32_address : AT(ADDR(.start32_address) - OSV_KERNEL_VM_SHIFT) {
*(.start32_address)
}
- . = OSV_KERNEL_BASE + 0x1000;
- .dynamic : { *(.dynamic) } :dynamic :text
- .text : {
+ . = OSV_KERNEL_VM_BASE + 0x1000;
+ .dynamic : AT(ADDR(.dynamic) - OSV_KERNEL_VM_SHIFT) { *(.dynamic) } :dynamic :text
+ .text : AT(ADDR(.text) - OSV_KERNEL_VM_SHIFT) {
text_start = .;
*(.text.hot .text.hot.*)
*(.text.unlikely .text.*_unlikely)
@@ -31,60 +32,61 @@ SECTIONS
@@ -114,4 +116,4 @@ PHDRS {
diff --git a/arch/x64/xen.cc b/arch/x64/xen.cc
index 462c266c..c02bf62c 100644
--- a/arch/x64/xen.cc
+++ b/arch/x64/xen.cc
@@ -172,7 +172,7 @@ void xen_init(processor::features_type &features, unsigned base)
// Base + 1 would have given us the version number, it is mostly
// uninteresting for us now
auto x = processor::cpuid(base + 2);
- processor::wrmsr(x.b, cast_pointer(&hypercall_page));
+ processor::wrmsr(x.b, cast_pointer(&hypercall_page) - OSV_KERNEL_VM_SHIFT);

struct xen_feature_info info;
// To fill up the array used by C code
@@ -192,7 +192,7 @@ void xen_init(processor::features_type &features, unsigned base)
map.domid = DOMID_SELF;
map.idx = 0;
map.space = 0;
- map.gpfn = cast_pointer(&xen_shared_info) >> 12;
+ map.gpfn = (cast_pointer(&xen_shared_info) - OSV_KERNEL_VM_SHIFT) >> 12;

// 7 => add to physmap
if (memory_hypercall(XENMEM_add_to_physmap, &map))
diff --git a/core/elf.cc b/core/elf.cc
index fc2ee0c3..477a0177 100644
--- a/core/elf.cc
+++ b/core/elf.cc
@@ -1099,7 +1099,7 @@ void create_main_program()
program::program(void* addr)
: _next_alloc(addr)
{
- _core = std::make_shared<memory_image>(*this, (void*)ELF_IMAGE_START);
+ _core = std::make_shared<memory_image>(*this, (void*)(ELF_IMAGE_START + OSV_KERNEL_VM_SHIFT));
assert(_core->module_index() == core_module_index);
_core->load_segments();
set_search_path({"/", "/usr/lib"});
diff --git a/core/mmu.cc b/core/mmu.cc
index f9294125..575834ca 100644
--- a/core/mmu.cc
+++ b/core/mmu.cc
@@ -91,12 +91,12 @@ phys pte_level_mask(unsigned level)
return ~((phys(1) << shift) - 1);
}

+static void *elf_phys_start = (void*)OSV_KERNEL_BASE;
void* phys_to_virt(phys pa)
{
- // The ELF is mapped 1:1
void* phys_addr = reinterpret_cast<void*>(pa);
- if ((phys_addr >= elf_start) && (phys_addr < elf_start + elf_size)) {
- return phys_addr;
+ if ((phys_addr >= elf_phys_start) && (phys_addr < elf_phys_start + elf_size)) {
+ return (void*)(phys_addr + OSV_KERNEL_VM_SHIFT);
}

return phys_mem + pa;
@@ -106,9 +106,8 @@ phys virt_to_phys_pt(void* virt);

phys virt_to_phys(void *virt)
{
- // The ELF is mapped 1:1

Waldek Kozaczuk

unread,
Jun 20, 2019, 12:15:42 AM6/20/19
to OSv Development
I did manage to run OSv on my local Xen in HVM mode. At least almost as it gets stuck right after or before connecting to a disk (similar to the one reported here on our mailing list - https://groups.google.com/d/msg/osv-dev/qXK3eIP9NAw/TJqwjr-HEgAJ). In any case it is passing the phase it was crashing before on EC2. In any case given EC2 works I think the XEN HVM is fine.

I also tried to boot using loader.elf on XEn (run.py -p xenpv). It looks like it crashes even before my changes. So in this case my patch might be correct or not. But I think it is a separate story. 

To unsubscribe from this group and stop receiving emails from it, send an email to osv...@googlegroups.com.

Nadav Har'El

unread,
Jun 20, 2019, 12:55:32 PM6/20/19
to Waldemar Kozaczuk, Osv Dev
Hi, thanks. I committed this patch because I think it is indeed a step in the right direction, and you did impressive work in getting everything to work on all these different hypervisors.

But now as a followup, I'd like you to please reconsider the changes to loader.ld, and whether all of it is really necessary.
I don't understand why any of these AT() stuff are needed, or if it's needed, why in the specific places you added - e.g., why did ".tbss" need it and ".comment" did not (just to pick two random examples).

In theory, the addresses in the kernel's ELF should all use virtual addresses, because that will be the situation when it runs.
But I'm guessing (?) that the problem you had is that
  1. We need to *load* the kernel from disk into the physical address, not the virtual address.
  2. When we start to run the kernel, maybe the virtual  mapping isn't yet available? 
If these are the problems, can't we fix them directly? E.g., load the kernel from disk into a fixed physical address (OSV_KERNEL_BASE, not related to any address specified in the kernel's elf), then set up the mapping and run it from the virtual address. In fact, isn't this what we already do - set up the mapping before jumping into the kernel's ordinary C code?

--
Nadav Har'El
n...@scylladb.com


--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20190620040707.23249-1-jwkozaczuk%40gmail.com.

Waldek Kozaczuk

unread,
Jun 20, 2019, 2:00:52 PM6/20/19
to OSv Development
Nadav,

As I understand (but I am not ELF-expert), the AT directives are used to drive so-called LMA (load address) which typically is equal to virtual address (VMA) - http://www.scoberlin.de/content/media/http/informatik/gcc_docs/ld_3.html#SEC6. More specifically it drives values of p_addr fields of the program headers which normally are disregarded. 

Anyway, there were two problems I had to fix using the AT directives:
  1. Without the AT directives, the linker.elf would grow to some 1GB, I am guessing, because of the difference between virtual, physical and file offsets. It could have been doing something wrong though and AT maybe was not necessary.
  2. The physical segment addresses (p_addr) I think in general should not matter and in all situations when we (boot16.S) control loading the kernel (loader.elf) we can load the kernel anywhere in physical memory and p_addr in the loader.elf does not matter. But firecracker is different - it does not use usr.img instead, it loads OSv loader.elf and it uses p_addr to know where to place the segments in physical memory. And OSv cares that it starts at 0x200000 OSv. So even if I fixed the above (1GB), firecracker (and possibly in future QEMU 4.0 which can load ELF kernel as well) would still need correct p_addr
BTW I based it all on what I have found in the Linux linker script - https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S. If I run readelf -l for vmlinux file here is what I get:

readelf -l ~/projects/firecracker/hello-vmlinux.bin 

Elf file type is EXEC (Executable file)
Entry point 0x1000000
There are 5 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000200000 0xffffffff81000000 0x0000000001000000
                 0x0000000000b6e000 0x0000000000b6e000  R E    0x200000
  LOAD           0x0000000000e00000 0xffffffff81c00000 0x0000000001c00000
                 0x00000000000aa000 0x00000000000aa000  RW     0x200000
  LOAD           0x0000000001000000 0x0000000000000000 0x0000000001caa000
                 0x000000000001f6d8 0x000000000001f6d8  RW     0x200000
  LOAD           0x00000000010ca000 0xffffffff81cca000 0x0000000001cca000
                 0x0000000000125000 0x000000000040c000  RWE    0x200000
  NOTE           0x0000000000a031d4 0xffffffff818031d4 0x00000000018031d4
                 0x0000000000000024 0x0000000000000024         0x4

 Section to Segment mapping:
  Segment Sections...
   00     .text .notes __ex_table .rodata .pci_fixup __ksymtab __ksymtab_gpl __kcrctab __kcrctab_gpl __ksymtab_strings __param __modver 
   01     .data __bug_table .vvar 
   02     .data..percpu 
   03     .init.text .altinstr_aux .init.data .x86_cpu_dev.init .parainstructions .altinstructions .altinstr_replacement .iommu_table .apicdrivers .exit.text .smp_locks .data_nosave .bss .brk 
   04     .notes 


I may have missed AT for some sections but maybe they are irrelevant. But for consistency I should fix it.

I hope you have found my explanation satisfactory. 

Regards,
Waldek

PS. I think you have not committed this patch yet ;-)
To unsubscribe from this group and stop receiving emails from it, send an email to osv...@googlegroups.com.

Commit Bot

unread,
Jun 23, 2019, 5:06:22 AM6/23/19
to osv...@googlegroups.com, Waldemar Kozaczuk
From: Waldemar Kozaczuk <jwkoz...@gmail.com>
Committer: Nadav Har'El <n...@scylladb.com>
Branch: master

Move kernel to 0x40200000 address (1 GiB higher) in virtual memory
Message-Id: <20190620040707.2...@gmail.com>

---
diff --git a/Makefile b/Makefile
--- a/arch/x64/entry-xen.S
+++ b/arch/x64/entry-xen.S
@@ -23,7 +23,7 @@

elfnote_val(XEN_ELFNOTE_ENTRY, xen_start)
elfnote_val(XEN_ELFNOTE_HYPERCALL_PAGE, hypercall_page)
-elfnote_val(XEN_ELFNOTE_VIRT_BASE, 0)
+elfnote_val(XEN_ELFNOTE_VIRT_BASE, OSV_KERNEL_VM_SHIFT)
elfnote_str(XEN_ELFNOTE_XEN_VERSION, "xen-3.0")
elfnote_str(XEN_ELFNOTE_GUEST_OS, "osv")
elfnote_str(XEN_ELFNOTE_GUEST_VERSION, "?.?")
@@ -50,4 +50,5 @@ xen_start:
mov %rsp, xen_bootstrap_end
mov %rsi, %rdi
call xen_init
+ mov $0x0, %rdi
jmp start64
diff --git a/arch/x64/loader.ld b/arch/x64/loader.ld
--- a/arch/x64/loader.ld
+++ b/arch/x64/loader.ld
@@ -14,77 +14,79 @@ SECTIONS
*
* We can't export the ELF header base as a symbol, because ld
* insists on moving stuff around if we do.
- *
+ */
+ . = OSV_KERNEL_VM_BASE + 0x800;
+ /*
* Place address of start32 routine at predefined offset in memory
*/
- . = OSV_KERNEL_BASE + 0x800;
- .start32_address : {
+ .start32_address : AT(ADDR(.start32_address) - OSV_KERNEL_VM_SHIFT) {
*(.start32_address)
}
- . = OSV_KERNEL_BASE + 0x1000;
- .dynamic : { *(.dynamic) } :dynamic :text
- .text : {
+ . = OSV_KERNEL_VM_BASE + 0x1000;
+ .dynamic : AT(ADDR(.dynamic) - OSV_KERNEL_VM_SHIFT) { *(.dynamic)
} :dynamic :text
+ .text : AT(ADDR(.text) - OSV_KERNEL_VM_SHIFT) {
text_start = .;
*(.text.hot .text.hot.*)
*(.text.unlikely .text.*_unlikely)
*(.text.fixup)
--- a/arch/x64/vmlinux-boot64.S
+++ b/arch/x64/vmlinux-boot64.S
@@ -13,7 +13,9 @@ vmlinux_entry64:
mov %rsi, %rdi

# Load the 64-bit version of the GDT
- lgdt gdt64_desc
+ # Because the memory is mapped 1:1 at this point, we have to manualy
+ # subtract OSV_KERNEL_VM_SHIFT from the gdt address
+ lgdt gdt64_desc-OSV_KERNEL_VM_SHIFT

# Setup the stack to switch back to 32-bit mode in order
# to converge with the code that sets up transiton to 64-bit mode
later.
@@ -32,6 +34,6 @@ vmlinux_entry64:
# to start32_from_64 which is where the boot process converges.
subq $8, %rsp
movl $0x18, 4(%rsp)
- movl $start32_from_64, %eax
+ movl $start32_from_64-OSV_KERNEL_VM_SHIFT, %eax # Because memory is
mapped 1:1 subtract OSV_KERNEL_VM_SHIFT
movl %eax, (%rsp)
lret
diff --git a/arch/x64/xen.cc b/arch/x64/xen.cc
--- a/arch/x64/xen.cc
+++ b/arch/x64/xen.cc
@@ -172,7 +172,7 @@ void xen_init(processor::features_type &features,
unsigned base)
// Base + 1 would have given us the version number, it is mostly
// uninteresting for us now
auto x = processor::cpuid(base + 2);
- processor::wrmsr(x.b, cast_pointer(&hypercall_page));
+ processor::wrmsr(x.b, cast_pointer(&hypercall_page) -
OSV_KERNEL_VM_SHIFT);

struct xen_feature_info info;
// To fill up the array used by C code
@@ -192,7 +192,7 @@ void xen_init(processor::features_type &features,
unsigned base)
map.domid = DOMID_SELF;
map.idx = 0;
map.space = 0;
- map.gpfn = cast_pointer(&xen_shared_info) >> 12;
+ map.gpfn = (cast_pointer(&xen_shared_info) - OSV_KERNEL_VM_SHIFT) >>
12;

// 7 => add to physmap
if (memory_hypercall(XENMEM_add_to_physmap, &map))
diff --git a/core/elf.cc b/core/elf.cc
--- a/core/elf.cc
+++ b/core/elf.cc
@@ -1099,7 +1099,7 @@ void create_main_program()
program::program(void* addr)
: _next_alloc(addr)
{
- _core = std::make_shared<memory_image>(*this, (void*)ELF_IMAGE_START);
+ _core = std::make_shared<memory_image>(*this, (void*)(ELF_IMAGE_START
+ OSV_KERNEL_VM_SHIFT));
assert(_core->module_index() == core_module_index);
_core->load_segments();
set_search_path({"/", "/usr/lib"});
diff --git a/core/mmu.cc b/core/mmu.cc
--- a/core/mmu.cc
+++ b/core/mmu.cc
@@ -91,12 +91,12 @@ phys pte_level_mask(unsigned level)
return ~((phys(1) << shift) - 1);
}

+static void *elf_phys_start = (void*)OSV_KERNEL_BASE;
void* phys_to_virt(phys pa)
{
- // The ELF is mapped 1:1
void* phys_addr = reinterpret_cast<void*>(pa);
- if ((phys_addr >= elf_start) && (phys_addr < elf_start + elf_size)) {
- return phys_addr;
+ if ((phys_addr >= elf_phys_start) && (phys_addr < elf_phys_start +
elf_size)) {
+ return (void*)(phys_addr + OSV_KERNEL_VM_SHIFT);
}

return phys_mem + pa;
@@ -106,9 +106,8 @@ phys virt_to_phys_pt(void* virt);

phys virt_to_phys(void *virt)
{
- // The ELF is mapped 1:1
if ((virt >= elf_start) && (virt < elf_start + elf_size)) {
- return reinterpret_cast<phys>(virt);
+ return reinterpret_cast<phys>((void*)(virt - OSV_KERNEL_VM_SHIFT));
}

#if CONF_debug_memory
diff --git a/loader.cc b/loader.cc

Nadav Har'El

unread,
Jun 23, 2019, 7:19:23 AM6/23/19
to Waldek Kozaczuk, OSv Development
I just relalized I told you that I committed this patch, but forgot to push it until now :-( So now it's really in.
Anyway, continuing the interesting discussion,

On Thu, Jun 20, 2019 at 9:00 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Nadav,

As I understand (but I am not ELF-expert), the AT directives are used to drive so-called LMA (load address) which typically is equal to virtual address (VMA) - http://www.scoberlin.de/content/media/http/informatik/gcc_docs/ld_3.html#SEC6. More specifically it drives values of p_addr fields of the program headers which normally are disregarded. 

Interesting. I wasn't familiar with this.
I guess my main question is why can't we continue with LMA being equal to VMA.
Why would we care about "LMA"? My thinking is:
  1. Assembly-language code loads the kernel image into an area of memory that our loader decides, regardless of what's written in the image (this place will have a different physical and virtual address - OSV_KERNEL_BASE and OSV_KERNEL_VM_BASE).
  2. Now we call the constructor program::program(). Hopefully at this point the memory is already mapped, and the virtual addresses work. We don't need those physical address.
  3. There is indeed a p_paddr address in our ELF structures, but it isn't used anywhere, and comments call it "reserved" or "ignored"...
So if all of the above was true, I still don't understand why those AT() things are needed


Anyway, there were two problems I had to fix using the AT directives:
  1. Without the AT directives, the linker.elf would grow to some 1GB, I am guessing, because of the difference between virtual, physical and file offsets. It could have been doing something wrong though and AT maybe was not necessary.
This is interesting. I can't say I understand why without these AT directives, the file grows 1GB (I assume has has a gigabytes of zeros before the actual content).
Or why does the AT solve this issue? With AT, it it "just" one megabytes of zeros, or none at all?  I don't see this 1MB of zeros (although I remember in the past you did say you saw them?)

Without this "AT" stuff, what does readelf -l  **/loader.elf show?
  1. The physical segment addresses (p_addr) I think in general should not matter and in all situations when we (boot16.S) control loading the kernel (loader.elf) we can load the kernel anywhere in physical memory and p_addr in the loader.elf does not matter. But firecracker is different - it does not use usr.img instead, it loads OSv loader.elf and it uses p_addr to know where to place the segments in physical memory. And OSv cares that it starts at 0x200000 OSv. So even if I fixed the above (1GB), firecracker (and possibly in future QEMU 4.0 which can load ELF kernel as well) would still need correct p_addr
BTW I based it all on what I have found in the Linux linker script - https://github.com/torvalds/linux/blob/master/arch/x86/kernel/vmlinux.lds.S.

Interesting! So apparently this idea has merit, and I should less frantically ask to remove them, but I would still like to understand what this merit is :-(

By the way I see that for some of the sections in OSv's loader.ld (towards the end of this file) you didn't add the "AT" thing. Why? Is it just because you noticed these sections don't exists at all? (maybe they existed in some other versions of the compilers, I don't remember why we added each of them). (I see you also answered this below).
As I said above, I guess these sections simply didn't appear in the executable. But I don't know maybe it appeared in other compiler versions, or something.

I hope you have found my explanation satisfactory. 

Thanks. I'm happy you're looking into all these issues, and the actual patch you did seems to be working very well in practice, so thanks,
but unfortunately I have to say I still don't understand *why* the "AT" thing.

I dug around Linux's git and found (via heavy archaeology) the following patch which introduced this "AT" trick the first time:

commit ad0d75ebacdbf1004d004803df0ba371c6bdbe2a
Author: Eric W. Biederman <ebie...@xmission.com>
Date:   Sat Jun 25 14:57:47 2005 -0700

Where the author describes the reason for this AT() to "help crash dump analysis tools" and to "allows bootloaders that load vmlinux as a standard ELF executable". It doesn't say anything about a smaller executable or anything relevant to why OSv might need this.

I think the AT thing is actually correct, so let's keep it - just add it to all sections and just some of them. But I'm still curious why it helped in anything for our needs.

By the way, I tried to google a bit and found in https://lore.kernel.org/patchwork/patch/329626/  which suggests a PHDRS could replace all these dozens of ugly AT()s . But I have no idea if it's true:
All that AT(ADDR(blah) - LOAD_OFFSET) stuff is cumbersome, but if it's
at least consistent with other architectures then it may not such a
disaster.  It's not universal though: less than 50% of the arches in
the kernel currently seem to use this.

An alternative, cleaner approach might be to specify segment load
addresses directly using PHDRS { }.  This should at leat allow the
load address to be specified once per phdr, rather than once per section.
To which the other person replied:
I agree it is not nice, but I once did try to make PHDRS work as you
described, but was never successful. IIRC there were serious linker
bugs) As you note the AT method is consistent with other arches, and
the generic vmlinux.lds.h
To which the original author responded:
A quick experiment shows that 

	PHDRS {
		kernel PT_LOAD AT(PHYS_OFFSET + LOAD_OFFSET);
	}

	/* ... */

	SECTIONS {
		.head.text {
			/* ... */
		} :kernel

		/* ... */
	}

can produce a sensible-looking vmlinux at least with my version of the
tools.

As you observe, GNU ld behaviour in this area tends to be rather patchily
specified, buggy or both.  That does argue in favour of reusing the
same techniques already used for other arches, though.


Regards,
Waldek

PS. I think you have not committed this patch yet ;-)

I forgot to (I committed it, but didn't "push"!), but today I did.

To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/45c81e44-2c24-4634-9bd9-bb12471c9c03%40googlegroups.com.

Waldek Kozaczuk

unread,
Jun 23, 2019, 8:12:37 AM6/23/19
to Nadav Har'El, OSv Development
Sorry if I did not make it clear enough in my previous email.  The main reason we need AT is (as Biederman states as well for Linux) the firecracker that loads OSv loader.elf as a Linux vmlinux. 

Waldek Kozaczuk

unread,
Jun 23, 2019, 1:02:15 PM6/23/19
to OSv Development
Please see this bit of code from firecracker - https://github.com/firecracker-microvm/firecracker/blob/master/kernel/src/loader/mod.rs#L76-L145 - which loads OSv loader.elf.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscribe@googlegroups.com.

Nadav Har'El

unread,
Jun 23, 2019, 1:46:12 PM6/23/19
to Waldek Kozaczuk, OSv Development
On Sun, Jun 23, 2019 at 3:12 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:


On Sun, Jun 23, 2019 at 07:19 Nadav Har'El <n...@scylladb.com> wrote:
I dug around Linux's git and found (via heavy archaeology) the following patch which introduced this "AT" trick the first time:

commit ad0d75ebacdbf1004d004803df0ba371c6bdbe2a
Author: Eric W. Biederman <ebie...@xmission.com>
Date:   Sat Jun 25 14:57:47 2005 -0700

Where the author describes the reason for this AT() to "help crash dump analysis tools" and to "allows bootloaders that load vmlinux as a standard ELF executable". It doesn't say anything about a smaller executable or anything relevant to why OSv might need this.
Sorry if I did not make it clear enough in my previous email.  The main reason we need AT is (as Biederman states as well for Linux) the firecracker that loads OSv loader.elf as a Linux vmlinux. 

Ok, so that was the crucial piece I was missing :-)
Reply all
Reply to author
Forward
0 new messages