[PATCH] Ignore missing symbols when processing certain relocation types on load

43 views
Skip to first unread message

Waldemar Kozaczuk

unread,
Aug 20, 2019, 12:20:53 AM8/20/19
to osv...@googlegroups.com, Waldemar Kozaczuk
The commit https://github.com/cloudius-systems/osv/commit/f5cc12d56dd986d2c7982c5738b1f859702b07fb
addressed the issue #993 to relax handling of missing symbols in relocate_pltgot()
when loading ELF objects with BIND_NOW. This patch on other hand attempts to close
the gap of handling missing symbols this time in relocate_rela() during loading phase
of ELF objects by dynamic linker.

The solution is similar to what we do in relocate_pltgot()
- let missing symbol get ignored and hope they will not be used.
However in this case we cannot rely on lazy symbol resolution
logic that would catch missing symbol later when used and abort.
Instead we place an indicator of missing symbol (special invalid address)
which would trigger page fault when symbol accessed. This indicator allows
us to distinguish missing symbol used scenario from other page fault related ones.
This unfortunately does not tell us which missing symbol was used.
In future we could place indicator + offset where offset would point
to a name of the missing symbol in some missing symbols table.

Fixes #1023

Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
arch/x64/arch-elf.cc | 20 ++++++++++++++++----
arch/x64/arch-elf.hh | 2 ++
arch/x64/mmu.cc | 4 ++++
3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x64/arch-elf.cc b/arch/x64/arch-elf.cc
index ebe40996..4c513150 100644
--- a/arch/x64/arch-elf.cc
+++ b/arch/x64/arch-elf.cc
@@ -70,16 +70,28 @@ bool object::arch_relocate_rela(u32 type, u32 sym, void *addr,
memcpy(addr, sm.relocated_addr(), sm.size());
break;
}
- case R_X86_64_64:
- *static_cast<void**>(addr) = symbol(sym).relocated_addr() + addend;
+ case R_X86_64_64: {
+ auto _sym = symbol(sym, true);
+ if (_sym.symbol) {
+ *static_cast<void**>(addr) = _sym.relocated_addr() + addend;
+ } else {
+ *static_cast<void**>(addr) = MISSING_SYMBOL_INDICATOR;
+ }
break;
+ }
case R_X86_64_RELATIVE:
*static_cast<void**>(addr) = _base + addend;
break;
case R_X86_64_JUMP_SLOT:
- case R_X86_64_GLOB_DAT:
- *static_cast<void**>(addr) = symbol(sym).relocated_addr();
+ case R_X86_64_GLOB_DAT: {
+ auto _sym = symbol(sym, true);
+ if (_sym.symbol) {
+ *static_cast<void**>(addr) = _sym.relocated_addr();
+ } else {
+ *static_cast<void**>(addr) = MISSING_SYMBOL_INDICATOR;
+ }
break;
+ }
// The next 3 types are intended to relocate symbols of thread local variables
// defined with __thread modifier
//
diff --git a/arch/x64/arch-elf.hh b/arch/x64/arch-elf.hh
index 1811ceb5..b4c63967 100644
--- a/arch/x64/arch-elf.hh
+++ b/arch/x64/arch-elf.hh
@@ -34,6 +34,8 @@ enum {
R_X86_64_IRELATIVE = 37, // word64 indirect(B + A)
};

+void * const MISSING_SYMBOL_INDICATOR = (void*)0xffffeeeeddddcccc;
+
/* for pltgot relocation */
#define ARCH_JUMP_SLOT R_X86_64_JUMP_SLOT

diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc
index 2f1ba5e2..441b6c45 100644
--- a/arch/x64/mmu.cc
+++ b/arch/x64/mmu.cc
@@ -6,6 +6,7 @@
*/

#include "arch-cpu.hh"
+#include "arch-elf.hh"
#include <osv/debug.hh>
#include <osv/sched.hh>
#include <osv/mmu.hh>
@@ -28,6 +29,9 @@ void page_fault(exception_frame *ef)
if (!pc) {
abort("trying to execute null pointer");
}
+ if (pc == MISSING_SYMBOL_INDICATOR) {
+ abort("trying to execute missing symbol");
+ }
// The following code may sleep. So let's verify the fault did not happen
// when preemption was disabled, or interrupts were disabled.
assert(sched::preemptable());
--
2.20.1

Nadav Har'El

unread,
Aug 20, 2019, 9:19:12 AM8/20/19
to Waldemar Kozaczuk, Osv Dev
On Tue, Aug 20, 2019 at 7:20 AM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:
The commit https://github.com/cloudius-systems/osv/commit/f5cc12d56dd986d2c7982c5738b1f859702b07fb
addressed the issue #993 to relax handling of missing symbols in relocate_pltgot()
when loading ELF objects with BIND_NOW. This patch on other hand attempts to close
the gap of handling missing symbols this time in relocate_rela() during loading phase
of ELF objects by dynamic linker.

Do you have an understanding *why* issue #1023 is different from anything we've seen
before, where the function is relocated by relocate_rela() and not relocate_pltgot()?
My guess was that this function is somehow accessed as a *variable* (e.g., its pointer is
copied or something) instead of as a real function call, but this was just a guess...


The solution is similar to what we do in relocate_pltgot()
- let missing symbol get ignored and hope they will not be used.
However in this case we cannot rely on lazy symbol resolution
logic that would catch missing symbol later when used and abort.
Instead we place an indicator of missing symbol (special invalid address)
which would trigger page fault when symbol accessed. This indicator allows
us to distinguish missing symbol used scenario from other page fault related ones.
This unfortunately does not tell us which missing symbol was used.
In future we could place indicator + offset where offset would point
to a name of the missing symbol in some missing symbols table.

Fixes #1023

Did you test this on an application (like the rust example #1023) where before the
application couldn't run, and now it can? Or does the application really try to use
the invalid pointer, and crash at that point?
Can you please remind me why this is an invalid pointer? Does it not have enough f's in the beginning to be a valid pointer?

+
 /* for pltgot relocation */
 #define ARCH_JUMP_SLOT R_X86_64_JUMP_SLOT

diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc
index 2f1ba5e2..441b6c45 100644
--- a/arch/x64/mmu.cc
+++ b/arch/x64/mmu.cc
@@ -6,6 +6,7 @@
  */

 #include "arch-cpu.hh"
+#include "arch-elf.hh"
 #include <osv/debug.hh>
 #include <osv/sched.hh>
 #include <osv/mmu.hh>
@@ -28,6 +29,9 @@ void page_fault(exception_frame *ef)
     if (!pc) {
         abort("trying to execute null pointer");
     }
+    if (pc == MISSING_SYMBOL_INDICATOR) {
+        abort("trying to execute missing symbol");

Do you have a test case where you actually see this message?
Because I wonder if the invalid pointer actually gets *executed* (so pc = ...) - it is also possible the pointer get followed, not executed. I think "pc" isn't the general indication of where the page fault happened.


     // The following code may sleep. So let's verify the fault did not happen
     // when preemption was disabled, or interrupts were disabled.
     assert(sched::preemptable());
--
2.20.1

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20190820042043.25133-1-jwkozaczuk%40gmail.com.

Waldek Kozaczuk

unread,
Aug 20, 2019, 11:25:13 AM8/20/19
to OSv Development
Thanks for the review. You can hopefully find some more related background info in this group post from a couple of days ago - https://groups.google.com/forum/#!msg/osv-dev/kmkqlZt2DRY/ijpde19DDQAJ


On Tuesday, August 20, 2019 at 9:19:12 AM UTC-4, Nadav Har'El wrote:

On Tue, Aug 20, 2019 at 7:20 AM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:
The commit https://github.com/cloudius-systems/osv/commit/f5cc12d56dd986d2c7982c5738b1f859702b07fb
addressed the issue #993 to relax handling of missing symbols in relocate_pltgot()
when loading ELF objects with BIND_NOW. This patch on other hand attempts to close
the gap of handling missing symbols this time in relocate_rela() during loading phase
of ELF objects by dynamic linker.

Do you have an understanding *why* issue #1023 is different from anything we've seen
before, where the function is relocated by relocate_rela() and not relocate_pltgot()?
My guess was that this function is somehow accessed as a *variable* (e.g., its pointer is
copied or something) instead of as a real function call, but this was just a guess...

Fixing #1023 was not a priority to me until I discovered it was preventing me from running ffmpeg (https://github.com/cloudius-systems/osv-apps/tree/master/ffmpeg) that I am helping someone else to run. This app requires compiling from source and it used to work at some point. At first, it would fail with missing versionsort symbol. After I added this, it would fail with a missing symbol from the family of posix_spawnattr_* functions. This rang a bell and this is how it led me to the issue 1023. After making simple modification like in an original RFC patch (see same mailing link) the app would happily work. As a matter of fact, I was able to run a version of ffmpeg from host. 

Honestly, I do not understand why this app, in this case, needs symbols resolution processed through relocate_rela() vs relocate_pltgot(). Actually, that is why I was asking the question about R_X86_64_GLOB_DAT which I would think should apply to variables, not functions. But I think you guess about being function pointers is a good one. Could it be different compiler flags?


The solution is similar to what we do in relocate_pltgot()
- let missing symbol get ignored and hope they will not be used.
However in this case we cannot rely on lazy symbol resolution
logic that would catch missing symbol later when used and abort.
Instead we place an indicator of missing symbol (special invalid address)
which would trigger page fault when symbol accessed. This indicator allows
us to distinguish missing symbol used scenario from other page fault related ones.
This unfortunately does not tell us which missing symbol was used.
In future we could place indicator + offset where offset would point
to a name of the missing symbol in some missing symbols table.

Fixes #1023

Did you test this on an application (like the rust example #1023) where before the
application couldn't run, and now it can?
I did test with ffmpeg which is the original reason for this patch. I did not try to recreate it with the original rust example which I should. 
 
Or does the application really try to use
the invalid pointer, and crash at that point?
In the ffmpeg case it never used the symbols it needed to be resolved through relocate_rela():

/usr/lib/libmount.so.1: ignoring missing symbol versionsort of type: 6
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_destroy of type: 6
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_setsigmask of type: 6
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_setsigdefault of type: 6
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawn_file_actions_adddup2 of type: 6
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawn_file_actions_init of type: 6
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawn_file_actions_destroy of type: 6
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_setflags of type: 6
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_init of type: 6
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnp of type: 6
/usr/lib/libgmp.so.10: ignoring missing symbol obstack_vprintf of type: 1
/usr/lib/libasound.so.2: ignoring missing symbol versionsort of type: 6
/usr/lib/libnsl.so.1: ignoring missing symbol _libc_intl_domainname of type: 6
/usr/lib/libasyncns.so.0: ignoring missing symbol __res_query of type: 6
/usr/lib/libasyncns.so.0: ignoring missing symbol __res_search of type: 6

So no it would not crash. Fffmpeg is a pretty complex app (toolbox for all kinds of media processing use cases) and depends on many shared libraries (or at least the version I compiled and the one provided as a package by Fedora or Ubuntu). I guess in all the use cases I have tested the particular symbols though missing were not used. I think that applies to many libraries - many symbols provided but not used by application actually.



 

Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
 arch/x64/arch-elf.cc | 20 ++++++++++++++++----
 arch/x64/arch-elf.hh |  2 ++
 arch/x64/mmu.cc      |  4 ++++
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/arch/x64/arch-elf.cc b/arch/x64/arch-elf.cc
index ebe40996..4c513150 100644
--- a/arch/x64/arch-elf.cc
+++ b/arch/x64/arch-elf.cc
@@ -70,16 +70,28 @@ bool object::arch_relocate_rela(u32 type, u32 sym, void *addr,
         memcpy(addr, sm.relocated_addr(), sm.size());
         break;
     }
-    case R_X86_64_64:
-        *static_cast<void**>(addr) = symbol(sym).relocated_addr() + addend;
+    case R_X86_64_64: {
+        auto _sym = symbol(sym, true);
+        if (_sym.symbol) {
+            *static_cast<void**>(addr) = _sym.relocated_addr() + addend;
+        } else {
+            *static_cast<void**>(addr) = MISSING_SYMBOL_INDICATOR;
+        }
         break;
+    }
The reason I set the address to MISSING_SYMBOL_INDICATOR  is NOT to prevent the app from crashing but to better inform why it crashes if missing symbol is used later. Otherwise it would show up as a regular null pointer fault.
I tried to pick something that would never be a valid pointer. Is there a better address I should use? 

+
 /* for pltgot relocation */
 #define ARCH_JUMP_SLOT R_X86_64_JUMP_SLOT

diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc
index 2f1ba5e2..441b6c45 100644
--- a/arch/x64/mmu.cc
+++ b/arch/x64/mmu.cc
@@ -6,6 +6,7 @@
  */

 #include "arch-cpu.hh"
+#include "arch-elf.hh"
 #include <osv/debug.hh>
 #include <osv/sched.hh>
 #include <osv/mmu.hh>
@@ -28,6 +29,9 @@ void page_fault(exception_frame *ef)
     if (!pc) {
         abort("trying to execute null pointer");
     }
+    if (pc == MISSING_SYMBOL_INDICATOR) {
+        abort("trying to execute missing symbol");

Do you have a test case where you actually see this message?
Yes. I was able to manually create tests that would trigger this with missing symbol scenario with R_X86_64_GLOB_DAT and R_X86_64_64 relocation type. In the latter case, I used the graalvm app that uses mprotect and I manually misspelled to make it not found and the message was properly handled. I also conducted a similar test for R_X86_64_GLOB_DAT.
 
Because I wonder if the invalid pointer actually gets *executed* (so pc = ...) - it is also possible the pointer get followed, not executed. I think "pc" isn't the general indication of where the page fault happened.
Not sure I understand what you are saying here. I did happen in the 2 test scenarios I ran. I see that this particular page fault logic looks at RIP value which only apply to function execution, right? Do we have another page fault handler when data is actually read/written to where we should supply similar logic?


     // The following code may sleep. So let's verify the fault did not happen
     // when preemption was disabled, or interrupts were disabled.
     assert(sched::preemptable());
--
2.20.1

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv...@googlegroups.com.

Waldek Kozaczuk

unread,
Aug 20, 2019, 11:32:07 AM8/20/19
to OSv Development
To add to this: the bottom line is that this smaller version of the patch:

diff --git a/arch/x64/arch-elf.cc b/arch/x64/arch-elf.cc
index ebe40996..4c513150 100644
--- a/arch/x64/arch-elf.cc
+++ b/arch/x64/arch-elf.cc
@@ -70,16 +70,28 @@ bool object::arch_relocate_rela(u32 type, u32 sym, void *addr,
         memcpy(addr, sm.relocated_addr(), sm.size());
         break;
     }
-    case R_X86_64_64:
-        *static_cast<void**>(addr) = symbol(sym).relocated_addr() + addend;
+    case R_X86_64_64: {
+        auto _sym = symbol(sym, true);
+        if (_sym.symbol) {
+            *static_cast<void**>(addr) = _sym.relocated_addr() + addend;
+        }
         break;
+    }
     case R_X86_64_RELATIVE:
         *static_cast<void**>(addr) = _base + addend;
         break;
     case R_X86_64_JUMP_SLOT:
-    case R_X86_64_GLOB_DAT:
-        *static_cast<void**>(addr) = symbol(sym).relocated_addr();
+    case R_X86_64_GLOB_DAT: {
+        auto _sym = symbol(sym, true);
+        if (_sym.symbol) {
+            *static_cast<void**>(addr) = _sym.relocated_addr();
+        }
         break;
+    }
     // The next 3 types are intended to relocate symbols of thread local variables
     // defined with __thread modifier
     //

would suffice to make ffmpeg run. 

The remaining part is to better inform the user if such a missing symbol is ACTUALLY used.

Nadav Har'El

unread,
Aug 21, 2019, 3:22:58 AM8/21/19
to Waldek Kozaczuk, OSv Development
Hi, thanks. I don't have satisfactory answers for everything, but I wrote below some answers with background information which might be helpful. Please read all the way to the hand - I have answers sprinkled all over.

On Tue, Aug 20, 2019 at 6:25 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Thanks for the review. You can hopefully find some more related background info in this group post from a couple of days ago - https://groups.google.com/forum/#!msg/osv-dev/kmkqlZt2DRY/ijpde19DDQAJ

On Tuesday, August 20, 2019 at 9:19:12 AM UTC-4, Nadav Har'El wrote:

On Tue, Aug 20, 2019 at 7:20 AM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:
The commit https://github.com/cloudius-systems/osv/commit/f5cc12d56dd986d2c7982c5738b1f859702b07fb
addressed the issue #993 to relax handling of missing symbols in relocate_pltgot()
when loading ELF objects with BIND_NOW. This patch on other hand attempts to close
the gap of handling missing symbols this time in relocate_rela() during loading phase
of ELF objects by dynamic linker.

Do you have an understanding *why* issue #1023 is different from anything we've seen
before, where the function is relocated by relocate_rela() and not relocate_pltgot()?
My guess was that this function is somehow accessed as a *variable* (e.g., its pointer is
copied or something) instead of as a real function call, but this was just a guess...

Fixing #1023 was not a priority to me until I discovered it was preventing me from running ffmpeg (https://github.com/cloudius-systems/osv-apps/tree/master/ffmpeg) that I am helping someone else to run. This app requires compiling from source and it used to work at some point. At first, it would fail with missing versionsort symbol. After I added this, it would fail with a missing symbol from the family of posix_spawnattr_* functions. This rang a bell and this is how it led me to the issue 1023. After making simple modification like in an original RFC patch (see same mailing link) the app would happily work. As a matter of fact, I was able to run a version of ffmpeg from host. 

Honestly, I do not understand why this app, in this case, needs symbols resolution processed through relocate_rela() vs relocate_pltgot(). Actually, that is why I was asking the question about R_X86_64_GLOB_DAT which I would think should apply to variables, not functions. But I think you guess about being function pointers is a good one. Could it be different compiler flags?

What bothers me is that I don't know... One of the lines you showed was:
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_destroy of type: 6

So I went and took a look at the source code of librsvg, and it doesn't seem to use posix_spawnattr_destroy at all, let alone use it in some strange way like assume it is a variable or something...

So I'm still at a complete loss what really happens that we're trying to solve, or why we only started to see this problem now, after running dozens of different executables without this problem.

That being said, I don't dispute you are solving a real problem that really prevents you from running things, so I won't refuse to commit this, even if we don't fully understand all the details.
Although x86_64 supports 64-bit pointers, they knew that applications don't really need - yet - to address the full 2^64 bits of memory.
So the processor allows you to choose how many bits you *really* want. The default is 48 bits, allowing you to address 262 terabytes of memory, and quite enough for today's standards. This uses 4-level page tables. You can also choose 57 bits, and 5-level page tables (https://en.wikipedia.org/wiki/Intel_5-level_paging) but OSv neither does this nor want it.

When 48-bit addresses are supported, valid addresses are so-called "canonical" addresses, where all the highest bits cannot be the same. For 48 bits it means the last legal positive addresses is 0x00007FFFFFFFFFFF, while the most negative address is 0xFFFF800000000000. If you take an address which is not canonical in that sense , for example, 0xF000000000000000, then this address is completely illegal. If you take an address which is canonical, but simply not in the page table, you get a regular page fault - this is what happens with the "0" address. But you need to make sure you never allocate it.

You chose 0xffffeeeeddddcccc. This is in fact a legal canonical address. In it may be possible that our malloc() (see mempool.cc) allocates it. Whether it does, I don't know.
What happens if you use a non-canonical address? I'm not even sure if you get the normal #PF or something else like #GP.


+
 /* for pltgot relocation */
 #define ARCH_JUMP_SLOT R_X86_64_JUMP_SLOT

diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc
index 2f1ba5e2..441b6c45 100644
--- a/arch/x64/mmu.cc
+++ b/arch/x64/mmu.cc
@@ -6,6 +6,7 @@
  */

 #include "arch-cpu.hh"
+#include "arch-elf.hh"
 #include <osv/debug.hh>
 #include <osv/sched.hh>
 #include <osv/mmu.hh>
@@ -28,6 +29,9 @@ void page_fault(exception_frame *ef)
     if (!pc) {
         abort("trying to execute null pointer");
     }
+    if (pc == MISSING_SYMBOL_INDICATOR) {
+        abort("trying to execute missing symbol");

Do you have a test case where you actually see this message?
Yes. I was able to manually create tests that would trigger this with missing symbol scenario with R_X86_64_GLOB_DAT and R_X86_64_64 relocation type. In the latter case, I used the graalvm app that uses mprotect and I manually misspelled to make it not found and the message was properly handled. I also conducted a similar test for R_X86_64_GLOB_DAT.

I didn't understand how you recreated this. Did you have a test case where the code really used the function relocated via R_X86_64_GLOB_DAT?

 
Because I wonder if the invalid pointer actually gets *executed* (so pc = ...) - it is also possible the pointer get followed, not executed. I think "pc" isn't the general indication of where the page fault happened.
Not sure I understand what you are saying here. I did happen in the 2 test scenarios I ran. I see that this particular page fault logic looks at RIP value which only apply to function execution, right? Do we have another page fault handler when data is actually read/written to where we should supply similar logic?

When you get a page fault, there can be two things that happened: One option is that the instruction tried to access some memory address which isn't mapped, or mapped for wrong permissions (e.g., the instruction tried to write into memory mapped read-only). In this case, we have the broken address in the cr2 register. Another option is that the *instruction* itself could not be executed - because the current pc (program counter) points to memory not mapped as executable - or - often - pc is 0 means that someone tried to *execute* the null pointer.

Because this last case happens commonly, we have a special message for it in our page_fault() handler. But it only happens if someone tries to *execute* a null pointer. If you try to read or write from a null pointer, you won't get that message.

Similarly, your test for pc will only catch trying to executing this fake address - not reading or writing for it. To catch the latter you would need to also check "addr" (i.e., cr2).

In all the examples we discussed so far, these relocations were used for functions, so presumably if ever used, these addresses will indeed be *executed*, so your test would be good enough. But if this relocation is used for a variable, then this address might be read or written, not just executed. Moreover, if the address is "close" (e.g., same page) as the fake address, it is probably an indication that someone tried to read or write a field inside that variable, instead of its first byte.


     // The following code may sleep. So let's verify the fault did not happen
     // when preemption was disabled, or interrupts were disabled.
     assert(sched::preemptable());
--
2.20.1

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/20190820042043.25133-1-jwkozaczuk%40gmail.com.

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/3a4fe9b0-e2a1-46fc-8982-b8a06fee2be3%40googlegroups.com.

Waldek Kozaczuk

unread,
Aug 25, 2019, 1:37:57 AM8/25/19
to OSv Development

On Wednesday, August 21, 2019 at 3:22:58 AM UTC-4, Nadav Har'El wrote:
Hi, thanks. I don't have satisfactory answers for everything, but I wrote below some answers with background information which might be helpful. Please read all the way to the hand - I have answers sprinkled all over.

On Tue, Aug 20, 2019 at 6:25 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Thanks for the review. You can hopefully find some more related background info in this group post from a couple of days ago - https://groups.google.com/forum/#!msg/osv-dev/kmkqlZt2DRY/ijpde19DDQAJ

On Tuesday, August 20, 2019 at 9:19:12 AM UTC-4, Nadav Har'El wrote:

On Tue, Aug 20, 2019 at 7:20 AM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:
The commit https://github.com/cloudius-systems/osv/commit/f5cc12d56dd986d2c7982c5738b1f859702b07fb
addressed the issue #993 to relax handling of missing symbols in relocate_pltgot()
when loading ELF objects with BIND_NOW. This patch on other hand attempts to close
the gap of handling missing symbols this time in relocate_rela() during loading phase
of ELF objects by dynamic linker.

Do you have an understanding *why* issue #1023 is different from anything we've seen
before, where the function is relocated by relocate_rela() and not relocate_pltgot()?
My guess was that this function is somehow accessed as a *variable* (e.g., its pointer is
copied or something) instead of as a real function call, but this was just a guess...

Fixing #1023 was not a priority to me until I discovered it was preventing me from running ffmpeg (https://github.com/cloudius-systems/osv-apps/tree/master/ffmpeg) that I am helping someone else to run. This app requires compiling from source and it used to work at some point. At first, it would fail with missing versionsort symbol. After I added this, it would fail with a missing symbol from the family of posix_spawnattr_* functions. This rang a bell and this is how it led me to the issue 1023. After making simple modification like in an original RFC patch (see same mailing link) the app would happily work. As a matter of fact, I was able to run a version of ffmpeg from host. 

Honestly, I do not understand why this app, in this case, needs symbols resolution processed through relocate_rela() vs relocate_pltgot(). Actually, that is why I was asking the question about R_X86_64_GLOB_DAT which I would think should apply to variables, not functions. But I think you guess about being function pointers is a good one. Could it be different compiler flags?

What bothers me is that I don't know... One of the lines you showed was:
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_destroy of type: 6

So I went and took a look at the source code of librsvg, and it doesn't seem to use posix_spawnattr_destroy at all, let alone use it in some strange way like assume it is a variable or something...
 
I dug a little deeper and it turns out librsvg was rewritten couple of years to use Rust inside but still provide C interfaces - see https://github.com/GNOME/librsvg/tree/8c5587d7c7e86cbfea1f3b14420d4bdba1f14d31/rsvg_internals/src and couple of presentations from main Readme. The Rust code seems to be using libc crate - https://github.com/rust-lang/libc - which seems to be providing Rust placeholders to all libc symbols -  https://github.com/rust-lang/libc/blob/692dc544dbaddf660d1aef37bf1846d714e3f7cf/src/unix/linux_like/linux/mod.rs#L2717 including the one you were looking for in librsvg code and could not find. I have very little experience with Rust and this specific libc crate but I am speculating that any Rust app/library that depends on libc gets all its symbols whether they are used or NOT. This kind of is similar o the symbols you originally reported missing in the issue 1023 which was about trying to run the httpserver in Rust.

So I'm still at a complete loss what really happens that we're trying to solve, or why we only started to see this problem now, after running dozens of different executables without this problem.

That being said, I don't dispute you are solving a real problem that really prevents you from running things, so I won't refuse to commit this, even if we don't fully understand all the details. 

Just like with the solution you proposed in https://github.com/cloudius-systems/osv/issues/993 to delay resolution of missing symbols in the hope the will not be used, I am trying to fill similar gap and let more unmodified binaries from Linux host run on OSv. Except in the case of 1023, where missing symbols are processed in "rela" phase, we cannot rely on a similar mechanism like elf_resolve_pltgot to catch symbol later when actually used. But similarly, as with 993 I want to ignore symbols identified as missing during rela phase with hope they will not be actually used and create a mechanism that would detect that missing symbol was actually used.
Do you suggest to use a different address then? Or maybe a different mechanism? The bottom line is that if I only ignore the missing symbol in arch_relocate_rela and unfortunately it does get used later, the user will get regular "0" address page fault in most cases which would not tell him that missing symbol was used in such case. Thoughts?

Waldek Kozaczuk

unread,
Aug 25, 2019, 10:47:22 PM8/25/19
to OSv Development


On Sunday, August 25, 2019 at 1:37:57 AM UTC-4, Waldek Kozaczuk wrote:

On Wednesday, August 21, 2019 at 3:22:58 AM UTC-4, Nadav Har'El wrote:
Hi, thanks. I don't have satisfactory answers for everything, but I wrote below some answers with background information which might be helpful. Please read all the way to the hand - I have answers sprinkled all over.

On Tue, Aug 20, 2019 at 6:25 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Thanks for the review. You can hopefully find some more related background info in this group post from a couple of days ago - https://groups.google.com/forum/#!msg/osv-dev/kmkqlZt2DRY/ijpde19DDQAJ

On Tuesday, August 20, 2019 at 9:19:12 AM UTC-4, Nadav Har'El wrote:

On Tue, Aug 20, 2019 at 7:20 AM Waldemar Kozaczuk <jwkoz...@gmail.com> wrote:
The commit https://github.com/cloudius-systems/osv/commit/f5cc12d56dd986d2c7982c5738b1f859702b07fb
addressed the issue #993 to relax handling of missing symbols in relocate_pltgot()
when loading ELF objects with BIND_NOW. This patch on other hand attempts to close
the gap of handling missing symbols this time in relocate_rela() during loading phase
of ELF objects by dynamic linker.

Do you have an understanding *why* issue #1023 is different from anything we've seen
before, where the function is relocated by relocate_rela() and not relocate_pltgot()?
My guess was that this function is somehow accessed as a *variable* (e.g., its pointer is
copied or something) instead of as a real function call, but this was just a guess...

Fixing #1023 was not a priority to me until I discovered it was preventing me from running ffmpeg (https://github.com/cloudius-systems/osv-apps/tree/master/ffmpeg) that I am helping someone else to run. This app requires compiling from source and it used to work at some point. At first, it would fail with missing versionsort symbol. After I added this, it would fail with a missing symbol from the family of posix_spawnattr_* functions. This rang a bell and this is how it led me to the issue 1023. After making simple modification like in an original RFC patch (see same mailing link) the app would happily work. As a matter of fact, I was able to run a version of ffmpeg from host. 

Honestly, I do not understand why this app, in this case, needs symbols resolution processed through relocate_rela() vs relocate_pltgot(). Actually, that is why I was asking the question about R_X86_64_GLOB_DAT which I would think should apply to variables, not functions. But I think you guess about being function pointers is a good one. Could it be different compiler flags?

What bothers me is that I don't know... One of the lines you showed was:
/usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_destroy of type: 6

So I went and took a look at the source code of librsvg, and it doesn't seem to use posix_spawnattr_destroy at all, let alone use it in some strange way like assume it is a variable or something...
 
I dug a little deeper and it turns out librsvg was rewritten couple of years to use Rust inside but still provide C interfaces - see https://github.com/GNOME/librsvg/tree/8c5587d7c7e86cbfea1f3b14420d4bdba1f14d31/rsvg_internals/src and couple of presentations from main Readme. The Rust code seems to be using libc crate - https://github.com/rust-lang/libc - which seems to be providing Rust placeholders to all libc symbols -  https://github.com/rust-lang/libc/blob/692dc544dbaddf660d1aef37bf1846d714e3f7cf/src/unix/linux_like/linux/mod.rs#L2717 including the one you were looking for in librsvg code and could not find. I have very little experience with Rust and this specific libc crate but I am speculating that any Rust app/library that depends on libc gets all its symbols whether they are used or NOT. This kind of is similar o the symbols you originally reported missing in the issue 1023 which was about trying to run the httpserver in Rust.
Another explanation is that librsvg is linked as dylib and it should as cdylib. Per this link - https://doc.rust-lang.org/edition-guide/rust-2018/platform-and-target-support/cdylib-crates-for-c-interoperability.html - dylib has extra stuff that cdylib linked lib does not. 

So I'm still at a complete loss what really happens that we're trying to solve, or why we only started to see this problem now, after running dozens of different executables without this problem.

That being said, I don't dispute you are solving a real problem that really prevents you from running things, so I won't refuse to commit this, even if we don't fully understand all the details. 

Just like with the solution you proposed in https://github.com/cloudius-systems/osv/issues/993 to delay resolution of missing symbols in the hope the will not be used, I am trying to fill similar gap and let more unmodified binaries from Linux host run on OSv. Except in the case of 1023, where missing symbols are processed in "rela" phase, we cannot rely on a similar mechanism like elf_resolve_pltgot to catch symbol later when actually used. But similarly, as with 993 I want to ignore symbols identified as missing during rela phase with hope they will not be actually used and create a mechanism that would detect that missing symbol was actually used.


The solution is similar to what we do in relocate_pltgot()
- let missing symbol get ignored and hope they will not be used.
However in this case we cannot rely on lazy symbol resolution
logic that would catch missing symbol later when used and abort.
Instead we place an indicator of missing symbol (special invalid address)
which would trigger page fault when symbol accessed. This indicator allows
us to distinguish missing symbol used scenario from other page fault related ones.
This unfortunately does not tell us which missing symbol was used.
In future we could place indicator + offset where offset would point
to a name of the missing symbol in some missing symbols table.

Fixes #1023

Did you test this on an application (like the rust example #1023) where before the
application couldn't run, and now it can?
I did test with ffmpeg which is the original reason for this patch. I did not try to recreate it with the original rust example which I should. 
I did run the original version of Rust httpserver which was used to identify the issue #1023 by reverting this patch https://github.com/cloudius-systems/osv-apps/commit/6329570cc685a6ba543c2ddecfb8b7173c3b3832 or manually changing cdylib do dylib in Cargo.toml. And now it works with my patch.

Nadav Har'El

unread,
Aug 26, 2019, 4:54:59 AM8/26/19
to Waldek Kozaczuk, OSv Development
On Sun, Aug 25, 2019 at 8:37 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
+void * const MISSING_SYMBOL_INDICATOR = (void*)0xffffeeeeddddcccc;

Can you please remind me why this is an invalid pointer? Does it not have enough f's in the beginning to be a valid pointer?
I tried to pick something that would never be a valid pointer. Is there a better address I should use? 

Although x86_64 supports 64-bit pointers, they knew that applications don't really need - yet - to address the full 2^64 bits of memory.
So the processor allows you to choose how many bits you *really* want. The default is 48 bits, allowing you to address 262 terabytes of memory, and quite enough for today's standards. This uses 4-level page tables. You can also choose 57 bits, and 5-level page tables (https://en.wikipedia.org/wiki/Intel_5-level_paging) but OSv neither does this nor want it.

When 48-bit addresses are supported, valid addresses are so-called "canonical" addresses, where all the highest bits cannot be the same. For 48 bits it means the last legal positive addresses is 0x00007FFFFFFFFFFF, while the most negative address is 0xFFFF800000000000. If you take an address which is not canonical in that sense , for example, 0xF000000000000000, then this address is completely illegal. If you take an address which is canonical, but simply not in the page table, you get a regular page fault - this is what happens with the "0" address. But you need to make sure you never allocate it.

You chose 0xffffeeeeddddcccc. This is in fact a legal canonical address. In it may be possible that our malloc() (see mempool.cc) allocates it. Whether it does, I don't know.
What happens if you use a non-canonical address? I'm not even sure if you get the normal #PF or something else like #GP.
Do you suggest to use a different address then? Or maybe a different mechanism? The bottom line is that if I only ignore the missing symbol in arch_relocate_rela and unfortunately it does get used later, the user will get regular "0" address page fault in most cases which would not tell him that missing symbol was used in such case. Thoughts?

Yes, I would choose an address we can actually be sure is illegal - either a non-canonical address as I explained above (but please verify if you still get a #PF and not a #GP), or a canonical address which we know for some reason is not allocated.
E.g., we could allocate a single page and mprotect() it to have zero permissions, and use that address.
We could also do the same without allocating a physical page at all - just mark the *virtual* address taken, without actually allocating any physical page. This will require more elaborate coding so probably not worth the extra effort just to save 4K of memory but you could leave it as a FIXME.

If you really want to spend a long time on this feature(you probably shouldn't), you can even make each variable point to a different page (or different positions in the same page, but watch out for the possibility of someone accessing an offset into a variable..) so we can print a different error message for each of the missing variables.


 
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/15086b1a-553a-496b-9c0b-498973da1ec1%40googlegroups.com.

Waldemar Kozaczuk

unread,
Aug 26, 2019, 9:56:22 PM8/26/19
to osv...@googlegroups.com, Waldemar Kozaczuk
The commit https://github.com/cloudius-systems/osv/commit/f5cc12d56dd986d2c7982c5738b1f859702b07fb
addressed the issue #993 to relax handling of missing symbols in relocate_pltgot()
when loading ELF objects with BIND_NOW. This patch on other hand attempts to close
the gap to allow running unmodified Linux executables "as-is" that reference symbols
missing in OSv kernel and relocated by relocate_rela() during loading phase but never
used later during execution. Such missing symbols would typically include global variables
and pointers to functions.

The solution is similar to what relocate_pltgot() does - let missing symbol
get ignored and hope it will not be used. However in this case we cannot
rely on lazy symbol resolution logic that would catch missing symbol later
if used through elf_resolve_pltgot() later and abort. Instead we allocate a single
page of memory - "missing_symbols_page" - setup to trigger a fault when any
read, write or execution attempt is made. Anytime missing symbol is identified
in relocate_rela(), we set symbol address in GOT to point to this new page address.
The page fault against missing_symbols_page is recognized as an attempt to access
missing symbol (invoke a function or read/write variable) and program is aborted
with stack trace like this:

trying to execute or access missing symbol
[backtrace]
0x00000000403a7540 <page_fault+288>
0x00000000403a6316 <???+1077568278>

One way to test this patch is to modify Cargo.toml in rust-httpserver app
and change 'crate-type' to 'dylib' and run the app. Another way
is to misspell or comment out 'mprotect' function in libc/mman.cc
and build and run graalvm-example.

Fixes #1023

Signed-off-by: Waldemar Kozaczuk <jwkoz...@gmail.com>
---
arch/x64/arch-elf.cc | 20 ++++++++++++++++----
arch/x64/mmu.cc | 4 ++++
core/elf.cc | 14 ++++++++++++++
include/osv/elf.hh | 3 +++
loader.cc | 1 +
5 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/arch/x64/arch-elf.cc b/arch/x64/arch-elf.cc
index ebe40996..665f06ff 100644
--- a/arch/x64/arch-elf.cc
+++ b/arch/x64/arch-elf.cc
@@ -70,16 +70,28 @@ bool object::arch_relocate_rela(u32 type, u32 sym, void *addr,
memcpy(addr, sm.relocated_addr(), sm.size());
break;
}
- case R_X86_64_64:
- *static_cast<void**>(addr) = symbol(sym).relocated_addr() + addend;
+ case R_X86_64_64: {
+ auto _sym = symbol(sym, true);
+ if (_sym.symbol) {
+ *static_cast<void**>(addr) = _sym.relocated_addr() + addend;
+ } else {
+ *static_cast<void**>(addr) = missing_symbols_page_addr;
+ }
break;
+ }
case R_X86_64_RELATIVE:
*static_cast<void**>(addr) = _base + addend;
break;
case R_X86_64_JUMP_SLOT:
- case R_X86_64_GLOB_DAT:
- *static_cast<void**>(addr) = symbol(sym).relocated_addr();
+ case R_X86_64_GLOB_DAT: {
+ auto _sym = symbol(sym, true);
+ if (_sym.symbol) {
+ *static_cast<void**>(addr) = _sym.relocated_addr();
+ } else {
+ *static_cast<void**>(addr) = missing_symbols_page_addr;
+ }
break;
+ }
// The next 3 types are intended to relocate symbols of thread local variables
// defined with __thread modifier
//
diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc
index 2f1ba5e2..24da5caa 100644
--- a/arch/x64/mmu.cc
+++ b/arch/x64/mmu.cc
@@ -13,6 +13,7 @@
#include <osv/interrupt.hh>
#include <osv/migration-lock.hh>
#include <osv/prio.hh>
+#include <osv/elf.hh>
#include "exceptions.hh"

void page_fault(exception_frame *ef)
@@ -28,6 +29,9 @@ void page_fault(exception_frame *ef)
if (!pc) {
abort("trying to execute null pointer");
}
+ if (reinterpret_cast<void*>(addr & ~(mmu::page_size - 1)) == elf::missing_symbols_page_addr) {
+ abort("trying to execute or access missing symbol");
+ }
// The following code may sleep. So let's verify the fault did not happen
// when preemption was disabled, or interrupts were disabled.
assert(sched::preemptable());
diff --git a/core/elf.cc b/core/elf.cc
index cb932044..85b2eb96 100644
--- a/core/elf.cc
+++ b/core/elf.cc
@@ -1108,6 +1108,20 @@ void object::init_static_tls()
}
}

+// This allocates single page of memory and sets its permissions so
+// that any read, write or execution attempt would trigger a page fault
+// to indicate to user that application tried to access missing symbol
+// ignored by relocate_rela().
+// TODO: In future this page can store the name addresses for each missing symbol
+// and allow informing user which particular symbol was missing
+void *missing_symbols_page_addr;
+void setup_missing_symbols_detector()
+{
+ missing_symbols_page_addr = mmu::map_anon(nullptr, mmu::page_size, mmu::mmap_populate, mmu::perm_rw);
+ assert(missing_symbols_page_addr);
+ mmu::mprotect(missing_symbols_page_addr, mmu::page_size, 0);
+}
+
program* s_program;

void create_main_program()
diff --git a/include/osv/elf.hh b/include/osv/elf.hh
index 775d8c8d..c38c5c25 100644
--- a/include/osv/elf.hh
+++ b/include/osv/elf.hh
@@ -634,6 +634,9 @@ private:
std::stack<std::vector<std::shared_ptr<object>>> _loaded_objects_stack;
};

+extern void *missing_symbols_page_addr;
+void setup_missing_symbols_detector();
+
void create_main_program();

/**
diff --git a/loader.cc b/loader.cc
index 7ac99ef5..661979e3 100644
--- a/loader.cc
+++ b/loader.cc
@@ -591,6 +591,7 @@ void main_cont(int loader_argc, char** loader_argv)
enable_backtraces();
}
sched::init_detached_threads_reaper();
+ elf::setup_missing_symbols_detector();

bsd_init();

--
2.20.1

Waldek Kozaczuk

unread,
Aug 29, 2019, 7:58:24 AM8/29/19
to OSv Development
Any reason we should not apply this version of the patch?

Commit Bot

unread,
Aug 31, 2019, 10:54:59 AM8/31/19
to osv...@googlegroups.com, Waldemar Kozaczuk
From: Waldemar Kozaczuk <jwkoz...@gmail.com>
Committer: Waldemar Kozaczuk <jwkoz...@gmail.com>
Branch: master

Ignore missing symbols when processing certain relocation types on load
diff --git a/arch/x64/arch-elf.cc b/arch/x64/arch-elf.cc
--- a/arch/x64/mmu.cc
+++ b/arch/x64/mmu.cc
@@ -13,6 +13,7 @@
#include <osv/interrupt.hh>
#include <osv/migration-lock.hh>
#include <osv/prio.hh>
+#include <osv/elf.hh>
#include "exceptions.hh"

void page_fault(exception_frame *ef)
@@ -28,6 +29,9 @@ void page_fault(exception_frame *ef)
if (!pc) {
abort("trying to execute null pointer");
}
+ if (reinterpret_cast<void*>(addr & ~(mmu::page_size - 1)) ==
elf::missing_symbols_page_addr) {
+ abort("trying to execute or access missing symbol");
+ }
// The following code may sleep. So let's verify the fault did not
happen
// when preemption was disabled, or interrupts were disabled.
assert(sched::preemptable());
diff --git a/core/elf.cc b/core/elf.cc
--- a/core/elf.cc
+++ b/core/elf.cc
@@ -1114,6 +1114,20 @@ void object::init_static_tls()
}
}

+// This allocates single page of memory and sets its permissions so
+// that any read, write or execution attempt would trigger a page fault
+// to indicate to user that application tried to access missing symbol
+// ignored by relocate_rela().
+// TODO: In future this page can store the name addresses for each missing
symbol
+// and allow informing user which particular symbol was missing
+void *missing_symbols_page_addr;
+void setup_missing_symbols_detector()
+{
+ missing_symbols_page_addr = mmu::map_anon(nullptr, mmu::page_size,
mmu::mmap_populate, mmu::perm_rw);
+ assert(missing_symbols_page_addr);
+ mmu::mprotect(missing_symbols_page_addr, mmu::page_size, 0);
+}
+
program* s_program;

void create_main_program()
diff --git a/include/osv/elf.hh b/include/osv/elf.hh
--- a/include/osv/elf.hh
+++ b/include/osv/elf.hh
@@ -635,6 +635,9 @@ private:
std::stack<std::vector<std::shared_ptr<object>>> _loaded_objects_stack;
};

+extern void *missing_symbols_page_addr;
+void setup_missing_symbols_detector();
+
void create_main_program();

/**
diff --git a/loader.cc b/loader.cc
Reply all
Reply to author
Forward
0 new messages