OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg

37 views
Skip to first unread message

Waldek Kozaczuk

unread,
Nov 22, 2018, 2:55:03 PM11/22/18
to OSv Development
I see this crash:

frame=  275 fps= 47 q=-0.0 size=     520kB time=00:00:1pa2ge. f2a1 ublitt roauttes=i d3e4 8a.p6pklbiictast/iso ns,p eeadd=d2r200001d00000 0 0
[registers]
RIP: 0x000000000044ef5f <???+4517727>
RFL: 0x0000000000010202  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x8000000000000000  RBX: 0x0000200001d00004  RCX: 0x0000000000000002  RDX: 0x0000200001cff66c
RSI: 0xfffffffffffffffc  RDI: 0x8000000000000000  RBP: 0x0000200001cff7a0  R8:  0x0000000000004000
R9:  0x00000000ffffffe5  R10: 0x0000000000004000  R11: 0x8000000000000000  R12: 0x0000000000000000
R13: 0x00000000ffffffe5  R14: 0x0000000000004000  R15: 0x8000000000000000  RSP: 0x0000200001cfda50
Aborted

[backtrace]
0x0000000000346ce2 <???+3435746>
0x0000000000347946 <mmu::vm_fault(unsigned long, exception_frame*)+310>
0x00000000003a222b <page_fault+123>
0x00000000003a10a6 <???+3805350>

Please note that ffmpeg is constantly printing to screen (vga or serial console?) some output about progress.

Once connected to gdb I see this stacktrace:

(gdb) bt
#0  0x00000000003a83d2 in processor::cli_hlt () at arch/x64/processor.hh:247
#1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
#2  osv::halt () at arch/x64/power.cc:24
#3  0x000000000023ef34 in abort (fmt=fmt@entry=0x63095b "Aborted\n") at runtime.cc:132
#4  0x0000000000202765 in abort () at runtime.cc:98
#5  0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, ef=0xffff800006550068) at core/mmu.cc:1316
#6  0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184402497536, ef=ef@entry=0xffff800006550068) at core/mmu.cc:1330
#7  0x00000000003a222c in page_fault (ef=0xffff800006550068) at arch/x64/mmu.cc:38
#8  <signal handler called>
#9  0x000000000044ef5f in fmt_fp (f=0x200001cffa50, y=0, w=0, p=2, fl=0, t=102) at libc/stdio/vfprintf.c:300
#10 0x0000000000000000 in ?? ()

I wonder if this is related to https://github.com/cloudius-systems/osv/issues/1010 (though no httpserver at all) and this https://github.com/cloudius-systems/osv/issues/536.

Please note that the common thing between all these stack traces is fmt_fp() function in libc/stdio/vfprintf.c:300. Coincidence?

Wadek

Waldek Kozaczuk

unread,
Nov 22, 2018, 2:58:40 PM11/22/18
to OSv Development
To reproduce (may not happen every time):

1) build with ffmpeg (enable h265 in ffmpeg Makefile)
./scripts/build image=ffmpeg

2) In one terminal start ffmpeg to receive video on host:
cd apps/ffmpeg/ROOTFS
LD_LIBRARY_PATH=. ./ffmpeg.so -i tcp://0.0.0.0:12345?listen -c copy /home/wkozaczuk/test.mp4

3) In another terminal start ffmpeg to transcode and send video:
./scripts/run.py -c 8 -e '/ffmpeg.so -i http://clips.vorwaerts-gmbh.de/VfE_html5.mp4 -c:v libx265 -crf 28 -c:a aac -b:a 128k -f mpegts tcp://192.168.122.1:12345'

Waldek

Waldek Kozaczuk

unread,
Nov 25, 2018, 11:06:14 PM11/25/18
to OSv Development
I have updated #1010 where I think it is possibly a duplicate to #536. As far as #1010 I did an experiment where I changed priority field of type float in JSON thread structure to long (which makes it not use fmt_fp()) and the crash went away. So I wonder if this problem is similar in that the the underlying cause might be related to how OSv handles FPU state when concurrent threads are trying to printf floating numbers (%f) (that is why it happens in fmt_fp()) or possibly there is a bug in this code that was fixed in musl.

I will try to create simple test that printfs floating point number in multiple threads to see if we can narrow it down even more.

Nadav Har'El

unread,
Nov 26, 2018, 9:27:43 AM11/26/18
to Waldek Kozaczuk, osv...@googlegroups.com
Interesting. Please open a new issue about it but link to #1010 and #536, which I agree - are probably all exactly the same bug.

Such crashes always smell like FPU-state-saving bugs but in this case (which are notoriously difficult to fix), but I wonder...
The code in line 300 is:

       do {
                *z = y;
                y = 1000000000*(y-*z++);
        } while (y);
 
   
Can you print "y" and "z" with the debugger?

The FPU bugs usually cause crashes when memory-copying code using FPU overwrites random parts of memory. So why here do we always get a crash in exactly the same place with this "z", and what overwrote it and when??? (z is set just a couple of lines above and increased here in a tight loop, what can overwrite it?

Another possibility is that "y" is kept in a floating point register and clobbered by some interrupt that doesn't save FPU state (or something like that) which causes the loop to continue forever.
If you can reliably reproduce this, you can add various printouts or variables to help debug what goes wrong with "z" or "y".

Finally, I'm not familiar with this code - this can also be a musl bug - see for example https://www.openwall.com/lists/musl/2014/03/22/7 which says MUSL did have known bugs in this area. You should be able to fairly easily port a new musl's vfprintf.c to OSv (do diff libc/stdio/vfprintf.c musl/src/stdio/vfprintf.c to see what were the only changes OSv deliberately did to musl's vfprintf.c).


Waldek Kozaczuk

unread,
Nov 26, 2018, 11:52:23 PM11/26/18
to OSv Development
Upgrading vfprintf.c to the latest version from musl did not help. So back to square one.
#3  0x000000000023ef34 in abort (fmt=fmt@entry=0x63089b "Aborted\n") at runtime.cc:132
#4  0x0000000000202765 in abort () at runtime.cc:98
#5  0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, ef=0xffff800003431068) at core/mmu.cc:1316
#6  0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, ef=ef@entry=0xffff800003431068) at core/mmu.cc:1330
#7  0x00000000003a222c in page_fault (ef=0xffff800003431068) at arch/x64/mmu.cc:38
#8  <signal handler called>
#9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:300
#10 0x0000000000000000 in ?? ()
(gdb) frame 9
#9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:300
300 *z = y;
(gdb) display y
1: y = 0
(gdb) p/f y
$1 = 0
(gdb) p z
$2 = (uint32_t *) 0x200000200000
(gdb) p z - big // big is a beginning of the buffer that is around 1800 long 
$3 = 4632
(gdb) p big
$4 = {0 <repeats 1828 times>}


The FPU bugs usually cause crashes when memory-copying code using FPU overwrites random parts of memory. So why here do we always get a crash in exactly the same place with this "z", and what overwrote it and when??? (z is set just a couple of lines above and increased here in a tight loop, what can overwrite it?

Another possibility is that "y" is kept in a floating point register and clobbered by some interrupt that doesn't save FPU state (or something like that) which causes the loop to continue forever.
If you can reliably reproduce this, you can add various printouts or variables to help debug what goes wrong with "z" or "y".

What is interesting when I modified vfprintf.c like the I could not reproduce the error any more (or maybe less frequent) after around 20 runs of ffmpeg. Normally it happens every time in the scenario #1010.

diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c
index aac790c0..1e116038 100644
--- a/libc/stdio/vfprintf.c
+++ b/libc/stdio/vfprintf.c
@@ -8,6 +8,7 @@
 #include <inttypes.h>
 #include <math.h>
 #include <float.h>
+#include <assert.h>
 
 /* Some useful macros */
 
@@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int p, int fl, int t)
        if (e2<0) a=r=z=big;
        else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
 
+        int steps = 0;
        do {
                *z = y;
                y = 1000000000*(y-*z++);
+                steps++;
+               if(steps > 2000) {
+                  assert(0);
+                }
        } while (y);
 
        while (e2>0) {

Could it be because gcc optimized code differently and did not put "y" in the FPU registerr?

BTW I wonder if this part of some optimization suggestions in #536 is any relevant - https://github.com/cloudius-systems/osv/issues/536#issuecomment-62365543. Would it be possible to explicitly force gcc not to use FPU register for y? Would it degrade performance a lot? In ether case even if possible I am guessing the underlying issue of not saving FPU state somewhere could hurt us later in other place of the code anyway, correct? 
 
Finally when you say FPU state got corrupted or not saved/restored do you mean same thing? Can you elaborate?
I see this struct in processor.hh:
struct fpu_state {
    char legacy[512];
    char xsavehdr[24];
    char reserved[40];
    char ymm[256];
} __attribute__((packed));

Is this struct getting corrupted/overwritten in memory by some code or is it that some interrupt handler does not save/restore FPU using fpu_lock class like it is done by interrupt, page_fault handler, signal and syscall handler? Is this what you are saying difficult to trace?

Nadav Har'El

unread,
Nov 27, 2018, 3:36:14 AM11/27/18
to Waldek Kozaczuk, osv...@googlegroups.com
On Tue, Nov 27, 2018 at 6:52 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Upgrading vfprintf.c to the latest version from musl did not help. So back to square one.

:-(



(gdb) bt
#0  0x00000000003a83d2 in processor::cli_hlt () at arch/x64/processor.hh:247
#1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
#2  osv::halt () at arch/x64/power.cc:24
#3  0x000000000023ef34 in abort (fmt=fmt@entry=0x63089b "Aborted\n") at runtime.cc:132
#4  0x0000000000202765 in abort () at runtime.cc:98
#5  0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, ef=0xffff800003431068) at core/mmu.cc:1316
#6  0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, ef=ef@entry=0xffff800003431068) at core/mmu.cc:1330
#7  0x00000000003a222c in page_fault (ef=0xffff800003431068) at arch/x64/mmu.cc:38
#8  <signal handler called>
#9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:300
#10 0x0000000000000000 in ?? ()
(gdb) frame 9
#9  0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:300
300 *z = y;
(gdb) display y
1: y = 0

If y is really 0 (and not some sort of "almost 0 printed as 0", the loop should have stopped. But maybe it's also possible that gdb thinks it is 0 because that's what the memory variable says, but the compiler actually didn't update the memory variable and only keeps y in a register.... You can disassemble the relevant part of this function to see what the compiled looks like.

That being said, something I don't understand: even if y is corrupted once, I would expect it to perhaps double the amount this loop tries to write. It seems the loop ran much more than that. Could y have been corrupted more than once? Could y have reached some broken state which "looks" like zero but never evaluates to be equal to zero?

 
(gdb) p/f y
$1 = 0
(gdb) p z
$2 = (uint32_t *) 0x200000200000

Interesting. This is exactly the beginning of another huge page, so it makes sense we would get a page fault at this place, after already looping many more times than we should.
Because you used a function call here, the compiler is forced to save its FPU state... Can you try just "break" here and do the print or assert after the loop?

+                }
        } while (y);
 
        while (e2>0) {

Could it be because gcc optimized code differently and did not put "y" in the FPU registerr?

BTW I wonder if this part of some optimization suggestions in #536 is any relevant - https://github.com/cloudius-systems/osv/issues/536#issuecomment-62365543. Would it be possible to explicitly force gcc not to use FPU register for y? Would it degrade performance a lot? In ether case even if possible I am guessing the underlying issue of not saving FPU state somewhere could hurt us later in other place of the code anyway, correct? 

I'm less worried about fixing this specific bug, and much more worried that we have a bigger bug which can crop up randomly in other workloads...
 
 
Finally when you say FPU state got corrupted or not saved/restored do you mean same thing? Can you elaborate?
I see this struct in processor.hh:
struct fpu_state {
    char legacy[512];
    char xsavehdr[24];
    char reserved[40];
    char ymm[256];
} __attribute__((packed));

Is this struct getting corrupted/overwritten in memory by some code or is it that some interrupt handler does not save/restore FPU using fpu_lock class like it is done by interrupt, page_fault handler, signal and syscall handler? Is this what you are saying difficult to trace?

The basic issue is that our kernel code uses the FPU - both for traditional FPU work (e.g., floating-point calculations in the scheduler, etc.) and for modern SIMD trickery generated by the compiler or hand-written by us (e.g., use some 128-bit FPU instruction to copy 128 bits at a time).  As I mentioned above, when the compiler calls a function, it saves whatever FPU work it was in the middle of. But when something happens asynchronously - e.g, an interrupt, exception, etc., the code called (interrupt handler, etc.) needs to save the entire FPU state because it has no idea what the user's code might be in the middle of doing.

We obviously have been doing this for a long time, but in some cases we forgot to save the FPU state or did it in the wrong time and this ets discovered and exposed by new compiler optimization which uses FPU in more and more unexpected places (e.g., see 4349bfd3583df03d44cda480049e630081d3a20b for one I fixed last year). Unfortunately it is possible we still have such bugs and they are hard to hunt down (the last bug I solved by reviewing disassembled code....).

You're right that it's also possible that we did save the FPU state correctly, but then the interrupt handler or another thread corrupted the saved buffer, so when restoring it, we restore crap. I've never seen this happening, but it definitely can...

Waldek Kozaczuk

unread,
Nov 27, 2018, 1:36:28 PM11/27/18
to OSv Development
Here is the disassembled portion of the code in vfprint.c:
480         else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
 481 
 482         do {
 483                 *z = y;
 484      540:       d9 c0                   fld    %st(0)
 485      542:       d9 ad 0c e3 ff ff       fldcw  -0x1cf4(%rbp)
 486      548:       df bd 00 e3 ff ff       fistpll -0x1d00(%rbp)
 487      54e:       d9 ad 0e e3 ff ff       fldcw  -0x1cf2(%rbp)
 488                 y = 1000000000*(y-*z++);
 489      554:       48 83 c3 04             add    $0x4,%rbx
 490                 *z = y;
 491      558:       48 8b 85 00 e3 ff ff    mov    -0x1d00(%rbp),%rax
 492      55f:       89 43 fc                mov    %eax,-0x4(%rbx)
 493                 y = 1000000000*(y-*z++);
 494      562:       89 c0                   mov    %eax,%eax
 495      564:       48 89 85 f0 e2 ff ff    mov    %rax,-0x1d10(%rbp)
 496      56b:       df ad f0 e2 ff ff       fildll -0x1d10(%rbp)
 497      571:       de e9                   fsubrp %st,%st(1)
 498      573:       d8 c9                   fmul   %st(1),%st
 499         } while (y);
 500      575:       d9 ee                   fldz
 501      577:       d9 c9                   fxch   %st(1)
 502      579:       db e9                   fucomi %st(1),%st
 503      57b:       dd d9                   fstp   %st(1)
 504      57d:       7a c1                   jp     540 <fmt_fp+0x140>
 505      57f:       75 bf                   jne    540 <fmt_fp+0x140>
 506      581:       dd d8                   fstp   %st(0)
 507      583:       dd d8                   fstp   %st(0)
 508 
 509         while (e2>0) {
 510      585:       4c 8b a5 e0 e2 ff ff    mov    -0x1d20(%rbp),%r12

Obviously it does use FP registers. I wonder if there is some kind of arithmetic exception encountered (overflow when multiplying) which we do not handle correctly. I think we have handler that would simply abort in this case.

Waldek

Waldek Kozaczuk

unread,
Nov 27, 2018, 3:01:38 PM11/27/18
to OSv Development
Also I wonder if the musl sources are compiled correctly as far as floating point handling is concerned - https://wiki.musl-libc.org/mathematical-library.html

Sent from my iPhone
--
You received this message because you are subscribed to a topic in the Google Groups "OSv Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/osv-dev/ZAjvi_vEhnA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to osv-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nadav Har'El

unread,
Nov 27, 2018, 3:12:08 PM11/27/18
to Waldek Kozaczuk, osv...@googlegroups.com
Indeed, seems like a loop that works on fpu registers and stack. The actual loop's test, while(y) is the "fucomi" instruction which compares two floating point values one of which being a zero created by "fldz". My completely unproven suspicion is that in the middle of this loop we get an interrupt (possibly also leading to a context switch, running another thread, and only much later returning to this thread), and for some reason the floating point state (which includes the register stack, etc.) is not saved correctly - or not restored correctly (perhaps restored from a corrupted array?). If after such corruption, "y" (in whatever register it sits) becomes, for example, NaN, the loop will never finish. I wonder if we can print these registers from gdb to see if perhaps gdb showing "y=0" isn't really correct.
 
I wonder if there is some kind of arithmetic exception encountered (overflow when multiplying) which we do not handle correctly. I think we have handler that would simply abort in this case.

I don't think this is the case. Floating point exceptions are rarely enabled (see also https://github.com/cloudius-systems/osv/issues/855 where we discovered qemu doesn't even support them correctly), and usually overflow when multiplying will just create an "inf" value.

Nadav Har'El

unread,
Nov 27, 2018, 5:53:17 PM11/27/18
to Waldek Kozaczuk, osv...@googlegroups.com

On Tue, Nov 27, 2018 at 10:11 PM Nadav Har'El <n...@scylladb.com> wrote:
Indeed, seems like a loop that works on fpu registers and stack. The actual loop's test, while(y) is the "fucomi" instruction which compares two floating point values one of which being a zero created by "fldz". My completely unproven suspicion is that in the middle of this loop we get an interrupt (possibly also leading to a context switch, running another thread, and only much later returning to this thread), and for some reason the floating point state (which includes the register stack, etc.) is not saved correctly - or not restored correctly (perhaps restored from a corrupted array?). If after such corruption, "y" (in whatever register it sits) becomes, for example, NaN, the loop will never finish. I wonder if we can print these registers from gdb to see if perhaps gdb showing "y=0" isn't really correct.

Ok, so I started theorizing what might cause this...
If I remember correctly, OSv currently always saves the FPU state on some stack, using the fpu_lock type.
Could we possibly be using stacks which are too small to hold this FPU state?
In arch/x64/arch-cpu.hh we set a 4096 byte stack for nested exceptions, 4096 byte stack for interrupts, and 4096*4 byte stack for normal exceptions. Maybe one of these is too small? If you can easily reproduce this bug, can you add a zero to all of these and see if maybe the bug goes away with bigger stacks?

Waldek Kozaczuk

unread,
Nov 27, 2018, 6:02:54 PM11/27/18
to Nadav Har'El, osv...@googlegroups.com
Did not help. 

Sent from my iPhone

Waldek Kozaczuk

unread,
Nov 27, 2018, 6:07:55 PM11/27/18
to Nadav Har'El, osv...@googlegroups.com
I also checked if y == 0 in Gdb when I connected after the crash and indeed it was true. 

What about eflags which are set by FUCOMI? Are we saving those?

Sent from my iPhone

Waldek Kozaczuk

unread,
Nov 27, 2018, 6:30:03 PM11/27/18
to Nadav Har'El, OSv Development
I was also reading about "lazy" vs "eager" FPU state save/restore in Linux and how due to some security reasons they are advocating users to switch to the eager. I think the eager means that on each context switch FPU state gets saved/restored regardless if FPU registered are used.

Is OSv using "eager" or "lazy" strategy? I am guessing probably the lazy one. 

Also I am not sure about cassandra (ant original older bug) but ffmpeg is very heavy on floating point arithmetic so maybe that exposes FPU bugs more easily. 

Waldek Kozaczuk

unread,
Nov 27, 2018, 10:54:02 PM11/27/18
to OSv Development
I do not think I understand mechanics of how context switch in handled in OSv but fundamentally whenever current thread is preempted for whatever reason (timer expired, etc), OSv thread scheduler must somehow save all 16 registers, flags and everything else but FPU state somewhere and then upon selecting new thread for execution it must restore values of its 16 registers plus everything else so that the newly switched thread sees "the world" as it was when it was last running. I think more less it is done through TSS (task state segment) for each cpu and saving/restoring happens automatically (or not?).

In any case as far as FPU state goes there is number of places (interrupts, page fault, syscall handler) where fpu_lock construct is used to isolate any changes to FPU registers made by OSv code executed by corresponding code for these interrupts, etc. That way OSv code should not corrupt any FPU state.

But what mechanism does exist in OSv to make sure that concurrent application threads that use floating point registers (like vfprintf used by httpserver thread or any other floating point arithmetic used by ffmpeg threads for example) do NOT corrupt each other FPU state when preempted? 

Here is what I found on internet (https://wiki.osdev.org/Context_Switching):

"The FPU/MMX and SSE state could be saved and reloaded, but the CPU can also be tricked into generating an exception the first time that an FPU/MMX or SSE instruction is used by copying the hardware context switch mechanism (setting the TS flag in CR0)."

Does this suggest it should be possible to register an exception handler that would be called when first FPU operation is executed to mark to save FPU state when preempted and then restore when coming back to this thread? But I do not see where such code exists in OSv. Or have I missed it?

Waldek

Nadav Har'El

unread,
Nov 28, 2018, 3:58:19 AM11/28/18
to Waldek Kozaczuk, osv...@googlegroups.com
On Wed, Nov 28, 2018 at 5:54 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
I do not think I understand mechanics of how context switch in handled in OSv but fundamentally whenever current thread is preempted for whatever reason (timer expired, etc), OSv thread scheduler must somehow save all 16 registers, flags and everything else but FPU state somewhere and then upon selecting new thread for execution it must restore values of its 16 registers plus everything else so that the newly switched thread sees "the world" as it was when it was last running. I think more less it is done through TSS (task state segment) for each cpu and saving/restoring happens automatically (or not?).

A context switch (from one thread to another thread) can happen in two ways. The first is voluntary - the first thread calls some blocking function (sleep(), read(), etc.) and goes to sleep and replaced by another thread. The second is involuntary - we get an asynchronous event during the run of the thread - an interrupt, exception, or signal - and the scheduler decides to do a context switch at that point.

In the first case (voluntary switching) our life is easy. The user thread called a *function*. The ABI requires that when calling a function, the caller must save many of the registers (known as "caller-saved") himself because these can be used by the called function. It also requires that the caller cannot assume anything about the FPU state (because the callee is allowed to use the FPU too!). For this reason the thread context switch code (switching between user threads) only needs to save some of the registers, and does NOT need to save the FPU state.

The situation is different with involuntary context switches. When an asynchronous event, e.g., an interrupt, occurs, the user thread is in a random position in the code. It may be using all its registers, and the FPU state (including the old-style FPU and the new SSE and AVX registers). Because our interrupt handler (which may do anything from running the scheduler to reading a page from disk on page fault) may need to use any of these registers, all of them, including the FPU, need to be saved on interrupt time. The interrupt has a separate stack, and the FPU is saved on this stack (see fpu_lock use in interrupt()). When the interrupt finishes, this FPU is restored. This includes involuntary context switching: thread A receives an interrupt, saves the FPU, does something and decides to switch to thread B, and a while later we switch back to thread A at which point the interrupt handler "returns" and restores the  FPU state.

I can't think of a case where the user thread (doing that printf loop) can get interrupted in the middle but we *dont'* save the FPU. So if we do, there remains the possibility that we manage to ruin it before saving (this was issue 4349bfd3583df03d44cda480049e630081d3a20b) or that something else entirely overwrites and corrupts the saved state.

An idea worth trying: We can verify that the latter (corruption of the saved state) is not the problem by adding a simple checksum to the saved state in fpu_lock, and before restoring it, check that the checksum is intact.
 

In any case as far as FPU state goes there is number of places (interrupts, page fault, syscall handler) where fpu_lock construct is used to isolate any changes to FPU registers made by OSv code executed by corresponding code for these interrupts, etc. That way OSv code should not corrupt any FPU state.

But what mechanism does exist in OSv to make sure that concurrent application threads that use floating point registers (like vfprintf used by httpserver thread or any other floating point arithmetic used by ffmpeg threads for example) do NOT corrupt each other FPU state when preempted? 

Here is what I found on internet (https://wiki.osdev.org/Context_Switching):

"The FPU/MMX and SSE state could be saved and reloaded, but the CPU can also be tricked into generating an exception the first time that an FPU/MMX or SSE instruction is used by copying the hardware context switch mechanism (setting the TS flag in CR0)."

OSv does not use the lazy FPU trick. This trick has rather gone out of style when use of the FPU became much more commonly used, especially with modern compilers generating SSE instructions for things that have nothing to do with floating point.

The way this trick worked was to say, on context switch, that the processor does *not* have an FPU (by setting the TS bit in the CR0 register), and not save (yet) the outgoing thread's FPU state. When, hopefully a long time later, another thread uses the first FPU instruction, it will get a #NM exception (no-math, i.e., no FPU available). The OS catches this exception, and only *then* saves the old FPU state, loads the current thread's saved FPU state, and restarts the instruction (and this time it will work).

But OSv does not do this, it will be rather pointless where a lot of code - including the kernel's scheduler - uses FPU instructions.


Does this suggest it should be possible to register an exception handler that would be called when first FPU operation is executed to mark to save FPU state when preempted and then restore when coming back to this thread? But I do not see where such code exists in OSv. Or have I missed it?

Waldek

On Tuesday, November 27, 2018 at 6:30:03 PM UTC-5, Waldek Kozaczuk wrote:
I was also reading about "lazy" vs "eager" FPU state save/restore in Linux and how due to some security reasons they are advocating users to switch to the eager. I think the eager means that on each context switch FPU state gets saved/restored regardless if FPU registered are used.

Is OSv using "eager" or "lazy" strategy? I am guessing probably the lazy one. 

Also I am not sure about cassandra (ant original older bug) but ffmpeg is very heavy on floating point arithmetic so maybe that exposes FPU bugs more easily. 

On Tue, Nov 27, 2018 at 6:07 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
I also checked if y == 0 in Gdb when I connected after the crash and indeed it was true. 

What about eflags which are set by FUCOMI? Are we saving those?

Sent from my iPhone

On Nov 27, 2018, at 17:53, Nadav Har'El <n...@scylladb.com> wrote:


On Tue, Nov 27, 2018 at 10:11 PM Nadav Har'El <n...@scylladb.com> wrote:
Indeed, seems like a loop that works on fpu registers and stack. The actual loop's test, while(y) is the "fucomi" instruction which compares two floating point values one of which being a zero created by "fldz". My completely unproven suspicion is that in the middle of this loop we get an interrupt (possibly also leading to a context switch, running another thread, and only much later returning to this thread), and for some reason the floating point state (which includes the register stack, etc.) is not saved correctly - or not restored correctly (perhaps restored from a corrupted array?). If after such corruption, "y" (in whatever register it sits) becomes, for example, NaN, the loop will never finish. I wonder if we can print these registers from gdb to see if perhaps gdb showing "y=0" isn't really correct.

Ok, so I started theorizing what might cause this...
If I remember correctly, OSv currently always saves the FPU state on some stack, using the fpu_lock type.
Could we possibly be using stacks which are too small to hold this FPU state?
In arch/x64/arch-cpu.hh we set a 4096 byte stack for nested exceptions, 4096 byte stack for interrupts, and 4096*4 byte stack for normal exceptions. Maybe one of these is too small? If you can easily reproduce this bug, can you add a zero to all of these and see if maybe the bug goes away with bigger stacks?

--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.

Waldek Kozaczuk

unread,
Nov 28, 2018, 7:18:26 AM11/28/18
to Nadav Har'El, osv...@googlegroups.com
Thanks for a very thorough explanation.  

Sent from my iPhone

On Nov 28, 2018, at 03:58, Nadav Har'El <n...@scylladb.com> wrote:

On Wed, Nov 28, 2018 at 5:54 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
I do not think I understand mechanics of how context switch in handled in OSv but fundamentally whenever current thread is preempted for whatever reason (timer expired, etc), OSv thread scheduler must somehow save all 16 registers, flags and everything else but FPU state somewhere and then upon selecting new thread for execution it must restore values of its 16 registers plus everything else so that the newly switched thread sees "the world" as it was when it was last running. I think more less it is done through TSS (task state segment) for each cpu and saving/restoring happens automatically (or not?).

A context switch (from one thread to another thread) can happen in two ways. The first is voluntary - the first thread calls some blocking function (sleep(), read(), etc.) and goes to sleep and replaced by another thread. The second is involuntary - we get an asynchronous event during the run of the thread - an interrupt, exception, or signal - and the scheduler decides to do a context switch at that point.

In the first case (voluntary switching) our life is easy. The user thread called a *function*. The ABI requires that when calling a function, the caller must save many of the registers (known as "caller-saved") himself because these can be used by the called function. It also requires that the caller cannot assume anything about the FPU state (because the callee is allowed to use the FPU too!). For this reason the thread context switch code (switching between user threads) only needs to save some of the registers, and does NOT need to save the FPU state.

The situation is different with involuntary context switches. When an asynchronous event, e.g., an interrupt, occurs, the user thread is in a random position in the code. It may be using all its registers, and the FPU state (including the old-style FPU and the new SSE and AVX registers). Because our interrupt handler (which may do anything from running the scheduler to reading a page from disk on page fault) may need to use any of these registers, all of them, including the FPU, need to be saved on interrupt time. The interrupt has a separate stack, and the FPU is saved on this stack (see fpu_lock use in interrupt()). When the interrupt finishes, this FPU is restored. This includes involuntary context switching: thread A receives an interrupt, saves the FPU, does something and decides to switch to thread B, and a while later we switch back to thread A at which point the interrupt handler "returns" and restores the  FPU state.
Does involuntary case include scenario when enough time designated for current thread by scheduler expires? I would imaging this would qualify as interrupt?


I can't think of a case where the user thread (doing that printf loop) can get interrupted in the middle but we *dont'* save the FPU. So if we do, there remains the possibility that we manage to ruin it before saving (this was issue 4349bfd3583df03d44cda480049e630081d3a20b) or that something else entirely overwrites and corrupts the saved state.

An idea worth trying: We can verify that the latter (corruption of the saved state) is not the problem by adding a simple checksum to the saved state in fpu_lock, and before restoring it, check that the checksum is intact.
I will give it a try. 

Nadav Har'El

unread,
Nov 28, 2018, 8:14:05 AM11/28/18
to Waldek Kozaczuk, osv...@googlegroups.com
On Wed, Nov 28, 2018 at 2:18 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
On Nov 28, 2018, at 03:58, Nadav Har'El <n...@scylladb.com> wrote:

The situation is different with involuntary context switches. When an asynchronous event, e.g., an interrupt, occurs, the user thread is in a random position in the code. It may be using all its registers, and the FPU state (including the old-style FPU and the new SSE and AVX registers). Because our interrupt handler (which may do anything from running the scheduler to reading a page from disk on page fault) may need to use any of these registers, all of them, including the FPU, need to be saved on interrupt time. The interrupt has a separate stack, and the FPU is saved on this stack (see fpu_lock use in interrupt()). When the interrupt finishes, this FPU is restored. This includes involuntary context switching: thread A receives an interrupt, saves the FPU, does something and decides to switch to thread B, and a while later we switch back to thread A at which point the interrupt handler "returns" and restores the  FPU state.
Does involuntary case include scenario when enough time designated for current thread by scheduler expires? I would imaging this would qualify as interrupt?

Indeed. We set a timer to when this thread's runtime quota will expire, and this timer generates an interrupt. Check out interrupt() - after saving the FPU state and acknowledging the interrupt (EOI), it calls the scheduler (sched::preempt()). This will decide which thread to run next - it may run the same thread again, or a different thread. When sched::preempt() returns - possibly a long time later, it means the scheduler decided to run *this* thread again. At that point, interrupt() returns and just before returning it restores the FPU state automatically. 

Waldek Kozaczuk

unread,
Nov 28, 2018, 2:20:28 PM11/28/18
to OSv Development
The title of this thread indicates that the crash happens sporadically and indeed it is the case with the scenario described in 1st and 2nd email of this thread. The symptom (stack trace with fmt_fp() function captured by gdb) looks identical to what was reported in https://github.com/cloudius-systems/osv/issues/1010 and this https://github.com/cloudius-systems/osv/issues/536. We cannot be certain about whether the root cause is the same. However it is worth mentioning that crash happens with same symptom almost every time if I run the scenario described in issue #1010 which is different (but similar) than scenario described in 1st and 2nd email of this thread. I am emphasizing this because we have a way to replicate one of these crashes (possibly same root cause) in very repeatable way in case we have ideas of things to try.

Regarding your excellent explanations how non-FPU and FPU state is saved/restored by OSv I wanted to run a scenario to see if I understand it well (please forgive my ignorance if I am missing something obvious). Imagine we have 2 user threads (uT1 and uT2) executing some logic that involves floating point operations in a loop (like the one in the disassembled code). Ffmpeg is a good example as it must be very CPU intensive and doing all kind of FP computations to transcode video. Let us say on given cpu this sequence happens in time:

->uT1 is running (executing long FP logic loop) 
--> timer triggered interrupt happens because allotted time slice expired [some FPU registers had some inflight data = FPU has state S1)
----> OSv saves FPU state and other registers using fpu_lock construct
-------> OSv identifies next thread to run which is uT2
---------> OSv restores FPU to state S1 (fpu_lock goes out of scope) and other registers as well and switches to uT2
--------------> UT2 is running (executing long FP logic loop) 
--> timer triggered interrupt happens again because allotted time slice expired [some FPU registers had some inflight data = FPU has state S2)
... same thing 
... same thing 
---------> OSv restores FPU to state S2 and other registers as well and switches back to uT1
--------------> UT1 is resuming where it left above and instead of FPU be in expected S1 state it sees FPU in state S2 -> crash????

What am I missing in my scenario?

Lastly I am adding the disassembled portion of fmt_fp() with applied patch like below that makes the repeatable crash (from issue #1010) go away or at least way less frequent. 

diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c
index aac790c0..1e116038 100644
--- a/libc/stdio/vfprintf.c
+++ b/libc/stdio/vfprintf.c
@@ -8,6 +8,7 @@
#include <inttypes.h>
#include <math.h>
#include <float.h>
+#include <assert.h>
 /* Some useful macros */
@@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int p, int fl, int t)
       if (e2<0) a=r=z=big;
       else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
+        int steps = 0;
       do {
               *z = y;
               y = 1000000000*(y-*z++);
+                steps++;
+               if(steps > 2000) {
+                  assert(0);
+                }
       } while (y);
        while (e2>0) {

Your explanation was that because of added assert() function call we force compiler to generate code to save/restore any used FPU registers. But if we do not see the crash with this patch applied happeny any more, then the assert function does not get called and FPU not saved/restored, so how this helps make this crash go away?

Portion of disassembled code:
      if (e2<0) a=r=z=big;
        else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;

        int steps = 0;
        do {
                *z = y;
     501:       d9 bd 0e e3 ff ff       fnstcw -0x1cf2(%rbp)
        if (e2<0) a=r=z=big;
     507:       48 8d 85 40 e3 ff ff    lea    -0x1cc0(%rbp),%rax
     50e:       45 85 c9                test   %r9d,%r9d
     511:       48 8d 95 cc fe ff ff    lea    -0x134(%rbp),%rdx
     518:       48 89 c1                mov    %rax,%rcx
                *z = y;
     51b:       0f b7 85 0e e3 ff ff    movzwl -0x1cf2(%rbp),%eax
        if (e2<0) a=r=z=big;
     522:       48 0f 49 ca             cmovns %rdx,%rcx
                *z = y;
     526:       80 cc 0c                or     $0xc,%ah
        if (e2<0) a=r=z=big;
     529:       48 89 8d d8 e2 ff ff    mov    %rcx,-0x1d28(%rbp)
                y = 1000000000*(y-*z++);
     530:       48 8d 59 04             lea    0x4(%rcx),%rbx
     534:       48 8d 91 44 1f 00 00    lea    0x1f44(%rcx),%rdx
                *z = y;
     53b:       66 89 85 0c e3 ff ff    mov    %ax,-0x1cf4(%rbp)
     542:       d9 c0                   fld    %st(0)
     544:       d9 ad 0c e3 ff ff       fldcw  -0x1cf4(%rbp)
     54a:       df bd 00 e3 ff ff       fistpll -0x1d00(%rbp)
     550:       d9 ad 0e e3 ff ff       fldcw  -0x1cf2(%rbp)
     556:       48 8b 85 00 e3 ff ff    mov    -0x1d00(%rbp),%rax
     55d:       89 01                   mov    %eax,(%rcx)
                y = 1000000000*(y-*z++);
     55f:       89 c0                   mov    %eax,%eax
     561:       48 89 85 f8 e2 ff ff    mov    %rax,-0x1d08(%rbp)
     568:       df ad f8 e2 ff ff       fildll -0x1d08(%rbp)
     56e:       de e9                   fsubrp %st,%st(1)
     570:       d9 05 00 00 00 00       flds   0x0(%rip)        # 576 <fmt_fp+0x176>
     576:       dc c9                   fmul   %st,%st(1)
                steps++;
                if(steps > 2000) {
     578:       eb 48                   jmp    5c2 <fmt_fp+0x1c2>
     57a:       d9 c9                   fxch   %st(1)
     57c:       eb 04                   jmp    582 <fmt_fp+0x182>
     57e:       66 90                   xchg   %ax,%ax
     580:       d9 c9                   fxch   %st(1)
                *z = y;
     582:       d9 c0                   fld    %st(0)
     584:       d9 ad 0c e3 ff ff       fldcw  -0x1cf4(%rbp)
     58a:       df bd 00 e3 ff ff       fistpll -0x1d00(%rbp)
     590:       d9 ad 0e e3 ff ff       fldcw  -0x1cf2(%rbp)
                y = 1000000000*(y-*z++);
     596:       48 83 c3 04             add    $0x4,%rbx
                *z = y;
     59a:       48 8b 85 00 e3 ff ff    mov    -0x1d00(%rbp),%rax
     5a1:       89 43 fc                mov    %eax,-0x4(%rbx)
                y = 1000000000*(y-*z++);
     5a4:       89 c0                   mov    %eax,%eax
     5a6:       48 89 85 f8 e2 ff ff    mov    %rax,-0x1d08(%rbp)
     5ad:       df ad f8 e2 ff ff       fildll -0x1d08(%rbp)
     5b3:       de e9                   fsubrp %st,%st(1)
     5b5:       d8 c9                   fmul   %st(1),%st
                if(steps > 2000) {
     5b7:       48 39 d3                cmp    %rdx,%rbx
     5ba:       0f 84 99 10 00 00       je     1659 <fmt_fp+0x1259>
     5c0:       d9 c9                   fxch   %st(1)
                   assert(0);
                }
        } while (y);
     5c2:       d9 ee                   fldz
     5c4:       d9 ca                   fxch   %st(2)
     5c6:       db ea                   fucomi %st(2),%st
     5c8:       dd da                   fstp   %st(2)
     5ca:       7a ae                   jp     57a <fmt_fp+0x17a>
     5cc:       75 b2                   jne    580 <fmt_fp+0x180>
     5ce:       dd d8                   fstp   %st(0)
     5d0:       dd d8                   fstp   %st(0)

Nadav Har'El

unread,
Nov 28, 2018, 4:51:08 PM11/28/18
to Waldek Kozaczuk, osv...@googlegroups.com
On Wed, Nov 28, 2018 at 9:20 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
The title of this thread indicates that the crash happens sporadically and indeed it is the case with the scenario described in 1st and 2nd email of this thread. The symptom (stack trace with fmt_fp() function captured by gdb) looks identical to what was reported in https://github.com/cloudius-systems/osv/issues/1010 and this https://github.com/cloudius-systems/osv/issues/536. We cannot be certain about whether the root cause is the same. However it is worth mentioning that crash happens with same symptom almost every time if I run the scenario described in issue #1010 which is different (but similar) than scenario described in 1st and 2nd email of this thread. I am emphasizing this because we have a way to replicate one of these crashes (possibly same root cause) in very repeatable way in case we have ideas of things to try.

That is good. If we give up on this effort (at least temporarily), please summarize what you know about reproducing this issue in a new bug-tracker issue (and reference 1010 and 536) so we could return to it later.


Regarding your excellent explanations how non-FPU and FPU state is saved/restored by OSv I wanted to run a scenario to see if I understand it well (please forgive my ignorance if I am missing something obvious). Imagine we have 2 user threads (uT1 and uT2) executing some logic that involves floating point operations in a loop (like the one in the disassembled code). Ffmpeg is a good example as it must be very CPU intensive and doing all kind of FP computations to transcode video. Let us say on given cpu this sequence happens in time:

->uT1 is running (executing long FP logic loop) 
--> timer triggered interrupt happens because allotted time slice expired [some FPU registers had some inflight data = FPU has state S1)
----> OSv saves FPU state and other registers using fpu_lock construct

So far, right. This state is saved inside the interrupt stack, which while thread uT1 is running, is  uT1->_arch->interrupt_stack.
By the way, note that the regular (non-FPU) registers are also on this stack - the code in entry.S saves them on the stack and passes the pointer to it when it calls interrupt().

-------> OSv identifies next thread to run which is uT2
---------> OSv restores FPU to state S1 (fpu_lock goes out of scope) and other registers as well and switches to uT2

No, this is not what happens now (at least, not if you mean that "S1"  was uT1's state...). Before interrupt() ends or fpu_lock runs out of scope, interrupt calls sched::preempt().
This is a weird function. It doesn't do something quickly and return :-) Rather, it calls the scheduler to decide which thread to run next. If it decides to run uT1 again, it just returns normally (and the FPU state is restored and the interrupt returns).
But if it decides to run uT2 it doesn't quite return... Rather, it *switches* to thread uT2. This means we switch to uT2's stack, and jump to the program counter where uT2 was before. And where was uT2 running before? One of the likely options was that it was itself in interrupt() calling preempt() :-) So now, finally, when uT2 resumes, its call to preempt() "returns", and we reach the end of the interrupt() function. Remember that at this point we're running uT2, with uT2's stack. So when interrupt() ends, the fpu_lock goes out of scope, and the FPU state is popped from *uT2*'s stack. This is was uT2's FPU state, saved long ago, and now we're restoring it, and when interrupt() returns, uT2's correct FPU state will have been restored.
Yes, my thesis (and I'm too tired now to verify this in the assembler code below) was that because there is this function call (and the compiler doesn't know if it is commonly called or not), the compiler doesn't want to leave y in a register, so it will load and store it from memory. But this is just a guess.
You can try instead of assert(0), do

asm volatile ("cli; hlt" : : : "memory");

This is not a function call, so it will not change the compiled code. It will allow you to connect with a debugger to print more stuff at that point. Note that this cli; hlt stuff will only hang the single CPU that ran this command (if you can reproduce this with -c1, this will be the only cpu so things will be easier).

 
--

Waldek Kozaczuk

unread,
Nov 28, 2018, 5:03:31 PM11/28/18
to OSv Development


On Wednesday, November 28, 2018 at 4:51:08 PM UTC-5, Nadav Har'El wrote:
On Wed, Nov 28, 2018 at 9:20 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
The title of this thread indicates that the crash happens sporadically and indeed it is the case with the scenario described in 1st and 2nd email of this thread. The symptom (stack trace with fmt_fp() function captured by gdb) looks identical to what was reported in https://github.com/cloudius-systems/osv/issues/1010 and this https://github.com/cloudius-systems/osv/issues/536. We cannot be certain about whether the root cause is the same. However it is worth mentioning that crash happens with same symptom almost every time if I run the scenario described in issue #1010 which is different (but similar) than scenario described in 1st and 2nd email of this thread. I am emphasizing this because we have a way to replicate one of these crashes (possibly same root cause) in very repeatable way in case we have ideas of things to try.

That is good. If we give up on this effort (at least temporarily), please summarize what you know about reproducing this issue in a new bug-tracker issue (and reference 1010 and 536) so we could return to it later.


Regarding your excellent explanations how non-FPU and FPU state is saved/restored by OSv I wanted to run a scenario to see if I understand it well (please forgive my ignorance if I am missing something obvious). Imagine we have 2 user threads (uT1 and uT2) executing some logic that involves floating point operations in a loop (like the one in the disassembled code). Ffmpeg is a good example as it must be very CPU intensive and doing all kind of FP computations to transcode video. Let us say on given cpu this sequence happens in time:

->uT1 is running (executing long FP logic loop) 
--> timer triggered interrupt happens because allotted time slice expired [some FPU registers had some inflight data = FPU has state S1)
----> OSv saves FPU state and other registers using fpu_lock construct

So far, right. This state is saved inside the interrupt stack, which while thread uT1 is running, is  uT1->_arch->interrupt_stack.
By the way, note that the regular (non-FPU) registers are also on this stack - the code in entry.S saves them on the stack and passes the pointer to it when it calls interrupt().

-------> OSv identifies next thread to run which is uT2
---------> OSv restores FPU to state S1 (fpu_lock goes out of scope) and other registers as well and switches to uT2

No, this is not what happens now (at least, not if you mean that "S1"  was uT1's state...). Before interrupt() ends or fpu_lock runs out of scope, interrupt calls sched::preempt().
This is a weird function. It doesn't do something quickly and return :-) Rather, it calls the scheduler to decide which thread to run next. If it decides to run uT1 again, it just returns normally (and the FPU state is restored and the interrupt returns).
But if it decides to run uT2 it doesn't quite return... Rather, it *switches* to thread uT2. This means we switch to uT2's stack, and jump to the program counter where uT2 was before. And where was uT2 running before? One of the likely options was that it was itself in interrupt() calling preempt() :-) So now, finally, when uT2 resumes, its call to preempt() "returns", and we reach the end of the interrupt() function. Remember that at this point we're running uT2, with uT2's stack. So when interrupt() ends, the fpu_lock goes out of scope, and the FPU state is popped from *uT2*'s stack. This is was uT2's FPU state, saved long ago, and now we're restoring it, and when interrupt() returns, uT2's correct FPU state will have been restored.
Thanks. I get it now. The key thing is that FPU state is saved (pushed) to current and restored from (popped) new thread's stack. 

Waldek Kozaczuk

unread,
Nov 29, 2018, 12:28:53 PM11/29/18
to OSv Development
Nadav,

I have found this email thread around patches adding AVX support - https://groups.google.com/forum/#!msg/osv-dev/DIbzudXr0jo/glTrLgjFOssJ;context-place=forum/osv-dev.

There is a comment you left specifically about "per-thread fpu save area". Maybe that is something we need to pursue once we confirm that the fpu_state on interrupt stack somehow gets corrupted. I still need to code that checksum logic to verify that indeed it happens.

"When we leave a thread during an involuntary context switch (e.g., interrupt), we need to XSAVEOPT its state in the same place we last XRESTORed it from when entering this thread - so this can't be a random location on the stack chosen at interrupt time (as in inplace_fpu_state) because then it won't be the same address used last time and the optimization won't work. So we need to have a per-thread fpu save area, and save the state *there* on interrupt. Funny enough, we already had such a tread->_fpu area in the past, but removed it (commit 202b2cccc14a43d088002a80f7add51e3e4bc4ce)."

Waldek Kozaczuk

unread,
Nov 29, 2018, 2:19:48 PM11/29/18
to OSv Development
Implemented checksum logic (please let me know if I have made any stupid mistakes).

diff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.cc
index bb96af3d..34bf6081 100644
--- a/arch/x64/arch-cpu.cc
+++ b/arch/x64/arch-cpu.cc
@@ -42,6 +42,14 @@ exception_guard::~exception_guard()
     sched::cpu::current()->arch.exit_exception();
 }
 
+static inline u64 calculate_fpu_state_checksum(processor::fpu_state *s) {
+    u64 val = 0;
+    char* data = s->legacy;
+    for (u64 i = 0; i < sizeof(processor::fpu_state) - sizeof(u64); i++)
+       val += data[i];
+    return val;
+}
+
 extern "C"
 [[gnu::target("no-sse")]]
 void fpu_state_init_xsave(processor::fpu_state *s) {
@@ -61,21 +69,25 @@ void fpu_state_init_fxsave(processor::fpu_state *s) {
 extern "C"
 void fpu_state_save_xsave(processor::fpu_state *s) {
     processor::xsave(s, -1);
+    s->checksum = calculate_fpu_state_checksum(s);
 }
 
 extern "C"
 void fpu_state_save_fxsave(processor::fpu_state *s) {
     processor::fxsave(s);
+    s->checksum = calculate_fpu_state_checksum(s);
 }
 
 extern "C"
 void fpu_state_restore_xsave(processor::fpu_state *s) {
     processor::xrstor(s, -1);
+    assert(s->checksum == calculate_fpu_state_checksum(s));
 }
 
 extern "C"
 void fpu_state_restore_fxsave(processor::fpu_state *s) {
     processor::fxrstor(s);
+    assert(s->checksum == calculate_fpu_state_checksum(s));
 }
 
 extern "C"
diff --git a/arch/x64/processor.hh b/arch/x64/processor.hh
index 6b1fbf66..a26fb028 100644
--- a/arch/x64/processor.hh
+++ b/arch/x64/processor.hh
@@ -343,6 +343,7 @@ struct fpu_state {
     char xsavehdr[24];
     char reserved[40];
     char ymm[256];
+    u64 checksum;
 } __attribute__((packed));
 
 inline void fxsave(fpu_state* s)

:-( Unfortunately no fpu_state corruption got detected. And I am still getting same faults in same place.

Waldek Kozaczuk

unread,
Nov 29, 2018, 2:26:10 PM11/29/18
to OSv Development


On Wednesday, November 28, 2018 at 4:51:08 PM UTC-5, Nadav Har'El wrote:
On Wed, Nov 28, 2018 at 9:20 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
The title of this thread indicates that the crash happens sporadically and indeed it is the case with the scenario described in 1st and 2nd email of this thread. The symptom (stack trace with fmt_fp() function captured by gdb) looks identical to what was reported in https://github.com/cloudius-systems/osv/issues/1010 and this https://github.com/cloudius-systems/osv/issues/536. We cannot be certain about whether the root cause is the same. However it is worth mentioning that crash happens with same symptom almost every time if I run the scenario described in issue #1010 which is different (but similar) than scenario described in 1st and 2nd email of this thread. I am emphasizing this because we have a way to replicate one of these crashes (possibly same root cause) in very repeatable way in case we have ideas of things to try.

That is good. If we give up on this effort (at least temporarily), please summarize what you know about reproducing this issue in a new bug-tracker issue (and reference 1010 and 536) so we could return to it later.

Waldek Kozaczuk

unread,
Nov 29, 2018, 2:33:26 PM11/29/18
to OSv Development
I wonder if per this - https://software.intel.com/en-us/node/523375 - we are enabling correct bits for xsave if caste is what being used here.

“The second operand is a save/restore mask specifying the saved/restored extended states. The value of the mask is ANDed with XFEATURE_ENABLED_MASK(XCR0). A particular extended state is saved/restored only if the corresponding bit of both save/restore mask and XFEATURE_ENABLED_MASK is set to '1'.”

Waldek Kozaczuk

unread,
Nov 29, 2018, 6:19:45 PM11/29/18
to OSv Development
Unless I made a mistake in the code that verifies fpu state corruption which showed no corruption I think it might be worth to ponder possibility of different nature of the root cause of this and other issues where fmt_fp() shows up in the stack trace.

What if the problem is actually caused by wrong/incompatible compiler options that drive what machine code is generated to handle floating point operation. For example I noticed that when I configured musl independently on my machine I saw it add following options related to floating point math:

CFLAGS_C99FSE = -std=c99 -nostdinc -ffreestanding -fexcess-precision=standard -frounding-math -Wa,--noexecstack

I have not enough experience to make any conclusive observations but two of those -fexcess-precision and -frounding-math do affet code using float, double, long double etc. I did not see those in main OSv makefile so I wonder if musl compiled code is wrong and misbehaves.

I think FPU has control word with RC fields that control rounding (http://www.website.masmforum.com/tutorials/fptute/fpuchap1.htm) what if our code setups FPU rounding that is incompatible with how we compile musl. Just wild theory. 

Adding musl math library documentation - https://wiki.musl-libc.org/mathematical-library.html- there are paragraphs that talk about rounding and precision.

Waldek Kozaczuk

unread,
Nov 30, 2018, 4:34:21 PM11/30/18
to OSv Development
I think I have found the culprit! 

Related to what I am pointing out here - https://groups.google.com/forum/#!topic/osv-dev/-7vIPwOJsyg - I thought about enabling some FPU exceptions which are disabled by default. Once I changed 0x037f to 0x037e which enables invalid operand exception and ran my test and got this exception:

[registers] fps=0.0 q=-0.0 size=N/A time=00:00:14.00 bitrate=N/A dup=0 drop=273 speed=  28x    
RIP: 0x000000000044fb28 <???+4520744>
RFL: 0x0000000000010206  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x0000000000000040  RBX: 0x00001000000faa22  RCX: 0x00002000001fd960  RDX: 0x00002000001fd618
RSI: 0x000000000000000a  RDI: 0x00002000001fd580  RBP: 0x00002000001fd5f0  R8:  0x0000000000000000
R9:  0x0000000000000000  R10: 0x0000000000000000  R11: 0x00000000ffffffff  R12: 0x0000000000000000
R13: 0x0000000000000000  R14: 0x0000000000000000  R15: 0x00000000ffffffff  RSP: 0x00002000001fd508
DUMMY_HANDLER for math_fault aborting.

[backtrace]
0x00000000003a3ad6 <math_fault+22>
0x00000000003a11dc <???+3805660>
0x000000000045218b <vfprintf+171>
0x0000000000454138 <vsnprintf+216>
0x00001000000bfdbe <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > __gnu_cxx::__to_xstring<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char>(int (*)(char*, unsigned long, char const*, __va_list_tag*), unsigned long, char const*, ...)+174>
0x00001000000f1447 <httpserver::json::formatter::to_json[abi:cxx11](float)+103>
0x00002000001fdf3f <???+2088767>
0x00002000001fdf27 <???+2088743>
0x0000100003c5a49f <???+63284383>
0x4810c0834808578a <???+1208506250>

Once I put breakpoint in dump_registers() and connected with gdd to get stacktrace I saw this:
Breakpoint 1, dump_registers (ef=0xffff800002a79068) at arch/x64/dump.cc:17
17          osv::lookup_name_demangled(
(gdb) bt
#0  dump_registers (ef=0xffff800002a79068) at arch/x64/dump.cc:17
#1  0x00000000003a3ac9 in math_fault (ef=<optimized out>) at arch/x64/exceptions.cc:338
#2  <signal handler called>
#3  0x000000000044fb28 in pop_arg (arg=arg@entry=0x2000001fd580, type=<optimized out>, ap=ap@entry=0x2000001fd618)
   at libc/stdio/vfprintf.c:165
#4  0x000000000045161c in pop_arg (ap=0x2000001fd618, type=<optimized out>, arg=0x2000001fd580) at libc/stdio/vfprintf.c:143
#5  printf_core (f=f@entry=0x2000001fd7a0, fmt=fmt@entry=0x1000000faa20 "%f", ap=ap@entry=0x2000001fd618,
    nl_arg=nl_arg@entry=0x2000001fd6b0, nl_type=nl_type@entry=0x2000001fd630) at libc/stdio/vfprintf.c:538
#6  0x000000000045218c in vfprintf (f=f@entry=0x2000001fd7a0, fmt=0x1000000faa20 "%f", ap=<optimized out>) at libc/stdio/vfprintf.c:675
#7  0x0000000000454139 in vsnprintf (s=0x2000001fd8c0 "\256", n=<optimized out>, fmt=<optimized out>, ap=<optimized out>)
   at libc/stdio/vsnprintf.c:33
#8  0x00001000000bfdbf in ?? ()
#9  0x00000000000000ae in ?? ()
#10 0x0000000000550b48 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned
#11 0x0000000000000000 in ?? ()

Here is the code around line 165 in vfprintf.c
140 static void pop_arg(union arg *arg, int type, va_list *ap)
141 {
142         /* Give the compiler a hint for optimizing the switch. */
143         if ((unsigned)type > MAXSTATE) return;
144         switch (type) {
145                case PTR:        arg->p = va_arg(*ap, void *);
146         break; case INT:        arg->i = va_arg(*ap, int);
147         break; case UINT:       arg->i = va_arg(*ap, unsigned int);
148 #ifndef LONG_IS_INT
149         break; case LONG:       arg->i = va_arg(*ap, long);
150         break; case ULONG:      arg->i = va_arg(*ap, unsigned long);
151 #endif
152         break; case ULLONG:     arg->i = va_arg(*ap, unsigned long long);
153         break; case SHORT:      arg->i = (short)va_arg(*ap, int);
154         break; case USHORT:     arg->i = (unsigned short)va_arg(*ap, int);
155         break; case CHAR:       arg->i = (signed char)va_arg(*ap, int);
156         break; case UCHAR:      arg->i = (unsigned char)va_arg(*ap, int);
157 #ifdef ODD_TYPES
158         break; case LLONG:      arg->i = va_arg(*ap, long long);
159         break; case SIZET:      arg->i = va_arg(*ap, size_t);
160         break; case IMAX:       arg->i = va_arg(*ap, intmax_t);
161         break; case UMAX:       arg->i = va_arg(*ap, uintmax_t);
162         break; case PDIFF:      arg->i = va_arg(*ap, ptrdiff_t);
163         break; case UIPTR:      arg->i = (uintptr_t)va_arg(*ap, void *);
164 #endif
165         break; case DBL:        arg->f = va_arg(*ap, double);
166         break; case LDBL:       arg->f = va_arg(*ap, long double);
167         }
168 }

What could have gone wrong in the line 165 that triggered invalid operand exception?

My theory is this:

1. Httpserver tries to print thread list using this code:
118     os_threads.set_handler([](const_req req) {
119         using namespace std::chrono;
120         httpserver::json::Threads threads;
121         threads.time_ms = duration_cast<milliseconds>
122             (osv::clock::wall::now().time_since_epoch()).count();
123         httpserver::json::Thread thread;
124         sched::with_all_threads([&](sched::thread &t) {
125             thread.id = t.id();
126             thread.status = t.get_status();
127             auto tcpu = t.tcpu();
128             thread.cpu = tcpu ? tcpu->id : -1;
129             thread.cpu_ms = duration_cast<milliseconds>(t.thread_clock()).count();
130             thread.switches = t.stat_switches.get();
131             thread.migrations = t.stat_migrations.get();
132             thread.preemptions = t.stat_preemptions.get();
133             thread.name = t.name();
134             thread.priority = t.priority();
135             thread.stack_size = t.get_stack_info().size;
136             thread.status = t.get_status();
137             threads.list.push(thread);
138         });
139         return threads;
140     });
This code uses sched::with_all_threads() to iterate over the list of threads which uses mutex guarding against adding new threads and destroying existing one. However I think this does not guard any of the values grabbed to populate httpserver::json::Thread from being concurrently modified by OSv scheduler (am I wrong?). The priority field is float which when copied in not-thread-safe way ends up being junk which does not represent any valid float value (invalid operand).

2. The code downstream to serialize JSON calls vfprintf which at line 165 tries to extract the single element from va_list. This probably ends up using FPU which rejects corrupted priority float value. Because normally invalid operand exception is disabled vfprintf happily continues and gets to that famous loop which probably keeps raising same exception which gets ignored until we reach the beginning of the unmapped page and boom !!!

How do you like my theory?

I think this theory has its shortcomings - I think it explains open issues #536 and #1010 (and possibly to soon closed #490) but I think it fails to explain the newest one I created - https://github.com/cloudius-systems/osv/issues/1018 (there is no httpserver running). Maybe the latter one is due to some issues with missing initialization of FPU control word (but I doubt it).

Waldek

Nadav Har'El

unread,
Nov 30, 2018, 7:33:54 PM11/30/18
to Waldek Kozaczuk, osv...@googlegroups.com
On Fri, Nov 30, 2018 at 11:34 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
I think I have found the culprit! 

If you remember, I suggested in the past that if the number being printed had been a NaN, the loop in question might have gone on forever. However, IIRC the code checks for NaN and infinity before entering the loop. This is why I suggested maybe an FPU corruption might have changed some good number into a NaN.

Now you're suggesting, maybe the original number passed to printf() is somehow invalid, and the printf() code doesn't recognize it as special and uses it in the loop causing the loop to never end. You should be able to *directly* test this hypotesis - pick an "invalid" number, pass it to printf(), and see if runs into the same loop (and crash) we are chasing. But what "invalid" number to pick? I don't know, I'm not an expert enough in the floating point standard. I'd start with an sNaN (signalling NaN), and if that doesn't help, take the same floating point number which crashed the http example.


Related to what I am pointing out here - https://groups.google.com/forum/#!topic/osv-dev/-7vIPwOJsyg - I thought about enabling some FPU exceptions which are disabled by default. Once I changed 0x037f to 0x037e which enables invalid operand exception and ran my test and got this exception:

[registers] fps=0.0 q=-0.0 size=N/A time=00:00:14.00 bitrate=N/A dup=0 drop=273 speed=  28x    
RIP: 0x000000000044fb28 <???+4520744>
RFL: 0x0000000000010206  CS:  0x0000000000000008  SS:  0x0000000000000010
RAX: 0x0000000000000040  RBX: 0x00001000000faa22  RCX: 0x00002000001fd960  RDX: 0x00002000001fd618
RSI: 0x000000000000000a  RDI: 0x00002000001fd580  RBP: 0x00002000001fd5f0  R8:  0x0000000000000000
R9:  0x0000000000000000  R10: 0x0000000000000000  R11: 0x00000000ffffffff  R12: 0x0000000000000000
R13: 0x0000000000000000  R14: 0x0000000000000000  R15: 0x00000000ffffffff  RSP: 0x00002000001fd508
DUMMY_HANDLER for math_fault aborting.

This analysis is interesting - it shows we are trying to print invalid floating-point numbers, which is definitely a bug. However, I have no idea if without enabling the invalid-operand exception the result would have been an endless loop in printf(). If the answer is no, then you found another bug, but not the same bug we are chasing :-(

If you can find a way to print the floating-point number which caused this crash (its representation as an integer, of course, not with %f :-)), you can try calling printf() with the same number and seeing whether it goes into an endless loop. If it does, then we really have our culprit, or even two: First, printf() shouldn't crash on an invalid number. Second, why we get an invalid number in the first place.
It is true that there is no special protection for the priority field, but there are two reasons why I doubt that this is the problem (although I can't be sure...):
1. priority is a 32-bit "float" (not a 64-bit double). An aligned 32-bit number should be atomic, I believe.
2. In any case, we don't commonly change the priority field of threads after they are created. I'm not sure I have even one example of this. It's possible, but I doubt we actually do it?


2. The code downstream to serialize JSON calls vfprintf which at line 165 tries to extract the single element from va_list. This probably ends up using FPU which rejects corrupted priority float value. Because normally invalid operand exception is disabled vfprintf happily continues and gets to that famous loop which probably keeps raising same exception which gets ignored until we reach the beginning of the unmapped page and boom !!!

How do you like my theory?

I like it, but it has some big holes that can be tested (see above).


I think this theory has its shortcomings - I think it explains open issues #536 and #1010 (and possibly to soon closed #490) but I think it fails to explain the newest one I created - https://github.com/cloudius-systems/osv/issues/1018 (there is no httpserver running). Maybe the latter one is due to some issues with missing initialization of FPU control word (but I doubt it).

Waldek

On Thursday, November 29, 2018 at 6:19:45 PM UTC-5, Waldek Kozaczuk wrote:
Unless I made a mistake in the code that verifies fpu state corruption which showed no corruption I think it might be worth to ponder possibility of different nature of the root cause of this and other issues where fmt_fp() shows up in the stack trace.

What if the problem is actually caused by wrong/incompatible compiler options that drive what machine code is generated to handle floating point operation. For example I noticed that when I configured musl independently on my machine I saw it add following options related to floating point math:

CFLAGS_C99FSE = -std=c99 -nostdinc -ffreestanding -fexcess-precision=standard -frounding-math -Wa,--noexecstack

I have not enough experience to make any conclusive observations but two of those -fexcess-precision and -frounding-math do affet code using float, double, long double etc. I did not see those in main OSv makefile so I wonder if musl compiled code is wrong and misbehaves.

I think FPU has control word with RC fields that control rounding (http://www.website.masmforum.com/tutorials/fptute/fpuchap1.htm) what if our code setups FPU rounding that is incompatible with how we compile musl. Just wild theory. 

Adding musl math library documentation - https://wiki.musl-libc.org/mathematical-library.html- there are paragraphs that talk about rounding and precision.

On Thursday, November 29, 2018 at 2:33:26 PM UTC-5, Waldek Kozaczuk wrote:
I wonder if per this - https://software.intel.com/en-us/node/523375 - we are enabling correct bits for xsave if caste is what being used here.

“The second operand is a save/restore mask specifying the saved/restored extended states. The value of the mask is ANDed with XFEATURE_ENABLED_MASK(XCR0). A particular extended state is saved/restored only if the corresponding bit of both save/restore mask and XFEATURE_ENABLED_MASK is set to '1'.”

--

Nadav Har'El

unread,
Nov 30, 2018, 7:38:38 PM11/30/18
to Waldek Kozaczuk, osv...@googlegroups.com
On Fri, Nov 30, 2018 at 1:19 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Unless I made a mistake in the code that verifies fpu state corruption which showed no corruption I think it might be worth to ponder possibility of different nature of the root cause of this and other issues where fmt_fp() shows up in the stack trace.

What if the problem is actually caused by wrong/incompatible compiler options that drive what machine code is generated to handle floating point operation. For example I noticed that when I configured musl independently on my machine I saw it add following options related to floating point math:

CFLAGS_C99FSE = -std=c99 -nostdinc -ffreestanding -fexcess-precision=standard -frounding-math -Wa,--noexecstack

I have not enough experience to make any conclusive observations but two of those -fexcess-precision and -frounding-math do affet code using float, double, long double etc. I did not see those in main OSv makefile so I wonder if musl compiled code is wrong and misbehaves.

The gcc manual says that -fexcess-precision=standard is the default if -std=c99 is used. We use -std=gnu99, so I wonder if it also enabled by default. In any case, I doubt (but as usual, not sure), that rounding, in whatever manner, can cause the endless loop we're seeing.


I think FPU has control word with RC fields that control rounding (http://www.website.masmforum.com/tutorials/fptute/fpuchap1.htm) what if our code setups FPU rounding that is incompatible with how we compile musl. Just wild theory. 

Adding musl math library documentation - https://wiki.musl-libc.org/mathematical-library.html- there are paragraphs that talk about rounding and precision.

I never gave this issue any thought, so if you do, and come up with any conclusions, I'll be happy to know. We can easily change the Makefile to compile the Musl code differently, if needed, or set the default FPU control word differently.

Nadav Har'El

unread,
Nov 30, 2018, 7:43:31 PM11/30/18
to Waldek Kozaczuk, osv...@googlegroups.com
On Thu, Nov 29, 2018 at 9:33 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
I wonder if per this - https://software.intel.com/en-us/node/523375 - we are enabling correct bits for xsave if caste is what being used here.

“The second operand is a save/restore mask specifying the saved/restored extended states. The value of the mask is ANDed with XFEATURE_ENABLED_MASK(XCR0). A particular extended state is saved/restored only if the corresponding bit of both save/restore mask and XFEATURE_ENABLED_MASK is set to '1'.”

Interesting question. In arch/x64/arch-cpu.hh, arch_cpu::init_on_cpu(), you can see:

    if (features().xsave) {
        auto bits = xcr0_x87 | xcr0_sse;
        if (features().avx) {
            bits |= xcr0_avx;
        }
        write_xcr(xcr0, bits);
    }

So we're turning on the "x87", "sse", and "avx" bits. Are there any other bits we need to set (or at least, not zero with this overwrite-all-bits code)?

Waldek Kozaczuk

unread,
Dec 1, 2018, 12:46:48 AM12/1/18
to OSv Development
My theory has fallen apart: I modified httpserver like this to pass constant float value of 0.5 and still getting same crash:

index 358a86c6..126ab5f4 100644
--- a/modules/httpserver-api/api/os.cc
+++ b/modules/httpserver-api/api/os.cc
@@ -131,7 +131,7 @@ void init(routes& routes)
             thread.migrations = t.stat_migrations.get();
             thread.preemptions = t.stat_preemptions.get();
             thread.name = t.name();
-            thread.priority = t.priority();
+            thread.priority = 1.0;//t.priority();
             thread.stack_size = t.get_stack_info().size;
             thread.status = t.get_status();
             threads.list.push(thread);

 I even created separate test program that keeps sprintf'ing both NAN and regular float numbers using 5 threads to replace ffmpeg and I could not replicate same program running exact same scenario with httpserver.

As I wrote in other thread I added checksum logic to detect FPU state corruption that unfortunately does not detect anything. But is this logic actually correct?

Ediff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.cc
index 6b1fbf66..f699d5b8 100644
--- a/arch/x64/processor.hh
+++ b/arch/x64/processor.hh
@@ -343,6 +343,7 @@ struct fpu_state {
     char xsavehdr[24];
     char reserved[40];
     char ymm[256];
+    u64 checksum;
 } __attribute__((packed));
 
 inline void fxsave(fpu_state* s)

Waldek Kozaczuk

unread,
Dec 1, 2018, 11:26:37 AM12/1/18
to OSv Development
Here we go again ... I was thinking more about it. 

First of regarding potential corruption of FPU state during involuntary interrupt task switch, unless my code that verifies it is not the case using the checksum logic (see my previous email with the patch), there could be another possibility - we are telling xsave/xstor to save/restore to/from wrong location in memory. Meaning that whatever got saved stays not corrupted but somehow we are restoring wrong data for whatever reason. Probably very unlikely. 

Secondly I think that part of our theory, that this famous loop in vfprintf loops infinitely until tries to access unmapped page of memory, is very likely correct. But it might happening for different reason. So far we thought that simply the number is NaN or corrupt that makes FPU Instruction calculate NAN forever. But what if for whatever reason we are experiencing FPU stack overflow - all registers are taken and any instruction trying to load data into it keeps failing because it is full. I believe this exception is disabled but I can enable it to see if I am correct and it happens somewhere in the code. Now what if after voluntary or involuntary thread switch the FPU stack is full and the thread switched to simply does not know it. Whose responsibility is in this case to make sure that FPU is in clean state (empty)?

Maybe the involuntary switch has/should have two sub-cases:
1) the thread we are switching to before was also involuntarily switched and we need to restore its FPU state from interrupt stack which we do 
2) the thread we are switching to voluntarily yielded and ... I think FPU should be in somewhat clean state, at least be empty unless something left it full before, right?

My thinking might be all wrong. But just food for thought. 

Waldek

Sent from my iPhone
--
You received this message because you are subscribed to a topic in the Google Groups "OSv Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/osv-dev/ZAjvi_vEhnA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to osv-dev+u...@googlegroups.com.

Waldek Kozaczuk

unread,
Dec 1, 2018, 11:31:43 AM12/1/18
to OSv Development
Relatedly I wonder how much this is related to that old patch - https://github.com/cloudius-systems/osv/commit/0959f20863b119c5ca7d1e663bc22e6329814386. I do quite understand why we ALWAYS save FPU status of old thread to use same value to restore for new thread. What we both threads switched voluntarily? Is the outcome correct?

Waldek Kozaczuk

unread,
Dec 1, 2018, 12:26:25 PM12/1/18
to OSv Development
One more thing - per https://wiki.osdev.org/FPU there are also bits in cr0 like the ts one that enables (?) lazy FPU state switching. We do not seem to be setting cr0 register anywhere.

Nadav Har'El

unread,
Dec 2, 2018, 5:36:12 AM12/2/18
to Waldek Kozaczuk, osv...@googlegroups.com
On Thu, Nov 29, 2018 at 9:19 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Implemented checksum logic (please let me know if I have made any stupid mistakes).

diff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.cc
index bb96af3d..34bf6081 100644
--- a/arch/x64/arch-cpu.cc
+++ b/arch/x64/arch-cpu.cc
@@ -42,6 +42,14 @@ exception_guard::~exception_guard()
     sched::cpu::current()->arch.exit_exception();
 }
 
+static inline u64 calculate_fpu_state_checksum(processor::fpu_state *s) {
+    u64 val = 0;
+    char* data = s->legacy;
+    for (u64 i = 0; i < sizeof(processor::fpu_state) - sizeof(u64); i++)

What is this "- sizeof(u64)"?

+       val += data[i];

Addition isn't the greatest hash function, but it is probably good enough to detect random corruption.
 
+    return val;
+}
+
 extern "C"
 [[gnu::target("no-sse")]]
 void fpu_state_init_xsave(processor::fpu_state *s) {
@@ -61,21 +69,25 @@ void fpu_state_init_fxsave(processor::fpu_state *s) {
 extern "C"
 void fpu_state_save_xsave(processor::fpu_state *s) {
     processor::xsave(s, -1);
+    s->checksum = calculate_fpu_state_checksum(s);

I hope you verified that indeed xsave or fxsave was used in your test setup, and not the old x87 FPU saving code?

Nadav Har'El

unread,
Dec 2, 2018, 7:08:12 AM12/2/18
to Waldek Kozaczuk, osv...@googlegroups.com
On Sat, Dec 1, 2018 at 6:26 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Here we go again ... I was thinking more about it. 

First of regarding potential corruption of FPU state during involuntary interrupt task switch, unless my code that verifies it is not the case using the checksum logic (see my previous email with the patch), there could be another possibility - we are telling xsave/xstor to save/restore to/from wrong location in memory. Meaning that whatever got saved stays not corrupted but somehow we are restoring wrong data for whatever reason. Probably very unlikely. 

Yes, I think it is unlikely - since you are checking the checksum, it means we are restoring valid saved state, but someone else's saved state. I don't have an idea how this can happen.

If we rule out corruption of the saved FPU state, I can think of three remaining options:

1. This is not a saved-FPU-state corruption - printf() really goes into an endless loop when it gets a certain input. I'm still hoping you can get more detailed printouts on what exactly what printf()'s *input* when it went into this endless loop. One thing you can do is to have a variable (perhaps thread-local global if you don't want it optimized out?) holding that printf_fp's last input, and when you get a crash, print this global. Or something.

2. This is a situation of some asynchronous even where we ran a handler - especially the scheduler which uses the FPU - but we *forgot* to save the FPU state. I don't know what this can be. This can perhaps be debugged with tracepoints - enable various tracepoints, especially the scheduler, and fpu saves/restores, and after the crash look at the tracepoint log. Can we find unexpected scheduler calls without an fpu save?

3. This is a situation where we saved the FPU, but *after* doing some FPU work. We had and fix this kind of issue in 4349bfd3583df03d44cda480049e630081d3a20b but it's more surprising to see this kind of with x87 registers, not SSE, I can't think of a reason why the compiler will touch these registers without our intent.


Secondly I think that part of our theory, that this famous loop in vfprintf loops infinitely until tries to access unmapped page of memory, is very likely correct. But it might happening for different reason. So far we thought that simply the number is NaN or corrupt that makes FPU Instruction calculate NAN forever. But what if for whatever reason we are experiencing FPU stack overflow - all registers are taken and any instruction trying to load data into it keeps failing because it is full.

This is an interesting point. The eight x87 registers are freely addressable (st0 through st7), but can *also* be used as a stack. The current position is stored as three bits in the floating-point status word (FSW), and the fxsave (or newer xsave...) saves this as well on involuntary context switches.

The calling convention when calling a function (i.e., voluntary context switch) is that this stack is popped. The same is true on return (except st0 can be used for returning a floating-point value, if this is a function returning a floating point value). I don't know what will happen if a function violates this convention.
 
I believe this exception is disabled but I can enable it to see if I am correct and it happens somewhere in the code. Now what if after voluntary or involuntary thread switch the FPU stack is full and the thread switched to simply does not know it. Whose responsibility is in this case to make sure that FPU is in clean state (empty)?

When we switch *back* to an old thread, we should switch to its old FPU stack, including its old position in the x87 stack, so if it's full, it's this thread's fault.
But on a *new* thread, we do need to make sure it got a clean slate. We had a bug in this long ago, which 0e8d9b5e0a1af02a31799fe1d40e434bc2c2a5fb fixed - thread_main_c() now calls processor::init_fpu(). This calls the FNINIT instruction, which among other things, resets the stack postion (in FSW) to 0.

But now that I think about this, in case of an involuntary switch to the kernel (e.g., interrupt) what FPU state does the kernel code get? What if the kernel gets a full fpu stack and then can't do its own FPU calculations? Should fpu_lock() run fpu_init() after saving the FPU state?
In any case, I don't think this has anything to do with the problem at hand. I'll open a separate issue.


Maybe the involuntary switch has/should have two sub-cases:
1) the thread we are switching to before was also involuntarily switched and we need to restore its FPU state from interrupt stack which we do 
2) the thread we are switching to voluntarily yielded and ... I think FPU should be in somewhat clean state, at least be empty unless something left it full before, right?

I think you may have indeed found a problem (I'm not sure it's related to the bug we're chasing, but a problem nontheless, and *may* be related):

We already have these two cases, implicitly: As I explained in a previous mail, when we context switch *back* to a thread, we get back to where it was when it gave up the CPU:
1. If this was an involuntary context switch, we find ourselves returning (very belatedly) from the scheduler to the end of the interrupt() function, and now this function ends and pops the saved fpu (from the thread's interrupt stack), and finally returns from interrupt() to the original thread.
2. If this was a voluntary context switch, the scheduler returns into whatever function which called it (sleep(), read(), etc.) and that returns to. The floating point state was never saved (it doesn't need to be, because a function was called) but, on return, it was never restored. It just gets a copy of whatever junk was there from the previous thread or OSv code that ran. The junk st0..st7 registers are fine - the caller cannot count on anything in there too. But the control registers might now have junk as well, and that OSv does have to restore, and that it does since commit 0959f20863b119c5ca7d1e663bc22e6329814386. However, we will never restore the floating point *status* word (FSW). So if one thread was interrupted with a full fp stack, another thread we switch to might be returned-to with a full stack.

You can try adding an FNINIT in arch-switch.hh *before* processor::fldcw(fpucw) and see if it fixes anything.

By the way, to be honest, all this discussion is about x87. I don't know what is the modern SSE equivalent :-(

Finally, it may indeed make sense to enable fpu stack overflow exceptions (and their SSE equivalent??) to see if we have any of these problems.




You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.

Nadav Har'El

unread,
Dec 2, 2018, 9:01:04 AM12/2/18
to Waldek Kozaczuk, osv...@googlegroups.com
On Sun, Dec 2, 2018 at 2:07 PM Nadav Har'El <n...@scylladb.com> wrote:

2. If this was a voluntary context switch, the scheduler returns into whatever function which called it (sleep(), read(), etc.) and that returns to. The floating point state was never saved (it doesn't need to be, because a function was called) but, on return, it was never restored. It just gets a copy of whatever junk was there from the previous thread or OSv code that ran. The junk st0..st7 registers are fine - the caller cannot count on anything in there too. But the control registers might now have junk as well, and that OSv does have to restore, and that it does since commit 0959f20863b119c5ca7d1e663bc22e6329814386. However, we will never restore the floating point *status* word (FSW). So if one thread was interrupted with a full fp stack, another thread we switch to might be returned-to with a full stack.

You can try adding an FNINIT in arch-switch.hh *before* processor::fldcw(fpucw) and see if it fixes anything.


I wouldn't want to fix this without first proving it is needed. Either (hopefully!) it (i.e., adding FNINIT (or maybe FINIT?) and EMMS in the switch_to() code) will help in the bug you can reproduce, or creating a new reproducer - e.g., two user-space threads, one filling the FPU stack with garbage and the other trying a loop of normal FPU work and seeing if something fails.

Nadav.

Waldek Kozaczuk

unread,
Dec 2, 2018, 3:41:56 PM12/2/18
to Nadav Har'El, OSv Development
On Sun, Dec 2, 2018 at 5:36 AM Nadav Har'El <n...@scylladb.com> wrote:
On Thu, Nov 29, 2018 at 9:19 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Implemented checksum logic (please let me know if I have made any stupid mistakes).

diff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.cc
index bb96af3d..34bf6081 100644
--- a/arch/x64/arch-cpu.cc
+++ b/arch/x64/arch-cpu.cc
@@ -42,6 +42,14 @@ exception_guard::~exception_guard()
     sched::cpu::current()->arch.exit_exception();
 }
 
+static inline u64 calculate_fpu_state_checksum(processor::fpu_state *s) {
+    u64 val = 0;
+    char* data = s->legacy;
+    for (u64 i = 0; i < sizeof(processor::fpu_state) - sizeof(u64); i++)

What is this "- sizeof(u64)"?
To make sure we are not summing to "hash" the checksum field itself. Otherwise the calculated checksum would be different between save and restore. Am I wrong? 

+       val += data[i];

Addition isn't the greatest hash function, but it is probably good enough to detect random corruption.
 
+    return val;
+}
+
 extern "C"
 [[gnu::target("no-sse")]]
 void fpu_state_init_xsave(processor::fpu_state *s) {
@@ -61,21 +69,25 @@ void fpu_state_init_fxsave(processor::fpu_state *s) {
 extern "C"
 void fpu_state_save_xsave(processor::fpu_state *s) {
     processor::xsave(s, -1);
+    s->checksum = calculate_fpu_state_checksum(s);

I hope you verified that indeed xsave or fxsave was used in your test setup, and not the old x87 FPU saving code?

It was as I artificially changed assert in fpu_state_restore_xsave to something that would be false and indeed the assert being raised. I also put breakpoints to inspect that checksum is something non-zero. What was interesting very often it would show same value for given thread. Maybe because FPU is not that often most of the time.
Since I expanded this code to another check so the diff looks like this below. The point of the new saved flag was to verify that we are restoring data that was indeed save for this thread previously - another thought I had, but it did not detect any problems unless my detection logic with "saved" flag is incorrect.
diff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.cc
index bb96af3d..153aaa9a 100644
--- a/arch/x64/arch-cpu.cc
+++ b/arch/x64/arch-cpu.cc
@@ -42,6 +42,14 @@ exception_guard::~exception_guard()
     sched::cpu::current()->arch.exit_exception();
 }
 
+static inline u64 calculate_fpu_state_checksum(processor::fpu_state *s) {
+    u64 val = 0;
+    char* data = s->legacy;
+    for (u64 i = 0; i < sizeof(processor::fpu_state) - 2 * sizeof(u64); i++)
+       val += data[i];
+    return val;
+}
+
 extern "C"
 [[gnu::target("no-sse")]]
 void fpu_state_init_xsave(processor::fpu_state *s) {
@@ -52,30 +60,43 @@ void fpu_state_init_xsave(processor::fpu_state *s) {
     // FPU state to the stack, so must not touch any of the FPU registers.
     // This is why the "gnu::target(no-sse)" specification above is critical.
     memset(s->xsavehdr, 0, sizeof(s->xsavehdr));
+    s->saved = 0;
 }
 
 extern "C"
 void fpu_state_init_fxsave(processor::fpu_state *s) {
+    s->saved = 0;
 }
 
 extern "C"
 void fpu_state_save_xsave(processor::fpu_state *s) {
     processor::xsave(s, -1);
+    s->checksum = calculate_fpu_state_checksum(s);
+    s->saved = 1;
 }
 
 extern "C"
 void fpu_state_save_fxsave(processor::fpu_state *s) {
     processor::fxsave(s);
+    s->checksum = calculate_fpu_state_checksum(s);
+    s->saved = 1;
 }
 
 extern "C"
 void fpu_state_restore_xsave(processor::fpu_state *s) {
+    assert(s->saved == 1);
     processor::xrstor(s, -1);
+    assert(s->checksum == calculate_fpu_state_checksum(s));
+    s->saved = 0;
+
 }
 
 extern "C"
 void fpu_state_restore_fxsave(processor::fpu_state *s) {
+    assert(s->saved == 1);
     processor::fxrstor(s);
+    assert(s->checksum == calculate_fpu_state_checksum(s));
+    s->saved = 0;
 }
 
 extern "C"
diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
index e3769540..7cd8d181 100644
--- a/arch/x64/arch-switch.hh
+++ b/arch/x64/arch-switch.hh
@@ -103,6 +103,7 @@ void thread::switch_to()
            [rip]"i"(offsetof(thread_state, rip))
          : "rbx", "rdx", "rsi", "rdi", "r8", "r9",
+    s->saved = 0;
+
 }
 
 extern "C"
 void fpu_state_restore_fxsave(processor::fpu_state *s) {
+    assert(s->saved == 1);
     processor::fxrstor(s);
+    assert(s->checksum == calculate_fpu_state_checksum(s));
+    s->saved = 0;
 }
 
 extern "C"
diff --git a/arch/x64/processor.hh b/arch/x64/processor.hh
index 6b1fbf66..e9d0c4f8 100644
--- a/arch/x64/processor.hh
+++ b/arch/x64/processor.hh
@@ -339,10 +339,14 @@ inline u64 ticks()
 }
 
 struct fpu_state {
     char legacy[512];
          char xsavehdr[24];
     char reserved[40];
     char ymm[256];
+    u64 checksum;
+    u64 saved;
 } __attribute__((packed));
 
 inline void fxsave(fpu_state* s)
Addition isn't the greatest hash function, but it is probably good enough to detect random corruption.

Waldek Kozaczuk

unread,
Dec 2, 2018, 5:05:27 PM12/2/18
to OSv Development
Forgot to reply to entire group. 

---------- Forwarded message ---------
From: Waldek Kozaczuk <jwkoz...@gmail.com>
Date: Sun, Dec 2, 2018 at 16:26
Subject: Re: OSv crashes fairly sporadically with page fault when transcoding video with ffmpeg
To: Nadav Har'El <n...@scylladb.com>


I think I have good news but I am adding my comments inlined.
On Sun, Dec 2, 2018 at 7:08 AM Nadav Har'El <n...@scylladb.com> wrote:
On Sat, Dec 1, 2018 at 6:26 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
Here we go again ... I was thinking more about it. 

First of regarding potential corruption of FPU state during involuntary interrupt task switch, unless my code that verifies it is not the case using the checksum logic (see my previous email with the patch), there could be another possibility - we are telling xsave/xstor to save/restore to/from wrong location in memory. Meaning that whatever got saved stays not corrupted but somehow we are restoring wrong data for whatever reason. Probably very unlikely. 

Yes, I think it is unlikely - since you are checking the checksum, it means we are restoring valid saved state, but someone else's saved state. I don't have an idea how this can happen.

If we rule out corruption of the saved FPU state, I can think of three remaining options:

1. This is not a saved-FPU-state corruption - printf() really goes into an endless loop when it gets a certain input. I'm still hoping you can get more detailed printouts on what exactly what printf()'s *input* when it went into this endless loop. One thing you can do is to have a variable (perhaps thread-local global if you don't want it optimized out?) holding that printf_fp's last input, and when you get a crash, print this global. Or something. 
2. This is a situation of some asynchronous even where we ran a handler - especially the scheduler which uses the FPU - but we *forgot* to save the FPU state. I don't know what this can be. This can perhaps be debugged with tracepoints - enable various tracepoints, especially the scheduler, and fpu saves/restores, and after the crash look at the tracepoint log. Can we find unexpected scheduler calls without an fpu save?

3. This is a situation where we saved the FPU, but *after* doing some FPU work. We had and fix this kind of issue in 4349bfd3583df03d44cda480049e630081d3a20b but it's more surprising to see this kind of with x87 registers, not SSE, I can't think of a reason why the compiler will touch these registers without our intent.


Secondly I think that part of our theory, that this famous loop in vfprintf loops infinitely until tries to access unmapped page of memory, is very likely correct. But it might happening for different reason. So far we thought that simply the number is NaN or corrupt that makes FPU Instruction calculate NAN forever. But what if for whatever reason we are experiencing FPU stack overflow - all registers are taken and any instruction trying to load data into it keeps failing because it is full.

This is an interesting point. The eight x87 registers are freely addressable (st0 through st7), but can *also* be used as a stack. The current position is stored as three bits in the floating-point status word (FSW), and the fxsave (or newer xsave...) saves this as well on involuntary context switches.
Here is extra evidence that seems to validate at least some of the elements of my theory.

When I add this to vfprintf:
diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.c
index aac790c0..5cd31b3e 100644
--- a/libc/stdio/vfprintf.c
+++ b/libc/stdio/vfprintf.c
@@ -209,6 +209,9 @@ static char *fmt_u(uintmax_t x, char *s)
 
 static int fmt_fp(FILE *f, long double y, int w, int p, int fl, int t)
 {
+       uint16_t fpu_status;
+       asm volatile ("fnstsw %0" : "=m" (fpu_status));
+
        uint32_t big[(LDBL_MAX_EXP+LDBL_MANT_DIG)/9+1];
        uint32_t *a, *d, *r, *z;
        int e2=0, e, i, j, l;
@@ -654,8 +657,16 @@ static int printf_core(FILE *f, const char *fmt, va_list *ap, union arg *nl_arg,
        return 1;
 }
 
+//The TLS version cannot be printed by GDB:
+//(gdb) p vfprintf_fpu_status
+// Cannot find thread-local storage for Thread 1, executable file build/release.x64/loader.elf:
+// TLS not supported or disabled on this target
+//static __thread uint16_t vfprintf_fpu_status = 0;
+static uint16_t vfprintf_fpu_status = 0;
 int vfprintf(FILE *restrict f, const char *restrict fmt, va_list ap)
 {
+       asm volatile ("fnstsw %0" : "=m" (vfprintf_fpu_status));
+
        va_list ap2;
        int nl_type[NL_ARGMAX+1] = {0};
        union arg nl_arg[NL_ARGMAX+1];

 and capture both variable (FPU status word) after crash:
(gdb) bt
#0  0x00000000003a8542 in processor::cli_hlt () at arch/x64/processor.hh:247
#1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
#2  osv::halt () at arch/x64/power.cc:24
#3  0x000000000023ef34 in abort (fmt=fmt@entry=0x63199b "Aborted\n") at runtime.cc:132
#4  0x0000000000202765 in abort () at runtime.cc:98
#5  0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, ef=0xffff800004b8f068) at core/mmu.cc:1316
#6  0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, ef=ef@entry=0xffff800004b8f068) at core/mmu.cc:1330
#7  0x00000000003a239c in page_fault (ef=0xffff800004b8f068) at arch/x64/mmu.cc:38
#8  <signal handler called>
#9  0x000000000044ff77 in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:303
#10 0x0000000000000000 in ?? ()
(gdb) frame 9
#9  0x000000000044ff77 in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:303
303 *z = y;
(gdb) p/t fpu_status
$1 = 1100001
(gdb) p/t vfprintf_fpu_status
$2 = 11000001101001

According to this http://www.ray.masmcode.com/tutorial/fpuchap1.htm#sword - FPU status word at the beginning of vfprint does not look anything clean. So something left it not clean before.  

The calling convention when calling a function (i.e., voluntary context switch) is that this stack is popped. The same is true on return (except st0 can be used for returning a floating-point value, if this is a function returning a floating point value). I don't know what will happen if a function violates this convention.
 
I believe this exception is disabled but I can enable it to see if I am correct and it happens somewhere in the code. Now what if after voluntary or involuntary thread switch the FPU stack is full and the thread switched to simply does not know it. Whose responsibility is in this case to make sure that FPU is in clean state (empty)?

When we switch *back* to an old thread, we should switch to its old FPU stack, including its old position in the x87 stack, so if it's full, it's this thread's fault.
But on a *new* thread, we do need to make sure it got a clean slate. We had a bug in this long ago, which 0e8d9b5e0a1af02a31799fe1d40e434bc2c2a5fb fixed - thread_main_c() now calls processor::init_fpu(). This calls the FNINIT instruction, which among other things, resets the stack postion (in FSW) to 0.

But now that I think about this, in case of an involuntary switch to the kernel (e.g., interrupt) what FPU state does the kernel code get? What if the kernel gets a full fpu stack and then can't do its own FPU calculations? Should fpu_lock() run fpu_init() after saving the FPU state?
In any case, I don't think this has anything to do with the problem at hand. I'll open a separate issue.


Maybe the involuntary switch has/should have two sub-cases:
1) the thread we are switching to before was also involuntarily switched and we need to restore its FPU state from interrupt stack which we do 
2) the thread we are switching to voluntarily yielded and ... I think FPU should be in somewhat clean state, at least be empty unless something left it full before, right?

I think you may have indeed found a problem (I'm not sure it's related to the bug we're chasing, but a problem nontheless, and *may* be related):

We already have these two cases, implicitly: As I explained in a previous mail, when we context switch *back* to a thread, we get back to where it was when it gave up the CPU:
1. If this was an involuntary context switch, we find ourselves returning (very belatedly) from the scheduler to the end of the interrupt() function, and now this function ends and pops the saved fpu (from the thread's interrupt stack), and finally returns from interrupt() to the original thread.
2. If this was a voluntary context switch, the scheduler returns into whatever function which called it (sleep(), read(), etc.) and that returns to. The floating point state was never saved (it doesn't need to be, because a function was called) but, on return, it was never restored. It just gets a copy of whatever junk was there from the previous thread or OSv code that ran. The junk st0..st7 registers are fine - the caller cannot count on anything in there too. But the control registers might now have junk as well, and that OSv does have to restore, and that it does since commit 0959f20863b119c5ca7d1e663bc22e6329814386. However, we will never restore the floating point *status* word (FSW). So if one thread was interrupted with a full fp stack, another thread we switch to might be returned-to with a full stack.

You can try adding an FNINIT in arch-switch.hh *before* processor::fldcw(fpucw) and see if it fixes anything.

Indeed this what I did:
diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hh
index e3769540..7cd8d181 100644
--- a/arch/x64/arch-switch.hh
+++ b/arch/x64/arch-switch.hh
@@ -103,6 +103,7 @@ void thread::switch_to()
            [rip]"i"(offsetof(thread_state, rip))
          : "rbx", "rdx", "rsi", "rdi", "r8", "r9",
            "r10", "r11", "r12", "r13", "r14", "r15", "memory");
+    asm volatile("fninit");
     processor::fldcw(fpucw);
     processor::ldmxcsr(mxcsr);
 }

And at least the issue described in #1010 went away based on around 30-40 executions which used to fail almost every time. So I think is very good news !!!  

Waldek Kozaczuk

unread,
Dec 2, 2018, 5:43:00 PM12/2/18
to OSv Development
In one of the issues (536?) there was some discovery suggesting that certain compiler flags would fix it. So I wonder if depending on compiler flags FNINIT is sometimes skipped for optimization reasons. See this from “Simply FPU blog”:

“FINIT / FNINIT (Initialize FPU)

Syntax: finit (no operand)
fninit (no operand)
The FINIT and FNINIT have exactly the same coding except that the FINIT is automatically preceded by the FWAIT code.
This instruction initializes the FPU by resetting all the registers and flags to their default values. (Refer to Chap.1 Description of FPU Internals for default values.)

Due to some of the restrictions of the FPU, it is good programming practice to take the precaution of initializing it before starting any computation, unless it is known that data has been left in it for further processing. If several sections of code use the FPU and the programmer is careless about not leaving the data registers "empty" in some of them, they can fill up very rapidly.””

Nadav Har'El

unread,
Dec 3, 2018, 2:45:39 AM12/3/18
to Waldek Kozaczuk, osv...@googlegroups.com
On Sun, Dec 2, 2018 at 11:26 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
I think I have good news but I am adding my comments inlined.

This is excellent news!
I think, then, we'll need to add FNINIT or maybe just FEMMS (see my last comment in https://github.com/cloudius-systems/osv/issues/1019) to either the FPU fpu_lock code, or to the context switch code. I think just the former would be better, for several reasons:
1. It will protect the kernel's FPU code, not just the other threads, from starting with a filled stack.
2. A workload which does a lot of voluntary context switching will not suffer any performance loss (interrupts are already slow). tests/misc-ctxsw should not show any performance degredation, I think. In general OSv already optimizes voluntary context switches more than involuntary ones (interrupts).
3. It will be enough to protect the switched-to thread because the kernel will not leave anything in the stack (I think... because thread::switch_to() is a function...

It will be nice to have a test for this. Perhaps, as I suggested earlier, one thread in a busy loop filling the fpu stack, or doing sse (xmm) work, or both (two test cases) and another thread in a busy loop doing normal floating point work, and see if one bothers the other (or the kernel).


And at least the issue described in #1010 went away based on around 30-40 executions which used to fail almost every time. So I think is very good news !!!  

:-)

Waldek Kozaczuk

unread,
Dec 3, 2018, 7:19:38 AM12/3/18
to Nadav Har'El, osv...@googlegroups.com


Sent from my iPhone

On Dec 3, 2018, at 02:45, Nadav Har'El <n...@scylladb.com> wrote:


On Sun, Dec 2, 2018 at 11:26 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:
I think I have good news but I am adding my comments inlined.

This is excellent news!
I think, then, we'll need to add FNINIT or maybe just FEMMS (see my last comment in https://github.com/cloudius-systems/osv/issues/1019) to either the FPU fpu_lock code, or to the context switch code. I think just the former would be better, for several reasons:
I have not tested if former fixes anything. But I agree with you that it should. 

1. It will protect the kernel's FPU code, not just the other threads, from starting with a filled stack.
2. A workload which does a lot of voluntary context switching will not suffer any performance loss (interrupts are already slow). tests/misc-ctxsw should not show any performance degredation, I think. In general OSv already optimizes voluntary context switches more than involuntary ones (interrupts).
3. It will be enough to protect the switched-to thread because the kernel will not leave anything in the stack (I think... because thread::switch_to() is a function...

It will be nice to have a test for this. Perhaps, as I suggested earlier, one thread in a busy loop filling the fpu stack, or doing sse (xmm) work, or both (two test cases) and another thread in a busy loop doing normal floating point work, and see if one bothers the other (or the kernel).
Any good examples off code using xmm can you point me to? Do you think I will have to code this in assembly ?
Reply all
Reply to author
Forward
0 new messages