#3 0x000000000023ef34 in abort (fmt=fmt@entry=0x63089b "Aborted\n") at runtime.cc:132#4 0x0000000000202765 in abort () at runtime.cc:98#5 0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, ef=0xffff800003431068) at core/mmu.cc:1316#6 0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, ef=ef@entry=0xffff800003431068) at core/mmu.cc:1330#7 0x00000000003a222c in page_fault (ef=0xffff800003431068) at arch/x64/mmu.cc:38#8 <signal handler called>#9 0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:300#10 0x0000000000000000 in ?? ()(gdb) frame 9#9 0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:300300 *z = y;(gdb) display y1: y = 0(gdb) p/f y$1 = 0(gdb) p z$2 = (uint32_t *) 0x200000200000(gdb) p z - big // big is a beginning of the buffer that is around 1800 long $3 = 4632
(gdb) p big$4 = {0 <repeats 1828 times>}
The FPU bugs usually cause crashes when memory-copying code using FPU overwrites random parts of memory. So why here do we always get a crash in exactly the same place with this "z", and what overwrote it and when??? (z is set just a couple of lines above and increased here in a tight loop, what can overwrite it?Another possibility is that "y" is kept in a floating point register and clobbered by some interrupt that doesn't save FPU state (or something like that) which causes the loop to continue forever.If you can reliably reproduce this, you can add various printouts or variables to help debug what goes wrong with "z" or "y".
diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.cindex aac790c0..1e116038 100644--- a/libc/stdio/vfprintf.c+++ b/libc/stdio/vfprintf.c@@ -8,6 +8,7 @@ #include <inttypes.h> #include <math.h> #include <float.h>+#include <assert.h> /* Some useful macros */ @@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int p, int fl, int t) if (e2<0) a=r=z=big; else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1; + int steps = 0; do { *z = y; y = 1000000000*(y-*z++);+ steps++;+ if(steps > 2000) {+ assert(0);+ } } while (y); while (e2>0) {struct fpu_state { char legacy[512]; char xsavehdr[24]; char reserved[40]; char ymm[256];} __attribute__((packed));Upgrading vfprintf.c to the latest version from musl did not help. So back to square one.
(gdb) bt#0 0x00000000003a83d2 in processor::cli_hlt () at arch/x64/processor.hh:247#1 arch::halt_no_interrupts () at arch/x64/arch.hh:48#2 osv::halt () at arch/x64/power.cc:24#3 0x000000000023ef34 in abort (fmt=fmt@entry=0x63089b "Aborted\n") at runtime.cc:132#4 0x0000000000202765 in abort () at runtime.cc:98#5 0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, ef=0xffff800003431068) at core/mmu.cc:1316#6 0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, ef=ef@entry=0xffff800003431068) at core/mmu.cc:1330#7 0x00000000003a222c in page_fault (ef=0xffff800003431068) at arch/x64/mmu.cc:38#8 <signal handler called>#9 0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:300#10 0x0000000000000000 in ?? ()(gdb) frame 9#9 0x000000000044ef0f in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:300300 *z = y;(gdb) display y1: y = 0
(gdb) p/f y$1 = 0(gdb) p z$2 = (uint32_t *) 0x200000200000
+ }} while (y);while (e2>0) {
Could it be because gcc optimized code differently and did not put "y" in the FPU registerr?BTW I wonder if this part of some optimization suggestions in #536 is any relevant - https://github.com/cloudius-systems/osv/issues/536#issuecomment-62365543. Would it be possible to explicitly force gcc not to use FPU register for y? Would it degrade performance a lot? In ether case even if possible I am guessing the underlying issue of not saving FPU state somewhere could hurt us later in other place of the code anyway, correct?
Finally when you say FPU state got corrupted or not saved/restored do you mean same thing? Can you elaborate?I see this struct in processor.hh:struct fpu_state {char legacy[512];char xsavehdr[24];char reserved[40];char ymm[256];} __attribute__((packed));
Is this struct getting corrupted/overwritten in memory by some code or is it that some interrupt handler does not save/restore FPU using fpu_lock class like it is done by interrupt, page_fault handler, signal and syscall handler? Is this what you are saying difficult to trace?
480 else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1; 481 482 do { 483 *z = y; 484 540: d9 c0 fld %st(0) 485 542: d9 ad 0c e3 ff ff fldcw -0x1cf4(%rbp) 486 548: df bd 00 e3 ff ff fistpll -0x1d00(%rbp) 487 54e: d9 ad 0e e3 ff ff fldcw -0x1cf2(%rbp) 488 y = 1000000000*(y-*z++); 489 554: 48 83 c3 04 add $0x4,%rbx 490 *z = y; 491 558: 48 8b 85 00 e3 ff ff mov -0x1d00(%rbp),%rax 492 55f: 89 43 fc mov %eax,-0x4(%rbx) 493 y = 1000000000*(y-*z++); 494 562: 89 c0 mov %eax,%eax 495 564: 48 89 85 f0 e2 ff ff mov %rax,-0x1d10(%rbp) 496 56b: df ad f0 e2 ff ff fildll -0x1d10(%rbp) 497 571: de e9 fsubrp %st,%st(1) 498 573: d8 c9 fmul %st(1),%st 499 } while (y); 500 575: d9 ee fldz 501 577: d9 c9 fxch %st(1) 502 579: db e9 fucomi %st(1),%st 503 57b: dd d9 fstp %st(1) 504 57d: 7a c1 jp 540 <fmt_fp+0x140> 505 57f: 75 bf jne 540 <fmt_fp+0x140> 506 581: dd d8 fstp %st(0) 507 583: dd d8 fstp %st(0) 508 509 while (e2>0) { 510 585: 4c 8b a5 e0 e2 ff ff mov -0x1d20(%rbp),%r12--
You received this message because you are subscribed to a topic in the Google Groups "OSv Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/osv-dev/ZAjvi_vEhnA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to osv-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I wonder if there is some kind of arithmetic exception encountered (overflow when multiplying) which we do not handle correctly. I think we have handler that would simply abort in this case.
Indeed, seems like a loop that works on fpu registers and stack. The actual loop's test, while(y) is the "fucomi" instruction which compares two floating point values one of which being a zero created by "fldz". My completely unproven suspicion is that in the middle of this loop we get an interrupt (possibly also leading to a context switch, running another thread, and only much later returning to this thread), and for some reason the floating point state (which includes the register stack, etc.) is not saved correctly - or not restored correctly (perhaps restored from a corrupted array?). If after such corruption, "y" (in whatever register it sits) becomes, for example, NaN, the loop will never finish. I wonder if we can print these registers from gdb to see if perhaps gdb showing "y=0" isn't really correct.
I do not think I understand mechanics of how context switch in handled in OSv but fundamentally whenever current thread is preempted for whatever reason (timer expired, etc), OSv thread scheduler must somehow save all 16 registers, flags and everything else but FPU state somewhere and then upon selecting new thread for execution it must restore values of its 16 registers plus everything else so that the newly switched thread sees "the world" as it was when it was last running. I think more less it is done through TSS (task state segment) for each cpu and saving/restoring happens automatically (or not?).
In any case as far as FPU state goes there is number of places (interrupts, page fault, syscall handler) where fpu_lock construct is used to isolate any changes to FPU registers made by OSv code executed by corresponding code for these interrupts, etc. That way OSv code should not corrupt any FPU state.But what mechanism does exist in OSv to make sure that concurrent application threads that use floating point registers (like vfprintf used by httpserver thread or any other floating point arithmetic used by ffmpeg threads for example) do NOT corrupt each other FPU state when preempted?Here is what I found on internet (https://wiki.osdev.org/Context_Switching):"The FPU/MMX and SSE state could be saved and reloaded, but the CPU can also be tricked into generating an exception the first time that an FPU/MMX or SSE instruction is used by copying the hardware context switch mechanism (setting the TS flag in CR0)."
Does this suggest it should be possible to register an exception handler that would be called when first FPU operation is executed to mark to save FPU state when preempted and then restore when coming back to this thread? But I do not see where such code exists in OSv. Or have I missed it?Waldek
On Tuesday, November 27, 2018 at 6:30:03 PM UTC-5, Waldek Kozaczuk wrote:I was also reading about "lazy" vs "eager" FPU state save/restore in Linux and how due to some security reasons they are advocating users to switch to the eager. I think the eager means that on each context switch FPU state gets saved/restored regardless if FPU registered are used.Is OSv using "eager" or "lazy" strategy? I am guessing probably the lazy one.Also I am not sure about cassandra (ant original older bug) but ffmpeg is very heavy on floating point arithmetic so maybe that exposes FPU bugs more easily.On Tue, Nov 27, 2018 at 6:07 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:I also checked if y == 0 in Gdb when I connected after the crash and indeed it was true.What about eflags which are set by FUCOMI? Are we saving those?Sent from my iPhoneOn Tue, Nov 27, 2018 at 10:11 PM Nadav Har'El <n...@scylladb.com> wrote:Indeed, seems like a loop that works on fpu registers and stack. The actual loop's test, while(y) is the "fucomi" instruction which compares two floating point values one of which being a zero created by "fldz". My completely unproven suspicion is that in the middle of this loop we get an interrupt (possibly also leading to a context switch, running another thread, and only much later returning to this thread), and for some reason the floating point state (which includes the register stack, etc.) is not saved correctly - or not restored correctly (perhaps restored from a corrupted array?). If after such corruption, "y" (in whatever register it sits) becomes, for example, NaN, the loop will never finish. I wonder if we can print these registers from gdb to see if perhaps gdb showing "y=0" isn't really correct.Ok, so I started theorizing what might cause this...If I remember correctly, OSv currently always saves the FPU state on some stack, using the fpu_lock type.Could we possibly be using stacks which are too small to hold this FPU state?In arch/x64/arch-cpu.hh we set a 4096 byte stack for nested exceptions, 4096 byte stack for interrupts, and 4096*4 byte stack for normal exceptions. Maybe one of these is too small? If you can easily reproduce this bug, can you add a zero to all of these and see if maybe the bug goes away with bigger stacks?
--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
On Wed, Nov 28, 2018 at 5:54 AM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:I do not think I understand mechanics of how context switch in handled in OSv but fundamentally whenever current thread is preempted for whatever reason (timer expired, etc), OSv thread scheduler must somehow save all 16 registers, flags and everything else but FPU state somewhere and then upon selecting new thread for execution it must restore values of its 16 registers plus everything else so that the newly switched thread sees "the world" as it was when it was last running. I think more less it is done through TSS (task state segment) for each cpu and saving/restoring happens automatically (or not?).A context switch (from one thread to another thread) can happen in two ways. The first is voluntary - the first thread calls some blocking function (sleep(), read(), etc.) and goes to sleep and replaced by another thread. The second is involuntary - we get an asynchronous event during the run of the thread - an interrupt, exception, or signal - and the scheduler decides to do a context switch at that point.In the first case (voluntary switching) our life is easy. The user thread called a *function*. The ABI requires that when calling a function, the caller must save many of the registers (known as "caller-saved") himself because these can be used by the called function. It also requires that the caller cannot assume anything about the FPU state (because the callee is allowed to use the FPU too!). For this reason the thread context switch code (switching between user threads) only needs to save some of the registers, and does NOT need to save the FPU state.The situation is different with involuntary context switches. When an asynchronous event, e.g., an interrupt, occurs, the user thread is in a random position in the code. It may be using all its registers, and the FPU state (including the old-style FPU and the new SSE and AVX registers). Because our interrupt handler (which may do anything from running the scheduler to reading a page from disk on page fault) may need to use any of these registers, all of them, including the FPU, need to be saved on interrupt time. The interrupt has a separate stack, and the FPU is saved on this stack (see fpu_lock use in interrupt()). When the interrupt finishes, this FPU is restored. This includes involuntary context switching: thread A receives an interrupt, saves the FPU, does something and decides to switch to thread B, and a while later we switch back to thread A at which point the interrupt handler "returns" and restores the FPU state.
I can't think of a case where the user thread (doing that printf loop) can get interrupted in the middle but we *dont'* save the FPU. So if we do, there remains the possibility that we manage to ruin it before saving (this was issue 4349bfd3583df03d44cda480049e630081d3a20b) or that something else entirely overwrites and corrupts the saved state.An idea worth trying: We can verify that the latter (corruption of the saved state) is not the problem by adding a simple checksum to the saved state in fpu_lock, and before restoring it, check that the checksum is intact.
On Nov 28, 2018, at 03:58, Nadav Har'El <n...@scylladb.com> wrote:Does involuntary case include scenario when enough time designated for current thread by scheduler expires? I would imaging this would qualify as interrupt?The situation is different with involuntary context switches. When an asynchronous event, e.g., an interrupt, occurs, the user thread is in a random position in the code. It may be using all its registers, and the FPU state (including the old-style FPU and the new SSE and AVX registers). Because our interrupt handler (which may do anything from running the scheduler to reading a page from disk on page fault) may need to use any of these registers, all of them, including the FPU, need to be saved on interrupt time. The interrupt has a separate stack, and the FPU is saved on this stack (see fpu_lock use in interrupt()). When the interrupt finishes, this FPU is restored. This includes involuntary context switching: thread A receives an interrupt, saves the FPU, does something and decides to switch to thread B, and a while later we switch back to thread A at which point the interrupt handler "returns" and restores the FPU state.
diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.cindex aac790c0..1e116038 100644--- a/libc/stdio/vfprintf.c+++ b/libc/stdio/vfprintf.c@@ -8,6 +8,7 @@ #include <inttypes.h> #include <math.h> #include <float.h>+#include <assert.h> /* Some useful macros */ @@ -296,9 +297,14 @@ static int fmt_fp(FILE *f, long double y, int w, int p, int fl, int t) if (e2<0) a=r=z=big; else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1; + int steps = 0; do { *z = y; y = 1000000000*(y-*z++);+ steps++;+ if(steps > 2000) {+ assert(0);+ } } while (y); while (e2>0) { if (e2<0) a=r=z=big; else a=r=z=big+sizeof(big)/sizeof(*big) - LDBL_MANT_DIG - 1;
int steps = 0; do { *z = y; 501: d9 bd 0e e3 ff ff fnstcw -0x1cf2(%rbp) if (e2<0) a=r=z=big; 507: 48 8d 85 40 e3 ff ff lea -0x1cc0(%rbp),%rax 50e: 45 85 c9 test %r9d,%r9d 511: 48 8d 95 cc fe ff ff lea -0x134(%rbp),%rdx 518: 48 89 c1 mov %rax,%rcx *z = y; 51b: 0f b7 85 0e e3 ff ff movzwl -0x1cf2(%rbp),%eax if (e2<0) a=r=z=big; 522: 48 0f 49 ca cmovns %rdx,%rcx *z = y; 526: 80 cc 0c or $0xc,%ah if (e2<0) a=r=z=big; 529: 48 89 8d d8 e2 ff ff mov %rcx,-0x1d28(%rbp) y = 1000000000*(y-*z++); 530: 48 8d 59 04 lea 0x4(%rcx),%rbx 534: 48 8d 91 44 1f 00 00 lea 0x1f44(%rcx),%rdx *z = y; 53b: 66 89 85 0c e3 ff ff mov %ax,-0x1cf4(%rbp) 542: d9 c0 fld %st(0) 544: d9 ad 0c e3 ff ff fldcw -0x1cf4(%rbp) 54a: df bd 00 e3 ff ff fistpll -0x1d00(%rbp) 550: d9 ad 0e e3 ff ff fldcw -0x1cf2(%rbp) 556: 48 8b 85 00 e3 ff ff mov -0x1d00(%rbp),%rax 55d: 89 01 mov %eax,(%rcx) y = 1000000000*(y-*z++); 55f: 89 c0 mov %eax,%eax 561: 48 89 85 f8 e2 ff ff mov %rax,-0x1d08(%rbp) 568: df ad f8 e2 ff ff fildll -0x1d08(%rbp) 56e: de e9 fsubrp %st,%st(1) 570: d9 05 00 00 00 00 flds 0x0(%rip) # 576 <fmt_fp+0x176> 576: dc c9 fmul %st,%st(1) steps++; if(steps > 2000) { 578: eb 48 jmp 5c2 <fmt_fp+0x1c2> 57a: d9 c9 fxch %st(1) 57c: eb 04 jmp 582 <fmt_fp+0x182> 57e: 66 90 xchg %ax,%ax
580: d9 c9 fxch %st(1) *z = y; 582: d9 c0 fld %st(0) 584: d9 ad 0c e3 ff ff fldcw -0x1cf4(%rbp) 58a: df bd 00 e3 ff ff fistpll -0x1d00(%rbp) 590: d9 ad 0e e3 ff ff fldcw -0x1cf2(%rbp) y = 1000000000*(y-*z++); 596: 48 83 c3 04 add $0x4,%rbx *z = y; 59a: 48 8b 85 00 e3 ff ff mov -0x1d00(%rbp),%rax 5a1: 89 43 fc mov %eax,-0x4(%rbx) y = 1000000000*(y-*z++); 5a4: 89 c0 mov %eax,%eax 5a6: 48 89 85 f8 e2 ff ff mov %rax,-0x1d08(%rbp) 5ad: df ad f8 e2 ff ff fildll -0x1d08(%rbp) 5b3: de e9 fsubrp %st,%st(1) 5b5: d8 c9 fmul %st(1),%st if(steps > 2000) { 5b7: 48 39 d3 cmp %rdx,%rbx 5ba: 0f 84 99 10 00 00 je 1659 <fmt_fp+0x1259> 5c0: d9 c9 fxch %st(1) assert(0); } } while (y); 5c2: d9 ee fldz 5c4: d9 ca fxch %st(2) 5c6: db ea fucomi %st(2),%st 5c8: dd da fstp %st(2) 5ca: 7a ae jp 57a <fmt_fp+0x17a> 5cc: 75 b2 jne 580 <fmt_fp+0x180> 5ce: dd d8 fstp %st(0) 5d0: dd d8 fstp %st(0)The title of this thread indicates that the crash happens sporadically and indeed it is the case with the scenario described in 1st and 2nd email of this thread. The symptom (stack trace with fmt_fp() function captured by gdb) looks identical to what was reported in https://github.com/cloudius-systems/osv/issues/1010 and this https://github.com/cloudius-systems/osv/issues/536. We cannot be certain about whether the root cause is the same. However it is worth mentioning that crash happens with same symptom almost every time if I run the scenario described in issue #1010 which is different (but similar) than scenario described in 1st and 2nd email of this thread. I am emphasizing this because we have a way to replicate one of these crashes (possibly same root cause) in very repeatable way in case we have ideas of things to try.
Regarding your excellent explanations how non-FPU and FPU state is saved/restored by OSv I wanted to run a scenario to see if I understand it well (please forgive my ignorance if I am missing something obvious). Imagine we have 2 user threads (uT1 and uT2) executing some logic that involves floating point operations in a loop (like the one in the disassembled code). Ffmpeg is a good example as it must be very CPU intensive and doing all kind of FP computations to transcode video. Let us say on given cpu this sequence happens in time:->uT1 is running (executing long FP logic loop)--> timer triggered interrupt happens because allotted time slice expired [some FPU registers had some inflight data = FPU has state S1)----> OSv saves FPU state and other registers using fpu_lock construct
-------> OSv identifies next thread to run which is uT2---------> OSv restores FPU to state S1 (fpu_lock goes out of scope) and other registers as well and switches to uT2
--
On Wed, Nov 28, 2018 at 9:20 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:The title of this thread indicates that the crash happens sporadically and indeed it is the case with the scenario described in 1st and 2nd email of this thread. The symptom (stack trace with fmt_fp() function captured by gdb) looks identical to what was reported in https://github.com/cloudius-systems/osv/issues/1010 and this https://github.com/cloudius-systems/osv/issues/536. We cannot be certain about whether the root cause is the same. However it is worth mentioning that crash happens with same symptom almost every time if I run the scenario described in issue #1010 which is different (but similar) than scenario described in 1st and 2nd email of this thread. I am emphasizing this because we have a way to replicate one of these crashes (possibly same root cause) in very repeatable way in case we have ideas of things to try.That is good. If we give up on this effort (at least temporarily), please summarize what you know about reproducing this issue in a new bug-tracker issue (and reference 1010 and 536) so we could return to it later.Regarding your excellent explanations how non-FPU and FPU state is saved/restored by OSv I wanted to run a scenario to see if I understand it well (please forgive my ignorance if I am missing something obvious). Imagine we have 2 user threads (uT1 and uT2) executing some logic that involves floating point operations in a loop (like the one in the disassembled code). Ffmpeg is a good example as it must be very CPU intensive and doing all kind of FP computations to transcode video. Let us say on given cpu this sequence happens in time:->uT1 is running (executing long FP logic loop)--> timer triggered interrupt happens because allotted time slice expired [some FPU registers had some inflight data = FPU has state S1)----> OSv saves FPU state and other registers using fpu_lock constructSo far, right. This state is saved inside the interrupt stack, which while thread uT1 is running, is uT1->_arch->interrupt_stack.By the way, note that the regular (non-FPU) registers are also on this stack - the code in entry.S saves them on the stack and passes the pointer to it when it calls interrupt().-------> OSv identifies next thread to run which is uT2---------> OSv restores FPU to state S1 (fpu_lock goes out of scope) and other registers as well and switches to uT2No, this is not what happens now (at least, not if you mean that "S1" was uT1's state...). Before interrupt() ends or fpu_lock runs out of scope, interrupt calls sched::preempt().This is a weird function. It doesn't do something quickly and return :-) Rather, it calls the scheduler to decide which thread to run next. If it decides to run uT1 again, it just returns normally (and the FPU state is restored and the interrupt returns).But if it decides to run uT2 it doesn't quite return... Rather, it *switches* to thread uT2. This means we switch to uT2's stack, and jump to the program counter where uT2 was before. And where was uT2 running before? One of the likely options was that it was itself in interrupt() calling preempt() :-) So now, finally, when uT2 resumes, its call to preempt() "returns", and we reach the end of the interrupt() function. Remember that at this point we're running uT2, with uT2's stack. So when interrupt() ends, the fpu_lock goes out of scope, and the FPU state is popped from *uT2*'s stack. This is was uT2's FPU state, saved long ago, and now we're restoring it, and when interrupt() returns, uT2's correct FPU state will have been restored.
diff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.ccindex bb96af3d..34bf6081 100644--- a/arch/x64/arch-cpu.cc+++ b/arch/x64/arch-cpu.cc@@ -42,6 +42,14 @@ exception_guard::~exception_guard() sched::cpu::current()->arch.exit_exception(); } +static inline u64 calculate_fpu_state_checksum(processor::fpu_state *s) {+ u64 val = 0;+ char* data = s->legacy;+ for (u64 i = 0; i < sizeof(processor::fpu_state) - sizeof(u64); i++)+ val += data[i];+ return val;+}+ extern "C" [[gnu::target("no-sse")]] void fpu_state_init_xsave(processor::fpu_state *s) {@@ -61,21 +69,25 @@ void fpu_state_init_fxsave(processor::fpu_state *s) { extern "C" void fpu_state_save_xsave(processor::fpu_state *s) { processor::xsave(s, -1);+ s->checksum = calculate_fpu_state_checksum(s); } extern "C" void fpu_state_save_fxsave(processor::fpu_state *s) { processor::fxsave(s);+ s->checksum = calculate_fpu_state_checksum(s); } extern "C" void fpu_state_restore_xsave(processor::fpu_state *s) { processor::xrstor(s, -1);+ assert(s->checksum == calculate_fpu_state_checksum(s)); } extern "C" void fpu_state_restore_fxsave(processor::fpu_state *s) { processor::fxrstor(s);+ assert(s->checksum == calculate_fpu_state_checksum(s)); } extern "C"diff --git a/arch/x64/processor.hh b/arch/x64/processor.hhindex 6b1fbf66..a26fb028 100644--- a/arch/x64/processor.hh+++ b/arch/x64/processor.hh@@ -343,6 +343,7 @@ struct fpu_state { char xsavehdr[24]; char reserved[40]; char ymm[256];+ u64 checksum; } __attribute__((packed)); inline void fxsave(fpu_state* s)On Wed, Nov 28, 2018 at 9:20 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:The title of this thread indicates that the crash happens sporadically and indeed it is the case with the scenario described in 1st and 2nd email of this thread. The symptom (stack trace with fmt_fp() function captured by gdb) looks identical to what was reported in https://github.com/cloudius-systems/osv/issues/1010 and this https://github.com/cloudius-systems/osv/issues/536. We cannot be certain about whether the root cause is the same. However it is worth mentioning that crash happens with same symptom almost every time if I run the scenario described in issue #1010 which is different (but similar) than scenario described in 1st and 2nd email of this thread. I am emphasizing this because we have a way to replicate one of these crashes (possibly same root cause) in very repeatable way in case we have ideas of things to try.That is good. If we give up on this effort (at least temporarily), please summarize what you know about reproducing this issue in a new bug-tracker issue (and reference 1010 and 536) so we could return to it later.
CFLAGS_C99FSE = -std=c99 -nostdinc -ffreestanding -fexcess-precision=standard -frounding-math -Wa,--noexecstack
[registers] fps=0.0 q=-0.0 size=N/A time=00:00:14.00 bitrate=N/A dup=0 drop=273 speed= 28x RIP: 0x000000000044fb28 <???+4520744>RFL: 0x0000000000010206 CS: 0x0000000000000008 SS: 0x0000000000000010RAX: 0x0000000000000040 RBX: 0x00001000000faa22 RCX: 0x00002000001fd960 RDX: 0x00002000001fd618RSI: 0x000000000000000a RDI: 0x00002000001fd580 RBP: 0x00002000001fd5f0 R8: 0x0000000000000000R9: 0x0000000000000000 R10: 0x0000000000000000 R11: 0x00000000ffffffff R12: 0x0000000000000000R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x00000000ffffffff RSP: 0x00002000001fd508DUMMY_HANDLER for math_fault aborting.
[backtrace]0x00000000003a3ad6 <math_fault+22>0x00000000003a11dc <???+3805660>0x000000000045218b <vfprintf+171>0x0000000000454138 <vsnprintf+216>0x00001000000bfdbe <std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > __gnu_cxx::__to_xstring<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char>(int (*)(char*, unsigned long, char const*, __va_list_tag*), unsigned long, char const*, ...)+174>0x00001000000f1447 <httpserver::json::formatter::to_json[abi:cxx11](float)+103>0x00002000001fdf3f <???+2088767>0x00002000001fdf27 <???+2088743>0x0000100003c5a49f <???+63284383>0x4810c0834808578a <???+1208506250>Breakpoint 1, dump_registers (ef=0xffff800002a79068) at arch/x64/dump.cc:1717 osv::lookup_name_demangled((gdb) bt#0 dump_registers (ef=0xffff800002a79068) at arch/x64/dump.cc:17#1 0x00000000003a3ac9 in math_fault (ef=<optimized out>) at arch/x64/exceptions.cc:338#2 <signal handler called>#3 0x000000000044fb28 in pop_arg (arg=arg@entry=0x2000001fd580, type=<optimized out>, ap=ap@entry=0x2000001fd618) at libc/stdio/vfprintf.c:165#4 0x000000000045161c in pop_arg (ap=0x2000001fd618, type=<optimized out>, arg=0x2000001fd580) at libc/stdio/vfprintf.c:143#5 printf_core (f=f@entry=0x2000001fd7a0, fmt=fmt@entry=0x1000000faa20 "%f", ap=ap@entry=0x2000001fd618, nl_arg=nl_arg@entry=0x2000001fd6b0, nl_type=nl_type@entry=0x2000001fd630) at libc/stdio/vfprintf.c:538#6 0x000000000045218c in vfprintf (f=f@entry=0x2000001fd7a0, fmt=0x1000000faa20 "%f", ap=<optimized out>) at libc/stdio/vfprintf.c:675#7 0x0000000000454139 in vsnprintf (s=0x2000001fd8c0 "\256", n=<optimized out>, fmt=<optimized out>, ap=<optimized out>) at libc/stdio/vsnprintf.c:33#8 0x00001000000bfdbf in ?? ()#9 0x00000000000000ae in ?? ()#10 0x0000000000550b48 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_replace(unsigned long, unsigned#11 0x0000000000000000 in ?? ()140 static void pop_arg(union arg *arg, int type, va_list *ap)141 {142 /* Give the compiler a hint for optimizing the switch. */143 if ((unsigned)type > MAXSTATE) return;144 switch (type) {145 case PTR: arg->p = va_arg(*ap, void *);146 break; case INT: arg->i = va_arg(*ap, int);147 break; case UINT: arg->i = va_arg(*ap, unsigned int);148 #ifndef LONG_IS_INT149 break; case LONG: arg->i = va_arg(*ap, long);150 break; case ULONG: arg->i = va_arg(*ap, unsigned long);151 #endif152 break; case ULLONG: arg->i = va_arg(*ap, unsigned long long);153 break; case SHORT: arg->i = (short)va_arg(*ap, int);154 break; case USHORT: arg->i = (unsigned short)va_arg(*ap, int);155 break; case CHAR: arg->i = (signed char)va_arg(*ap, int);156 break; case UCHAR: arg->i = (unsigned char)va_arg(*ap, int);157 #ifdef ODD_TYPES158 break; case LLONG: arg->i = va_arg(*ap, long long);159 break; case SIZET: arg->i = va_arg(*ap, size_t);160 break; case IMAX: arg->i = va_arg(*ap, intmax_t);161 break; case UMAX: arg->i = va_arg(*ap, uintmax_t);162 break; case PDIFF: arg->i = va_arg(*ap, ptrdiff_t);163 break; case UIPTR: arg->i = (uintptr_t)va_arg(*ap, void *);164 #endif165 break; case DBL: arg->f = va_arg(*ap, double);166 break; case LDBL: arg->f = va_arg(*ap, long double);167 }168 }118 os_threads.set_handler([](const_req req) {119 using namespace std::chrono;120 httpserver::json::Threads threads;121 threads.time_ms = duration_cast<milliseconds>122 (osv::clock::wall::now().time_since_epoch()).count();123 httpserver::json::Thread thread;124 sched::with_all_threads([&](sched::thread &t) {126 thread.status = t.get_status();127 auto tcpu = t.tcpu();128 thread.cpu = tcpu ? tcpu->id : -1;129 thread.cpu_ms = duration_cast<milliseconds>(t.thread_clock()).count();130 thread.switches = t.stat_switches.get();131 thread.migrations = t.stat_migrations.get();132 thread.preemptions = t.stat_preemptions.get();133 thread.name = t.name();134 thread.priority = t.priority();135 thread.stack_size = t.get_stack_info().size;136 thread.status = t.get_status();137 threads.list.push(thread);138 });139 return threads;140 });I think I have found the culprit!
Related to what I am pointing out here - https://groups.google.com/forum/#!topic/osv-dev/-7vIPwOJsyg - I thought about enabling some FPU exceptions which are disabled by default. Once I changed 0x037f to 0x037e which enables invalid operand exception and ran my test and got this exception:[registers] fps=0.0 q=-0.0 size=N/A time=00:00:14.00 bitrate=N/A dup=0 drop=273 speed= 28xRIP: 0x000000000044fb28 <???+4520744>RFL: 0x0000000000010206 CS: 0x0000000000000008 SS: 0x0000000000000010RAX: 0x0000000000000040 RBX: 0x00001000000faa22 RCX: 0x00002000001fd960 RDX: 0x00002000001fd618RSI: 0x000000000000000a RDI: 0x00002000001fd580 RBP: 0x00002000001fd5f0 R8: 0x0000000000000000R9: 0x0000000000000000 R10: 0x0000000000000000 R11: 0x00000000ffffffff R12: 0x0000000000000000R13: 0x0000000000000000 R14: 0x0000000000000000 R15: 0x00000000ffffffff RSP: 0x00002000001fd508DUMMY_HANDLER for math_fault aborting.
2. The code downstream to serialize JSON calls vfprintf which at line 165 tries to extract the single element from va_list. This probably ends up using FPU which rejects corrupted priority float value. Because normally invalid operand exception is disabled vfprintf happily continues and gets to that famous loop which probably keeps raising same exception which gets ignored until we reach the beginning of the unmapped page and boom !!!How do you like my theory?
I think this theory has its shortcomings - I think it explains open issues #536 and #1010 (and possibly to soon closed #490) but I think it fails to explain the newest one I created - https://github.com/cloudius-systems/osv/issues/1018 (there is no httpserver running). Maybe the latter one is due to some issues with missing initialization of FPU control word (but I doubt it).Waldek
On Thursday, November 29, 2018 at 6:19:45 PM UTC-5, Waldek Kozaczuk wrote:Unless I made a mistake in the code that verifies fpu state corruption which showed no corruption I think it might be worth to ponder possibility of different nature of the root cause of this and other issues where fmt_fp() shows up in the stack trace.What if the problem is actually caused by wrong/incompatible compiler options that drive what machine code is generated to handle floating point operation. For example I noticed that when I configured musl independently on my machine I saw it add following options related to floating point math:CFLAGS_C99FSE = -std=c99 -nostdinc -ffreestanding -fexcess-precision=standard -frounding-math -Wa,--noexecstack
I have not enough experience to make any conclusive observations but two of those -fexcess-precision and -frounding-math do affet code using float, double, long double etc. I did not see those in main OSv makefile so I wonder if musl compiled code is wrong and misbehaves.I think FPU has control word with RC fields that control rounding (http://www.website.masmforum.com/tutorials/fptute/fpuchap1.htm) what if our code setups FPU rounding that is incompatible with how we compile musl. Just wild theory.Adding musl math library documentation - https://wiki.musl-libc.org/mathematical-library.html- there are paragraphs that talk about rounding and precision.
On Thursday, November 29, 2018 at 2:33:26 PM UTC-5, Waldek Kozaczuk wrote:I wonder if per this - https://software.intel.com/en-us/node/523375 - we are enabling correct bits for xsave if caste is what being used here.
“The second operand is a save/restore mask specifying the saved/restored extended states. The value of the mask is ANDed with XFEATURE_ENABLED_MASK(XCR0). A particular extended state is saved/restored only if the corresponding bit of both save/restore mask and XFEATURE_ENABLED_MASK is set to '1'.”
--
Unless I made a mistake in the code that verifies fpu state corruption which showed no corruption I think it might be worth to ponder possibility of different nature of the root cause of this and other issues where fmt_fp() shows up in the stack trace.What if the problem is actually caused by wrong/incompatible compiler options that drive what machine code is generated to handle floating point operation. For example I noticed that when I configured musl independently on my machine I saw it add following options related to floating point math:CFLAGS_C99FSE = -std=c99 -nostdinc -ffreestanding -fexcess-precision=standard -frounding-math -Wa,--noexecstack
I have not enough experience to make any conclusive observations but two of those -fexcess-precision and -frounding-math do affet code using float, double, long double etc. I did not see those in main OSv makefile so I wonder if musl compiled code is wrong and misbehaves.
I think FPU has control word with RC fields that control rounding (http://www.website.masmforum.com/tutorials/fptute/fpuchap1.htm) what if our code setups FPU rounding that is incompatible with how we compile musl. Just wild theory.Adding musl math library documentation - https://wiki.musl-libc.org/mathematical-library.html- there are paragraphs that talk about rounding and precision.
I wonder if per this - https://software.intel.com/en-us/node/523375 - we are enabling correct bits for xsave if caste is what being used here.
“The second operand is a save/restore mask specifying the saved/restored extended states. The value of the mask is ANDed with XFEATURE_ENABLED_MASK(XCR0). A particular extended state is saved/restored only if the corresponding bit of both save/restore mask and XFEATURE_ENABLED_MASK is set to '1'.”
index 358a86c6..126ab5f4 100644--- a/modules/httpserver-api/api/os.cc+++ b/modules/httpserver-api/api/os.cc@@ -131,7 +131,7 @@ void init(routes& routes) thread.migrations = t.stat_migrations.get(); thread.preemptions = t.stat_preemptions.get(); thread.name = t.name();- thread.priority = t.priority();+ thread.priority = 1.0;//t.priority(); thread.stack_size = t.get_stack_info().size; thread.status = t.get_status(); threads.list.push(thread);Ediff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.cc--
You received this message because you are subscribed to a topic in the Google Groups "OSv Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/osv-dev/ZAjvi_vEhnA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to osv-dev+u...@googlegroups.com.
Implemented checksum logic (please let me know if I have made any stupid mistakes).diff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.ccindex bb96af3d..34bf6081 100644--- a/arch/x64/arch-cpu.cc+++ b/arch/x64/arch-cpu.cc@@ -42,6 +42,14 @@ exception_guard::~exception_guard()sched::cpu::current()->arch.exit_exception();}+static inline u64 calculate_fpu_state_checksum(processor::fpu_state *s) {+ u64 val = 0;+ char* data = s->legacy;+ for (u64 i = 0; i < sizeof(processor::fpu_state) - sizeof(u64); i++)
+ val += data[i];
+ return val;+}+extern "C"[[gnu::target("no-sse")]]void fpu_state_init_xsave(processor::fpu_state *s) {@@ -61,21 +69,25 @@ void fpu_state_init_fxsave(processor::fpu_state *s) {extern "C"void fpu_state_save_xsave(processor::fpu_state *s) {processor::xsave(s, -1);+ s->checksum = calculate_fpu_state_checksum(s);
Here we go again ... I was thinking more about it.First of regarding potential corruption of FPU state during involuntary interrupt task switch, unless my code that verifies it is not the case using the checksum logic (see my previous email with the patch), there could be another possibility - we are telling xsave/xstor to save/restore to/from wrong location in memory. Meaning that whatever got saved stays not corrupted but somehow we are restoring wrong data for whatever reason. Probably very unlikely.
Secondly I think that part of our theory, that this famous loop in vfprintf loops infinitely until tries to access unmapped page of memory, is very likely correct. But it might happening for different reason. So far we thought that simply the number is NaN or corrupt that makes FPU Instruction calculate NAN forever. But what if for whatever reason we are experiencing FPU stack overflow - all registers are taken and any instruction trying to load data into it keeps failing because it is full.
I believe this exception is disabled but I can enable it to see if I am correct and it happens somewhere in the code. Now what if after voluntary or involuntary thread switch the FPU stack is full and the thread switched to simply does not know it. Whose responsibility is in this case to make sure that FPU is in clean state (empty)?
Maybe the involuntary switch has/should have two sub-cases:1) the thread we are switching to before was also involuntarily switched and we need to restore its FPU state from interrupt stack which we do2) the thread we are switching to voluntarily yielded and ... I think FPU should be in somewhat clean state, at least be empty unless something left it full before, right?
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
2. If this was a voluntary context switch, the scheduler returns into whatever function which called it (sleep(), read(), etc.) and that returns to. The floating point state was never saved (it doesn't need to be, because a function was called) but, on return, it was never restored. It just gets a copy of whatever junk was there from the previous thread or OSv code that ran. The junk st0..st7 registers are fine - the caller cannot count on anything in there too. But the control registers might now have junk as well, and that OSv does have to restore, and that it does since commit 0959f20863b119c5ca7d1e663bc22e6329814386. However, we will never restore the floating point *status* word (FSW). So if one thread was interrupted with a full fp stack, another thread we switch to might be returned-to with a full stack.You can try adding an FNINIT in arch-switch.hh *before* processor::fldcw(fpucw) and see if it fixes anything.
On Thu, Nov 29, 2018 at 9:19 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:Implemented checksum logic (please let me know if I have made any stupid mistakes).diff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.ccindex bb96af3d..34bf6081 100644--- a/arch/x64/arch-cpu.cc+++ b/arch/x64/arch-cpu.cc@@ -42,6 +42,14 @@ exception_guard::~exception_guard()sched::cpu::current()->arch.exit_exception();}+static inline u64 calculate_fpu_state_checksum(processor::fpu_state *s) {+ u64 val = 0;+ char* data = s->legacy;+ for (u64 i = 0; i < sizeof(processor::fpu_state) - sizeof(u64); i++)What is this "- sizeof(u64)"?
+ val += data[i];Addition isn't the greatest hash function, but it is probably good enough to detect random corruption.+ return val;+}+extern "C"[[gnu::target("no-sse")]]void fpu_state_init_xsave(processor::fpu_state *s) {@@ -61,21 +69,25 @@ void fpu_state_init_fxsave(processor::fpu_state *s) {extern "C"void fpu_state_save_xsave(processor::fpu_state *s) {processor::xsave(s, -1);+ s->checksum = calculate_fpu_state_checksum(s);I hope you verified that indeed xsave or fxsave was used in your test setup, and not the old x87 FPU saving code?
diff --git a/arch/x64/arch-cpu.cc b/arch/x64/arch-cpu.ccindex bb96af3d..153aaa9a 100644
--- a/arch/x64/arch-cpu.cc+++ b/arch/x64/arch-cpu.cc@@ -42,6 +42,14 @@ exception_guard::~exception_guard()sched::cpu::current()->arch.exit_exception();}+static inline u64 calculate_fpu_state_checksum(processor::fpu_state *s) {+ u64 val = 0;+ char* data = s->legacy;
+ for (u64 i = 0; i < sizeof(processor::fpu_state) - 2 * sizeof(u64); i++)
+ val += data[i];
+ return val;+}+extern "C"[[gnu::target("no-sse")]]void fpu_state_init_xsave(processor::fpu_state *s) {
@@ -52,30 +60,43 @@ void fpu_state_init_xsave(processor::fpu_state *s) {// FPU state to the stack, so must not touch any of the FPU registers.// This is why the "gnu::target(no-sse)" specification above is critical.memset(s->xsavehdr, 0, sizeof(s->xsavehdr));+ s->saved = 0;}extern "C"void fpu_state_init_fxsave(processor::fpu_state *s) {+ s->saved = 0;
}extern "C"void fpu_state_save_xsave(processor::fpu_state *s) {processor::xsave(s, -1);+ s->checksum = calculate_fpu_state_checksum(s);
+ s->saved = 1;
}extern "C"void fpu_state_save_fxsave(processor::fpu_state *s) {processor::fxsave(s);+ s->checksum = calculate_fpu_state_checksum(s);
+ s->saved = 1;
}extern "C"void fpu_state_restore_xsave(processor::fpu_state *s) {
+ assert(s->saved == 1);
processor::xrstor(s, -1);+ assert(s->checksum == calculate_fpu_state_checksum(s));
+ s->saved = 0;+
}extern "C"void fpu_state_restore_fxsave(processor::fpu_state *s) {
+ assert(s->saved == 1);
processor::fxrstor(s);+ assert(s->checksum == calculate_fpu_state_checksum(s));
+ s->saved = 0;}extern "C"diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hhindex e3769540..7cd8d181 100644--- a/arch/x64/arch-switch.hh+++ b/arch/x64/arch-switch.hh@@ -103,6 +103,7 @@ void thread::switch_to()[rip]"i"(offsetof(thread_state, rip)): "rbx", "rdx", "rsi", "rdi", "r8", "r9",+ s->saved = 0;+
}extern "C"void fpu_state_restore_fxsave(processor::fpu_state *s) {
+ assert(s->saved == 1);
processor::fxrstor(s);+ assert(s->checksum == calculate_fpu_state_checksum(s));
+ s->saved = 0;
}extern "C"diff --git a/arch/x64/processor.hh b/arch/x64/processor.hh
index 6b1fbf66..e9d0c4f8 100644--- a/arch/x64/processor.hh+++ b/arch/x64/processor.hh@@ -339,10 +339,14 @@ inline u64 ticks()}struct fpu_state {char legacy[512];
char xsavehdr[24];char reserved[40];char ymm[256];+ u64 checksum;
+ u64 saved;
} __attribute__((packed));inline void fxsave(fpu_state* s)
On Sat, Dec 1, 2018 at 6:26 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:Here we go again ... I was thinking more about it.First of regarding potential corruption of FPU state during involuntary interrupt task switch, unless my code that verifies it is not the case using the checksum logic (see my previous email with the patch), there could be another possibility - we are telling xsave/xstor to save/restore to/from wrong location in memory. Meaning that whatever got saved stays not corrupted but somehow we are restoring wrong data for whatever reason. Probably very unlikely.Yes, I think it is unlikely - since you are checking the checksum, it means we are restoring valid saved state, but someone else's saved state. I don't have an idea how this can happen.If we rule out corruption of the saved FPU state, I can think of three remaining options:1. This is not a saved-FPU-state corruption - printf() really goes into an endless loop when it gets a certain input. I'm still hoping you can get more detailed printouts on what exactly what printf()'s *input* when it went into this endless loop. One thing you can do is to have a variable (perhaps thread-local global if you don't want it optimized out?) holding that printf_fp's last input, and when you get a crash, print this global. Or something.
2. This is a situation of some asynchronous even where we ran a handler - especially the scheduler which uses the FPU - but we *forgot* to save the FPU state. I don't know what this can be. This can perhaps be debugged with tracepoints - enable various tracepoints, especially the scheduler, and fpu saves/restores, and after the crash look at the tracepoint log. Can we find unexpected scheduler calls without an fpu save?3. This is a situation where we saved the FPU, but *after* doing some FPU work. We had and fix this kind of issue in 4349bfd3583df03d44cda480049e630081d3a20b but it's more surprising to see this kind of with x87 registers, not SSE, I can't think of a reason why the compiler will touch these registers without our intent.Secondly I think that part of our theory, that this famous loop in vfprintf loops infinitely until tries to access unmapped page of memory, is very likely correct. But it might happening for different reason. So far we thought that simply the number is NaN or corrupt that makes FPU Instruction calculate NAN forever. But what if for whatever reason we are experiencing FPU stack overflow - all registers are taken and any instruction trying to load data into it keeps failing because it is full.This is an interesting point. The eight x87 registers are freely addressable (st0 through st7), but can *also* be used as a stack. The current position is stored as three bits in the floating-point status word (FSW), and the fxsave (or newer xsave...) saves this as well on involuntary context switches.
diff --git a/libc/stdio/vfprintf.c b/libc/stdio/vfprintf.cindex aac790c0..5cd31b3e 100644--- a/libc/stdio/vfprintf.c+++ b/libc/stdio/vfprintf.c@@ -209,6 +209,9 @@ static char *fmt_u(uintmax_t x, char *s)
static int fmt_fp(FILE *f, long double y, int w, int p, int fl, int t)
{+ uint16_t fpu_status;+ asm volatile ("fnstsw %0" : "=m" (fpu_status));+uint32_t big[(LDBL_MAX_EXP+LDBL_MANT_DIG)/9+1];uint32_t *a, *d, *r, *z;int e2=0, e, i, j, l;@@ -654,8 +657,16 @@ static int printf_core(FILE *f, const char *fmt, va_list *ap, union arg *nl_arg,return 1;}+//The TLS version cannot be printed by GDB:+//(gdb) p vfprintf_fpu_status+// Cannot find thread-local storage for Thread 1, executable file build/release.x64/loader.elf:+// TLS not supported or disabled on this target+//static __thread uint16_t vfprintf_fpu_status = 0;+static uint16_t vfprintf_fpu_status = 0;int vfprintf(FILE *restrict f, const char *restrict fmt, va_list ap){+ asm volatile ("fnstsw %0" : "=m" (vfprintf_fpu_status));+va_list ap2;int nl_type[NL_ARGMAX+1] = {0};union arg nl_arg[NL_ARGMAX+1];
(gdb) bt#0 0x00000000003a8542 in processor::cli_hlt () at arch/x64/processor.hh:247#1 arch::halt_no_interrupts () at arch/x64/arch.hh:48#2 osv::halt () at arch/x64/power.cc:24#3 0x000000000023ef34 in abort (fmt=fmt@entry=0x63199b "Aborted\n") at runtime.cc:132#4 0x0000000000202765 in abort () at runtime.cc:98#5 0x0000000000346ce3 in mmu::vm_sigsegv (addr=<optimized out>, ef=0xffff800004b8f068) at core/mmu.cc:1316#6 0x0000000000347947 in mmu::vm_fault (addr=addr@entry=35184374185984, ef=ef@entry=0xffff800004b8f068) at core/mmu.cc:1330#7 0x00000000003a239c in page_fault (ef=0xffff800004b8f068) at arch/x64/mmu.cc:38#8 <signal handler called>#9 0x000000000044ff77 in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:303#10 0x0000000000000000 in ?? ()(gdb) frame 9#9 0x000000000044ff77 in fmt_fp (f=0x2000001fd710, y=0, w=0, p=6, fl=0, t=102) at libc/stdio/vfprintf.c:303303 *z = y;(gdb) p/t fpu_status$1 = 1100001(gdb) p/t vfprintf_fpu_status$2 = 11000001101001
The calling convention when calling a function (i.e., voluntary context switch) is that this stack is popped. The same is true on return (except st0 can be used for returning a floating-point value, if this is a function returning a floating point value). I don't know what will happen if a function violates this convention.I believe this exception is disabled but I can enable it to see if I am correct and it happens somewhere in the code. Now what if after voluntary or involuntary thread switch the FPU stack is full and the thread switched to simply does not know it. Whose responsibility is in this case to make sure that FPU is in clean state (empty)?When we switch *back* to an old thread, we should switch to its old FPU stack, including its old position in the x87 stack, so if it's full, it's this thread's fault.But on a *new* thread, we do need to make sure it got a clean slate. We had a bug in this long ago, which 0e8d9b5e0a1af02a31799fe1d40e434bc2c2a5fb fixed - thread_main_c() now calls processor::init_fpu(). This calls the FNINIT instruction, which among other things, resets the stack postion (in FSW) to 0.But now that I think about this, in case of an involuntary switch to the kernel (e.g., interrupt) what FPU state does the kernel code get? What if the kernel gets a full fpu stack and then can't do its own FPU calculations? Should fpu_lock() run fpu_init() after saving the FPU state?In any case, I don't think this has anything to do with the problem at hand. I'll open a separate issue.Maybe the involuntary switch has/should have two sub-cases:1) the thread we are switching to before was also involuntarily switched and we need to restore its FPU state from interrupt stack which we do2) the thread we are switching to voluntarily yielded and ... I think FPU should be in somewhat clean state, at least be empty unless something left it full before, right?I think you may have indeed found a problem (I'm not sure it's related to the bug we're chasing, but a problem nontheless, and *may* be related):We already have these two cases, implicitly: As I explained in a previous mail, when we context switch *back* to a thread, we get back to where it was when it gave up the CPU:1. If this was an involuntary context switch, we find ourselves returning (very belatedly) from the scheduler to the end of the interrupt() function, and now this function ends and pops the saved fpu (from the thread's interrupt stack), and finally returns from interrupt() to the original thread.2. If this was a voluntary context switch, the scheduler returns into whatever function which called it (sleep(), read(), etc.) and that returns to. The floating point state was never saved (it doesn't need to be, because a function was called) but, on return, it was never restored. It just gets a copy of whatever junk was there from the previous thread or OSv code that ran. The junk st0..st7 registers are fine - the caller cannot count on anything in there too. But the control registers might now have junk as well, and that OSv does have to restore, and that it does since commit 0959f20863b119c5ca7d1e663bc22e6329814386. However, we will never restore the floating point *status* word (FSW). So if one thread was interrupted with a full fp stack, another thread we switch to might be returned-to with a full stack.You can try adding an FNINIT in arch-switch.hh *before* processor::fldcw(fpucw) and see if it fixes anything.
diff --git a/arch/x64/arch-switch.hh b/arch/x64/arch-switch.hhindex e3769540..7cd8d181 100644--- a/arch/x64/arch-switch.hh+++ b/arch/x64/arch-switch.hh@@ -103,6 +103,7 @@ void thread::switch_to()[rip]"i"(offsetof(thread_state, rip)): "rbx", "rdx", "rsi", "rdi", "r8", "r9",
"r10", "r11", "r12", "r13", "r14", "r15", "memory");+ asm volatile("fninit");processor::fldcw(fpucw);processor::ldmxcsr(mxcsr);}
I think I have good news but I am adding my comments inlined.
And at least the issue described in #1010 went away based on around 30-40 executions which used to fail almost every time. So I think is very good news !!!
On Sun, Dec 2, 2018 at 11:26 PM Waldek Kozaczuk <jwkoz...@gmail.com> wrote:I think I have good news but I am adding my comments inlined.This is excellent news!I think, then, we'll need to add FNINIT or maybe just FEMMS (see my last comment in https://github.com/cloudius-systems/osv/issues/1019) to either the FPU fpu_lock code, or to the context switch code. I think just the former would be better, for several reasons:
1. It will protect the kernel's FPU code, not just the other threads, from starting with a filled stack.2. A workload which does a lot of voluntary context switching will not suffer any performance loss (interrupts are already slow). tests/misc-ctxsw should not show any performance degredation, I think. In general OSv already optimizes voluntary context switches more than involuntary ones (interrupts).3. It will be enough to protect the switched-to thread because the kernel will not leave anything in the stack (I think... because thread::switch_to() is a function...It will be nice to have a test for this. Perhaps, as I suggested earlier, one thread in a busy loop filling the fpu stack, or doing sse (xmm) work, or both (two test cases) and another thread in a busy loop doing normal floating point work, and see if one bothers the other (or the kernel).