I am running a Golang application on OSv under firecracker. Occasionally (it's proving difficult to reproduce reliably) I see an assertion failure early in the OSv startup sequence. The console output for my release build is as follows:OSv v0.55.0-168-g31a042391 CPUs detectedFirmware vendor: Unknownbsd: initializingAssertion failed: sched::preemptable() (arch/x64/mmu.cc: page_fault: 37)[backtrace]
It appears to be hitting the assert well before my go application is launched, so my impression is that this isn't being caused by anything my application is doing,
though I haven't been able to prove that the same behaviour occurs with a base image (no apps included).
As you can see the backtrace is empty (perhaps that's the norm for a release build?), so there are no additional clues. If I can find a way to reproduce with a debug build and/or with a base image, I will update this thread.In most cases, my OSv image starts up without issue and in those cases, the console output around the same area looks as follows:OSv v0.55.0-168-g31a042391 CPUs detectedFirmware vendor: Unknownbsd: initializing - doneVFS: mounting ramfs at /VFS: mounting devfs at /devnet: initializing - doneI can see there are numerous previous discussions around the page fault when preemption is disabled assertion, but these seem to be tied to application behaviour.Is an assert during "bsd: Initializing" a known issue for which there is already a patch?
Otherwise, if there is additional info. that I can capture to help isolate the cause, let me know.
--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/17c2a0f9-970f-4b8a-b2c4-18b3c678f027n%40googlegroups.com.
To add weight to the timing issue theory, we are seeing the crash much more often (est. 1 in 20) when :
- The -verbose OSv startup argument is supplied
or
- We TCP connect to the OSv instance whilst it is starting up
Is there somewhere I can raise a bug report for this?
The patch you supplied to obtain more detailed information isn't working as you intended I think, since we're seeing 'exception nested too deeply' instead of the exception data. So I modified dump.cc:dump_registers() to set the ‘name’ to an empty string then comment out the call to osv::lookup_name_demangled().
The image was built with fs=rofs and I ran with default firecracker.py memory settings (128M).
I was then able to capture the register information at the time of the Assert:
OSv v0.55.0-237-g7cd33926
1 CPUs detected
Firmware vendor: Unknown
bsd: initializing[registers]
RIP: 0x00000000403ebffe <>
RFL: 0x0000000000010246 CS: 0x0000000000000008 SS: 0x0000000000000010
RAX: 0x0000000000000001 RBX: 0xfffffffffffff900 RCX: 0x0000000000000020 RDX: 0x0000000000000000
RSI: 0xffff800000a31000 RDI: 0x0000000000000000 RBP: 0xffff80000015acd0 R8: 0x0000000000000007
R9: 0xffff80000000d518 R10: 0x0000000000000000 R11: 0xffff80000014a170 R12: 0xffff8000001cb040
R13: 0xffff80000000d518 R14: 0x0000000000000000 R15: 0xffff80000014a170 RSP: 0xffff80000015aca0
Assertion failed: sched::preemptable() (arch/x64/mmu.cc: page_fault: 42)
[backtrace]
[registers]
RIP: 0x00000000403e943a <>
RFL: 0x0000000000010046 CS: 0x0000000000000008 SS: 0x0000000000000010
RAX: 0x0000000000000000 RBX: 0xfffffffffffff900 RCX: 0x0000000000000000 RDX: 0x0000000000000001
RSI: 0x0000000000000007 RDI: 0x0000000000000001 RBP: 0xffff800000149120 R8: 0x0000000000000008
R9: 0x0000000000000001 R10: 0x0000000000000040 R11: 0x0000000000000002 R12: 0xffff8000001cb040
R13: 0x0000000000000001 R14: 0x0000000000000040 R15: 0x0000000000000002 RSP: 0xffff800000149100
I used the below approach to lookup the offending function from the supplied RIP, please advise if the approach I’ve used is incorrect:
1. readelf -s build/release.x64/loader.elf > symbols.txt
2. Lookup RIP in symbols.txt to find the function this address is in
Which gives me, for the first RIP (0x00000000403ebffe):
6828: 00000000403ebdd0 1054 FUNC GLOBAL DEFAULT 4 _ZN6memory9page_pool2l16r
And for the second RIP (0x00000000403e943a):
8594: 00000000403e93f0 203 FUNC GLOBAL DEFAULT 4 _ZN6memory9page_pool2l116
Responses below to some of the points from your previous reply.
Firecracker version differences are out I think (see my earlier comment where I mention that I was previously running FC v0.22.0 and saw the same issue). Now I’m running the very latest (v0.24.2).
Here's my gcc and go version information:
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
go version go1.16.2 linux/amd64
I agree with your comment. This assert seems to be so early in the kernel startup sequence, it seems to me unlikely to be an OOM related issue. In your tests, did you run the curltest shell script (as provided in my last post)? This, along with the verbose kernel arg. certainly seem to induce a failure more frequently, though it still happens without these.
Is there anything further that can be done to trace the cause?
void l1::refill(){SCOPE_LOCK(preempt_lock);auto& pbuf = get_l1();if (pbuf.nr + page_batch::nr_pages < pbuf.max / 2) {auto* pb = global_l2.alloc_page_batch();if (pb) {// Other threads might have filled the array while we waited for// the page batch. Make sure there is enough room to add the pages// we just acquired, otherwise return them.if (pbuf.nr + page_batch::nr_pages <= pbuf.max) {for (auto& page : pb->pages) {pbuf.push(page);}} else {global_l2.free_page_batch(pb);}}}}
And for the second RIP (0x00000000403e943a):
8594: 00000000403e93f0 203 FUNC GLOBAL DEFAULT 4 _ZN6memory9page_pool2l116
Thanks.I'm not sure what to look for to be able to verify that percpu_l1 is initialised correctly. I can see that it is a structure with various fields and function references, but I'm not clear how much of the field content should be initialised at the point of the failure. Regardless, I added some additional debug to at least show the percpu_l1 pointer content at the time of initialisation (so we can see this has been setup by the time the Assert occurs).Another possible clue might be that I have occasionally also seen a failure like this:OSv v0.55.0-237-g7cd339261 CPUs detectedFirmware vendor: Unknowntrying to execute or access missing symbol[backtrace]Looking in the Page_Fault() function (where we made the change to call dump_registers()), I can see that the "trying to execute....missing symbol" message is being output for another error condition. Perhaps this is a different issue, but I bring it to your attention just in case it is a possible clue to a single root cause.
while true; do ./scripts/firecracker.py >> /tmp/running; done
I could run it 71700 (over 71K) with no errors.
With verbose enabled:
while true; do ./scripts/firecracker.py -V >> /tmp/running2; done
I could run it 46691(over 46K) with no errors.
Let me share some of my host machine-specific (5-6 year old MacBook Pro) info:
more /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 70
model name : Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
stepping : 1
microcode : 0x1c
cpu MHz : 876.525
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1
gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 s
sse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_singl
e pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln
pts md_clear flush_l1d
vmx flags : vnmi preemption_timer invvpid ept_x_only ept_ad ept_1gb flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest ple shadow_vm
cs
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit srbds
bogomips : 4589.68
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
./scripts/run.py -k -p qemu_microvm --nics=0 -c 1
I have used the same recipe - 3 terminals running similar commands to start C native example on ROFS image. After I connected with gdb I saw this:
(gdb) osv info threads
id address name cpu status vruntime total_cpu_time location
1 (0xffff800000014040) reclaimer cpu0 waiting 6.51405e-24 1229 condvar::wait(lockfree::mutex*, sched::timer*) at core/condvar.cc:43
2 (0xffff80000004b040) kvm_wall_clock_ cpu0 waiting 4.76921e-24 8226 std::_Function_handler<void(), kvmclock::kvmclock()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/10/bits/std_function.h:291
3 (0xffff800000068040) page-access-sca cpu0 queued 1.92449e-23 33196 pagecache::access_scanner::run() at core/pagecache.cc:695
4 (0xffff800000080040) page_pool_l2 cpu0 queued 1.40362e-23 24193 memory::page_pool::l2::fill_thread() at core/mempool.cc:1435
5 (0xffff80000008d040) itimer-real cpu? unstarted 0 0 ?? at arch/x64/entry.S:112
6 (0xffff800000094040) itimer-virt cpu? unstarted 0 0 ?? at arch/x64/entry.S:112
7 (0xffff800000150040) balancer0 cpu0 queued 4.63251e-21 8154897 memory::page_pool::l1::alloc_page_local() at core/mempool.cc:1403
8 (0xffff800000168040) rcu0 cpu0 waiting 2.04856e-23 25420 osv::rcu::cpu_quiescent_state_thread::do_work() at core/rcu.cc:178
9 (0xffff800000180040) idle0 cpu0 queued inf 12481 std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/10/bits/std_function.h:291
10 (0xffff800000197040) >init cpu0 waiting 7.75482e-24 13338 std::_Function_handler<void(), sched::thread::reaper::reaper()::<lambda()> >::_M_invoke(const std::_Any_data &) at /usr/include/c++/10/bits/std_function.h:291
11 (0xffff8000001b3040) thread taskq cpu0 waiting 1.16969e-23 10158 synch_port::_msleep(void*, mtx*, int, char const*, int) at bsd/porting/synch.cc:101
12 (0xffff8000001bf040) callout cpu0 waiting 1.15418e-23 9891 _callout_thread() at bsd/porting/callout.cc:118
13 (0xffff8000001d9040) page_pool_l1_0 cpu0 running -2.4551e-21 709254 sched::thread::switch_to() at arch/x64/arch-switch.hh:108
Number of threads: 13
(gdb) bt
#0 abort (fmt=fmt@entry=0x4063f860 "Assertion failed: %s (%s: %s: %d)\n") at runtime.cc:106
#1 0x0000000040237f29 in __assert_fail (expr=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>) at runtime.cc:140
#2 0x0000000040390970 in page_fault (ef=0xffff80000094a048) at arch/x64/mmu.cc:37
#3 <signal handler called>
#4 0x00000000403e6266 in memory::page_pool::l1::pop (this=0xffff8000001d7040) at core/mempool.cc:1191
#5 memory::page_pool::l1::alloc_page_local () at core/mempool.cc:1411
#6 0x00000000403e8cfa in memory::page_pool::l1::alloc_page () at core/mempool.cc:1174
#7 memory::untracked_alloc_page () at core/mempool.cc:1671
#8 memory::pool::add_page (this=0x4090c968 <memory::malloc_pools+72>) at core/mempool.cc:254
#9 0x00000000403e8e18 in memory::pool::alloc (this=0x4090c968 <memory::malloc_pools+72>) at core/mempool.cc:221
#10 0x00000000403e8f9c in std_malloc (size=8, alignment=8) at core/mempool.cc:1782
#11 0x00000000403e9457 in malloc (size=8) at core/mempool.cc:2001
#12 0x00000000404b0439 in operator new(unsigned long) ()
#13 0x0000000040354018 in __gnu_cxx::new_allocator<elf::object*>::allocate (__n=1, this=<optimized out>) at /usr/include/c++/10/ext/new_allocator.h:103
#14 std::allocator_traits<std::allocator<elf::object*> >::allocate (__a=..., __n=1) at /usr/include/c++/10/bits/alloc_traits.h:460
#15 std::_Vector_base<elf::object*, std::allocator<elf::object*> >::_M_allocate (__n=1, this=0xffff8000001dd2e0)
at /usr/include/c++/10/bits/stl_vector.h:346
#16 std::vector<elf::object*, std::allocator<elf::object*> >::reserve (__n=1, this=0xffff8000001dd2e0) at /usr/include/c++/10/bits/vector.tcc:78
#17 elf::program::modules_get (this=0xffffa0000009adf0) at core/elf.cc:1810
#18 0x00000000403548d0 in elf::program::with_modules<elf::program::lookup_addr(void const*)::<lambda(const elf::program::modules_list&)> > (f=...,
this=0xffffa0000009adf0) at include/osv/elf.hh:702
#19 elf::program::lookup_addr (this=0xffffa0000009adf0,
addr=addr@entry=0x40237f28 <__assert_fail(char const*, char const*, unsigned int, char const*)+24>) at core/elf.cc:1621
#20 0x00000000404273ec in osv::lookup_name_demangled (addr=0x40237f28 <__assert_fail(char const*, char const*, unsigned int, char const*)+24>,
buf=0xffff8000001dd7a0 "", len=1024) at core/demangle.cc:47
#21 0x0000000040237a1a in print_backtrace () at runtime.cc:86
#22 0x0000000040237edc in abort (fmt=fmt@entry=0x4063f860 "Assertion failed: %s (%s: %s: %d)\n") at runtime.cc:122
#23 0x0000000040237f29 in __assert_fail (expr=<optimized out>, file=<optimized out>, line=<optimized out>, func=<optimized out>) at runtime.cc:140
#24 0x0000000040390970 in page_fault (ef=0xffff8000001de068) at arch/x64/mmu.cc:37
#25 <signal handler called>
#26 0x00000000403e860e in memory::page_pool::l1::push (page=0xffff800000a27000, this=0xffff8000001d7040) at core/mempool.cc:1198
#27 memory::page_pool::l1::refill () at core/mempool.cc:1379
#28 0x00000000403e88f5 in memory::page_pool::l1::fill_thread () at core/mempool.cc:1356
#29 0x00000000403eb579 in memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}::operator()() const (__closure=<optimized out>) at core/mempool.cc:1164
#30 std::__invoke_impl<void, memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}&>(std::__invoke_other, memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}&) (__f=...) at /usr/include/c++/10/bits/invoke.h:60
#31 std::__invoke_r<void, memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}&>(memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}&) (__fn=...)
at /usr/include/c++/10/bits/invoke.h:153
#32 std::_Function_handler<void (), memory::page_pool::l1::l1(sched::cpu*)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...)
at /usr/include/c++/10/bits/std_function.h:291
--Type <RET> for more, q to quit, c to continue without paging--c
#33 0x00000000403fbc0a in sched::thread::main (this=0xffff8000001d9040) at core/sched.cc:1267
#34 sched::thread_main_c (t=0xffff8000001d9040) at arch/x64/arch-switch.hh:325
#35 0x0000000040390733 in thread_main () at arch/x64/entry.S:113
(gdb) frame 26
#26 0x00000000403e860e in memory::page_pool::l1::push (page=0xffff800000a27000, this=0xffff8000001d7040) at core/mempool.cc:1198
1198 l1_pool_stats[cpu_id]._nr = nr;
(gdb) p cpu_id
$1 = 0
(gdb) p l1_pool_stats
$2 = std::vector of length 0, capacity 0
(gdb) p l1_pool_stats[0]
Cannot access memory at address 0x0
(gdb) p smp_allocator
$3 = true
--
You received this message because you are subscribed to the Google Groups "OSv Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/f820b2d4-a641-4c39-9c9b-2e8b8a0a0d05n%40googlegroups.com.