This is really related to the "OOM query" thread but I wanted to send new email as the other thread has gotten quite long.
I any case we are troubleshooting an app crash which happens pretty instantly after boot and one of the of thread stack trace looks like this:
(gdb) bt
#0 0x00000000403a7bea in processor::cli_hlt () at
arch/x64/processor.hh:247
#1 nmi (ef=0xffff80003fa1c068) at arch/x64/exceptions.cc:306
#2 <signal handler called>
#3 0x00000000403940a3 in memcpy_repmov_ssse3 (dest=0x2000415014c0,
src=0x20004e7851d4, n=16) at /usr/include/c++/9/array:185
#4 0x0000100001756a5b in ?? ()
#5 0x0000000000000000 in ?? ()
Also this is with the last 2 patches - "[PATCH V2 1/2] mempool: fix a bug in page_range_allocator() when handling worst case O(n) scenario" and "[PATCH V2 2/2] mempool: use map_anon() for large allocations or when memory is fragmented" applied to address fragmentation that make malloc_large() use mmu::map_anon() in certain cases.
So as you tell mempy (or specifically memcpy_repmov_ssse3()) triggers NMI (Non-maskable interrupt) exception in memcpy between memory areas allocated with mmu::map_anon() (see dest=0x2000415014c0,
src=0x20004e7851d4, n=16). I really have no idea why. But have a hunch that possibly it happens because mapping tables are not being refreshed properly/flushed. Possibly allocation in requested on one cpu and then memcpy() called on another one which does not see a mapping yet because. Or maybe TLB needs to flushed. From cursory reading it look mmu::map_anon() might be doing it (somewhere downstream) but not 100% sure.
Or maybe this NMI is caused by misaligned memory allocation (had question in my patch if it really addresses it properly). Or maybe a bug in my patch? Or maybe there is something fundamental in the way memory allocated with map_anon() vs allocation using contiguous physical memory.
Anybody has other smart ideas?
Waldek