NMI crash in memcpy() between memory areas allocated with mmu::map_anon()

7 views
Skip to first unread message

Waldek Kozaczuk

unread,
Mar 25, 2020, 11:48:52 AM3/25/20
to OSv Development
This is really related to the "OOM query" thread but I wanted to send new email as the other thread has gotten quite long.

I any case we are troubleshooting an app crash which happens pretty instantly after boot and one of the of thread stack trace looks like this:

(gdb) bt
#0  0x00000000403a7bea in processor::cli_hlt () at
arch/x64/processor.hh:247
#1  nmi (ef=0xffff80003fa1c068) at arch/x64/exceptions.cc:306
#2  <signal handler called>
#3  0x00000000403940a3 in memcpy_repmov_ssse3 (dest=0x2000415014c0,
src=0x20004e7851d4, n=16) at /usr/include/c++/9/array:185
#4  0x0000100001756a5b in ?? ()
#5  0x0000000000000000 in ?? ()

Also this is with the last 2 patches - "[PATCH V2 1/2] mempool: fix a bug in page_range_allocator() when handling worst case O(n) scenario" and "[PATCH V2 2/2] mempool: use map_anon() for large allocations or when memory is fragmented" applied to address fragmentation that make malloc_large() use mmu::map_anon() in certain cases.

So as you tell mempy (or specifically memcpy_repmov_ssse3()) triggers NMI (Non-maskable interrupt) exception in memcpy between memory areas allocated with mmu::map_anon() (see dest=0x2000415014c0,
src=0x20004e7851d4, n=16). I really have no idea why. But have a hunch that possibly it happens because mapping tables are not being refreshed properly/flushed. Possibly allocation in requested on one cpu and then memcpy()  called on another one which does not see a mapping yet because. Or maybe TLB needs to flushed. From cursory reading it look mmu::map_anon() might be doing it (somewhere downstream) but not 100% sure.

Or maybe this NMI is caused by misaligned memory allocation (had question in my patch if it really addresses it properly). Or maybe a bug in my patch? Or maybe there is something fundamental in the way memory allocated with map_anon() vs allocation using contiguous physical memory. 

Anybody has other smart ideas?

Waldek

Waldek Kozaczuk

unread,
Mar 26, 2020, 3:50:40 PM3/26/20
to OSv Development
This was actually caused by a bug in one of the older versions of the "mempool: use map_anon() for large allocations or when memory is fragmented"  patch. It turns out I forgot that object_size() also needs to account mamp_anon() based allocations and do it properly ;-) My latest - version 4 - of this patch should work better, plus I added a unit test around it. But it still needs to be reviewed.

Rick Payne

unread,
Mar 28, 2020, 6:32:36 PM3/28/20
to osv...@googlegroups.com

With your latest 2 patches, our production box which was having
problems has run fine for the last 48hours. Thanks for working so hard
on fixing it! It has been quite the pain point for us.

Are bugs 784 and 1077 something we should worry about?

Rick
> --
> You received this message because you are subscribed to the Google
> Groups "OSv Development" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to osv-dev+u...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/osv-dev/899b38b5-aed4-4497-ab83-c161a6b673ea%40googlegroups.com
> .

Reply all
Reply to author
Forward
0 new messages