+some folks who I think might work on THP-related stuff
-fsdevel because this doesn't have much to do with filesystem stuff
=== Short summary ===
I believe the issue here is a race between /proc/*/smaps and
split_huge_page_to_list():
The codepath for /proc/*/smaps walks the pagetables and (e.g. in
smaps_account()) calls page_mapcount() not just on pages from normal
PTEs but also on migration entries (since commit b1d4d9e0cbd0a
"proc/smaps: carefully handle migration entries", from Linux v3.5).
page_mapcount() expects compound pages to be stable.
The split_huge_page_to_list() path first protects the compound page by
locking it and replacing all its PTEs with migration entries (since
the THP rewrite in v4.5, I think?), then does the actual splitting
using __split_huge_page().
So there's a mismatch of expectations here:
The smaps code expects that migration entries point to stable compound
pages, while the THP code expects that it's okay to split a compound
page while it has migration entries.
I'm not sure what the best way to fix it is; I guess the smaps code is
at fault here, and the following options might be approaches to fixing
it?
1. skip migration entries completely in smaps?
2. let smaps assume that the mapcount is 1 for all migration entries?
3. try to be fully accurate by waiting for the migration entry to
change back to normal?
probably a bad idea, way too messy...
Note that the mapcount of a page that is being migrated doesn't really
represent how many MMs are using the page anyway, so we wouldn't
really be making the current situation much worse by fudging the
mapcount to 1 in the smaps code...
By the way, I think the /proc/*/pagemap code (pte_to_pagemap_entry())
has the same issue?
I have asked syzkaller to test what happens if smaps_account() is
hacked up to avoid page_mapcount() for migration entries, and that
seems to fix the crash:
https://groups.google.com/g/syzkaller-bugs/c/9AZZCz4OtvE/m/WZDYXEKKAgAJ
Original syzkaller report with some more details below:
The "syz repro" link shows that this appears to be some kind of race
involving mmap(), madvise(..., MADV_FREE), and
/proc/$pid/smaps_rollup.
Note the '"threaded":true,"collide":true,"repeat":true' bit, which
indicates that this is some kind of race that syzbot is repeatedly
trying to hit by executing operations on multiple threads
simultaneously.
The MADV_FREE path involves split_huge_page(page); and in the linked
console output we can see the page state:
[ 294.055186][T10694] page:ffffea0001328000 refcount:1 mapcount:0
mapping:0000000000000000 index:0x20000 pfn:0x4ca00
[ 294.069056][T10694] memcg:ffff88814011c000
[ 294.073365][T10694] anon flags:
0xfff0000008001d(locked|uptodate|dirty|lru|swapbacked|node=0|zone=1|lastcpupid=0x7ff)
[ 294.084590][T10694] raw: 00fff0000008001d ffffea00013e0508
ffffea0001328048 ffff888045983e01
[ 294.093279][T10694] raw: 0000000000020000 0000000000000000
00000001ffffffff ffff88814011c000
[ 294.102469][T10694] page dumped because: VM_BUG_ON_PAGE(!PageHead(page))
[ 294.109422][T10694] page_owner tracks the page as allocated
[ 294.115444][T10694] page last allocated via order 0, migratetype
Movable, gfp_mask
0x3d20ca(GFP_TRANSHUGE_LIGHT|__GFP_NORETRY|__GFP_THISNODE), pid 10690,
ts 293625395715, free_ts 293374863045
[ 294.133087][T10694] get_page_from_freelist+0x779/0xa20
[ 294.138623][T10694] __alloc_pages+0x26c/0x5f0
[ 294.143337][T10694] alloc_pages_vma+0x9c7/0xe70
[ 294.148655][T10694] do_huge_pmd_anonymous_page+0x5b9/0xce0
[ 294.155039][T10694] handle_mm_fault+0x207e/0x2620
[ 294.160759][T10694] do_user_addr_fault+0x8ce/0x1120
[ 294.165961][T10694] exc_page_fault+0xa1/0x1e0
[ 294.171144][T10694] asm_exc_page_fault+0x1e/0x30
The message "page last allocated via order 0" is kinda misleading; as
you can see from the stacktrace and the GFP_TRANSHUGE_LIGHT flag, this
page was actually allocated as part of a hugepage, but was then later
split: __split_page_owner() fixes up the order in the page_owner
metadata, and is called from __split_huge_page() via
split_page_owner().
[...]
> ------------[ cut here ]------------
> kernel BUG at include/linux/page-flags.h:686!
> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 10694 Comm: syz-executor.0 Not tainted 5.13.0-rc3-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:PageDoubleMap include/linux/page-flags.h:686 [inline]
> RIP: 0010:__page_mapcount+0x2b3/0x2d0 mm/util.c:728
> Code: e8 72 25 cf ff 4c 89 ff 48 c7 c6 40 fb 39 8a e8 03 4c 04 00 0f 0b e8 5c 25 cf ff 4c 89 ff 48 c7 c6 40 fc 39 8a e8 ed 4b 04 00 <0f> 0b e8 46 25 cf ff 4c 89 ff 48 c7 c6 80 fc 39 8a e8 d7 4b 04 00
> RSP: 0018:ffffc90001ff7460 EFLAGS: 00010246
> RAX: e8070b6faabf8b00 RBX: 00fff0000008001d RCX: ffff888047280000
> RDX: 0000000000000000 RSI: 000000000000ffff RDI: 000000000000ffff
> RBP: 0000000000000000 R08: ffffffff81ce2584 R09: ffffed1017363f24
> R10: ffffed1017363f24 R11: 0000000000000000 R12: 1ffffd4000265001
> R13: 00000000ffffffff R14: dffffc0000000000 R15: ffffea0001328000
> FS: 00007f6e83636700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000568000 CR3: 000000002b559000 CR4: 00000000001506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> page_mapcount include/linux/mm.h:873 [inline]
This is page_mapcount():
870 static inline int page_mapcount(struct page *page)
871 {
872 if (unlikely(PageCompound(page)))
873 return __page_mapcount(page);
<===========================
874 return atomic_read(&page->_mapcount) + 1;
875 }
So the page that page_mapcount() was called on is/was marked as
PageCompound() at that time, but then, in __page_mapcount we crash:
714 /* Slow path of page_mapcount() for compound pages */
715 int __page_mapcount(struct page *page)
716 {
717 int ret;
718
719 ret = atomic_read(&page->_mapcount) + 1;
720 /*
721 * For file THP page->_mapcount contains total number of mapping
722 * of the page: no need to look into compound_mapcount.
723 */
724 if (!PageAnon(page) && !PageHuge(page))
725 return ret;
726 page = compound_head(page);
727 ret += atomic_read(compound_mapcount_ptr(page)) + 1;
728 if (PageDoubleMap(page)) <===========================
729 ret--;
730 return ret;
731 }
which means the supposed compound head page is not a compound page (anymore).
> smaps_account+0x79d/0x980 fs/proc/task_mmu.c:467
> smaps_pte_entry fs/proc/task_mmu.c:533 [inline]
> smaps_pte_range+0x6ed/0xfc0 fs/proc/task_mmu.c:596
In here we're holding the PTE lock, so either the page of interest has
a non-zero (non-compound) mapcount, *OR* there is a migration PTE for
the page.
split_huge_page_to_list() takes a locked page and unmaps it from all
pagetables with unmap_page(), but because TTU_SPLIT_FREEZE is set,
try_to_unmap_one() will not simply clear the PTE but instead replace
it with a migration entry, so the /proc/*/smaps code can still observe
the page after that point.