[syzbot] [mm?] KCSAN: data-race in __anon_vma_prepare / __vmf_anon_prepare

0 views
Skip to first unread message

syzbot

unread,
11:32 AM (4 hours ago) 11:32 AM
to Liam.H...@oracle.com, ak...@linux-foundation.org, da...@kernel.org, harr...@oracle.com, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, lorenzo...@oracle.com, ri...@surriel.com, syzkall...@googlegroups.com, vba...@suse.cz
Hello,

syzbot found the following issue on:

HEAD commit: cfd4039213e7 Merge tag 'io_uring-6.19-20251208' of git://g..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=1554d992580000
kernel config: https://syzkaller.appspot.com/x/.config?x=c3201432211be40f
dashboard link: https://syzkaller.appspot.com/bug?extid=f5d897f5194d92aa1769
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/9f556ae6e3c4/disk-cfd40392.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/efcf53c1d459/vmlinux-cfd40392.xz
kernel image: https://storage.googleapis.com/syzbot-assets/858f42961336/bzImage-cfd40392.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+f5d897...@syzkaller.appspotmail.com

==================================================================
BUG: KCSAN: data-race in __anon_vma_prepare / __vmf_anon_prepare

write to 0xffff88811c751e80 of 8 bytes by task 13471 on cpu 1:
__anon_vma_prepare+0x172/0x2f0 mm/rmap.c:212
__vmf_anon_prepare+0x91/0x100 mm/memory.c:3673
hugetlb_no_page+0x1c4/0x10d0 mm/hugetlb.c:5782
hugetlb_fault+0x4cf/0xce0 mm/hugetlb.c:-1
handle_mm_fault+0x1894/0x2c60 mm/memory.c:6578
do_user_addr_fault+0x3fe/0x1080 arch/x86/mm/fault.c:1387
handle_page_fault arch/x86/mm/fault.c:1476 [inline]
exc_page_fault+0x62/0xa0 arch/x86/mm/fault.c:1532
asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:618
fault_in_readable+0xad/0x170 mm/gup.c:-1
fault_in_iov_iter_readable+0x129/0x210 lib/iov_iter.c:106
generic_perform_write+0x3cf/0x490 mm/filemap.c:4363
shmem_file_write_iter+0xc5/0xf0 mm/shmem.c:3490
new_sync_write fs/read_write.c:593 [inline]
vfs_write+0x52a/0x960 fs/read_write.c:686
ksys_pwrite64 fs/read_write.c:793 [inline]
__do_sys_pwrite64 fs/read_write.c:801 [inline]
__se_sys_pwrite64 fs/read_write.c:798 [inline]
__x64_sys_pwrite64+0xfd/0x150 fs/read_write.c:798
x64_sys_call+0x9f7/0x3000 arch/x86/include/generated/asm/syscalls_64.h:19
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xd8/0x2a0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff88811c751e80 of 8 bytes by task 13473 on cpu 0:
__vmf_anon_prepare+0x26/0x100 mm/memory.c:3667
hugetlb_no_page+0x1c4/0x10d0 mm/hugetlb.c:5782
hugetlb_fault+0x4cf/0xce0 mm/hugetlb.c:-1
handle_mm_fault+0x1894/0x2c60 mm/memory.c:6578
faultin_page mm/gup.c:1126 [inline]
__get_user_pages+0x1024/0x1ed0 mm/gup.c:1428
populate_vma_page_range mm/gup.c:1860 [inline]
__mm_populate+0x243/0x3a0 mm/gup.c:1963
mm_populate include/linux/mm.h:3701 [inline]
vm_mmap_pgoff+0x232/0x2e0 mm/util.c:586
ksys_mmap_pgoff+0x268/0x310 mm/mmap.c:604
x64_sys_call+0x16bb/0x3000 arch/x86/include/generated/asm/syscalls_64.h:10
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xd8/0x2a0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0x0000000000000000 -> 0xffff888104ecca28

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 13473 Comm: syz.2.3219 Tainted: G W syzkaller #0 PREEMPT(voluntary)
Tainted: [W]=WARN
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

Dmitry Vyukov

unread,
11:43 AM (4 hours ago) 11:43 AM
to syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, da...@kernel.org, harr...@oracle.com, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, lorenzo...@oracle.com, ri...@surriel.com, syzkall...@googlegroups.com, vba...@suse.cz
Hi Harry,

I see you've been debugging:
KASAN: slab-use-after-free Read in folio_remove_rmap_ptes
https://lore.kernel.org/all/694e3dc6.050a022...@google.com/T/

Can that bug be caused by this data race?
Below is an explanation by Gemini LLM as to why this race is harmful.
Obviously take it with a grain of salt, but with my limited mm
knowledge it does not look immediately wrong (re rmap invariant).

However, now digging into details I see that this Lorenzo's patch
also marked as fixing "KASAN: slab-use-after-free Read in
folio_remove_rmap_ptes":

mm/vma: fix anon_vma UAF on mremap() faulted, unfaulted merge
https://lore.kernel.org/all/b7930ad2b1503a657e29fe928eb33061d7e...@oracle.com/T/

So perhaps the race is still benign (or points to another issue?)

Here is what LLM said about the race:
-----

The bug report is actionable and points to a harmful data race in the Linux
kernel's memory management subsystem, specifically in the handling of
anonymous `hugetlb` mappings.

**Analysis:**

1. **Race Location:** The data race occurs on the `vma->anon_vma` field
of a `struct vm_area_struct`.
* **Writer:** Task 13471 executes `__anon_vma_prepare` in `mm/rmap.c`.
This function initializes the `anon_vma` for a VMA. It holds
`mm->page_table_lock` and writes to `vma->anon_vma` (line 211 in the
viewed source, corresponding to the report's `mm/rmap.c:212` area).
* **Reader:** Task 13473 executes `__vmf_anon_prepare` in `mm/memory.c`.
This function is an optimization wrapper that checks if
`vma->anon_vma` is already set (line 3666/3667) to avoid the overhead
of `__anon_vma_prepare`. This check is performed **without** holding
`mm->page_table_lock`.

2. **Consistency:** The report is consistent. Both tasks are handling
`hugetlb` page faults (`hugetlb_no_page` -> `__vmf_anon_prepare`).
The writer stack shows it proceeded into `__anon_vma_prepare` (implying
`vma->anon_vma` was NULL initially), while the reader stack shows it
reading `vma->anon_vma`. The value change `0x0000000000000000 ->
0xffff888104ecca28` confirms initialization from NULL to a pointer.

3. **Harmfulness (Why it is not benign):**
* In `__anon_vma_prepare`, the code currently initializes
`vma->anon_vma` **before** linking the VMA to the `anon_vma`
structure via `anon_vma_chain_link`.
* ```c
vma->anon_vma = anon_vma;
anon_vma_chain_link(vma, avc, anon_vma);
```
* Because the reader (`__vmf_anon_prepare`) checks `vma->anon_vma`
locklessly, it can see the non-NULL value before `anon_vma_chain_link`
has completed (due to compiler/CPU reordering or simple preemption
between the two statements).
* If the reader proceeds, it assumes the `anon_vma` is fully ready.
It then maps a page and sets `folio->mapping = anon_vma`.
* However, if `anon_vma_chain_link` hasn't finished, the `anon_vma`
(specifically its interval tree) does not yet contain the entry for
this `vma`.
* This breaks the reverse mapping (rmap) invariant. If the kernel
subsequently tries to unmap or migrate this page (finding it via
`folio->mapping`), `rmap_walk` will fail to find the VMA in the
`anon_vma`'s interval tree. This can lead to pages being effectively
pinned, migration failures, or in worst-case scenarios (like memory
corruption handling or specific reclaim paths), logical errors where
a page is assumed unmapped when it is not.

4. **Fix:** The fix requires enforcing ordering. `vma->anon_vma` should
be set **after** `anon_vma_chain_link` is complete, and `smp_store_release`
/ `smp_load_acquire` (or equivalent barriers) should be used to ensure the
reader observes the fully initialized state.

Jann Horn

unread,
12:00 PM (3 hours ago) 12:00 PM
to Dmitry Vyukov, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, da...@kernel.org, harr...@oracle.com, linux-...@vger.kernel.org, linu...@kvack.org, lorenzo...@oracle.com, ri...@surriel.com, syzkall...@googlegroups.com, vba...@suse.cz
On Wed, Jan 14, 2026 at 5:43 PM Dmitry Vyukov <dvy...@google.com> wrote:
> On Wed, 14 Jan 2026 at 17:32, syzbot
> <syzbot+f5d897...@syzkaller.appspotmail.com> wrote:
> > ==================================================================
> > BUG: KCSAN: data-race in __anon_vma_prepare / __vmf_anon_prepare
> >
> > write to 0xffff88811c751e80 of 8 bytes by task 13471 on cpu 1:
> > __anon_vma_prepare+0x172/0x2f0 mm/rmap.c:212
> > __vmf_anon_prepare+0x91/0x100 mm/memory.c:3673
> > hugetlb_no_page+0x1c4/0x10d0 mm/hugetlb.c:5782
> > hugetlb_fault+0x4cf/0xce0 mm/hugetlb.c:-1
> > handle_mm_fault+0x1894/0x2c60 mm/memory.c:6578
[...]
> > read to 0xffff88811c751e80 of 8 bytes by task 13473 on cpu 0:
> > __vmf_anon_prepare+0x26/0x100 mm/memory.c:3667
> > hugetlb_no_page+0x1c4/0x10d0 mm/hugetlb.c:5782
> > hugetlb_fault+0x4cf/0xce0 mm/hugetlb.c:-1
> > handle_mm_fault+0x1894/0x2c60 mm/memory.c:6578
[...]
This data race is not specific to hugetlb at all, and it isn't caused
by any recent changes. It's a longstanding thing in core MM, but it's
pretty benign as far as I know.

Fundamentally, the field vma->anon_vma can be read while only holding
the mmap lock in read mode; and it can concurrently be changed from
NULL to non-NULL.

One scenario to cause such a data race is to create a new anonymous
VMA, then trigger two concurrent page faults inside this VMA. Assume a
configuration with VMA locking disabled for simplicity, so that both
faults happen under the mmap lock in read mode. This will lead to two
concurrent calls to __vmf_anon_prepare()
(https://elixir.bootlin.com/linux/v6.18.5/source/mm/memory.c#L3623),
both threads only holding the mmap_lock in read mode.
__vmf_anon_prepare() is essentially this (from
https://elixir.bootlin.com/linux/v6.18.5/source/mm/memory.c#L3623,
with VMA locking code removed):

vm_fault_t __vmf_anon_prepare(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
vm_fault_t ret = 0;

if (likely(vma->anon_vma))
return 0;
[...]
if (__anon_vma_prepare(vma))
ret = VM_FAULT_OOM;
[...]
return ret;
}

int __anon_vma_prepare(struct vm_area_struct *vma)
{
struct mm_struct *mm = vma->vm_mm;
struct anon_vma *anon_vma, *allocated;
struct anon_vma_chain *avc;

[...]

[... allocate stuff ...]

anon_vma_lock_write(anon_vma);
/* page_table_lock to protect against threads */
spin_lock(&mm->page_table_lock);
if (likely(!vma->anon_vma)) {
vma->anon_vma = anon_vma;
[...]
}
spin_unlock(&mm->page_table_lock);
anon_vma_unlock_write(anon_vma);

[... cleanup ...]

return 0;

[... error handling ...]
}

So if one thread reaches the "vma->anon_vma = anon_vma" assignment
while the other thread is running the "if (likely(vma->anon_vma))"
check, you get a (AFAIK benign) data race.

Dmitry Vyukov

unread,
12:06 PM (3 hours ago) 12:06 PM
to Jann Horn, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, da...@kernel.org, harr...@oracle.com, linux-...@vger.kernel.org, linu...@kvack.org, lorenzo...@oracle.com, ri...@surriel.com, syzkall...@googlegroups.com, vba...@suse.cz
Thanks for checking, Jann.

To double check"

"vma->anon_vma = anon_vma" is done w/o store-release, so the lockless
readers can't read anon_vma contents, is it correct? So none of them
really reading anon_vma, right?

Also, anon_vma_chain_link and num_active_vmas++ indeed happen after
assignment to anon_vma:

/* page_table_lock to protect against threads */
spin_lock(&mm->page_table_lock);
if (likely(!vma->anon_vma)) {
vma->anon_vma = anon_vma;
anon_vma_chain_link(vma, avc, anon_vma);
anon_vma->num_active_vmas++;
allocated = NULL;
avc = NULL;
}
spin_unlock(&mm->page_table_lock);

So the lockless readers that observe anon_vma!=NULL won't rely on
these invariants, right?

Jann Horn

unread,
12:30 PM (3 hours ago) 12:30 PM
to Dmitry Vyukov, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, da...@kernel.org, harr...@oracle.com, linux-...@vger.kernel.org, linu...@kvack.org, lorenzo...@oracle.com, ri...@surriel.com, syzkall...@googlegroups.com, vba...@suse.cz
I think you are right that this should be using store-release;
searching around, I also mentioned this in
<https://lore.kernel.org/all/CAG48ez0qsAM-dkOUDetmNBSK...@mail.gmail.com/>:

| > +Note that there are some exceptions to this - the `anon_vma`
field is permitted
| > +to be written to under mmap read lock and is instead serialised
by the `struct
| > +mm_struct` field `page_table_lock`. In addition the `vm_mm` and all
|
| Hm, we really ought to add some smp_store_release() and READ_ONCE(),
| or something along those lines, around our ->anon_vma accesses...
| especially the "vma->anon_vma = anon_vma" assignment in
| __anon_vma_prepare() looks to me like, on architectures like arm64
| with write-write reordering, we could theoretically end up making a
| new anon_vma pointer visible to a concurrent page fault before the
| anon_vma has been initialized? Though I have no idea if that is
| practically possible, stuff would have to be reordered quite a bit for
| that to happen...

I just noticed that I tried fixing this back in 2023, I don't
remember why that didn't end up landing; the memory ordering was kind
of messy to think about:
<https://lore.kernel.org/all/20230726214103....@google.com/>

> Also, anon_vma_chain_link and num_active_vmas++ indeed happen after
> assignment to anon_vma:
>
> /* page_table_lock to protect against threads */
> spin_lock(&mm->page_table_lock);
> if (likely(!vma->anon_vma)) {
> vma->anon_vma = anon_vma;
> anon_vma_chain_link(vma, avc, anon_vma);
> anon_vma->num_active_vmas++;
> allocated = NULL;
> avc = NULL;
> }
> spin_unlock(&mm->page_table_lock);
>
> So the lockless readers that observe anon_vma!=NULL won't rely on
> these invariants, right?

Yeah, that stuff should be sufficiently protected because of the anon_vma lock.

Jann Horn

unread,
12:49 PM (3 hours ago) 12:49 PM
to Dmitry Vyukov, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, da...@kernel.org, harr...@oracle.com, linux-...@vger.kernel.org, linu...@kvack.org, lorenzo...@oracle.com, ri...@surriel.com, syzkall...@googlegroups.com, vba...@suse.cz
Er, except it actually isn't entirely, as I noticed in that old patch I linked:

@@ -1072,7 +1071,15 @@ static int anon_vma_compatible(struct
vm_area_struct *a, struct vm_area_struct *
static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old,
struct vm_area_struct *a, struct vm_area_struct *b)
{
if (anon_vma_compatible(a, b)) {
- struct anon_vma *anon_vma = READ_ONCE(old->anon_vma);
+ /*
+ * Pairs with smp_store_release() in __anon_vma_prepare().
+ *
+ * We could get away with a READ_ONCE() here, but
+ * smp_load_acquire() ensures that the following
+ * list_is_singular() check on old->anon_vma_chain doesn't race
+ * with __anon_vma_prepare().
+ */
+ struct anon_vma *anon_vma = smp_load_acquire(&old->anon_vma);

if (anon_vma && list_is_singular(&old->anon_vma_chain))
return anon_vma;

That list_is_singular(&old->anon_vma_chain) does plain loads on the
list_head, which can concurrently be modified by anon_vma_chain_link()
(which is called from __anon_vma_prepare()). I think that... probably
shouldn't cause any functional problems, but it is ugly.

Lorenzo Stoakes

unread,
1:02 PM (2 hours ago) 1:02 PM
to Jann Horn, Dmitry Vyukov, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, da...@kernel.org, harr...@oracle.com, linux-...@vger.kernel.org, linu...@kvack.org, ri...@surriel.com, syzkall...@googlegroups.com, vba...@suse.cz
Well isn't that what the page_table_lock is for...?
As far as the page fault is concerned it only really cares about whether it
exists or not, not whether it's initialised.

The operations that check/modify fields within the anon_vma are protected by the
anon rmap lock (my recent series takes advantage of this to avoid holding that
lock during AVC allocation for instance).

This lock also protects the interval tree.
Yeah I'm not sure this is really hugely important, as this being slightly wrong
only leads to very rarely having slightly less efficient lock scalability.

>
> if (anon_vma && list_is_singular(&old->anon_vma_chain))
> return anon_vma;
>
> That list_is_singular(&old->anon_vma_chain) does plain loads on the
> list_head, which can concurrently be modified by anon_vma_chain_link()

We're no longer using that directly as per my latest changes :)

But I don't think it really matters.

> (which is called from __anon_vma_prepare()). I think that... probably
> shouldn't cause any functional problems, but it is ugly.

But yeah this seems pretty benign.

Thanks, Lorenzo

Jann Horn

unread,
1:24 PM (2 hours ago) 1:24 PM
to Lorenzo Stoakes, Dmitry Vyukov, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, da...@kernel.org, harr...@oracle.com, linux-...@vger.kernel.org, linu...@kvack.org, ri...@surriel.com, syzkall...@googlegroups.com, vba...@suse.cz
The page_table_lock prevents writer-writer data races, but not
reader-writer data races. (It is only held by writers, not by
readers.)
Hmm, yeah, I'm not sure if anything in the page fault path actually
directly accesses the anon_vma. The page fault path does eventually
re-publish the anon_vma pointer with `WRITE_ONCE(folio->mapping,
(struct address_space *) anon_vma)` in __folio_set_anon() though,
which could then potentially allow a third thread to walk through
folio->mapping and observe the uninitialized anon_vma...

Looking at the situation on latest stable (v6.18.5), two racing faults
on _adjacent_ anonymous VMAs could also end up with one thread writing
->anon_vma while the other thread executes reusable_anon_vma(),
loading the pointer to that anon_vma and accessing its
->anon_vma_chain.
Reply all
Reply to author
Forward
0 new messages