[syzbot] [mm?] BUG: unable to handle kernel paging request in vma_merge_existing

syzbot

unread,

Mar 20, 2025, 7:09:38 PM (6 days ago) Mar 20

to Liam.H...@oracle.com, ak...@linux-foundation.org, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, lorenzo...@oracle.com, syzkall...@googlegroups.com, vba...@suse.cz

Hello,

syzbot found the following issue on:

HEAD commit: eb88e6bfbc0a Merge tag 'fsnotify_for_v6.14-rc7' of git://g..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=11e6c83f980000
kernel config: https://syzkaller.appspot.com/x/.config?x=77423669c2b8fa9
dashboard link: https://syzkaller.appspot.com/bug?extid=20ed41006cf9d842c2b5
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
userspace arch: i386

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7feb34a89c2a/non_bootable_disk-eb88e6bf.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/ded0ce69669f/vmlinux-eb88e6bf.xz
kernel image: https://storage.googleapis.com/syzbot-assets/6e6fa3c719e7/bzImage-eb88e6bf.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+20ed41...@syzkaller.appspotmail.com

RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
BUG: unable to handle page fault for address: fffffffffffffff4
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD df84067 P4D df84067 PUD df86067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 17805 Comm: syz.8.3237 Not tainted 6.14.0-rc6-syzkaller-00212-geb88e6bfbc0a #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:vma_merge_existing_range+0x266/0x2070 mm/vma.c:734
Code: e8 5f 25 ad ff 48 8b 14 24 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 1c 19 00 00 48 8b 04 24 48 8b 74 24 08 <4c> 8b 38 4c 89 ff e8 9f 1f ad ff 48 8b 44 24 08 49 39 c7 0f 83 db
RSP: 0000:ffffc9000319f988 EFLAGS: 00010246
RAX: fffffffffffffff4 RBX: ffffc9000319fae8 RCX: ffffffff820cd3e5
RDX: 1ffffffffffffffe RSI: 0000000080c2a000 RDI: 0000000000000005
RBP: 0000000080ce2000 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
R13: ffffc9000319fb08 R14: ffff888025eddc98 R15: ffff88804eec0a00
FS: 0000000000000000(0000) GS:ffff88802b500000(0063) knlGS:00000000f5106b40
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: fffffffffffffff4 CR3: 00000000614d6000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
vma_modify.constprop.0+0x87/0x410 mm/vma.c:1517
vma_modify_flags_uffd+0x241/0x2e0 mm/vma.c:1598
userfaultfd_clear_vma+0x91/0x130 mm/userfaultfd.c:1906
userfaultfd_release_all+0x2ae/0x4c0 mm/userfaultfd.c:2024
userfaultfd_release+0xf4/0x1c0 fs/userfaultfd.c:865
__fput+0x3ff/0xb70 fs/file_table.c:464
task_work_run+0x14e/0x250 kernel/task_work.c:227
resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
exit_to_user_mode_loop kernel/entry/common.c:114 [inline]
exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline]
__syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline]
syscall_exit_to_user_mode+0x27b/0x2a0 kernel/entry/common.c:218
__do_fast_syscall_32+0x80/0x120 arch/x86/entry/common.c:390
do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:412
entry_SYSENTER_compat_after_hwframe+0x84/0x8e
RIP: 0023:0xf7fe6579
Code: b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 00 00 00 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00
RSP: 002b:00000000f510655c EFLAGS: 00000296 ORIG_RAX: 0000000000000135
RAX: 0000000000000001 RBX: 0000000080000180 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Modules linked in:
CR2: fffffffffffffff4
---[ end trace 0000000000000000 ]---
RIP: 0010:vma_merge_existing_range+0x266/0x2070 mm/vma.c:734
Code: e8 5f 25 ad ff 48 8b 14 24 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 1c 19 00 00 48 8b 04 24 48 8b 74 24 08 <4c> 8b 38 4c 89 ff e8 9f 1f ad ff 48 8b 44 24 08 49 39 c7 0f 83 db
RSP: 0000:ffffc9000319f988 EFLAGS: 00010246
RAX: fffffffffffffff4 RBX: ffffc9000319fae8 RCX: ffffffff820cd3e5
RDX: 1ffffffffffffffe RSI: 0000000080c2a000 RDI: 0000000000000005
RBP: 0000000080ce2000 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001
R13: ffffc9000319fb08 R14: ffff888025eddc98 R15: ffff88804eec0a00
FS: 0000000000000000(0000) GS:ffff88802b500000(0063) knlGS:00000000f5106b40
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: fffffffffffffff4 CR3: 00000000614d6000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):
0: e8 5f 25 ad ff call 0xffad2564
5: 48 8b 14 24 mov (%rsp),%rdx
9: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
10: fc ff df
13: 48 c1 ea 03 shr $0x3,%rdx
17: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
1b: 0f 85 1c 19 00 00 jne 0x193d
21: 48 8b 04 24 mov (%rsp),%rax
25: 48 8b 74 24 08 mov 0x8(%rsp),%rsi
* 2a: 4c 8b 38 mov (%rax),%r15 <-- trapping instruction
2d: 4c 89 ff mov %r15,%rdi
30: e8 9f 1f ad ff call 0xffad1fd4
35: 48 8b 44 24 08 mov 0x8(%rsp),%rax
3a: 49 39 c7 cmp %rax,%r15
3d: 0f .byte 0xf
3e: 83 .byte 0x83
3f: db .byte 0xdb

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

syzbot

unread,

Mar 20, 2025, 8:02:40 PM (6 days ago) Mar 20

to pfal...@suse.de, ak...@linux-foundation.org, ja...@google.com, liam.h...@oracle.com, linux-...@vger.kernel.org, linu...@kvack.org, lorenzo...@oracle.com, pfal...@suse.de, syzkall...@googlegroups.com, vba...@suse.cz

> Ahh, fun bug. This *seems* to be the bug:
>
> First, in vma_modify:
>
> merged = vma_merge_existing_range(vmg);
> if (merged)
> return merged;
> if (vmg_nomem(vmg))
> return ERR_PTR(-ENOMEM);
>
> then, all the way up to userfaultfd_release_all (the return value propagates
> vma_modify -> vma_modify_flags_uffd -> userfaultfd_clear_vma):
>
> prev = NULL;
> for_each_vma(vmi, vma) {
> cond_resched();
> BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> !!(vma->vm_flags & __VM_UFFD_FLAGS));
> if (vma->vm_userfaultfd_ctx.ctx != ctx) {
> prev = vma;
> continue;
> }
>
> vma = userfaultfd_clear_vma(&vmi, prev, vma,
> vma->vm_start, vma->vm_end);
> prev = vma;
> }
>
> So, if uffd gets an IS_ERR(vma), it keeps going and takes that vma as the prev value,
> which leads to that ERR_PTR(-ENOMEM) deref crash (-12 = -ENOMEM = 0xffffff4).
> This situation is kind of awkward because ->release() errors don't mean a thing.
> So, I have another idea (pasting for syzbot) which might just be cromulent.
> Untested, but thoughts?
>
> #syz test

This crash does not have a reproducer. I cannot test it.

>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index d06453fa8aba..fb835d82eb84 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -2023,6 +2023,8 @@ void userfaultfd_release_all(struct mm_struct *mm,
>
> vma = userfaultfd_clear_vma(&vmi, prev, vma,
> vma->vm_start, vma->vm_end);
> + if (WARN_ON(IS_ERR(vma)))
> + break;
> prev = vma;
> }
> mmap_write_unlock(mm);
> diff --git a/mm/vma.c b/mm/vma.c
> index 71ca012c616c..b2167b7dc27d 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -1517,8 +1517,16 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg)
> merged = vma_merge_existing_range(vmg);
> if (merged)
> return merged;
> - if (vmg_nomem(vmg))
> + if (vmg_nomem(vmg)) {
> + /* If we can avoid failing the whole modification
> + * due to a merge OOM and validly keep going
> + * (we're modifying the whole VMA), return vma intact.
> + * It won't get merged, but such is life - we're avoiding
> + * OOM conditions in other parts of mm/ this way */
> + if (start <= vma->vm_start && end >= vma->vm_end)
> + return vma;
> return ERR_PTR(-ENOMEM);
> + }
>
> /* Split any preceding portion of the VMA. */
> if (vma->vm_start < start) {
>
> --
> Pedro

Jann Horn

unread,

Mar 20, 2025, 8:12:13 PM (6 days ago) Mar 20

to Pedro Falcato, syzbot, lorenzo...@oracle.com, Liam.H...@oracle.com, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz

On Thu, Mar 20, 2025 at 9:02 PM Pedro Falcato <pfal...@suse.de> wrote:
> On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote:

> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index d06453fa8aba..fb835d82eb84 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -2023,6 +2023,8 @@ void userfaultfd_release_all(struct mm_struct *mm,
>
> vma = userfaultfd_clear_vma(&vmi, prev, vma,
> vma->vm_start, vma->vm_end);
> + if (WARN_ON(IS_ERR(vma)))
> + break;

If this WARN_ON() was ever actually hit, I think we'd leave dangling
pointers in VMAs? As much as Linus hates BUG_ON(), I personally think
that would be a situation warranting BUG_ON(), or at least

CHECK_DATA_CORRUPTION(). That said:

> prev = vma;
> }
> mmap_write_unlock(mm);
> diff --git a/mm/vma.c b/mm/vma.c
> index 71ca012c616c..b2167b7dc27d 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -1517,8 +1517,16 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg)
> merged = vma_merge_existing_range(vmg);
> if (merged)
> return merged;
> - if (vmg_nomem(vmg))
> + if (vmg_nomem(vmg)) {
> + /* If we can avoid failing the whole modification
> + * due to a merge OOM and validly keep going
> + * (we're modifying the whole VMA), return vma intact.
> + * It won't get merged, but such is life - we're avoiding
> + * OOM conditions in other parts of mm/ this way */
> + if (start <= vma->vm_start && end >= vma->vm_end)
> + return vma;
> return ERR_PTR(-ENOMEM);
> + }

Along the lines of your idea, perhaps we could add a parameter "bool
never_fail" to vma_modify() that is passed through to
vma_merge_existing_range(), and guarantee that it never fails when
that parameter is set? Then we could also check that never_fail is
only used in cases where no split is necessary. That somewhat avoids
having this kind of check that only ever runs in error conditions...

Lorenzo Stoakes

unread,

Mar 20, 2025, 8:53:42 PM (6 days ago) Mar 20

to Jann Horn, Pedro Falcato, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz

On this - urgency is less as vmg_nomem() really can only occur with fault
injection afaict as it is a 'too small to fail' scenario, so we won't see
this IRL.

So we can relax a bit given LSF timing etc, and do something for 6.15rc1
I'd say :) then we can backport as needed obviously.

Also if this did somehow happen under such extreme memory pressure the
process would soon be torn down along with the system...

Sorry Jann, no security flaw here I would say :P

Oh lord... thanks for your analysis Pedro! But also, oh lord.

Yeah this 'pointer actually is an error value' thing is a thing I've seen
(and *ahem* caused) before.

It's funny because userfaultfd_clear_vma() -already has handling_ for errors...:

if (!IS_ERR(ret))
userfaultfd_reset_ctx(ret);

And yet...

> >
> > prev = NULL;
> > for_each_vma(vmi, vma) {
> > cond_resched();
> > BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> > !!(vma->vm_flags & __VM_UFFD_FLAGS));
> > if (vma->vm_userfaultfd_ctx.ctx != ctx) {
> > prev = vma;
> > continue;
> > }
> >
> > vma = userfaultfd_clear_vma(&vmi, prev, vma,
> > vma->vm_start, vma->vm_end);
> > prev = vma;
> > }
> >
> > So, if uffd gets an IS_ERR(vma), it keeps going and takes that vma as the prev value,
> > which leads to that ERR_PTR(-ENOMEM) deref crash (-12 = -ENOMEM = 0xffffff4).
> > This situation is kind of awkward because ->release() errors don't mean a thing.
> > So, I have another idea (pasting for syzbot) which might just be cromulent.
> > Untested, but thoughts?

Yeah that seems to match the values, I did wonder about this 'underflowed
u64' value and whether it was an error one and yeah.. sigh.

> >
> > #syz test

> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index d06453fa8aba..fb835d82eb84 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -2023,6 +2023,8 @@ void userfaultfd_release_all(struct mm_struct *mm,
> >
> > vma = userfaultfd_clear_vma(&vmi, prev, vma,
> > vma->vm_start, vma->vm_end);
> > + if (WARN_ON(IS_ERR(vma)))
> > + break;
>
> If this WARN_ON() was ever actually hit, I think we'd leave dangling
> pointers in VMAs? As much as Linus hates BUG_ON(), I personally think
> that would be a situation warranting BUG_ON(), or at least
> CHECK_DATA_CORRUPTION(). That said:

Yeah indeed, agreed.

>
> > prev = vma;
> > }
> > mmap_write_unlock(mm);
> > diff --git a/mm/vma.c b/mm/vma.c
> > index 71ca012c616c..b2167b7dc27d 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -1517,8 +1517,16 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg)
> > merged = vma_merge_existing_range(vmg);
> > if (merged)
> > return merged;
> > - if (vmg_nomem(vmg))
> > + if (vmg_nomem(vmg)) {
> > + /* If we can avoid failing the whole modification
> > + * due to a merge OOM and validly keep going
> > + * (we're modifying the whole VMA), return vma intact.
> > + * It won't get merged, but such is life - we're avoiding
> > + * OOM conditions in other parts of mm/ this way */
> > + if (start <= vma->vm_start && end >= vma->vm_end)
> > + return vma;

I do not like this solution at all, sorry.

I mean I get what you're doing and it's smart to try to find a means out of
this in general :) but let me explain my reasoning:

For one this is uffd's fault, and the fix clearly needs to be there.

But also, we _can't be sure_ vma is valid any more. The second it goes off
to vma_merge_existing_range() it might be removed, which is why it's
critical to only use 'merged'.

Now you might be able to prove that _right now_ it'd be ok, if you do this
check, because vma_complete() does the delete and only if either
vma_iter_prealloc() or dup_anon_vma() fails would we return -ENOMEM and
these happen _before_ vma_complete(), but that's an _implementation detail_
and now we've made an assumption that this is the case here.

An implicit effectively precondition on something unexpressed like that is
asking for trouble, really don't want to go that way.

> > return ERR_PTR(-ENOMEM);
> > + }

>
> Along the lines of your idea, perhaps we could add a parameter "bool
> never_fail" to vma_modify() that is passed through to
> vma_merge_existing_range(), and guarantee that it never fails when
> that parameter is set? Then we could also check that never_fail is
> only used in cases where no split is necessary. That somewhat avoids
> having this kind of check that only ever runs in error conditions...

Yeah hmmm, again this is _not where the problem is_. And we're doing it for
_one case only_, who _must_ succeed. Right?

Buuuut then again, we could add a _feature_ (it'd be something in VMG not a
bool, hey what are helper structs for right? :P)

We coould add a special mode that says we __GFP_NOFAIL, we do that in
vms_abort_munmap_vmas() and man that was under similar circumstances (hey
remember that fun Liam? :)

But at the same time, it feels icky, and we probably don't want to
proliferate this pattern too much.

So I'd kind of rather prefer a _general_ no-fail unmap that the uffd
release code could invoke.

Perhaps we could genericise the vms_abort_munmap_vmas():

mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);

And make that available or some form of it, to do the 'simplest' thing in
this scenario.

If we called that say vma_emergency_clear() then we could do something
like:

void userfaultfd_release_all(struct mm_struct *mm,
struct userfaultfd_ctx *ctx)
{
...

for_each_vma(vmi, vma) {
unsigned long start = vma->vm_start;
unsigned long end = vma->vm_end;

...

vma = userfaultfd_clear_vma(&vmi, prev, vma,
vma->vm_start, vma->vm_end);

if (IS_ERR(vma)) {
/*
* We can no longer rely on VMA state, we must
* unconditionally leave a gap.
*/
vma_emergency_clear(mm, start, end);
vma_iter_reset(&vmi); /* Probably? */
prev = NULL; /* Probably? */
continue;
}
}
}

I mean this isn't fun either. I wonder if we (that probably mean sme)
should go audit cases like this.

Another possible solution is to add a flag that _explicitly asserts and
documents_ that you require that no VMA be removed before attempting to
allocate.

Or we could make that an _explicit_ assumption?

And then the uffd code itself could cache vma and take Pedro's solution on
that basis?

void userfaultfd_release_all(struct mm_struct *mm,
struct userfaultfd_ctx *ctx)
{
...

for_each_vma(vmi, vma) {
struct vm_area_struct *old = vma;

...

vma = userfaultfd_clear_vma(&vmi, prev, vma,
vma->vm_start, vma->vm_end);

if (IS_ERR(vma)) {
BUG_ON(vma != -ENOMEM); /* Sorry Linus! */

/*
* OK we assert above that vma must remain intact if we fail to allocate,
* We are in an extreme memory pressure state, we simply cannot clear this VMA. Move on.
*/
prev = old;
}

...
}
}

I mean it's going to be dirty whichever way round we do this.

Thoughts guys?

But key point being - this is not quite as urgent as it might seem :)

Jann Horn

unread,

Mar 20, 2025, 11:05:22 PM (6 days ago) Mar 20

to Lorenzo Stoakes, Pedro Falcato, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz

On Thu, Mar 20, 2025 at 9:53 PM Lorenzo Stoakes
<lorenzo...@oracle.com> wrote:
> On Thu, Mar 20, 2025 at 09:11:33PM +0100, Jann Horn wrote:
> > On Thu, Mar 20, 2025 at 9:02 PM Pedro Falcato <pfal...@suse.de> wrote:
> > > On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote:

[...]

> > > Ahh, fun bug. This *seems* to be the bug:
> > >
> > > First, in vma_modify:
> > >
> > > merged = vma_merge_existing_range(vmg);
> > > if (merged)
> > > return merged;
> > > if (vmg_nomem(vmg))
> > > return ERR_PTR(-ENOMEM);
> > >
> > > then, all the way up to userfaultfd_release_all (the return value propagates
> > > vma_modify -> vma_modify_flags_uffd -> userfaultfd_clear_vma):

[...]

> > > diff --git a/mm/vma.c b/mm/vma.c
> > > index 71ca012c616c..b2167b7dc27d 100644
> > > --- a/mm/vma.c
> > > +++ b/mm/vma.c
> > > @@ -1517,8 +1517,16 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg)
> > > merged = vma_merge_existing_range(vmg);
> > > if (merged)
> > > return merged;
> > > - if (vmg_nomem(vmg))
> > > + if (vmg_nomem(vmg)) {
> > > + /* If we can avoid failing the whole modification
> > > + * due to a merge OOM and validly keep going
> > > + * (we're modifying the whole VMA), return vma intact.
> > > + * It won't get merged, but such is life - we're avoiding
> > > + * OOM conditions in other parts of mm/ this way */
> > > + if (start <= vma->vm_start && end >= vma->vm_end)
> > > + return vma;
>
> I do not like this solution at all, sorry.
>
> I mean I get what you're doing and it's smart to try to find a means out of
> this in general :) but let me explain my reasoning:
>
> For one this is uffd's fault, and the fix clearly needs to be there.

I mean... this worked fine back in, for example, 5.4 -
userfaultfd_release() would loop over the VMAs, change some stuff in
some of them, and merge them where possible, and there was no way
anything could fail. It's the VMA subsystem that changed its API...

> But also, we _can't be sure_ vma is valid any more. The second it goes off
> to vma_merge_existing_range() it might be removed, which is why it's
> critical to only use 'merged'.
>
> Now you might be able to prove that _right now_ it'd be ok, if you do this
> check, because vma_complete() does the delete and only if either
> vma_iter_prealloc() or dup_anon_vma() fails would we return -ENOMEM and
> these happen _before_ vma_complete(), but that's an _implementation detail_
> and now we've made an assumption that this is the case here.
>
> An implicit effectively precondition on something unexpressed like that is
> asking for trouble, really don't want to go that way.
>
>
> > > return ERR_PTR(-ENOMEM);
> > > + }
>
> >
> > Along the lines of your idea, perhaps we could add a parameter "bool
> > never_fail" to vma_modify() that is passed through to
> > vma_merge_existing_range(), and guarantee that it never fails when
> > that parameter is set? Then we could also check that never_fail is
> > only used in cases where no split is necessary. That somewhat avoids
> > having this kind of check that only ever runs in error conditions...
>
> Yeah hmmm, again this is _not where the problem is_. And we're doing it for
> _one case only_, who _must_ succeed. Right?

It seems to me like it is... theoretically very reasonable of
userfaultfd to expect to have a "reliably change the flags of an
entire VMA" operation, and if the normal VMA code doesn't provide that
because of maple tree internals in the merging case, then it would be
reasonable for the VMA code to provide an alternative that does
provide this?

> Buuuut then again, we could add a _feature_ (it'd be something in VMG not a
> bool, hey what are helper structs for right? :P)
>
> We coould add a special mode that says we __GFP_NOFAIL, we do that in
> vms_abort_munmap_vmas() and man that was under similar circumstances (hey
> remember that fun Liam? :)
>
> But at the same time, it feels icky, and we probably don't want to
> proliferate this pattern too much.
>
> So I'd kind of rather prefer a _general_ no-fail unmap that the uffd
> release code could invoke.
>
> Perhaps we could genericise the vms_abort_munmap_vmas():
>
> mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
>
> And make that available or some form of it, to do the 'simplest' thing in
> this scenario.

The userfaultfd release code doesn't want an "unmap" operation. It
just wants to remove the __VM_UFFD_FLAGS flags and set the
vma->vm_userfaultfd_ctx pointer to NULL.
The VMA code then sometimes sees an opportunity to merge with adjacent
VMAs; and this merging is what's failing.
So if we're willing to tolerate having adjacent VMAs that are
mergeable but aren't merged after an allocation failure, then instead
of using __GFP_NOFAIL for the merge, we could also just ignore merge
failures, at least when some "never fail" flag is set?

[...]

> Another possible solution is to add a flag that _explicitly asserts and
> documents_ that you require that no VMA be removed before attempting to
> allocate.
>
> Or we could make that an _explicit_ assumption?

I don't think I understand this part. What VMA removal and allocation
are you talking about?

> And then the uffd code itself could cache vma and take Pedro's solution on
> that basis?
>
> void userfaultfd_release_all(struct mm_struct *mm,
> struct userfaultfd_ctx *ctx)
> {
> ...
>
> for_each_vma(vmi, vma) {
> struct vm_area_struct *old = vma;
>
> ...
>
> vma = userfaultfd_clear_vma(&vmi, prev, vma,
> vma->vm_start, vma->vm_end);
> if (IS_ERR(vma)) {
> BUG_ON(vma != -ENOMEM); /* Sorry Linus! */
>
> /*
> * OK we assert above that vma must remain intact if we fail to allocate,
> * We are in an extreme memory pressure state, we simply cannot clear this VMA. Move on.
> */
> prev = old;
> }
>
> ...
> }
> }
>
> I mean it's going to be dirty whichever way round we do this.
>
> Thoughts guys?

I guess my main thought on this is: I would prefer it if we keep any
code that runs only in near-impossible cases as simple as possible,
because issues in those codepaths will take longer to find.

Pedro Falcato

unread,

Mar 20, 2025, 11:54:34 PM (6 days ago) Mar 20

to syzbot, lorenzo...@oracle.com, Liam.H...@oracle.com, ak...@linux-foundation.org, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz

On Thu, Mar 20, 2025 at 12:09:36PM -0700, syzbot wrote:

Ahh, fun bug. This *seems* to be the bug:

First, in vma_modify:

merged = vma_merge_existing_range(vmg);
if (merged)
return merged;
if (vmg_nomem(vmg))
return ERR_PTR(-ENOMEM);

then, all the way up to userfaultfd_release_all (the return value propagates
vma_modify -> vma_modify_flags_uffd -> userfaultfd_clear_vma):

prev = NULL;
for_each_vma(vmi, vma) {
cond_resched();
BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
!!(vma->vm_flags & __VM_UFFD_FLAGS));
if (vma->vm_userfaultfd_ctx.ctx != ctx) {
prev = vma;
continue;
}

vma = userfaultfd_clear_vma(&vmi, prev, vma,
vma->vm_start, vma->vm_end);

prev = vma;
}

So, if uffd gets an IS_ERR(vma), it keeps going and takes that vma as the prev value,
which leads to that ERR_PTR(-ENOMEM) deref crash (-12 = -ENOMEM = 0xffffff4).
This situation is kind of awkward because ->release() errors don't mean a thing.
So, I have another idea (pasting for syzbot) which might just be cromulent.
Untested, but thoughts?

#syz test

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d06453fa8aba..fb835d82eb84 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -2023,6 +2023,8 @@ void userfaultfd_release_all(struct mm_struct *mm,

vma = userfaultfd_clear_vma(&vmi, prev, vma,
vma->vm_start, vma->vm_end);

+ if (WARN_ON(IS_ERR(vma)))
+ break;

prev = vma;
}
mmap_write_unlock(mm);

diff --git a/mm/vma.c b/mm/vma.c
index 71ca012c616c..b2167b7dc27d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -1517,8 +1517,16 @@ static struct vm_area_struct *vma_modify(struct vma_merge_struct *vmg)
merged = vma_merge_existing_range(vmg);
if (merged)
return merged;
- if (vmg_nomem(vmg))
+ if (vmg_nomem(vmg)) {
+ /* If we can avoid failing the whole modification
+ * due to a merge OOM and validly keep going
+ * (we're modifying the whole VMA), return vma intact.
+ * It won't get merged, but such is life - we're avoiding
+ * OOM conditions in other parts of mm/ this way */
+ if (start <= vma->vm_start && end >= vma->vm_end)
+ return vma;

Lorenzo Stoakes

unread,

Mar 21, 2025, 9:10:57 AM (5 days ago) Mar 21

to Jann Horn, Pedro Falcato, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz

TL;DR: After actually sleeping :)

Original clear all implementation only tried to merge, so the split bits of
vma_modify() for it are no-ops (we iterate VMAs explicitly and try to clear the
whole of each VMA).

So we can simply go with a mix of Jann's and Pedro's suggestions and pass in a
flag that says 'I'm fine with this OOM'ing on the attempt, if that happens just
give up on the merge and let's reset the VMA in question.

I will write a patch for this shortly.

Thanks guys!

OK you make a fair point, actually (after some sleep :P sorry was running on 3
hours yesterday).

On reflection adding an option to the merge that says 'if we can't allocate,
give up on merge and return NULL' is sensible, and we can add a parameter to
vma_modify() saying the same thing, should split fail, which is another error
case that must be considered here.

So this code path is going vma-by-vma so will never split, we need not worry
about this.

The original implementation just tried to merge, I made it so we shared code
instead of duplicating all over the place as we do the same operation only we
might be within a VMA in userfaultfd_unregister().

So this is really simple actaully we just need a 'ok best effort try to merge,
we don't want to know if OOM, if you go OOM there, give up'.

But for aforementioned reasons about implicit assumptions this _has_ to be an
explicit flag.

Which I will add :)

Yes.

>
> > Buuuut then again, we could add a _feature_ (it'd be something in VMG not a
> > bool, hey what are helper structs for right? :P)
> >
> > We coould add a special mode that says we __GFP_NOFAIL, we do that in
> > vms_abort_munmap_vmas() and man that was under similar circumstances (hey
> > remember that fun Liam? :)
> >
> > But at the same time, it feels icky, and we probably don't want to
> > proliferate this pattern too much.
> >
> > So I'd kind of rather prefer a _general_ no-fail unmap that the uffd
> > release code could invoke.
> >
> > Perhaps we could genericise the vms_abort_munmap_vmas():
> >
> > mas_store_gfp(mas, NULL, GFP_KERNEL|__GFP_NOFAIL);
> >
> > And make that available or some form of it, to do the 'simplest' thing in
> > this scenario.
>
> The userfaultfd release code doesn't want an "unmap" operation. It
> just wants to remove the __VM_UFFD_FLAGS flags and set the
> vma->vm_userfaultfd_ctx pointer to NULL.
> The VMA code then sometimes sees an opportunity to merge with adjacent
> VMAs; and this merging is what's failing.
> So if we're willing to tolerate having adjacent VMAs that are
> mergeable but aren't merged after an allocation failure, then instead
> of using __GFP_NOFAIL for the merge, we could also just ignore merge
> failures, at least when some "never fail" flag is set?

Yes agreed.

Agreed.

syzbot

unread,

Mar 23, 2025, 4:49:26 PM (3 days ago) Mar 23

to Liam.H...@oracle.com, ak...@linux-foundation.org, ja...@google.com, liam.h...@oracle.com, linux-...@vger.kernel.org, linu...@kvack.org, lorenzo...@oracle.com, pfal...@suse.de, syzkall...@googlegroups.com, vba...@suse.cz

syzbot has found a reproducer for the following issue on:

HEAD commit: 586de92313fc Merge tag 'i2c-for-6.14-rc8' of git://git.ker..
git tree: upstream
console+strace: https://syzkaller.appspot.com/x/log.txt?x=162b7e98580000
kernel config: https://syzkaller.appspot.com/x/.config?x=2e330e9768b5b8ff

dashboard link: https://syzkaller.appspot.com/bug?extid=20ed41006cf9d842c2b5
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40

syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1196f3b0580000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=17d3dc4c580000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/3cb302fb058e/disk-586de923.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/6d42da95fefe/vmlinux-586de923.xz
kernel image: https://storage.googleapis.com/syzbot-assets/5a9e686ee97d/bzImage-586de923.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+20ed41...@syzkaller.appspotmail.com

RSP: 002b:00007fb22df1c208 EFLAGS: 00000246 ORIG_RAX: 0000000000000013
RAX: 0000000000000020 RBX: 00007fb22dfeb3c8 RCX: 00007fb22df69099
RDX: 0000000000000006 RSI: 0000200000000240 RDI: 0000000000000003
RBP: 00007fb22dfeb3c0 R08: 00007fb22df1bfa7 R09: 0000000000000033
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb22dfb8284
R13: 00007fb22df1c210 R14: 0000000000000001 R15: 0000200000000240

</TASK>
BUG: unable to handle page fault for address: fffffffffffffff4
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD df84067 P4D df84067 PUD df86067 PMD 0
Oops: Oops: 0000 [#1] PREEMPT SMP KASAN NOPTI

CPU: 0 UID: 0 PID: 5822 Comm: syz-executor515 Not tainted 6.14.0-rc7-syzkaller-00205-g586de92313fc #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2025

RIP: 0010:vma_merge_existing_range+0x266/0x2070 mm/vma.c:734

Code: e8 0f 47 ac ff 48 8b 14 24 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 1c 19 00 00 48 8b 04 24 48 8b 74 24 08 <4c> 8b 38 4c 89 ff e8 4f 41 ac ff 48 8b 44 24 08 49 39 c7 0f 83 db
RSP: 0018:ffffc900034d7950 EFLAGS: 00010246
RAX: fffffffffffffff4 RBX: ffffc900034d7ab0 RCX: ffffffff820db0e5
RDX: 1ffffffffffffffe RSI: 0000200000807000 RDI: 0000000000000005
RBP: 0000200000ce2000 R08: 0000000000000005 R09: 0000000000000000

R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001

R13: ffffc900034d7ad0 R14: ffff8880349961f0 R15: ffff8880122a6e00
FS: 00007fb22df1c6c0(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffff4 CR3: 000000007d69e000 CR4: 00000000003526f0

DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
vma_modify.constprop.0+0x87/0x410 mm/vma.c:1517
vma_modify_flags_uffd+0x241/0x2e0 mm/vma.c:1598
userfaultfd_clear_vma+0x91/0x130 mm/userfaultfd.c:1906
userfaultfd_release_all+0x2ae/0x4c0 mm/userfaultfd.c:2024
userfaultfd_release+0xf4/0x1c0 fs/userfaultfd.c:865
__fput+0x3ff/0xb70 fs/file_table.c:464
task_work_run+0x14e/0x250 kernel/task_work.c:227

ptrace_notify+0x10e/0x130 kernel/signal.c:2522
ptrace_report_syscall include/linux/ptrace.h:415 [inline]
ptrace_report_syscall_exit include/linux/ptrace.h:477 [inline]
syscall_exit_work kernel/entry/common.c:173 [inline]
syscall_exit_to_user_mode_prepare+0x126/0x290 kernel/entry/common.c:200
__syscall_exit_to_user_mode_work kernel/entry/common.c:205 [inline]
syscall_exit_to_user_mode+0x11/0x2a0 kernel/entry/common.c:218
do_syscall_64+0xda/0x250 arch/x86/entry/common.c:89
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fb22df69099
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 31 1b 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fb22df1c208 EFLAGS: 00000246 ORIG_RAX: 0000000000000013
RAX: 0000000000000020 RBX: 00007fb22dfeb3c8 RCX: 00007fb22df69099
RDX: 0000000000000006 RSI: 0000200000000240 RDI: 0000000000000003
RBP: 00007fb22dfeb3c0 R08: 00007fb22df1bfa7 R09: 0000000000000033
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb22dfb8284
R13: 00007fb22df1c210 R14: 0000000000000001 R15: 0000200000000240

</TASK>
Modules linked in:
CR2: fffffffffffffff4
---[ end trace 0000000000000000 ]---
RIP: 0010:vma_merge_existing_range+0x266/0x2070 mm/vma.c:734

Code: e8 0f 47 ac ff 48 8b 14 24 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02 00 0f 85 1c 19 00 00 48 8b 04 24 48 8b 74 24 08 <4c> 8b 38 4c 89 ff e8 4f 41 ac ff 48 8b 44 24 08 49 39 c7 0f 83 db
RSP: 0018:ffffc900034d7950 EFLAGS: 00010246
RAX: fffffffffffffff4 RBX: ffffc900034d7ab0 RCX: ffffffff820db0e5
RDX: 1ffffffffffffffe RSI: 0000200000807000 RDI: 0000000000000005
RBP: 0000200000ce2000 R08: 0000000000000005 R09: 0000000000000000

R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000001

R13: ffffc900034d7ad0 R14: ffff8880349961f0 R15: ffff8880122a6e00
FS: 00007fb22df1c6c0(0000) GS:ffff8880b8600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffffffffffff4 CR3: 000000007d69e000 CR4: 00000000003526f0

DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):

0: e8 0f 47 ac ff call 0xffac4714

5: 48 8b 14 24 mov (%rsp),%rdx
9: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
10: fc ff df
13: 48 c1 ea 03 shr $0x3,%rdx
17: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1)
1b: 0f 85 1c 19 00 00 jne 0x193d
21: 48 8b 04 24 mov (%rsp),%rax
25: 48 8b 74 24 08 mov 0x8(%rsp),%rsi
* 2a: 4c 8b 38 mov (%rax),%r15 <-- trapping instruction
2d: 4c 89 ff mov %r15,%rdi

30: e8 4f 41 ac ff call 0xffac4184

35: 48 8b 44 24 08 mov 0x8(%rsp),%rax
3a: 49 39 c7 cmp %rax,%r15
3d: 0f .byte 0xf
3e: 83 .byte 0x83
3f: db .byte 0xdb

---

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

syzbot

unread,

Mar 24, 2025, 12:31:08 AM (3 days ago) Mar 24

to Liam.H...@oracle.com, ak...@linux-foundation.org, brad.s...@opensrcsec.com, ja...@google.com, liam.h...@oracle.com, linux-...@vger.kernel.org, linu...@kvack.org, lorenzo...@oracle.com, pfal...@suse.de, syzkall...@googlegroups.com, vba...@suse.cz

syzbot has bisected this issue to:

commit 47b16d0462a460000b8f05dfb1292377ac48f3ca
Author: Lorenzo Stoakes <lorenzo...@oracle.com>
Date: Sat Feb 22 16:19:52 2025 +0000

mm: abort vma_modify() on merge out of memory failure

bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=1415fe98580000
start commit: 586de92313fc Merge tag 'i2c-for-6.14-rc8' of git://git.ker..
git tree: upstream
final oops: https://syzkaller.appspot.com/x/report.txt?x=1615fe98580000
console output: https://syzkaller.appspot.com/x/log.txt?x=1215fe98580000

kernel config: https://syzkaller.appspot.com/x/.config?x=2e330e9768b5b8ff
dashboard link: https://syzkaller.appspot.com/bug?extid=20ed41006cf9d842c2b5

syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1196f3b0580000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=17d3dc4c580000

Reported-by: syzbot+20ed41...@syzkaller.appspotmail.com
Fixes: 47b16d0462a4 ("mm: abort vma_modify() on merge out of memory failure")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

Reply all

Reply to author

Forward

[syzbot] [mm?] BUG: unable to handle kernel paging request in vma_merge_existing_range

syzbot

syzbot

Jann Horn

Lorenzo Stoakes

Jann Horn

Pedro Falcato

Lorenzo Stoakes

syzbot

syzbot