[syzbot] [mm?] general protection fault in find_mergeable_anon_vma

25 views
Skip to first unread message

syzbot

unread,
Dec 9, 2024, 6:20:22 AM12/9/24
to Liam.H...@oracle.com, ak...@linux-foundation.org, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, lorenzo...@oracle.com, syzkall...@googlegroups.com, vba...@suse.cz
Hello,

syzbot found the following issue on:

HEAD commit: feffde684ac2 Merge tag 'for-6.13-rc1-tag' of git://git.ker..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17f85fc0580000
kernel config: https://syzkaller.appspot.com/x/.config?x=50c7a61469ce77e7
dashboard link: https://syzkaller.appspot.com/bug?extid=2d788f4f7cb660dac4b7
compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7feb34a89c2a/non_bootable_disk-feffde68.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/6135c7297e8e/vmlinux-feffde68.xz
kernel image: https://storage.googleapis.com/syzbot-assets/6c154fdcc9cb/bzImage-feffde68.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+2d788f...@syzkaller.appspotmail.com

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000080: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000400-0x0000000000000407]
CPU: 0 UID: 0 PID: 5319 Comm: syz.0.0 Not tainted 6.13.0-rc1-syzkaller-00025-gfeffde684ac2 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:anon_vma_compatible mm/vma.c:1804 [inline]
RIP: 0010:reusable_anon_vma mm/vma.c:1837 [inline]
RIP: 0010:find_mergeable_anon_vma+0x1e4/0x8f0 mm/vma.c:1863
Code: 00 00 00 00 fc ff df 41 80 3c 06 00 74 08 4c 89 ff e8 10 39 10 00 4d 8b 37 4d 89 ec 49 c1 ec 03 48 b8 00 00 00 00 00 fc ff df <41> 80 3c 04 00 74 08 4c 89 ef e8 ed 38 10 00 49 8b 5d 00 4c 89 f7
RSP: 0018:ffffc9000d3df500 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: ffffc9000d3df540 RCX: ffff88801cf80000
RDX: 0000000000000000 RSI: ffffffff900062a0 RDI: 0000000000000000
RBP: ffffc9000d3df610 R08: 0000000000000005 R09: ffffffff8bc6b642
R10: 0000000000000003 R11: ffff88801cf80000 R12: 0000000000000080
R13: 0000000000000406 R14: 0000000021000000 R15: ffff8880120d4ca0
FS: 00007f137f7e86c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 0000000040256000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__anon_vma_prepare+0xd9/0x4a0 mm/rmap.c:199
anon_vma_prepare include/linux/rmap.h:164 [inline]
uprobe_write_opcode+0x1a95/0x2d80 kernel/events/uprobes.c:516
install_breakpoint+0x4fc/0x660 kernel/events/uprobes.c:1135
register_for_each_vma+0xa08/0xc50 kernel/events/uprobes.c:1275
uprobe_register+0x811/0x970 kernel/events/uprobes.c:1384
bpf_uprobe_multi_link_attach+0xaca/0xdd0 kernel/trace/bpf_trace.c:3442
link_create+0x6d7/0x870 kernel/bpf/syscall.c:5399
__sys_bpf+0x4bc/0x810 kernel/bpf/syscall.c:5860
__do_sys_bpf kernel/bpf/syscall.c:5897 [inline]
__se_sys_bpf kernel/bpf/syscall.c:5895 [inline]
__x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5895
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f137e97ff19
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f137f7e8058 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 00007f137eb46080 RCX: 00007f137e97ff19
RDX: 000000000000003c RSI: 00000000200012c0 RDI: 000000000000001c
RBP: 00007f137e9f3986 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007f137eb46080 R15: 00007fff36be56b8
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:anon_vma_compatible mm/vma.c:1804 [inline]
RIP: 0010:reusable_anon_vma mm/vma.c:1837 [inline]
RIP: 0010:find_mergeable_anon_vma+0x1e4/0x8f0 mm/vma.c:1863
Code: 00 00 00 00 fc ff df 41 80 3c 06 00 74 08 4c 89 ff e8 10 39 10 00 4d 8b 37 4d 89 ec 49 c1 ec 03 48 b8 00 00 00 00 00 fc ff df <41> 80 3c 04 00 74 08 4c 89 ef e8 ed 38 10 00 49 8b 5d 00 4c 89 f7
RSP: 0018:ffffc9000d3df500 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: ffffc9000d3df540 RCX: ffff88801cf80000
RDX: 0000000000000000 RSI: ffffffff900062a0 RDI: 0000000000000000
RBP: ffffc9000d3df610 R08: 0000000000000005 R09: ffffffff8bc6b642
R10: 0000000000000003 R11: ffff88801cf80000 R12: 0000000000000080
R13: 0000000000000406 R14: 0000000021000000 R15: ffff8880120d4ca0
FS: 00007f137f7e86c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020002240 CR3: 0000000040256000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess), 6 bytes skipped:
0: df 41 80 filds -0x80(%rcx)
3: 3c 06 cmp $0x6,%al
5: 00 74 08 4c add %dh,0x4c(%rax,%rcx,1)
9: 89 ff mov %edi,%edi
b: e8 10 39 10 00 call 0x103920
10: 4d 8b 37 mov (%r15),%r14
13: 4d 89 ec mov %r13,%r12
16: 49 c1 ec 03 shr $0x3,%r12
1a: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
21: fc ff df
* 24: 41 80 3c 04 00 cmpb $0x0,(%r12,%rax,1) <-- trapping instruction
29: 74 08 je 0x33
2b: 4c 89 ef mov %r13,%rdi
2e: e8 ed 38 10 00 call 0x103920
33: 49 8b 5d 00 mov 0x0(%r13),%rbx
37: 4c 89 f7 mov %r14,%rdi


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

Lorenzo Stoakes

unread,
Dec 9, 2024, 7:53:57 AM12/9/24
to syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
On Mon, Dec 09, 2024 at 03:20:19AM -0800, syzbot wrote:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: feffde684ac2 Merge tag 'for-6.13-rc1-tag' of git://git.ker..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=17f85fc0580000
> kernel config: https://syzkaller.appspot.com/x/.config?x=50c7a61469ce77e7
> dashboard link: https://syzkaller.appspot.com/bug?extid=2d788f4f7cb660dac4b7
> compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
>
> Unfortunately, I don't have any reproducer for this issue yet.

Points to this being racey.

>
> Downloadable assets:
> disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7feb34a89c2a/non_bootable_disk-feffde68.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/6135c7297e8e/vmlinux-feffde68.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/6c154fdcc9cb/bzImage-feffde68.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+2d788f...@syzkaller.appspotmail.com
>
> Oops: general protection fault, probably for non-canonical address 0xdffffc0000000080: 0000 [#1] PREEMPT SMP KASAN NOPTI
> KASAN: null-ptr-deref in range [0x0000000000000400-0x0000000000000407]

This doesn't make a huge amount of sense to me, the VMA is not 0x400 (1,024)
bytes in size... and the actual faulting offset seems to be 0xdffffc0000000080
which is 0x80 off from some KASAN-specified value?

This would be vma->vm_file. But that also doesn't really make any sense.

But I wonder...

I see in the report at [0] that there's a failure injection in vm_area_dup() on
fork:

[ 73.842623][ T5318] ? kmem_cache_alloc_noprof+0x48/0x380
[ 73.844725][ T5318] ? __pfx___might_resched+0x10/0x10
[ 73.846687][ T5318] should_fail_ex+0x3b0/0x4e0
[ 73.848496][ T5318] should_failslab+0xac/0x100
[ 73.850232][ T5318] ? vm_area_dup+0x27/0x290
[ 73.852017][ T5318] kmem_cache_alloc_noprof+0x70/0x380
[ 73.854011][ T5318] vm_area_dup+0x27/0x290
[ 73.855771][ T5318] copy_mm+0xc1d/0x1f90

I also see in the fork logic we have the following code on error path:

mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
mas_store(&vmi.mas, XA_ZERO_ENTRY);

And XA_ZERO_ENTRY is 0x406.

Now if _somehow_ the VMA was being looked up without XA_ZERO_ENTRY being
properly accounted for, this might explain it, and why all the !vma logic would
be bypassed.

[0]:https://syzkaller.appspot.com/x/log.txt?x=17f85fc0580000

I mean the weird thing for me here is that mtree_load() has:

if (xa_is_zero(entry))
return NULL;

So you'd think it'd pick this up, but maybe if we're not actually holding
the right lock we get a partial write/race of some kind
and... yeah. Anything's possible then I suppose...

> CPU: 0 UID: 0 PID: 5319 Comm: syz.0.0 Not tainted 6.13.0-rc1-syzkaller-00025-gfeffde684ac2 #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
> RIP: 0010:anon_vma_compatible mm/vma.c:1804 [inline]

This is in:

static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
{
return a->vm_end == b->vm_start && <-- this line

This suggests that either a->vm_end (offset 0x8 into the VMA) or b->vm_start
(offset 0 into the VMA) are being null pointer deref'd assuming the compiler is
specifically referring to this _typographical_ line rather than the expression
as a whole.

> RIP: 0010:reusable_anon_vma mm/vma.c:1837 [inline]
> RIP: 0010:find_mergeable_anon_vma+0x1e4/0x8f0 mm/vma.c:1863
> Code: 00 00 00 00 fc ff df 41 80 3c 06 00 74 08 4c 89 ff e8 10 39 10 00 4d 8b 37 4d 89 ec 49 c1 ec 03 48 b8 00 00 00 00 00 fc ff df <41> 80 3c 04 00 74 08 4c 89 ef e8 ed 38 10 00 49 8b 5d 00 4c 89 f7
> RSP: 0018:ffffc9000d3df500 EFLAGS: 00010203
> RAX: dffffc0000000000 RBX: ffffc9000d3df540 RCX: ffff88801cf80000
> RDX: 0000000000000000 RSI: ffffffff900062a0 RDI: 0000000000000000
> RBP: ffffc9000d3df610 R08: 0000000000000005 R09: ffffffff8bc6b642
> R10: 0000000000000003 R11: ffff88801cf80000 R12: 0000000000000080
> R13: 0000000000000406 R14: 0000000021000000 R15: ffff8880120d4ca0
> FS: 00007f137f7e86c0(0000) GS:ffff88801fc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020000140 CR3: 0000000040256000 CR4: 0000000000352ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> __anon_vma_prepare+0xd9/0x4a0 mm/rmap.c:199
> anon_vma_prepare include/linux/rmap.h:164 [inline]
> uprobe_write_opcode+0x1a95/0x2d80 kernel/events/uprobes.c:516

Here we find the VMA via:

old_page = get_user_page_vma_remote(mm, vaddr, gup_flags, &vma);

Actually one unfortunate thing here is... ugh god.

I think there might be a bug in get_user_page_vma_remote()...

I will check in more detail but I don't see anything that will prevent the
mmap lock from being dropped before we perform the
vma_lookup()... FOLL_UNLOCKABLE will be set due to the &local_lock
shenanigans in get_user_pages_remote(), and if we get a page after a
dropped lock and try to vma_lookup() we could be racing... :/

Let me look into that more...

Jann Horn

unread,
Dec 9, 2024, 8:36:07 AM12/9/24
to Lorenzo Stoakes, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
On Mon, Dec 9, 2024 at 1:53 PM Lorenzo Stoakes
<lorenzo...@oracle.com> wrote:
> On Mon, Dec 09, 2024 at 03:20:19AM -0800, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit: feffde684ac2 Merge tag 'for-6.13-rc1-tag' of git://git.ker..
> > git tree: upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=17f85fc0580000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=50c7a61469ce77e7
> > dashboard link: https://syzkaller.appspot.com/bug?extid=2d788f4f7cb660dac4b7
> > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
>
> Points to this being racey.
>
> >
> > Downloadable assets:
> > disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7feb34a89c2a/non_bootable_disk-feffde68.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/6135c7297e8e/vmlinux-feffde68.xz
> > kernel image: https://storage.googleapis.com/syzbot-assets/6c154fdcc9cb/bzImage-feffde68.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+2d788f...@syzkaller.appspotmail.com
> >
> > Oops: general protection fault, probably for non-canonical address 0xdffffc0000000080: 0000 [#1] PREEMPT SMP KASAN NOPTI
> > KASAN: null-ptr-deref in range [0x0000000000000400-0x0000000000000407]
>
> This doesn't make a huge amount of sense to me, the VMA is not 0x400 (1,024)
> bytes in size... and the actual faulting offset seems to be 0xdffffc0000000080
> which is 0x80 off from some KASAN-specified value?

If you look at the disassembly, you can see this:

13: 4d 89 ec mov %r13,%r12
16: 49 c1 ec 03 shr $0x3,%r12
1a: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
21: fc ff df
* 24: 41 80 3c 04 00 cmpb $0x0,(%r12,%rax,1) <-- trapping instruction

R13 is 0000000000000406, that's the address we're about to access.
This code is trying to read KASAN shadow memory for that address by
reading from 0xdffffc0000000000+address>>3, which for real kernel
addresses gives you an address in the "KASAN shadow memory" range (see
https://kernel.org/doc/html/latest/arch/x86/x86_64/mm.html), but for
addresses in the low half of the address space gives you non-canonical
addresses starting with 0xdfff that cause #GP on access.
The second line "KASAN: null-ptr-deref in range
[0x0000000000000400-0x0000000000000407]" is basically computed by
doing that calculation in reverse.

> This would be vma->vm_file. But that also doesn't really make any sense.
>
> But I wonder...
>
> I see in the report at [0] that there's a failure injection in vm_area_dup() on
> fork:
>
> [ 73.842623][ T5318] ? kmem_cache_alloc_noprof+0x48/0x380
> [ 73.844725][ T5318] ? __pfx___might_resched+0x10/0x10
> [ 73.846687][ T5318] should_fail_ex+0x3b0/0x4e0
> [ 73.848496][ T5318] should_failslab+0xac/0x100
> [ 73.850232][ T5318] ? vm_area_dup+0x27/0x290
> [ 73.852017][ T5318] kmem_cache_alloc_noprof+0x70/0x380
> [ 73.854011][ T5318] vm_area_dup+0x27/0x290
> [ 73.855771][ T5318] copy_mm+0xc1d/0x1f90
>
> I also see in the fork logic we have the following code on error path:
>
> mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> mas_store(&vmi.mas, XA_ZERO_ENTRY);
>
> And XA_ZERO_ENTRY is 0x406.

That matches...

Lorenzo Stoakes

unread,
Dec 9, 2024, 8:41:27 AM12/9/24
to syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
OK I think I know what's going on.

As mentioned below fault injection results in:

mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
mas_store(&vmi.mas, XA_ZERO_ENTRY);

Being set in the mm.

Then, in find_mergeable_anon_vma():

struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
{
...

/* Try next first. */
next = vma_iter_load(&vmi);
if (next) {
anon_vma = reusable_anon_vma(next, vma, next);
...
}

...
}

So we use vma_iter_load() -> mas_walk() -> can return an XA_ZERO_ENTRY.

So here next might be equal to XA_ZERO_ENTRY.

Then in reusable_anon_vma(), where b == next == XA_ZERO_ENTRY:

static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old,
struct vm_area_struct *a,
struct vm_area_struct *b)
{
if (anon_vma_compatible(a, b)) {
...
}
...
}

And in anon_vma_compatible():

static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *b)
{
return a->vm_end == b->vm_start && ...
}

So b->vm_start which is offset 0 into the vm_area_struct attempts access at
0x406 and *boom*.

So we need to be a lot more careful about our use of XA_ZERO_ENTRY.

As per Jann's reply to thread, R13 is set by KASAN to the value, which is
0x406 or XA_ZERO_ENTRY so I think this explanation is pretty much confirmed.

Liam - thoughts?

Jann Horn

unread,
Dec 9, 2024, 8:52:57 AM12/9/24
to Lorenzo Stoakes, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
On Mon, Dec 9, 2024 at 1:53 PM Lorenzo Stoakes
<lorenzo...@oracle.com> wrote:
>
You fixed another issue in this area a month ago, right?
(https://project-zero.issues.chromium.org/373391951,
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f64e67e5d3a45a4a04286c47afade4b518acd47b,
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=985da552a98e27096444508ce5d853244019111f)

And we came to the conclusion that MMs whose VMAs have not been
completely copied and might have XA_ZERO_ENTRY entries left should
never become visible to anything other than the MM teardown code?
Hm, aren't we holding an mmap_write_lock() across the whole operation
in register_for_each_vma()? I don't think FOLL_UNLOCKABLE will be set,
the call from get_user_pages_remote() to is_valid_gup_args() passes
the caller's "locked" parameter, not &local_locked.

Lorenzo Stoakes

unread,
Dec 9, 2024, 8:54:57 AM12/9/24
to Jann Horn, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
Ah thanks.

>
> > This would be vma->vm_file. But that also doesn't really make any sense.
> >
> > But I wonder...
> >
> > I see in the report at [0] that there's a failure injection in vm_area_dup() on
> > fork:
> >
> > [ 73.842623][ T5318] ? kmem_cache_alloc_noprof+0x48/0x380
> > [ 73.844725][ T5318] ? __pfx___might_resched+0x10/0x10
> > [ 73.846687][ T5318] should_fail_ex+0x3b0/0x4e0
> > [ 73.848496][ T5318] should_failslab+0xac/0x100
> > [ 73.850232][ T5318] ? vm_area_dup+0x27/0x290
> > [ 73.852017][ T5318] kmem_cache_alloc_noprof+0x70/0x380
> > [ 73.854011][ T5318] vm_area_dup+0x27/0x290
> > [ 73.855771][ T5318] copy_mm+0xc1d/0x1f90
> >
> > I also see in the fork logic we have the following code on error path:
> >
> > mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> > mas_store(&vmi.mas, XA_ZERO_ENTRY);
> >
> > And XA_ZERO_ENTRY is 0x406.
>
> That matches...

And I wasn't aware that R13 was equal to the _actual_ address derefenced,
really useful to know, I mentioned it in my mega reply where I figured out
how we end up trying to deref this... :) yes I think this confirms the
theory.

Lorenzo Stoakes

unread,
Dec 9, 2024, 8:58:32 AM12/9/24
to Jann Horn, syzbot, Liam.H...@oracle.com, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
That's for ksm/uffd though, neither pertinent here.

>
> And we came to the conclusion that MMs whose VMAs have not been
> completely copied and might have XA_ZERO_ENTRY entries left should
> never become visible to anything other than the MM teardown code?

Well if we came to that conclusion, it was wrong! :)

Error paths at play again. I mean I think probably the slab allocation is 'too
small to fail' _in reality_. But somebody will point out some horrendous way
involving a fatal signal or what-not where we could hit this. Maybe.
Yeah I was just about to reply saying this, that code should be cleaned up
a bit to make clear... But yeah it's the bool *locked of the invoker, and
can't be &local_locked.

So yes this rules out get_user_page_vma_remote() as a problem, which is
good, because I wrote that :P

Liam R. Howlett

unread,
Dec 9, 2024, 10:34:05 AM12/9/24
to Lorenzo Stoakes, Jann Horn, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
* Lorenzo Stoakes <lorenzo...@oracle.com> [241209 08:58]:
The mm_struct isn't fully initialized at this point - and won't be once
the dup_mmap() fails. How exactly are we getting to this point in the
first place?

I have some ideas on fixing this particular issue in the not fully
initialised mm structure, but we will still be using a
not-fully-initialised mm structure and that sounds wrong on a whole
other level.

Thanks,
Liam


Lorenzo Stoakes

unread,
Dec 9, 2024, 10:53:25 AM12/9/24
to Liam R. Howlett, Jann Horn, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
It seems like uprobe can still connect at least via bpf... it uses
dup_mmap_sem to prevent races with dup_mmap(), but then in no way checks to
see if the fork _succeeded_ and assumes that the uprobe is good to go.

I wonder if we can tell uprobe... not to do this in that case :)

Some MMF_xxx maybe could help us? I guess we're full up there... but maybe
MMF_UNSTABLE somehow?

>
> Thanks,
> Liam
>
>

Liam R. Howlett

unread,
Dec 9, 2024, 11:13:04 AM12/9/24
to Lorenzo Stoakes, mhir...@kernel.org, ol...@redhat.com, pet...@infradead.org, Jann Horn, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
+Cc maintainers listed of kernel/events/uprobe.c

TL;DR:
dup_mmap() fails, but uprobe thinks it's fine and keeps trying to use an
incomplete mm_struct.

We're looking for a way to signal to uprobe to abort, cleanly.

Looking at kernel/fork.c, dup_mmap():

fail_uprobe_end:
uprobe_end_dup_mmap();
return retval;

So uprobe is aware it could fail, but releases the semaphore and then
doesn't check if the mm struct is okay to use.

What should happen in the failed mm_struct case?

Thanks,
Liam

* Lorenzo Stoakes <lorenzo...@oracle.com> [241209 10:53]:

Lorenzo Stoakes

unread,
Dec 9, 2024, 12:09:30 PM12/9/24
to Liam R. Howlett, mhir...@kernel.org, ol...@redhat.com, pet...@infradead.org, Jann Horn, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
On Mon, Dec 09, 2024 at 11:12:52AM -0500, Liam R. Howlett wrote:
> +Cc maintainers listed of kernel/events/uprobe.c
>
> TL;DR:
> dup_mmap() fails, but uprobe thinks it's fine and keeps trying to use an
> incomplete mm_struct.
>
> We're looking for a way to signal to uprobe to abort, cleanly.
>
> Looking at kernel/fork.c, dup_mmap():
>
> fail_uprobe_end:
> uprobe_end_dup_mmap();
> return retval;
>
> So uprobe is aware it could fail, but releases the semaphore and then
> doesn't check if the mm struct is okay to use.
>
> What should happen in the failed mm_struct case?
>
> Thanks,
> Liam
>

(As discussed on IRC) how about moving up the dup_mmap_sem lock one level, we
can put the mm before the rmap lookup in build_map_info() is able to find it,
which should avoid the whole issue?

Untested patch attached.

----8<----
From 629b04fe8cfdf6b4fad0ff91a316d4643ccd737d Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo...@oracle.com>
Date: Mon, 9 Dec 2024 16:58:14 +0000
Subject: [PATCH] tmp

---
kernel/fork.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 1450b461d196..4d62e776c413 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -639,7 +639,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
LIST_HEAD(uf);
VMA_ITERATOR(vmi, mm, 0);

- uprobe_start_dup_mmap();
if (mmap_write_lock_killable(oldmm)) {
retval = -EINTR;
goto fail_uprobe_end;
@@ -783,7 +782,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
else
dup_userfaultfd_fail(&uf);
fail_uprobe_end:
- uprobe_end_dup_mmap();
return retval;

fail_nomem_anon_vma_fork:
@@ -1692,9 +1690,11 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
if (!mm_init(mm, tsk, mm->user_ns))
goto fail_nomem;

+ uprobe_start_dup_mmap();
err = dup_mmap(mm, oldmm);
if (err)
goto free_pt;
+ uprobe_end_dup_mmap();

mm->hiwater_rss = get_mm_rss(mm);
mm->hiwater_vm = mm->total_vm;
@@ -1709,6 +1709,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
mm->binfmt = NULL;
mm_init_owner(mm, NULL);
mmput(mm);
+ uprobe_end_dup_mmap();

fail_nomem:
return NULL;
--
2.47.1

Oleg Nesterov

unread,
Dec 10, 2024, 10:06:03 AM12/10/24
to Lorenzo Stoakes, Liam R. Howlett, mhir...@kernel.org, pet...@infradead.org, Jann Horn, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
On 12/09, Lorenzo Stoakes wrote:
>
> (As discussed on IRC) how about moving up the dup_mmap_sem lock one level, we
> can put the mm before the rmap lookup in build_map_info() is able to find it,
> which should avoid the whole issue?

Not sure I fully understand the problem, but so far I see nothing wrong in
this idea. However,

> @@ -1692,9 +1690,11 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
> if (!mm_init(mm, tsk, mm->user_ns))
> goto fail_nomem;
>
> + uprobe_start_dup_mmap();
> err = dup_mmap(mm, oldmm);
> if (err)
> goto free_pt;
> + uprobe_end_dup_mmap();

If try_module_get(mm->binfmt->module)) fails after that, dup_mm() does
"goto free_pt;" and in this case ...

> @@ -1709,6 +1709,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
> mm->binfmt = NULL;
> mm_init_owner(mm, NULL);
> mmput(mm);
> + uprobe_end_dup_mmap();

... we have the unbalanced uprobe_end_dup_mmap().

Also. Perhaps we can change dup_mmap() to set MMF_XXX before uprobe_end_dup_mmap(),

fail_uprobe_end:
+ if (retval)
+ set_bit(mm->flags, MMF_XXX);
uprobe_end_dup_mmap();
return retval;

Then build_map_info() can check this flag. I guess we can reuse some of
MMF_OOM_ bits? May be MMF_UNSTABLE...

Oleg.

Lorenzo Stoakes

unread,
Dec 10, 2024, 10:15:12 AM12/10/24
to Oleg Nesterov, Liam R. Howlett, mhir...@kernel.org, pet...@infradead.org, Jann Horn, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
On Tue, Dec 10, 2024 at 04:05:28PM +0100, Oleg Nesterov wrote:
> On 12/09, Lorenzo Stoakes wrote:
> >
> > (As discussed on IRC) how about moving up the dup_mmap_sem lock one level, we
> > can put the mm before the rmap lookup in build_map_info() is able to find it,
> > which should avoid the whole issue?
>
> Not sure I fully understand the problem, but so far I see nothing wrong in
> this idea. However,
>
> > @@ -1692,9 +1690,11 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
> > if (!mm_init(mm, tsk, mm->user_ns))
> > goto fail_nomem;
> >
> > + uprobe_start_dup_mmap();
> > err = dup_mmap(mm, oldmm);
> > if (err)
> > goto free_pt;
> > + uprobe_end_dup_mmap();
>
> If try_module_get(mm->binfmt->module)) fails after that, dup_mm() does
> "goto free_pt;" and in this case ...
>
> > @@ -1709,6 +1709,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk,
> > mm->binfmt = NULL;
> > mm_init_owner(mm, NULL);
> > mmput(mm);
> > + uprobe_end_dup_mmap();
>
> ... we have the unbalanced uprobe_end_dup_mmap().

Right yeah we can obviously fix that, good spot though.

>
> Also. Perhaps we can change dup_mmap() to set MMF_XXX before uprobe_end_dup_mmap(),
>
> fail_uprobe_end:
> + if (retval)
> + set_bit(mm->flags, MMF_XXX);
> uprobe_end_dup_mmap();
> return retval;
>
> Then build_map_info() can check this flag. I guess we can reuse some of
> MMF_OOM_ bits? May be MMF_UNSTABLE...

Well this was my initial suggestion :) There is some concern as to how this will
interact with OOM however.

But I think my original suggestion, modified to fix the issue you rightly raise,
is a good proximate solution to keep thing simple.

>
> Oleg.
>

Peter Zijlstra

unread,
Dec 10, 2024, 10:19:07 AM12/10/24
to Lorenzo Stoakes, Liam R. Howlett, mhir...@kernel.org, ol...@redhat.com, Jann Horn, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, syzkall...@googlegroups.com, vba...@suse.cz
On Mon, Dec 09, 2024 at 05:09:13PM +0000, Lorenzo Stoakes wrote:
> On Mon, Dec 09, 2024 at 11:12:52AM -0500, Liam R. Howlett wrote:
> > +Cc maintainers listed of kernel/events/uprobe.c
> >
> > TL;DR:
> > dup_mmap() fails, but uprobe thinks it's fine and keeps trying to use an
> > incomplete mm_struct.
> >
> > We're looking for a way to signal to uprobe to abort, cleanly.
> >
> > Looking at kernel/fork.c, dup_mmap():
> >
> > fail_uprobe_end:
> > uprobe_end_dup_mmap();
> > return retval;
> >
> > So uprobe is aware it could fail, but releases the semaphore and then
> > doesn't check if the mm struct is okay to use.
> >
> > What should happen in the failed mm_struct case?
> >
> > Thanks,
> > Liam
> >
>
> (As discussed on IRC) how about moving up the dup_mmap_sem lock one level, we
> can put the mm before the rmap lookup in build_map_info() is able to find it,
> which should avoid the whole issue?
>
> Untested patch attached.

Urgh, bit of a maze this. So BPF does a uprobe_register(), which walks
rmap and finds an incomplete mm.

uprobe_{start,end}_dup_mmap() serialize uprobe_register(), but are too
narrow to cover the whole fail case.

So *bang* happens.

The below moves this serialization up to cover the whole of dup_mmap(),
such that register can no longer find partial mm's in the rmap.

Which will cure problem, but it does leave me uncomfortable vs other
rmap users.

Also, you need to fix fail_uprobe_end label, that's left dangling as is.
Reply all
Reply to author
Forward
0 new messages