KASAN: use-after-free Read in __do_page_fault

24 views
Skip to first unread message

syzbot

unread,
Oct 30, 2017, 3:12:05 PM10/30/17
to JBeu...@suse.com, h...@zytor.com, jpoi...@redhat.com, kirill....@linux.intel.com, ldu...@linux.vnet.ibm.com, linux-...@vger.kernel.org, lu...@kernel.org, mi...@redhat.com, syzkall...@googlegroups.com, tg...@linutronix.de, x...@kernel.org
Hello,

syzkaller hit the following crash on
887c8ba753fbe809ba93fa3cfd0cc46db18d37d4
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
compiler: gcc (GCC) 7.1.1 20170620
.config is attached
Raw console output is attached.

syzkaller reproducer is attached. See https://goo.gl/kgGztJ
for information about syzkaller reproducers


BUG: KASAN: use-after-free in arch_local_irq_enable
arch/x86/include/asm/paravirt.h:787 [inline]
BUG: KASAN: use-after-free in __do_page_fault+0xc03/0xd60
arch/x86/mm/fault.c:1357
Read of size 8 at addr ffff8801cbfd3090 by task syz-executor7/3660

CPU: 1 PID: 3660 Comm: syz-executor7 Not tainted 4.14.0-rc3+ #23
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
print_address_description+0x73/0x250 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351 [inline]
kasan_report+0x25b/0x340 mm/kasan/report.c:409
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
arch_local_irq_enable arch/x86/include/asm/paravirt.h:787 [inline]
__do_page_fault+0xc03/0xd60 arch/x86/mm/fault.c:1357
do_page_fault+0xee/0x720 arch/x86/mm/fault.c:1520
page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1066
RIP: 0023:0x8073f4f
RSP: 002b:00000000f7f89bd0 EFLAGS: 00010202
RAX: 00000000f7f89c8c RBX: 0000000000000400 RCX: 000000000000000e
RDX: 00000000f7f8aa88 RSI: 0000000020012fe0 RDI: 00000000f7f89c8c
RBP: 0000000008128000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Allocated by task 3660:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
kmem_cache_alloc+0x12e/0x760 mm/slab.c:3561
kmem_cache_zalloc include/linux/slab.h:656 [inline]
mmap_region+0x7ee/0x15a0 mm/mmap.c:1658
do_mmap+0x6a1/0xd50 mm/mmap.c:1468
do_mmap_pgoff include/linux/mm.h:2150 [inline]
vm_mmap_pgoff+0x1de/0x280 mm/util.c:333
SYSC_mmap_pgoff mm/mmap.c:1518 [inline]
SyS_mmap_pgoff+0x23b/0x5f0 mm/mmap.c:1476
do_syscall_32_irqs_on arch/x86/entry/common.c:329 [inline]
do_fast_syscall_32+0x3f2/0xf05 arch/x86/entry/common.c:391
entry_SYSENTER_compat+0x51/0x60 arch/x86/entry/entry_64_compat.S:124

Freed by task 3667:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
__cache_free mm/slab.c:3503 [inline]
kmem_cache_free+0x77/0x280 mm/slab.c:3763
remove_vma+0x162/0x1b0 mm/mmap.c:176
remove_vma_list mm/mmap.c:2475 [inline]
do_munmap+0x82a/0xdf0 mm/mmap.c:2714
mmap_region+0x59e/0x15a0 mm/mmap.c:1631
do_mmap+0x6a1/0xd50 mm/mmap.c:1468
do_mmap_pgoff include/linux/mm.h:2150 [inline]
vm_mmap_pgoff+0x1de/0x280 mm/util.c:333
SYSC_mmap_pgoff mm/mmap.c:1518 [inline]
SyS_mmap_pgoff+0x23b/0x5f0 mm/mmap.c:1476
do_syscall_32_irqs_on arch/x86/entry/common.c:329 [inline]
do_fast_syscall_32+0x3f2/0xf05 arch/x86/entry/common.c:391
entry_SYSENTER_compat+0x51/0x60 arch/x86/entry/entry_64_compat.S:124

The buggy address belongs to the object at ffff8801cbfd3040
which belongs to the cache vm_area_struct of size 200
The buggy address is located 80 bytes inside of
200-byte region [ffff8801cbfd3040, ffff8801cbfd3108)
The buggy address belongs to the page:
page:ffffea00072ff4c0 count:1 mapcount:0 mapping:ffff8801cbfd3040 index:0x0
flags: 0x200000000000100(slab)
raw: 0200000000000100 ffff8801cbfd3040 0000000000000000 000000010000000f
raw: ffffea000730c7a0 ffffea00072ff7a0 ffff8801dae069c0 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff8801cbfd2f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff8801cbfd3000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ffff8801cbfd3080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8801cbfd3100: fb fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb
ffff8801cbfd3180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzk...@googlegroups.com.

syzbot will keep track of this bug report.
Once a fix for this bug is committed, please reply to this email with:
#syz fix: exact-commit-title
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug
report.
config.txt
raw.log
repro.txt

Dmitry Vyukov

unread,
Oct 30, 2017, 3:15:30 PM10/30/17
to syzbot, JBeu...@suse.com, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, ldu...@linux.vnet.ibm.com, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org
On Mon, Oct 30, 2017 at 10:12 PM, syzbot
<bot+6a5269ce759a7bb127...@syzkaller.appspotmail.com>
wrote:
I guess this is more related to mm rather than x86, so +mm maintainers.
This continues to happen, in particular on upstream
781402340475144bb360e32bb7437fa4b84cadc3 (Oct 28).


> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzk...@googlegroups.com.
>
> syzbot will keep track of this bug report.
> Once a fix for this bug is committed, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to syzkaller-bug...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/94eb2c0433c8f42cac055cc86991%40google.com.
> For more options, visit https://groups.google.com/d/optout.

Vlastimil Babka

unread,
Oct 31, 2017, 8:00:27 AM10/31/17
to Dmitry Vyukov, syzbot, JBeu...@suse.com, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, ldu...@linux.vnet.ibm.com, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org
On 10/30/2017 08:15 PM, Dmitry Vyukov wrote:
> On Mon, Oct 30, 2017 at 10:12 PM, syzbot
> <bot+6a5269ce759a7bb127...@syzkaller.appspotmail.com>
> wrote:
>> Hello,
>>
>> syzkaller hit the following crash on
>> 887c8ba753fbe809ba93fa3cfd0cc46db18d37d4
>> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached
>> Raw console output is attached.
>>
>> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>> for information about syzkaller reproducers
>>
>>
>> BUG: KASAN: use-after-free in arch_local_irq_enable
>> arch/x86/include/asm/paravirt.h:787 [inline]
>> BUG: KASAN: use-after-free in __do_page_fault+0xc03/0xd60
>> arch/x86/mm/fault.c:1357
>> Read of size 8 at addr ffff8801cbfd3090 by task syz-executor7/3660

Why would local_irq_enable() touch a vma object? Is the stack unwinder
confused or what?
arch/x86/mm/fault.c:1357 means the "else" path of if (user_mode(regs)),
but the page fault's RIP is userspace? Strange.
This would mean that mmap_sem is not doing its job and we raced with a
vma removal. Or the rbtree is broken and contains a vma that has been
freed. Hmm, or the vmacache is broken? You could try removing the 3
lines starting with vmacache_find() in find_vma().

>> The buggy address belongs to the object at ffff8801cbfd3040
>> which belongs to the cache vm_area_struct of size 200
>> The buggy address is located 80 bytes inside of
>> 200-byte region [ffff8801cbfd3040, ffff8801cbfd3108)

My vm_area_struct is 192 bytes, could be your layout is different due to
.config. At offset 80 I have vma->vm_flags. That is checked by
__do_page_fault(), but only after vma->vm_start (offset 0). Of course,
reordering is possible.
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>
>

Dmitry Vyukov

unread,
Oct 31, 2017, 8:42:28 AM10/31/17
to Vlastimil Babka, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, ldu...@linux.vnet.ibm.com, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org
It seems that compiler over-optimizes things and messes debug info.
I just re-reproduced this on upstream
15f859ae5c43c7f0a064ed92d33f7a5bc5de6de0 and got the same report:

==================================================================
BUG: KASAN: use-after-free in arch_local_irq_enable
arch/x86/include/asm/paravirt.h:787 [inline]
BUG: KASAN: use-after-free in __do_page_fault+0xc03/0xd60
arch/x86/mm/fault.c:1357
Read of size 8 at addr ffff880064d19aa0 by task syz-executor/8001

CPU: 0 PID: 8001 Comm: syz-executor Not tainted 4.14.0-rc6+ #12
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
print_address_description+0x73/0x250 mm/kasan/report.c:252
kasan_report_error mm/kasan/report.c:351 [inline]
kasan_report+0x25b/0x340 mm/kasan/report.c:409
__asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
arch_local_irq_enable arch/x86/include/asm/paravirt.h:787 [inline]
__do_page_fault+0xc03/0xd60 arch/x86/mm/fault.c:1357
do_page_fault+0xee/0x720 arch/x86/mm/fault.c:1520
do_async_page_fault+0x82/0x110 arch/x86/kernel/kvm.c:273
async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
RIP: 0033:0x441bd0
RSP: 002b:00007f2ed8229798 EFLAGS: 00010202
RAX: 00007f2ed82297c0 RBX: 0000000000000000 RCX: 000000000000000e
RDX: 0000000000000400 RSI: 0000000020012fe0 RDI: 00007f2ed82297c0
RBP: 0000000000748020 R08: 0000000000000400 R09: 0000000000000000
R10: 0000000020012fee R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000008430 R14: 00000000006ec4d0 R15: 00007f2ed822a700

Allocated by task 8001:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:489
kmem_cache_alloc+0x12e/0x760 mm/slab.c:3561
kmem_cache_zalloc include/linux/slab.h:656 [inline]
mmap_region+0x7ee/0x15a0 mm/mmap.c:1658
do_mmap+0x69b/0xd40 mm/mmap.c:1468
do_mmap_pgoff include/linux/mm.h:2150 [inline]
vm_mmap_pgoff+0x1de/0x280 mm/util.c:333
SYSC_mmap_pgoff mm/mmap.c:1518 [inline]
SyS_mmap_pgoff+0x23b/0x5f0 mm/mmap.c:1476
SYSC_mmap arch/x86/kernel/sys_x86_64.c:99 [inline]
SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:90
entry_SYSCALL_64_fastpath+0x1f/0xbe

Freed by task 8007:
save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
save_stack+0x43/0xd0 mm/kasan/kasan.c:447
set_track mm/kasan/kasan.c:459 [inline]
kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
__cache_free mm/slab.c:3503 [inline]
kmem_cache_free+0x77/0x280 mm/slab.c:3763
remove_vma+0x162/0x1b0 mm/mmap.c:176
remove_vma_list mm/mmap.c:2475 [inline]
do_munmap+0x82a/0xdf0 mm/mmap.c:2714
mmap_region+0x59e/0x15a0 mm/mmap.c:1631
do_mmap+0x69b/0xd40 mm/mmap.c:1468
do_mmap_pgoff include/linux/mm.h:2150 [inline]
vm_mmap_pgoff+0x1de/0x280 mm/util.c:333
SYSC_mmap_pgoff mm/mmap.c:1518 [inline]
SyS_mmap_pgoff+0x23b/0x5f0 mm/mmap.c:1476
SYSC_mmap arch/x86/kernel/sys_x86_64.c:99 [inline]
SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:90
entry_SYSCALL_64_fastpath+0x1f/0xbe

The buggy address belongs to the object at ffff880064d19a50
which belongs to the cache vm_area_struct of size 200
The buggy address is located 80 bytes inside of
200-byte region [ffff880064d19a50, ffff880064d19b18)
The buggy address belongs to the page:
page:ffffea0001934640 count:1 mapcount:0 mapping:ffff880064d19000 index:0x0
flags: 0x100000000000100(slab)
raw: 0100000000000100 ffff880064d19000 0000000000000000 000000010000000f
raw: ffffea00018a3a60 ffffea0001940be0 ffff88006c5f79c0 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff880064d19980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff880064d19a00: fb fb fc fc fc fc fc fc fc fc fb fb fb fb fb fb
>ffff880064d19a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff880064d19b00: fb fb fb fc fc fc fc fc fc fc fc fb fb fb fb fb
ffff880064d19b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


Here is disasm of the function:
https://gist.githubusercontent.com/dvyukov/5a56c66ce605168c951a321d94df6e3a/raw/538d4ce72ceb5631dfcc866ccde46c74543de1cf/gistfile1.txt

Seems to be vma->vm_flags at offset 80.

I think the size of 200 reported by slab is OK as it can do some rounding.
Everything points to a vma object.

Vlastimil Babka

unread,
Oct 31, 2017, 9:20:15 AM10/31/17
to Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, ldu...@linux.vnet.ibm.com, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org
You can see it from the disasm? I can't make much of it, unfortunately,
the added kasan calls obscure it a lot for me. But I suspect it might be
the vma_pkey() thing which reads from vma->vm_flags. What happens when
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS is disabled? (or is it already?)

Also did you try the vmacache shortcut test suggested in my previous mail?

Vlastimil Babka

unread,
Oct 31, 2017, 9:58:00 AM10/31/17
to Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, ldu...@linux.vnet.ibm.com, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org, Andrea Arcangeli, Linus Torvalds, Thorsten Leemhuis
+CC Andrea, Thorsten, Linus
OK, so I opened the google groups link in the report's signature and
looked at the attached config there, which says protkeys are enabled.
Also looked at the repro.txt attachment:
#{Threaded:true Collide:true Repeat:true Procs:8 Sandbox:none Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:true Debug:false Repro:false}
mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000011000/0x3000)=nil, 0x3000, 0x1, 0x32, 0xffffffffffffffff, 0x0)
r0 = userfaultfd(0x0)
ioctl$UFFDIO_API(r0, 0xc018aa3f, &(0x7f0000002000-0x18)={0xaa, 0x0, 0x0})
ioctl$UFFDIO_REGISTER(r0, 0xc020aa00, &(0x7f0000019000)={{&(0x7f0000012000/0x2000)=nil, 0x2000}, 0x1, 0x0})
r1 = gettid()
syz_open_dev$evdev(&(0x7f0000013000-0x12)="2f6465762f696e7075742f6576656e742300", 0x0, 0x0)
tkill(r1, 0x7)

The userfaultfd() caught my attention so I checked handle_userfault()
which seems to do up_read(&mm->mmap_sem); and in some cases later
followed by down_read(&mm->mmap_sem); return VM_FAULT_NOPAGE.
However, __do_page_fault() only expects that mmap_sem to be released
when handle_mm_fault() returns with VM_FAULT_RETRY. It doesn't expect it
to be released and then acquired again, because then vma can be indeed
gone. It seems vma hasn't been touched after that point until the
vma_pkey() was added by commit a3c4fb7c9c2e ("x86/mm: Fix fault error
path using unsafe vma pointer") in rc3. Which tried to fix a similar
problem, but run into this corner case?

So I suspect a3c4fb7c9c2e is the culprit and thus a regression.

Kirill A. Shutemov

unread,
Oct 31, 2017, 10:11:55 AM10/31/17
to Vlastimil Babka, Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, ldu...@linux.vnet.ibm.com, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org, Andrea Arcangeli, Linus Torvalds, Thorsten Leemhuis
I wounder if we can move "pkey = vma_pkey(vma);" before handle_mm_fault()?
pkey can't change during page fault handing, can it?

--
Kirill A. Shutemov

Vlastimil Babka

unread,
Oct 31, 2017, 10:28:32 AM10/31/17
to Kirill A. Shutemov, Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, ldu...@linux.vnet.ibm.com, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org, Andrea Arcangeli, Linus Torvalds, Thorsten Leemhuis
Hmm that could indeed work, Dmitry can you try the patch below?
But it still seems rather fragile so I'd hope Andrea can do it more
robust, or at least make sure that we don't reintroduce this kind of
problem in the future (explicitly set vma to NULL with a comment?).

----8<----
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e2baeaa053a5..9bd16fc621db 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1441,6 +1441,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
* the fault. Since we never set FAULT_FLAG_RETRY_NOWAIT, if
* we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
*/
+ pkey = vma_pkey(vma);
fault = handle_mm_fault(vma, address, flags);
major |= fault & VM_FAULT_MAJOR;

@@ -1467,7 +1468,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
return;
}

- pkey = vma_pkey(vma);
up_read(&mm->mmap_sem);
if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, error_code, address, &pkey, fault);

Linus Torvalds

unread,
Oct 31, 2017, 11:37:48 AM10/31/17
to Vlastimil Babka, Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, Laurent Dufour, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linux-mm, Andrea Arcangeli, Thorsten Leemhuis
On Tue, Oct 31, 2017 at 6:57 AM, Vlastimil Babka <vba...@suse.cz> wrote:
>
> However, __do_page_fault() only expects that mmap_sem to be released
> when handle_mm_fault() returns with VM_FAULT_RETRY. It doesn't expect it
> to be released and then acquired again, because then vma can be indeed
> gone.

Yes. Accessing "vma" after calling "handle_mm_fault()" is a bug. An
unfortunate issue with userfaultfd.

The suggested fix to simply look up pkey beforehand seems sane and simple.

But sadly, from a quick check, it looks like arch/um/ has the same
bug, but even worse. It will do

(a) handle_mm_fault() in a loop without re-calculating vma. Don't ask me why.

(b) flush_tlb_page(vma, address); afterwards

but much more importantly, I think __get_user_pages() is broken in two ways:

- faultin_page() does:

ret = handle_mm_fault(vma, address, fault_flags);
...
if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))

(easily fixed the same way)

- more annoyingly and harder to fix: the retry case in
__get_user_pages(), and the VMA saving there.

Ho humm.

Andrea, looking at that get_user_pages() case, I really think it's
userfaultfd that is broken.

Could we perhaps limit userfaultfd to _only_ do the VM_FAULT_RETRY,
and simply fail for non-retry faults?

Linus

Andrea Arcangeli

unread,
Oct 31, 2017, 3:13:41 PM10/31/17
to Linus Torvalds, Vlastimil Babka, Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, Laurent Dufour, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linux-mm, Thorsten Leemhuis
On Tue, Oct 31, 2017 at 08:37:47AM -0700, Linus Torvalds wrote:
> Yes. Accessing "vma" after calling "handle_mm_fault()" is a bug. An
> unfortunate issue with userfaultfd.
>
> The suggested fix to simply look up pkey beforehand seems sane and simple.

Agreed.

>
> But sadly, from a quick check, it looks like arch/um/ has the same
> bug, but even worse. It will do
>
> (a) handle_mm_fault() in a loop without re-calculating vma. Don't ask me why.
>
> (b) flush_tlb_page(vma, address); afterwards

Yes, that flush_tlb_page is unsafe. Luckily it's only using it for
vma->vm_mm so it doesn't sound major issue to fix it.

>
> but much more importantly, I think __get_user_pages() is broken in two ways:
>
> - faultin_page() does:
>
> ret = handle_mm_fault(vma, address, fault_flags);
> ...
> if ((ret & VM_FAULT_WRITE) && !(vma->vm_flags & VM_WRITE))
>
> (easily fixed the same way)
>
> - more annoyingly and harder to fix: the retry case in
> __get_user_pages(), and the VMA saving there.
>
> Ho humm.
>
> Andrea, looking at that get_user_pages() case, I really think it's
> userfaultfd that is broken.
>
> Could we perhaps limit userfaultfd to _only_ do the VM_FAULT_RETRY,
> and simply fail for non-retry faults?

In the get_user_pages case we already limit it to do only
VM_FAULT_RETRY so no use after free should materialize whenever gup is
involved.

The problematic path for the return to userland (get_user_pages
returns to kernel) is this one:

if (return_to_userland) {
if (signal_pending(current) &&
!fatal_signal_pending(current)) {
/*
* If we got a SIGSTOP or SIGCONT and this is
* a normal userland page fault, just let
* userland return so the signal will be
* handled and gdb debugging works. The page
* fault code immediately after we return from
* this function is going to release the
* mmap_sem and it's not depending on it
* (unlike gup would if we were not to return
* VM_FAULT_RETRY).
*
* If a fatal signal is pending we still take
* the streamlined VM_FAULT_RETRY failure path
* and there's no need to retake the mmap_sem
* in such case.
*/
down_read(&mm->mmap_sem);
ret = VM_FAULT_NOPAGE;
}
}

We could remove the above branch all together and then
handle_userfault() would always return VM_FAULT_RETRY whenever it
decides to release the mmap_sem. The above makes debugging with gdb
more user friendly and it potentially lowers the latency of signals as
signals can unblock handle_userfault. The downside is that the return
to userland cannot dereference the vma after calling handle_mm_fault.

Thanks,
Andrea

Andrea Arcangeli

unread,
Oct 31, 2017, 3:15:09 PM10/31/17
to Vlastimil Babka, Kirill A. Shutemov, Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, ldu...@linux.vnet.ibm.com, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org, Linus Torvalds, Thorsten Leemhuis
On Tue, Oct 31, 2017 at 03:28:26PM +0100, Vlastimil Babka wrote:
> Hmm that could indeed work, Dmitry can you try the patch below?
> But it still seems rather fragile so I'd hope Andrea can do it more
> robust, or at least make sure that we don't reintroduce this kind of
> problem in the future (explicitly set vma to NULL with a comment?).

Reviewed-by: Andrea Arcangeli <aarc...@redhat.com>

Vlastimil Babka

unread,
Nov 1, 2017, 3:43:04 AM11/1/17
to Andrea Arcangeli, Kirill A. Shutemov, Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, ldu...@linux.vnet.ibm.com, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org, Linus Torvalds, Thorsten Leemhuis
On 10/31/2017 08:15 PM, Andrea Arcangeli wrote:
> On Tue, Oct 31, 2017 at 03:28:26PM +0100, Vlastimil Babka wrote:
>> Hmm that could indeed work, Dmitry can you try the patch below?
>> But it still seems rather fragile so I'd hope Andrea can do it more
>> robust, or at least make sure that we don't reintroduce this kind of
>> problem in the future (explicitly set vma to NULL with a comment?).
>
> Reviewed-by: Andrea Arcangeli <aarc...@redhat.com>

Thanks. OK so here's the full patch for the immediate issue, unless we
decide to do something more general.

----8<----
From a5f887fcac65372f4e76a290ed59855de0b08e2e Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vba...@suse.cz>
Date: Wed, 1 Nov 2017 08:21:25 +0100
Subject: [PATCH] x86/mm: fix use-after-free of vma during userfaultfd fault

Syzkaller with KASAN has reported a use-after-free of vma->vm_flags in
__do_page_fault() with the following reproducer:

#{Threaded:true Collide:true Repeat:true Procs:8 Sandbox:none Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:true Debug:false Repro:false}
mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000011000/0x3000)=nil, 0x3000, 0x1, 0x32, 0xffffffffffffffff, 0x0)
r0 = userfaultfd(0x0)
ioctl$UFFDIO_API(r0, 0xc018aa3f, &(0x7f0000002000-0x18)={0xaa, 0x0, 0x0})
ioctl$UFFDIO_REGISTER(r0, 0xc020aa00, &(0x7f0000019000)={{&(0x7f0000012000/0x2000)=nil, 0x2000}, 0x1, 0x0})
r1 = gettid()
syz_open_dev$evdev(&(0x7f0000013000-0x12)="2f6465762f696e7075742f6576656e742300", 0x0, 0x0)
tkill(r1, 0x7)

The vma should be pinned by mmap_sem, but handle_userfault() will in some
scenarios release it and then acquire again, so when we return to
__do_page_fault() with other result than VM_FAULT_RETRY, the vma might be gone.
However, since a3c4fb7c9c2e ("x86/mm: Fix fault error path using unsafe vma
pointer") there is a vma_pkey() read of vma->vm_flags after that point, which
can thus become use-after-free. Fix this by moving the read before calling
handle_mm_fault().

Reported-by: syzbot <bot+6a5269ce759a7bb127...@syzkaller.appspotmail.com>
Reported-by: Dmitry Vyukov <dvy...@google.com>
Suggested-by: Kirill A. Shutemov <kir...@shutemov.name>
Fixes: 3c4fb7c9c2e ("x86/mm: Fix fault error path using unsafe vma pointer")
Reviewed-by: Andrea Arcangeli <aarc...@redhat.com>
Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---
arch/x86/mm/fault.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e2baeaa053a5..2f45a959aec2 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1440,7 +1440,13 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
* make sure we exit gracefully rather than endlessly redo
* the fault. Since we never set FAULT_FLAG_RETRY_NOWAIT, if
* we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
+ *
+ * Since handle_userfault() may also release and reacquire mmap_sem
+ * in some scenario (and not return VM_FAULT_RETRY), we have to be
+ * careful about not touching vma after handling the fault. So we
+ * read the pkey beforehand.
*/
+ pkey = vma_pkey(vma);
fault = handle_mm_fault(vma, address, flags);
major |= fault & VM_FAULT_MAJOR;

@@ -1467,7 +1473,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
return;
}

- pkey = vma_pkey(vma);
up_read(&mm->mmap_sem);
if (unlikely(fault & VM_FAULT_ERROR)) {
mm_fault_error(regs, error_code, address, &pkey, fault);
--
2.14.3

Andrea Arcangeli

unread,
Nov 1, 2017, 6:17:48 AM11/1/17
to Vlastimil Babka, Kirill A. Shutemov, Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, ldu...@linux.vnet.ibm.com, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org, Linus Torvalds, Thorsten Leemhuis
On Wed, Nov 01, 2017 at 08:42:57AM +0100, Vlastimil Babka wrote:
> The vma should be pinned by mmap_sem, but handle_userfault() will in some
> scenarios release it and then acquire again, so when we return to

In the above message and especially in the below comment, I would
suggest to take the opportunity to more accurately document the
specific scenario instead of "some scenario" which is only "A return
to userland to repeat the page fault later with a VM_FAULT_NOPAGE
retval (potentially after handling any pending signal during the
return to userland). The return to userland is identified whenever
FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in vmf->flags".

> + * in some scenario (and not return VM_FAULT_RETRY), we have to be

Thanks,
Andrea

Vlastimil Babka

unread,
Nov 1, 2017, 8:14:48 AM11/1/17
to Andrea Arcangeli, Kirill A. Shutemov, Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, ldu...@linux.vnet.ibm.com, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org, Linus Torvalds, Thorsten Leemhuis
On 11/01/2017 11:17 AM, Andrea Arcangeli wrote:
> On Wed, Nov 01, 2017 at 08:42:57AM +0100, Vlastimil Babka wrote:
>> The vma should be pinned by mmap_sem, but handle_userfault() will in some
>> scenarios release it and then acquire again, so when we return to
>
> In the above message and especially in the below comment, I would
> suggest to take the opportunity to more accurately document the
> specific scenario instead of "some scenario" which is only "A return
> to userland to repeat the page fault later with a VM_FAULT_NOPAGE
> retval (potentially after handling any pending signal during the
> return to userland). The return to userland is identified whenever
> FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in vmf->flags".

OK, updated patch below
----8<----
From d72b9960310b959ccc2c211d90bc5215ee4560ee Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vba...@suse.cz>
Date: Wed, 1 Nov 2017 08:21:25 +0100
Subject: [PATCH] x86/mm: fix use-after-free of vma during userfaultfd fault

Syzkaller with KASAN has reported a use-after-free of vma->vm_flags in
__do_page_fault() with the following reproducer:

mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
mmap(&(0x7f0000011000/0x3000)=nil, 0x3000, 0x1, 0x32, 0xffffffffffffffff, 0x0)
r0 = userfaultfd(0x0)
ioctl$UFFDIO_API(r0, 0xc018aa3f, &(0x7f0000002000-0x18)={0xaa, 0x0, 0x0})
ioctl$UFFDIO_REGISTER(r0, 0xc020aa00, &(0x7f0000019000)={{&(0x7f0000012000/0x2000)=nil, 0x2000}, 0x1, 0x0})
r1 = gettid()
syz_open_dev$evdev(&(0x7f0000013000-0x12)="2f6465762f696e7075742f6576656e742300", 0x0, 0x0)
tkill(r1, 0x7)

The vma should be pinned by mmap_sem, but handle_userfault() might (in a return
to userspace scenario) release it and then acquire again, so when we return to
__do_page_fault() (with other result than VM_FAULT_RETRY), the vma might be
gone. Specifically, per Andrea the scenario is "A return to userland to repeat
the page fault later with a VM_FAULT_NOPAGE retval (potentially after handling
any pending signal during the return to userland). The return to userland is
identified whenever FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in
vmf->flags"

However, since a3c4fb7c9c2e ("x86/mm: Fix fault error path using unsafe vma
pointer") there is a vma_pkey() read of vma->vm_flags after that point, which
can thus become use-after-free. Fix this by moving the read before calling
handle_mm_fault().

Reported-by: syzbot <bot+6a5269ce759a7bb127...@syzkaller.appspotmail.com>
Reported-by: Dmitry Vyukov <dvy...@google.com>
Suggested-by: Kirill A. Shutemov <kir...@shutemov.name>
Fixes: 3c4fb7c9c2e ("x86/mm: Fix fault error path using unsafe vma pointer")
Reviewed-by: Andrea Arcangeli <aarc...@redhat.com>
Signed-off-by: Vlastimil Babka <vba...@suse.cz>
---
arch/x86/mm/fault.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index e2baeaa053a5..7101c281c7ce 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1440,7 +1440,17 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,
* make sure we exit gracefully rather than endlessly redo
* the fault. Since we never set FAULT_FLAG_RETRY_NOWAIT, if
* we get VM_FAULT_RETRY back, the mmap_sem has been unlocked.
+ *
+ * Note that handle_userfault() may also release and reacquire mmap_sem
+ * (and not return with VM_FAULT_RETRY), when returning to userland to
+ * repeat the page fault later with a VM_FAULT_NOPAGE retval
+ * (potentially after handling any pending signal during the return to
+ * userland). The return to userland is identified whenever
+ * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags.
+ * Thus we have to be careful about not touching vma after handling the
+ * fault, so we read the pkey beforehand.
*/
+ pkey = vma_pkey(vma);
fault = handle_mm_fault(vma, address, flags);
major |= fault & VM_FAULT_MAJOR;

@@ -1467,7 +1477,6 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code,

Linus Torvalds

unread,
Nov 1, 2017, 11:26:25 AM11/1/17
to Andrea Arcangeli, Vlastimil Babka, Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, Laurent Dufour, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linux-mm, Thorsten Leemhuis
Honestly, I would *much* prefer that.

> The above makes debugging with gdb
> more user friendly and it potentially lowers the latency of signals as
> signals can unblock handle_userfault.

I don't disagree about that, but why don't you use VM_FAULT_RETRY and
not re-take the mmap_sem? Then we wouldn't have a special case for
userfaultfd at all.

I see the gdb issue, but I wonder if we shouldn't fix that differently
by changing the retry logic in the fault handler.

In particular, right now we do

- Retry at most once

- handle fatal signals specially

and I think the gdb case actually shows that both of those decisions
may have been wrong, or at least something we could improve on?

Maybe we should return to user space on _any_ pending signal? That
might help latency for other things than gdb (think ^Z etc, but also
things that catch SIGSEGV and abort).

And maybe we should allow FAULT_FLAG_ALLOW_RETRY to just go on forever
- although that will want to verify that every case that returns
VM_FAULT_RETRY does wait for the condition it dropped the mmap
semaphore and is retrying for.

There aren't that many places that return VM_FAULT_RETRY. The
important one is lock_page_or_retry(), and that one does wait for the
page.

(There's one in mm/shmem.c too, and obviously the userfaultfd case,
but those seem to do waiting too).

So maybe we could just fix the gdb case without that userfaultfd hack?

Linus

Laurent Dufour

unread,
Nov 2, 2017, 6:00:39 AM11/2/17
to Vlastimil Babka, Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linu...@kvack.org, Andrea Arcangeli, Linus Torvalds, Thorsten Leemhuis
Hi Vlastimil,

Sorry for the late answer I got a few day off.
Yes that's my mistake.

My patch was removing the use of vma once mmap_sem is released but it was
mainly done in the error path and I moved the read of the vma's pkey before
releasing the mmap_sem, but in the no-error path also, leading to the use
after free you seen.

As suggested and done later in this thread, reading the vma's key value
before calling handle_mm_fault() will solve this issue. This is safe since
the vma's pkey can't be changed once the mmap_sem is held.

Thanks,
Laurent.

Andrea Arcangeli

unread,
Nov 2, 2017, 3:36:49 PM11/2/17
to Linus Torvalds, Vlastimil Babka, Dmitry Vyukov, syzbot, Jan Beulich, H. Peter Anvin, Josh Poimboeuf, Kirill A. Shutemov, Laurent Dufour, LKML, Andy Lutomirski, Ingo Molnar, syzkall...@googlegroups.com, Thomas Gleixner, the arch/x86 maintainers, Andrew Morton, Michal Hocko, Hugh Dickins, David Rientjes, linux-mm, Thorsten Leemhuis
Yes I prefer that as well, it's much more generic and cleaner that
way.

> I see the gdb issue, but I wonder if we shouldn't fix that differently
> by changing the retry logic in the fault handler.
>
> In particular, right now we do
>
> - Retry at most once
>
> - handle fatal signals specially
>
> and I think the gdb case actually shows that both of those decisions
> may have been wrong, or at least something we could improve on?
>
> Maybe we should return to user space on _any_ pending signal? That
> might help latency for other things than gdb (think ^Z etc, but also
> things that catch SIGSEGV and abort).

That would be an ideal solution and then we can drop the special case
from handle_userfault() entirely and the signal check will cover more
cases then.

If VM_FAULT_RETRY is returned but there are pending signals we and the
page fault was invoked on top of userland, we will ignore the
VM_FAULT_RETRY request and return to userland instead.

> And maybe we should allow FAULT_FLAG_ALLOW_RETRY to just go on forever

Returning VM_FAULT_RETRY more than once is already a dependency in the
future userfaultfd WP support (at the moment to stay on the safe side
it only allows one more VM_FAULT_RETRY), as we may writeprotect
individual pagetables while there's already a page fault in flight, in
turn requiring two VM_FAULT_RETRY in a row if the second page fault
invocation ends up in handle_userfault() because of a concurrent
UFFDIO_WRITEPROTECT.

Such an issue cannot materialize with "missing" userfaults because by
definition if the virtual page is missing there cannot be already
other page faults in flight on it.

It'd be cleaner to just let it go on forever.

> - although that will want to verify that every case that returns
> VM_FAULT_RETRY does wait for the condition it dropped the mmap
> semaphore and is retrying for.

Agreed.

> There aren't that many places that return VM_FAULT_RETRY. The
> important one is lock_page_or_retry(), and that one does wait for the
> page.
>
> (There's one in mm/shmem.c too, and obviously the userfaultfd case,
> but those seem to do waiting too).
>
> So maybe we could just fix the gdb case without that userfaultfd hack?

That would be great, I will look into implementing the above.

Thanks!
Andrea

Dmitry Vyukov

unread,
Feb 14, 2018, 9:07:38 AM2/14/18
to syzbot, syzkall...@googlegroups.com
for some reason syzbot did not recognize the Reported-by tag

#syz fix: x86/mm: fix use-after-free of vma during userfaultfd fault


On Mon, Oct 30, 2017 at 8:12 PM, syzbot
<bot+6a5269ce759a7bb127...@syzkaller.appspotmail.com>
wrote:
Reply all
Reply to author
Forward
0 new messages