[syzbot] general protection fault in PageHeadHuge

5 views
Skip to first unread message

syzbot

unread,
Sep 23, 2022, 12:05:41 PMSep 23
to ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, mike.k...@oracle.com, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
Hello,

syzbot found the following issue on:

HEAD commit: 483fed3b5dc8 Add linux-next specific files for 20220921
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=16f0a418880000
kernel config: https://syzkaller.appspot.com/x/.config?x=849cb9f70f15b1ba
dashboard link: https://syzkaller.appspot.com/bug?extid=152d76c44ba142f8992b
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12b97b64880000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11fb9040880000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/1cb3f4618323/disk-483fed3b.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/cc02cb30b495/vmlinux-483fed3b.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+152d76...@syzkaller.appspotmail.com

general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 PID: 3617 Comm: syz-executor722 Not tainted 6.0.0-rc6-next-20220921-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/16/2022
RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
FS: 00007f5642262700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000203f4ef0 CR3: 000000007adcc000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
folio_test_hugetlb include/linux/page-flags.h:831 [inline]
folio_file_page include/linux/pagemap.h:683 [inline]
shmem_fault+0x27c/0x8a0 mm/shmem.c:2130
__do_fault+0x107/0x600 mm/memory.c:4191
do_shared_fault mm/memory.c:4597 [inline]
do_fault mm/memory.c:4675 [inline]
handle_pte_fault mm/memory.c:4943 [inline]
__handle_mm_fault+0x2200/0x3a40 mm/memory.c:5085
handle_mm_fault+0x1c8/0x780 mm/memory.c:5206
do_user_addr_fault+0x475/0x1210 arch/x86/mm/fault.c:1428
handle_page_fault arch/x86/mm/fault.c:1519 [inline]
exc_page_fault+0x94/0x170 arch/x86/mm/fault.c:1575
asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
RIP: 0010:__put_user_nocheck_4+0x3/0x11
Code: 00 00 48 39 d9 73 54 0f 01 cb 66 89 01 31 c9 0f 01 ca c3 0f 1f 44 00 00 48 bb fd ef ff ff ff 7f 00 00 48 39 d9 73 34 0f 01 cb <89> 01 31 c9 0f 01 ca c3 66 0f 1f 44 00 00 48 bb f9 ef ff ff ff 7f
RSP: 0018:ffffc90003e7fa00 EFLAGS: 00050293
RAX: 0000000000000000 RBX: ffffc90003e7fdf4 RCX: 00000000203f4ef0
RDX: ffff888020c51d40 RSI: ffffffff8726d52f RDI: 0000000000000005
RBP: ffffc90003e7fdb0 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000002 R14: 00000000203f4ef0 R15: 0000000000000000
____sys_recvmsg+0x3ba/0x610 net/socket.c:2714
___sys_recvmsg+0xf2/0x180 net/socket.c:2743
do_recvmmsg+0x25e/0x6e0 net/socket.c:2837
__sys_recvmmsg net/socket.c:2916 [inline]
__do_sys_recvmmsg net/socket.c:2939 [inline]
__se_sys_recvmmsg net/socket.c:2932 [inline]
__x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2932
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f56422dabb9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f5642262208 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 00007f564235c4b8 RCX: 00007f56422dabb9
RDX: 0000000000010106 RSI: 00000000200000c0 RDI: 0000000000000003
RBP: 00007f564235c4b0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000246 R12: 00007f564235c4bc
R13: 00007fffbde3618f R14: 00007f5642262300 R15: 0000000000022000
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
FS: 00007f5642262700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f56419ff718 CR3: 000000007adcc000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):
0: ff 66 66 jmpq *0x66(%rsi)
3: 2e 0f 1f 84 00 00 00 nopl %cs:0x0(%rax,%rax,1)
a: 00 00
c: 90 nop
d: 41 54 push %r12
f: 55 push %rbp
10: 48 89 fd mov %rdi,%rbp
13: 53 push %rbx
14: e8 54 c9 b9 ff callq 0xffb9c96d
19: 48 89 ea mov %rbp,%rdx
1c: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
23: fc ff df
26: 48 c1 ea 03 shr $0x3,%rdx
* 2a: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction
2e: 0f 85 a2 01 00 00 jne 0x1d6
34: 48 8b 5d 00 mov 0x0(%rbp),%rbx
38: 48 c7 c7 ff ff ff ff mov $0xffffffffffffffff,%rdi
3f: 48 rex.W


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

Mike Kravetz

unread,
Sep 23, 2022, 5:11:34 PMSep 23
to Matthew Wilcox, syzbot, Hugh Dickins, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
While it is true that the addressing exception is happening in the hugetlb
routine PageHeadHuge(), the reason is because it is passed a NULL page
pointer. This is via the call to folio_file_page() at the end of shmem_fault.

err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
gfp, vma, vmf, &ret);
if (err)
return vmf_error(err);

vmf->page = folio_file_page(folio, vmf->pgoff);
return ret;

The code assumes that if a non-zero value is returned from shmem_get_folio_gfp,
then folio pointer will be set to a folio. However, it looks like there are
a few places in shmem_get_folio_gfp where it will return zero and not set
folio.

In this specific case, it is the code block:

if (vma && userfaultfd_missing(vma)) {
*fault_type = handle_userfault(vmf, VM_UFFD_MISSING);
return 0;
}

I could try to sort this out, but I believe Matthew has the most context as
he has been changing this code recently.
--
Mike Kravetz

Hugh Dickins

unread,
Sep 23, 2022, 8:15:03 PMSep 23
to Mike Kravetz, Peter Xu, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, Hugh Dickins, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
Thanks Mike. Yes, it looks like userfaultfd's successful returns from
shmem_get_folio_gfp() have (correctly?) filled in vmf->page directly,
and are not setting *foliop at all. So this would affect any
userfaultfd usage of shmem, not just an odd syzbot corner case.

I do think shmem_fault() ought at least to initialize its folio pointer
(as shmem_file_read_iter() correctly does); and if it does that, then we
don't need to weed out the userfaultfd cases as such, but just avoid
folio_file_page() when folio is still NULL.

I've neither tried the syzbot test case (but see its .c does involve
userfaultld), nor tried userfaultfd shmem: Peter, could you give this a
try - we'd expect userfaultfd shmem to crash on linux-next without the
patch below.

And it implies that nobody has been trying userfaultfd shmem on recent
linux-next, so whether vmf->page is then set correctly we're not sure.

But me, I cannot even get the latest linux-next to boot: next job.

Hugh

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2060,7 +2060,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct inode *inode = file_inode(vma->vm_file);
gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
- struct folio *folio;
+ struct folio *folio = NULL;
int err;
vm_fault_t ret = VM_FAULT_LOCKED;

@@ -2127,7 +2127,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
gfp, vma, vmf, &ret);
if (err)
return vmf_error(err);
- vmf->page = folio_file_page(folio, vmf->pgoff);
+ if (folio)
+ vmf->page = folio_file_page(folio, vmf->pgoff);
return ret;
}

Nathan Chancellor

unread,
Sep 23, 2022, 8:18:28 PMSep 23
to Hugh Dickins, Mike Kravetz, Peter Xu, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com

Peter Xu

unread,
Sep 23, 2022, 8:58:13 PMSep 23
to Hugh Dickins, Mike Kravetz, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
Hi, Hugh, Mike,

On Fri, Sep 23, 2022 at 05:14:38PM -0700, Hugh Dickins wrote:
AFAICT handle_userfault() doens't set vmf->page; it relies on returning
"*fault_type" to be VM_FAULT_RETRY so the fault will be retried assuming
the pgtable has been installed properly when it returns.
The patch itself looks correct to me. Thanks.

Reviewed-by: Peter Xu <pet...@redhat.com>

Not sure whether it can be directly squashed into 371133f76a6d.

--
Peter Xu

Hugh Dickins

unread,
Sep 24, 2022, 3:29:13 AMSep 24
to Nathan Chancellor, Hugh Dickins, Mike Kravetz, Peter Xu, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
Many thanks Nathan, yes, you did save me a bisect with that link.
And I managed to get a patch to revert the badness,
I'll reply there on that thread now.

Hugh

Peter Xu

unread,
Sep 24, 2022, 11:06:53 AMSep 24
to Hugh Dickins, Mike Kravetz, Mike Kravetz, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
Sorry I forgot to reply on this one.

I didn't try linux-next, but I can easily reproduce this with mm-unstable
already, and I verified that Hugh's patch fixes the problem for shmem.

When I was testing I found hugetlb selftest is broken too but with some
other errors:

$ sudo ./userfaultfd hugetlb 100 10
...
bounces: 6, mode: racing ver read, ERROR: unexpected write fault (errno=0, line=779)

The failing check was making sure all MISSING events are not triggered by
writes, but frankly I don't really know why it's required, and that check
existed since the 1st commit when test was introduced.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c47174fc362a089b1125174258e53ef4a69ce6b8
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/vm/userfaultfd.c?id=c47174fc362a089b1125174258e53ef4a69ce6b8#n291

And obviously some recent hugetlb-related change caused that to happen.

Dropping that check can definitely work, but I'll have a closer look soon
too to make sure I didn't miss something. Mike, please also let me know if
you are aware of this problem.
--
Peter Xu

Mike Kravetz

unread,
Sep 24, 2022, 3:01:28 PMSep 24
to Peter Xu, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
On 09/24/22 11:06, Peter Xu wrote:
>
> Sorry I forgot to reply on this one.
>
> I didn't try linux-next, but I can easily reproduce this with mm-unstable
> already, and I verified that Hugh's patch fixes the problem for shmem.
>
> When I was testing I found hugetlb selftest is broken too but with some
> other errors:
>
> $ sudo ./userfaultfd hugetlb 100 10
> ...
> bounces: 6, mode: racing ver read, ERROR: unexpected write fault (errno=0, line=779)
>
> The failing check was making sure all MISSING events are not triggered by
> writes, but frankly I don't really know why it's required, and that check
> existed since the 1st commit when test was introduced.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c47174fc362a089b1125174258e53ef4a69ce6b8
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/vm/userfaultfd.c?id=c47174fc362a089b1125174258e53ef4a69ce6b8#n291
>
> And obviously some recent hugetlb-related change caused that to happen.
>
> Dropping that check can definitely work, but I'll have a closer look soon
> too to make sure I didn't miss something. Mike, please also let me know if
> you are aware of this problem.
>

Peter, I am not aware of this problem. I really should make running ALL
hugetlb tests part of my regular routine.

If you do not beat me to it, I will take a look in the next few days.
--
Mike Kravetz

Matthew Wilcox

unread,
Sep 24, 2022, 5:56:26 PMSep 24
to Mike Kravetz, syzbot, Hugh Dickins, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com, vishal...@gmail.com
Vishal sent me a patch a few days ago, but I was on holiday so haven't
seen it until now.

From: "Vishal Moola (Oracle)" <vishal...@gmail.com>
To: wi...@infradead.org
Cc: "Vishal Moola (Oracle)" <vishal...@gmail.com>
Subject: [PATCH 1/1] Fix an issue in shmem_fault()
Date: Wed, 21 Sep 2022 17:38:56 -0700
Message-Id: <20220922003855.23...@gmail.com>
X-Mailer: git-send-email 2.36.1
In-Reply-To: <20220922003855.23...@gmail.com>
References: <20220922003855.23...@gmail.com>

If shmem_get_folio_gfp returns 0 AND does not assign folio,
folio_file_page() runs into issues. Make sure to assign vmf->page only
if a folio is assigned, otherwise set it to NULL.

Signed-off-by: Vishal Moola (Oracle) <vishal...@gmail.com>
---
mm/shmem.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index d6921b8e2cb5..986c07362eab 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2060,7 +2060,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
struct vm_area_struct *vma = vmf->vma;
struct inode *inode = file_inode(vma->vm_file);
gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
- struct folio *folio;
+ struct folio *folio = NULL;
int err;
vm_fault_t ret = VM_FAULT_LOCKED;

@@ -2127,7 +2127,10 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
gfp, vma, vmf, &ret);
if (err)
return vmf_error(err);
- vmf->page = folio_file_page(folio, vmf->pgoff);
+ if (folio)
+ vmf->page = folio_file_page(folio, vmf->pgoff);
+ else
+ vmf->page = NULL;
return ret;
}

--
2.36.1

Andrew Morton

unread,
Sep 25, 2022, 2:55:08 PMSep 25
to Hugh Dickins, Mike Kravetz, Peter Xu, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
On Fri, 23 Sep 2022 17:14:38 -0700 (PDT) Hugh Dickins <hu...@google.com> wrote:

> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2060,7 +2060,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
> struct vm_area_struct *vma = vmf->vma;
> struct inode *inode = file_inode(vma->vm_file);
> gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> - struct folio *folio;
> + struct folio *folio = NULL;
> int err;
> vm_fault_t ret = VM_FAULT_LOCKED;
>
> @@ -2127,7 +2127,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
> gfp, vma, vmf, &ret);
> if (err)
> return vmf_error(err);
> - vmf->page = folio_file_page(folio, vmf->pgoff);
> + if (folio)
> + vmf->page = folio_file_page(folio, vmf->pgoff);
> return ret;
> }

I grabbed this as a fix against
shmem-convert-shmem_fault-to-use-shmem_get_folio_gfp.patch. Hopefully
someone can send along something more formal when ready.

Peter Xu

unread,
Sep 25, 2022, 8:11:15 PMSep 25
to Mike Kravetz, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
Just to update - my bisection points to 00cdec99f3eb ("hugetlbfs: revert
use i_mmap_rwsem to address page fault/truncate race", 2022-09-21).

I don't understand how they are related so far, though. It should be a
timing thing because the failure cannot be reproduced on a VM but only on
the host, and it can also pass sometimes even on the host but rarely.

Logically all the uffd messages in the stress test should be generated by
the locking thread, upon:

pthread_mutex_lock(area_mutex(area_dst, page_nr));

I thought a common scheme for lock() fast path should already be an
userspace cmpxchg() and that should be a write fault already.

For example, I did some stupid hack on the test and I can trigger the write
check fault with anonymous easily with an explicit cmpxchg on byte offset 128:

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 74babdbc02e5..a7d6938d4553 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -637,6 +637,10 @@ static void *locking_thread(void *arg)
} else
page_nr += 1;
page_nr %= nr_pages;
+ char *ptr = area_dst + (page_nr * page_size) + 128;
+ char _old = 0, new = 1;
+ (void)__atomic_compare_exchange_n(ptr, &_old, new, false,
+ __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
pthread_mutex_lock(area_mutex(area_dst, page_nr));
count = *area_count(area_dst, page_nr);
if (count != count_verify[page_nr])

I'll need some more time thinking about it before I send a patch to drop
the write check..

Thanks,

--
Peter Xu

Hugh Dickins

unread,
Sep 27, 2022, 12:32:31 AMSep 27
to Matthew Wilcox, Mike Kravetz, syzbot, Hugh Dickins, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com, vishal...@gmail.com, pet...@redhat.com
Acked-by: Hugh Dickins <hu...@google.com>

Peter, thank you for reviewing and advising and testing mine, and
Andrew, thank you for taking it into mm-unstable. But I think that
since Vishal sent his first, thank you, it is his that should go in.

Me, I tend to avoid an "else", but that's not a very good reason to
prefer mine. If, in my ignorance, I'd been right about userfaultfd
setting vmf->page, then mine would have been the right patch; but
Peter showed I was wrong on that, so Vishal's seems slightly the better.

Matthew, please send Vishal's with your signoff on to Andrew;
or I can do so (mutatis mutandis) if you prefer.

Thanks,
Hugh

Mike Kravetz

unread,
Sep 27, 2022, 12:24:54 PMSep 27
to Peter Xu, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
Thanks Peter!

After your analysis, I also started looking at this.
- I did reproduce a few times in a VM
- On BM (a laptop) I could reproduce but it would take several (10's of) runs

> Logically all the uffd messages in the stress test should be generated by
> the locking thread, upon:
>
> pthread_mutex_lock(area_mutex(area_dst, page_nr));

I personally find that test program hard to understand/follow and it takes me
a day or so to understand what it is doing, then I immediately loose context
when I stop looking at it. :(

So, as you mention below the program is depending on pthread_mutex_lock()
doing a read fault before a write.

> I thought a common scheme for lock() fast path should already be an
> userspace cmpxchg() and that should be a write fault already.
>
> For example, I did some stupid hack on the test and I can trigger the write
> check fault with anonymous easily with an explicit cmpxchg on byte offset 128:
>
> diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> index 74babdbc02e5..a7d6938d4553 100644
> --- a/tools/testing/selftests/vm/userfaultfd.c
> +++ b/tools/testing/selftests/vm/userfaultfd.c
> @@ -637,6 +637,10 @@ static void *locking_thread(void *arg)
> } else
> page_nr += 1;
> page_nr %= nr_pages;
> + char *ptr = area_dst + (page_nr * page_size) + 128;
> + char _old = 0, new = 1;
> + (void)__atomic_compare_exchange_n(ptr, &_old, new, false,
> + __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
> pthread_mutex_lock(area_mutex(area_dst, page_nr));
> count = *area_count(area_dst, page_nr);
> if (count != count_verify[page_nr])
>
> I'll need some more time thinking about it before I send a patch to drop
> the write check..

I did another stupid hack, and duplicated the statement:
count = *area_count(area_dst, page_nr);
before the,
pthread_mutex_lock(area_mutex(area_dst, page_nr));

This should guarantee a read fault independent of what pthread_mutex_lock
does. However, it still results in the occasional "ERROR: unexpected write
fault". So, something else if happening. I will continue to experiment
and think about this.
--
Mike Kravetz

Peter Xu

unread,
Sep 27, 2022, 12:46:25 PMSep 27
to Mike Kravetz, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
On Tue, Sep 27, 2022 at 09:24:37AM -0700, Mike Kravetz wrote:
> This should guarantee a read fault independent of what pthread_mutex_lock
> does. However, it still results in the occasional "ERROR: unexpected write
> fault". So, something else if happening. I will continue to experiment
> and think about this.

Thanks for verifying this, Mike. I didn't yet reply but I did have some
update on my side too. I plan to look deeper and wanted to reply until
that, because I do think there's something special on hugetlb and I still
don't know. I just keep getting distracted by other things.. but maybe I
should still share out what I have already.

I think I already know what had guaranteed the read faults - the NPTL
pthread lib implemented mutex in different types, and the 1st instruction
of lock() is to fetch the mutex type (at offset 0x10) then we know how we
should move on:

(gdb) disas pthread_mutex_lock
Dump of assembler code for function ___pthread_mutex_lock:
0x00007ffff7e3b0d0 <+0>: endbr64
0x00007ffff7e3b0d4 <+4>: mov 0x10(%rdi),%eax <---- read 0x10 of &mutex
0x00007ffff7e3b0d7 <+7>: mov %eax,%edx
0x00007ffff7e3b0d9 <+9>: and $0x17f,%edx
(gdb) ptype pthread_mutex_t
type = union {
struct __pthread_mutex_s __data;
char __size[40];
long __align;
}
(gdb) ptype struct __pthread_mutex_s
type = struct __pthread_mutex_s {
int __lock;
unsigned int __count;
int __owner;
unsigned int __nusers;
int __kind; <--- 0x10 offset here
short __spins;
short __elision;
__pthread_list_t __list;
}

I looked directly at asm dumped from the binary I tested to make sure it's
accurate. So it means with current uffd selftest logically there should
never be a write missing fault (I still don't think it's good to have the
write check though.. but that's separate question to ask).

It also means hugetlb does something special here. It smells really like
for some reason the hugetlb pgtable got evicted after UFFDIO_COPY during
locking_thread running, then any further lock() (e.g. cmpxchg) or modifying
the counter could trigger a write fault.

OTOH this also explained why futex code was never tested on userfaultfd
selftest, simply because futex() will always to be after that "read the
type of mutex" thing.. which is something I want to rework a bit, so as to
have uffd selftest to cover gup as you used to rightfully suggest. But
that'll be nothing urgent, and be something after we know what's special
with hugetlb new code.

I'll also keep update if I figured something more out of it.

--
Peter Xu

Mike Kravetz

unread,
Sep 29, 2022, 7:34:04 PMSep 29
to Peter Xu, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
I was able to do a little more debugging:

As you know the hugetlb calling path to handle_userfault is something
like this,

hugetlb_fault
mutex_lock(&hugetlb_fault_mutex_table[hash]);
ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
if (huge_pte_none_mostly())
hugetlb_no_page()
page = find_lock_page(mapping, idx);
if (!page) {
if (userfaultfd_missing(vma))
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
return handle_userfault()
}

For anon mappings, find_lock_page() will never find a page, so as long
as huge_pte_none_mostly() is true we will call into handle_userfault().

Since your analysis shows the testcase should never call handle_userfault() for
a write fault, I simply added a 'if (flags & FAULT_FLAG_WRITE) printk' before
the call to handle_userfault(). Sure enough, I saw plenty of printk messages.

Then, before calling handle_userfault() I added code to take the page table
lock and test huge_pte_none_mostly() again. In every FAULT_FLAG_WRITE case,
this second test of huge_pte_none_mostly() was false. So, the condition
changed from the check in hugetlb_fault until the check (with page table
lock) in hugetlb_no_page.

IIUC, the only code that should be modifying the pte in this test is
hugetlb_mcopy_atomic_pte(). It also holds the hugetlb_fault_mutex while
updating the pte.

It 'appears' that hugetlb_fault is not seeing the updated pte and I can
only guess that it is due to some caching issues.

After writing the pte in hugetlb_mcopy_atomic_pte() there is this comment.

/* No need to invalidate - it was non-present before */
update_mmu_cache(dst_vma, dst_addr, dst_pte);

I suspect that is true. However, it seems like this test depends on all
CPUs seeing the updated pte immediately?

I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
any difference. Suggestions would be appreciated as cache/tlb/??? flushing
issues take me a while to figure out.
--
Mike Kravetz

Peter Xu

unread,
Sep 29, 2022, 9:30:07 PMSep 29
to Mike Kravetz, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
Hi, Mike,

On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote:
> I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
> any difference. Suggestions would be appreciated as cache/tlb/??? flushing
> issues take me a while to figure out.

It seems the UFFDIO_COPY for hugetlb is the major issue here, in that for
private mappings we don't inject the page cache.

I think it makes sense in that e.g. we don't want to allow a private
mapping to be able to write to the page cache. But afaict that's not
correct because UFFDIO_COPY resolves exactly page faults in page cache
layer for file backed memories. So what we should do is inject page cache
but mark the page RO, waiting for a coming CoW if needed.

I'll attach one patch fix that will start to inject the page into page
cache for UFFDIO_COPY+hugetlb even if mapping is private. Another test
patch is also added because otherwise the private hugetlb selftest won't
work after the fix applied - in the selftest we used to use DONTNEED to
drop the private mapping, but IMHO that's not enough, we need to drop the
page cache too (after the fix). I've also have the test patch attached.

Feel free to try out with the two patches applied. It started to work for
me for current issue.

I didn't yet post them out yet because after I applied the two patches I
found other issues - the reserved pages are messed up and leaked. I'll
keep looking tomorrow on the leak issue, but please also let me know if you
figured anything suspecious as I know you're definitely must more fluent on
the reservation code.

And that's not the only issue I found - shmem can have other issues
regarding private mappings; shmem does it right on the page cache insertion
but not the rest I think.. I'll look into them one by one. It's quite
interesting to dig multiple things out of the write check symptons..

--
Peter Xu

Peter Xu

unread,
Sep 29, 2022, 9:35:53 PMSep 29
to Mike Kravetz, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
The patches..

--
Peter Xu
0001-mm-hugetlb-Insert-page-cache-for-UFFDIO_COPY-even-if.patch
0002-selftests-vm-Use-memfd-for-hugetlb-tests.patch

Peter Xu

unread,
Sep 30, 2022, 12:05:18 PMSep 30
to Mike Kravetz, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote:
This morning when I went back and rethink the matter, I just found that the
common hugetlb path handles private anonymous mappings with all empty page
cache as you explained above. In that sense the two patches I posted may
not really make sense even if they can pass the tests.. and maybe that's
also the reason why the reservations got messed up. This is also something
I found after I read more on the reservation code e.g. no matter private or
shared hugetlb mappings we only reserve that only number of pages when mmap().

Indeed if with that in mind the UFFDIO_COPY should also work because
hugetlb fault handler checks pte first before page cache, so uffd missing
should still work as expected.

It makes sense especially for hugetlb to do that otherwise there can be
plenty of zero huge pages cached in the page cache. I'm not sure whether
this is the reason hugetlb does it differently (e.g. comparing to shmem?),
it'll be great if I can get a confirmation. If it's true please ignore the
two patches I posted.

I think what you analyzed is correct in that the pte shouldn't go away
after being armed once. One more thing I tried (actually yesterday) was
SIGBUS the process when the write missing event was generated, and I can
see the user stack points to the cmpxchg() of the pthread_mutex_lock(). It
means indeed it moved forward and passed the mutex type check, it also
means it should have seen a !none pte already with at least reading
permission, in that sense it matches with "missing TLB" possibility
experiment mentioned above, because for a missing TLB it should keep
stucking at the read not write. It's still uncertain why the pte can go
away somehow from under us and why it quickly re-appears according to your
experiment.

--
Peter Xu

Mike Kravetz

unread,
Sep 30, 2022, 5:38:11 PMSep 30
to Peter Xu, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
I 'think' it is more of a race with all cpus seeing the pte update. To be
honest, I can not wrap my head around how that can happen.

I did not really like your idea of adding anon (or private) pages to the
page cache. As you discovered, there is code like reservations which depend
on current behavior.

It seems to me that for 'missing' hugetlb faults there are two specific cases:
1) Shared or file backed mappings. In this case, the page cache is the
'source of truth'. If there is not a page in the page cache, then we
hand off to userfault with VM_UFFD_MISSING.
2) anon or private mappings. In this case, pages are not in the page cache.
The page table is the 'source of truth'. Early in hugetlb fault processing
we check the page table (huge_pte_none_mostly). However, as my debug code
has shown, checking the page table again with lock held will reveal that
the pte has in fact been updated.

My thought was that regular anon pages would have the same issue. So, I looked
at the calling code there. In do_anonymous_page() there is this block:

/* Use the zero-page for reads */
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
vma->vm_page_prot));
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
if (!pte_none(*vmf->pte)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
goto unlock;
}
ret = check_stable_address_space(vma->vm_mm);
if (ret)
goto unlock;
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return handle_userfault(vmf, VM_UFFD_MISSING);
}
goto setpte;
}

Notice that here the pte is checked while holding the page table lock while
to make the decision to call handle_userfault().

In my testing, if we check huge_pte_none_mostly() while holding the page table
lock before calling handle_userfault we will not experience the failure. Can
you see if this also resolves the issue in your environment? I do not love
this solution as I still can not explain how this code is missing the pte
update.

From f910e7155d6831514165af35e0d75574124a4477 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.k...@oracle.com>
Date: Fri, 30 Sep 2022 13:45:08 -0700
Subject: [PATCH] hugetlb: check pte with page table lock before handing to
userfault

In hugetlb_no_page we decide a page is missing if not present in the
page cache. This is perfectly valid for shared/file mappings where
pages must exist in the page cache. For anon/private mappings, the page
table must be checked. This is done early in hugetlb_fault processing
and is the reason we enter hugetlb_no_page. However, the early check is
made without holding the page table lock. There could be racing updates
to the pte entry, so check again with page table lock held before
deciding to call handle_userfault.

Signed-off-by: Mike Kravetz <mike.k...@oracle.com>
---
mm/hugetlb.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 60e077ce6ca7..4cb44a4629b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5560,10 +5560,29 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
if (idx >= size)
goto out;
/* Check for page in userfault range */
- if (userfaultfd_missing(vma))
+ if (userfaultfd_missing(vma)) {
+ /*
+ * For missing pages, the page cache (checked above) is
+ * the 'source of truth' for shared mappings. For anon
+ * mappings, the source of truth is the page table. We
+ * already checked huge_pte_none_mostly() in
+ * hugetlb_fault. However, there could be racing
+ * updates. Check again while holding page table lock
+ * before handing off to userfault.
+ */
+ if (!(vma->vm_flags & VM_MAYSHARE)) {
+ ptl = huge_pte_lock(h, mm, ptep);
+ if (!huge_pte_none_mostly(huge_ptep_get(ptep))) {
+ spin_unlock(ptl);
+ ret = 0;
+ goto out;
+ }
+ spin_unlock(ptl);
+ }
return hugetlb_handle_userfault(vma, mapping, idx,
flags, haddr, address,
VM_UFFD_MISSING);
+ }

page = alloc_huge_page(vma, haddr, 0);
if (IS_ERR(page)) {
--
2.37.3

Peter Xu

unread,
Sep 30, 2022, 10:47:53 PMSep 30
to Mike Kravetz, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
Hi, Mike,
I don't like that either, especially when I notice it breaks the
reservations... :)

Note that my previous patch wasn't adding anon or private pages into page
cache. PageAnon() will be false for those pages being added. I was
literally doing insertion of page cache for UFFDIO_COPY, rather than
directly making it anonymous. Actually that's what shmem does.

> As you discovered, there is code like reservations which depend
> on current behavior.
>
> It seems to me that for 'missing' hugetlb faults there are two specific cases:
> 1) Shared or file backed mappings. In this case, the page cache is the
> 'source of truth'. If there is not a page in the page cache, then we
> hand off to userfault with VM_UFFD_MISSING.
> 2) anon or private mappings. In this case, pages are not in the page cache.
> The page table is the 'source of truth'.

Sorry, I can't say I fully agree with it.

IMHO the missing mode is really about page cache. Let's imagine when we
create private mappings upon a hugetlbfs file that has some pages written.
When page fault triggers, we should never generate a missing if cache hit,
even if the pte is null. Maybe that's not exactly what you meant, but the
wording is kind of misleading here.

I'd say it's really about hugetlbfs is special in treating private pages.
Note that shmem wasn't treat it like that as I mentioned. But AFAICT
hugetlbfs is different in a good way because otherwise we could be wasting
quite a few useless zero pages dangling in the page cache, and AFAICT
that's still what we do with shmem - just try to create 20M shmem
anonymouse file, privately map and write to them and see how many mem we'll
consume.. It's not optimal, but still correct IMHO.

> Early in hugetlb fault processing
> we check the page table (huge_pte_none_mostly). However, as my debug code
> has shown, checking the page table again with lock held will reveal that
> the pte has in fact been updated.

Right, I found it in the hard way. I should have been reading code more
carefully: it's caused by CoW where we'll need a clear+flush to make sure
TLB consistency (in hugetlb_wp):

if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
ClearHPageRestoreReserve(new_page);

/* Break COW or unshare */
huge_ptep_clear_flush(vma, haddr, ptep);
mmu_notifier_invalidate_range(mm, range.start, range.end);
page_remove_rmap(old_page, vma, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
set_huge_pte_at(mm, haddr, ptep,
make_huge_pte(vma, new_page, !unshare));
SetHPageMigratable(new_page);
/* Make the old page be freed below */
new_page = old_page;
}

The early TLB flush was trying to avoid inconsistent TLB entries in
different cores.

So far I still don't know why the hugetlb commit changed the timing, but
this time since I tracked the pgtables I'm more sure of its cause.

>
> My thought was that regular anon pages would have the same issue. So, I looked
> at the calling code there. In do_anonymous_page() there is this block:
>
> /* Use the zero-page for reads */
> if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> !mm_forbids_zeropage(vma->vm_mm)) {
> entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
> vma->vm_page_prot));
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> vmf->address, &vmf->ptl);
> if (!pte_none(*vmf->pte)) {
> update_mmu_tlb(vma, vmf->address, vmf->pte);
> goto unlock;
> }
> ret = check_stable_address_space(vma->vm_mm);
> if (ret)
> goto unlock;
> /* Deliver the page fault to userland, check inside PT lock */
> if (userfaultfd_missing(vma)) {
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> return handle_userfault(vmf, VM_UFFD_MISSING);
> }
> goto setpte;
> }
>
> Notice that here the pte is checked while holding the page table lock while
> to make the decision to call handle_userfault().

Right. That's a great finding, thanks. It's just that I found it great
only after I found the whole story on the CoW in progress.. I could have
been done better.

>
> In my testing, if we check huge_pte_none_mostly() while holding the page table
> lock before calling handle_userfault we will not experience the failure. Can
> you see if this also resolves the issue in your environment? I do not love
> this solution as I still can not explain how this code is missing the pte
> update.

Yes this patch will fix it which I tested. But IMHO there're quite a few
wordings that are misleading as I tried to explain above on why page cache
still matters (and matters the most IMHO for MISSING events).

Meanwhile IMHO there's a better way to address this - we're in
hugetlb_no_page() but we're relying on a pte that was read _without_
pgtable lock. It means it can be either unstable or changed. We do proper
check for most of the rest code path for hugetlb_no_page() on pte_same()
but we just forgot to do that for userfaultfd.

One example is IMO we shouldn't target this "check pte under locking" for
private mappings only. If the pte changed for either private/shared
mappings, it's probably already illegal to be inside hugetlb_no_page().
Logically for shared mappings the pte can change in any form too as long as
under pgtable lock. So the lock should logically apply to shared mappings
too, IMHO.

In summary, I attached another patch that addressed it in the way I
described above (only compile tested; even without writting the commit
message because I need to go very soon). Let me know what do you think the
best way to approach this.

(During debugging this problem, I also found a few other issues; I'll
not make this email any longer but will verify them next week and follow up
from there)

Thanks,
--
Peter Xu

Peter Xu

unread,
Sep 30, 2022, 11:01:17 PMSep 30
to Mike Kravetz, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
Obviously I never remember to attach things when I meant to.. Sorry for the
noise. Attached this time.
--
Peter Xu
0001-mm-hugetlb-Fix-race-condition-of-uffd-missing-handli.patch

Mike Kravetz

unread,
Oct 2, 2022, 9:17:03 PMOct 2
to Peter Xu, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
On 09/30/22 23:01, Peter Xu wrote:
> On Fri, Sep 30, 2022 at 10:47:45PM -0400, Peter Xu wrote:
> From fe9e50551f3fdb7107315784affca4f9b1c4720f Mon Sep 17 00:00:00 2001
> From: Peter Xu <pet...@redhat.com>
> Date: Fri, 30 Sep 2022 22:22:44 -0400
> Subject: [PATCH] mm/hugetlb: Fix race condition of uffd missing handling
> Content-type: text/plain
>
> Signed-off-by: Peter Xu <pet...@redhat.com>
> ---
> mm/hugetlb.c | 36 +++++++++++++++++++++++++++++++++---
> 1 file changed, 33 insertions(+), 3 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dd29cba46e9e..5015d8aa5da4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5557,9 +5557,39 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> if (!page) {
> /* Check for page in userfault range */
> if (userfaultfd_missing(vma)) {
> - ret = hugetlb_handle_userfault(vma, mapping, idx,
> - flags, haddr, address,
> - VM_UFFD_MISSING);
> + bool same;
> +
> + /*
> + * Since hugetlb_no_page() was examining pte
> + * without pgtable lock, we need to re-test under
> + * lock because the pte may not be stable and could
> + * have changed from under us. Try to detect
> + * either changed or during-changing ptes and bail
> + * out properly.
> + *
> + * One example of changing pte is in-progress CoW
> + * of private mapping, which will clear+flush pte
> + * then reinstall the new one.
> + *
> + * Note that userfaultfd is actually fine with
> + * false positives (e.g. caused by pte changed),
> + * but not wrong logical events (e.g. caused by
> + * reading a pte during changing). The latter can
> + * confuse the userspace, so the strictness is very
> + * much preferred. E.g., MISSING event should
> + * never happen on the page after UFFDIO_COPY has
> + * correctly installed the page and returned.
> + */

Thanks Peter!

The wording and pte_same check here is better than what I proposed. I think
that last paragraph above should go into the commit message as it describes
user visible effects (missing event after UFFDIO_COPY has correctly installed
the page and returned).

This seems to have existed since hugetlb userfault support was added. It just
became exposed recently due to locking changes going into 6.1. However, I
think it may have existed in the window after hugetlb userfault support was
added and before current i_mmap_sema locking for pmd sharing was added. Just
a long way of saying I am not sure cc stable if of much value.

--
Mike Kravetz

> + ptl = huge_pte_lock(h, mm, ptep);
> + same = pte_same(huge_ptep_get(ptep), old_pte);
> + spin_unlock(ptl);
> + if (same)
> + ret = hugetlb_handle_userfault(vma, mapping, idx,
> + flags, haddr, address,
> + VM_UFFD_MISSING);
> + else
> + /* PTE changed or unstable, retry */
> + ret = 0;
> goto out;
> }
>
> --
> 2.37.3
>

Peter Xu

unread,
Oct 3, 2022, 11:24:26 AMOct 3
to Mike Kravetz, Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot, ak...@linux-foundation.org, linux-...@vger.kernel.org, linu...@kvack.org, ll...@lists.linux.dev, nat...@kernel.org, ndesau...@google.com, songm...@bytedance.com, syzkall...@googlegroups.com, tr...@redhat.com
Will do.

>
> This seems to have existed since hugetlb userfault support was added. It just
> became exposed recently due to locking changes going into 6.1. However, I
> think it may have existed in the window after hugetlb userfault support was
> added and before current i_mmap_sema locking for pmd sharing was added.

Agreed.

> Just a long way of saying I am not sure cc stable if of much value.

Logically the change is stable material. I had worry that after uffd-wp
intergration with hugetlb it's indeed possible to trigger on the CoWs we're
encountering already, so IMO still something good to have for 5.19.

I just saw that you proposed a similar fix in 4643d67e8cb0b35 on a similar
page migration race three years ago. I'm not sure whether it also can
happen with uffd missing modes too even before uffd-wp introduced.

I think I'll first post the patch with Fixes attached without having stable
tagged, but let me know your thoughts. No worry on the backport, I can
take care of doing that and tests.

I also plan to add your co-devel tag if you're fine with it because this
patch is a collaboration effort IMO, but please let me know either here or
directly replying to the patch if it's posted if you think that's inproper
in any form.

Thanks Mike!

--
Peter Xu

Reply all
Reply to author
Forward
0 new messages