Simplified reproducer that does not depend on a race with the
child process.
As expected previously, we have PAE cleared on the head page,
because it is/was COW-shared with a child process.
We are registering more than one consecutive tail pages of that
THP through iouring, GUP-pinning them. These pages are not
COW-shared and, therefore, do not have PAE set.
#define _GNU_SOURCE
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <liburing.h>
int main(void)
{
struct io_uring_params params = {
.wq_fd = -1,
};
struct iovec iovec;
const size_t pagesize = getpagesize();
size_t size = 2048 * pagesize;
char *addr;
int fd;
/* We need a THP-aligned area. */
addr = mmap((char *)0x20000000u, size, PROT_WRITE|PROT_READ,
MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("MAP_FIXED failed\n");
return 1;
}
if (madvise(addr, size, MADV_HUGEPAGE)) {
perror("MADV_HUGEPAGE failed\n");
return 1;
}
/* Populate a THP. */
memset(addr, 0, size);
/* COW-share only the first page ... */
if (madvise(addr + pagesize, size - pagesize, MADV_DONTFORK)) {
perror("MADV_DONTFORK failed\n");
return 1;
}
/* ... using fork(). This will clear PAE on the head page. */
if (fork() == 0)
exit(0);
/* Setup iouring */
fd = syscall(__NR_io_uring_setup, 1024, ¶ms);
if (fd < 0) {
perror("__NR_io_uring_setup failed\n");
return 1;
}
/* Register (GUP-pin) two consecutive tail pages. */
iovec.iov_base = addr + pagesize;
iovec.iov_len = 2 * pagesize;
syscall(__NR_io_uring_register, fd, IORING_REGISTER_BUFFERS, &iovec, 1);
return 0;
}
[ 108.070381][ T14] kernel BUG at mm/gup.c:71!
[ 108.070502][ T14] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
[ 108.117202][ T14] Modules linked in:
[ 108.119105][ T14] CPU: 1 UID: 0 PID: 14 Comm: kworker/u32:1 Not tainted 6.16.0-rc2-syzkaller-g9aa9b43d689e #0 PREEMPT
[ 108.123672][ T14] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20250221-8.fc42 02/21/2025
[ 108.127458][ T14] Workqueue: iou_exit io_ring_exit_work
[ 108.129812][ T14] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 108.133091][ T14] pc : sanity_check_pinned_pages+0x7cc/0x7d0
[ 108.135566][ T14] lr : sanity_check_pinned_pages+0x7cc/0x7d0
[ 108.138025][ T14] sp : ffff800097ac7640
[ 108.139859][ T14] x29: ffff800097ac7660 x28: dfff800000000000 x27: 1fffffbff80d3000
[ 108.143185][ T14] x26: 01ffc0000002007c x25: 01ffc0000002007c x24: fffffdffc0698000
[ 108.146599][ T14] x23: fffffdffc0698000 x22: ffff800097ac76e0 x21: 01ffc0000002007c
[ 108.150025][ T14] x20: 0000000000000000 x19: ffff800097ac76e0 x18: 00000000ffffffff
[ 108.153449][ T14] x17: 703e2d6f696c6f66 x16: ffff80008ae33808 x15: ffff700011ed61d4
[ 108.156892][ T14] x14: 1ffff00011ed61d4 x13: 0000000000000004 x12: ffffffffffffffff
[ 108.160267][ T14] x11: ffff700011ed61d4 x10: 0000000000ff0100 x9 : f6672ecf4f89d700
[ 108.163782][ T14] x8 : f6672ecf4f89d700 x7 : 0000000000000001 x6 : 0000000000000001
[ 108.167180][ T14] x5 : ffff800097ac6d58 x4 : ffff80008f727060 x3 : ffff80008054c348
[ 108.170807][ T14] x2 : 0000000000000000 x1 : 0000000100000000 x0 : 0000000000000061
[ 108.174205][ T14] Call trace:
[ 108.175649][ T14] sanity_check_pinned_pages+0x7cc/0x7d0 (P)
[ 108.178138][ T14] unpin_user_page+0x80/0x10c
[ 108.180189][ T14] io_release_ubuf+0x84/0xf8
[ 108.182196][ T14] io_free_rsrc_node+0x250/0x57c
[ 108.184345][ T14] io_rsrc_data_free+0x148/0x298
[ 108.186493][ T14] io_sqe_buffers_unregister+0x84/0xa0
[ 108.188991][ T14] io_ring_ctx_free+0x48/0x480
[ 108.191057][ T14] io_ring_exit_work+0x764/0x7d8
[ 108.193207][ T14] process_one_work+0x7e8/0x155c
[ 108.195431][ T14] worker_thread+0x958/0xed8
[ 108.197561][ T14] kthread+0x5fc/0x75c
[ 108.199362][ T14] ret_from_fork+0x10/0x20
When only pinning a single tail page (iovec.iov_len = pagesize), it works as expected.
So, if we pinned two tail pages but end up calling io_release_ubuf()->unpin_user_page()
on the head page, meaning that "imu->bvec[i].bv_page" points at the wrong folio page
(IOW, one we never pinned).
So it's related to the io_coalesce_buffer() machinery.
And in fact, in there, we have this weird logic:
/* Store head pages only*/
new_array = kvmalloc_array(nr_folios, sizeof(struct page *), GFP_KERNEL);
...
Essentially discarding the subpage information when coalescing tail pages.
I am afraid the whole io_check_coalesce_buffer + io_coalesce_buffer() logic might be
flawed (we can -- in theory -- coalesc different folio page ranges in
a GUP result?).
@Jens, not sure if this only triggers a warning when unpinning or if we actually mess up
imu->bvec[i].bv_page, to end up pointing at (reading/writing) pages we didn't even pin in the first
place.
Can you look into that, as you are more familiar with the logic?