[syzbot] general protection fault in skb_dequeue (3)

15 views
Skip to first unread message

syzbot

unread,
Feb 1, 2023, 5:04:42 AM2/1/23
to da...@davemloft.net, dhow...@redhat.com, edum...@google.com, h...@lst.de, jhub...@nvidia.com, joha...@sipsolutions.net, ku...@kernel.org, linux-...@vger.kernel.org, linux-w...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot found the following issue on:

HEAD commit: 80bd9028feca Add linux-next specific files for 20230131
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1468e369480000
kernel config: https://syzkaller.appspot.com/x/.config?x=904dc2f450eaad4a
dashboard link: https://syzkaller.appspot.com/bug?extid=a440341a59e3b7142895
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12c5d2be480000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11259a79480000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/924618188238/disk-80bd9028.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/7a03cf86e545/vmlinux-80bd9028.xz
kernel image: https://storage.googleapis.com/syzbot-assets/568e80043a41/bzImage-80bd9028.xz

The issue was bisected to:

commit 920756a3306a35f1c08f25207d375885bef98975
Author: David Howells <dhow...@redhat.com>
Date: Sat Jan 21 12:51:18 2023 +0000

block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages

bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=170384f9480000
final oops: https://syzkaller.appspot.com/x/report.txt?x=148384f9480000
console output: https://syzkaller.appspot.com/x/log.txt?x=108384f9480000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+a44034...@syzkaller.appspotmail.com
Fixes: 920756a3306a ("block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages")

general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 PID: 2838 Comm: kworker/u4:6 Not tainted 6.2.0-rc6-next-20230131-syzkaller-09515-g80bd9028feca #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
Workqueue: phy4 ieee80211_iface_work
RIP: 0010:__skb_unlink include/linux/skbuff.h:2321 [inline]
RIP: 0010:__skb_dequeue include/linux/skbuff.h:2337 [inline]
RIP: 0010:skb_dequeue+0xf5/0x180 net/core/skbuff.c:3511
Code: 8d 7e 08 49 8b 5c 24 08 48 b8 00 00 00 00 00 fc ff df 49 c7 44 24 08 00 00 00 00 48 89 fa 49 c7 04 24 00 00 00 00 48 c1 ea 03 <80> 3c 02 00 75 6d 48 89 da 49 89 5e 08 48 b8 00 00 00 00 00 fc ff
RSP: 0018:ffffc9000ca2fc80 EFLAGS: 00010002
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff8808951d RDI: 0000000000000008
RBP: 0000000000000293 R08: 0000000000000001 R09: 0000000000000003
R10: fffff52001945f7e R11: 0000000000000000 R12: ffff88801d8f63c0
R13: ffff888075675880 R14: 0000000000000000 R15: ffff888075675868
FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4a51f6d150 CR3: 0000000072a78000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
ieee80211_iface_work+0x369/0xd70 net/mac80211/iface.c:1631
process_one_work+0x9bf/0x1820 kernel/workqueue.c:2390
worker_thread+0x669/0x1090 kernel/workqueue.c:2537
kthread+0x2e8/0x3a0 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:__skb_unlink include/linux/skbuff.h:2321 [inline]
RIP: 0010:__skb_dequeue include/linux/skbuff.h:2337 [inline]
RIP: 0010:skb_dequeue+0xf5/0x180 net/core/skbuff.c:3511
Code: 8d 7e 08 49 8b 5c 24 08 48 b8 00 00 00 00 00 fc ff df 49 c7 44 24 08 00 00 00 00 48 89 fa 49 c7 04 24 00 00 00 00 48 c1 ea 03 <80> 3c 02 00 75 6d 48 89 da 49 89 5e 08 48 b8 00 00 00 00 00 fc ff
RSP: 0018:ffffc9000ca2fc80 EFLAGS: 00010002
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff8808951d RDI: 0000000000000008
RBP: 0000000000000293 R08: 0000000000000001 R09: 0000000000000003
R10: fffff52001945f7e R11: 0000000000000000 R12: ffff88801d8f63c0
R13: ffff888075675880 R14: 0000000000000000 R15: ffff888075675868
FS: 0000000000000000(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f4a51f6d150 CR3: 0000000072a78000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):
0: 8d 7e 08 lea 0x8(%rsi),%edi
3: 49 8b 5c 24 08 mov 0x8(%r12),%rbx
8: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
f: fc ff df
12: 49 c7 44 24 08 00 00 movq $0x0,0x8(%r12)
19: 00 00
1b: 48 89 fa mov %rdi,%rdx
1e: 49 c7 04 24 00 00 00 movq $0x0,(%r12)
25: 00
26: 48 c1 ea 03 shr $0x3,%rdx
* 2a: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction
2e: 75 6d jne 0x9d
30: 48 89 da mov %rbx,%rdx
33: 49 89 5e 08 mov %rbx,0x8(%r14)
37: 48 rex.W
38: b8 00 00 00 00 mov $0x0,%eax
3d: 00 fc add %bh,%ah
3f: ff .byte 0xff


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

Hillf Danton

unread,
Feb 1, 2023, 6:53:16 AM2/1/23
to syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On Wed, 01 Feb 2023 02:04:41 -0800
> syzbot found the following issue on:
>
> HEAD commit: 80bd9028feca Add linux-next specific files for 20230131
> git tree: linux-next
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11259a79480000

Flush work before ieee80211_setup_sdata() which initializes skb queue.

#syz test https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 80bd9028feca

--- x/net/mac80211/iface.c
+++ y/net/mac80211/iface.c
@@ -630,6 +630,7 @@ static void ieee80211_do_stop(struct iee
skb_queue_purge(&sdata->skb_queue);
skb_queue_purge(&sdata->status_queue);
}
+ flush_work(&sdata->work);

spin_lock_irqsave(&local->queue_stop_reason_lock, flags);
for (i = 0; i < IEEE80211_MAX_QUEUES; i++) {
--

syzbot

unread,
Feb 1, 2023, 10:15:20 AM2/1/23
to hda...@sina.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch but the reproducer is still triggering an issue:
BUG: Bad rss-counter state

BUG: Bad rss-counter state mm:ffff88802b07c800 type:MM_ANONPAGES val:1
BUG: non-zero pgtables_bytes on freeing mm: 8192


Tested on:

commit: 80bd9028 Add linux-next specific files for 20230131
git tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=128b1afd480000
kernel config: https://syzkaller.appspot.com/x/.config?x=904dc2f450eaad4a
dashboard link: https://syzkaller.appspot.com/bug?extid=a440341a59e3b7142895
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
patch: https://syzkaller.appspot.com/x/patch.diff?x=15e89d43480000

Hillf Danton

unread,
Feb 1, 2023, 6:39:28 PM2/1/23
to syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On Wed, 01 Feb 2023 02:04:41 -0800
> syzbot found the following issue on:
>
> HEAD commit: 80bd9028feca Add linux-next specific files for 20230131
> git tree: linux-next
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11259a79480000

Flush work before ieee80211_setup_sdata() which initializes skb queue.

#syz test https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master

syzbot

unread,
Feb 2, 2023, 12:53:19 AM2/2/23
to hda...@sina.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-and-tested-by: syzbot+a44034...@syzkaller.appspotmail.com

Tested on:

commit: 9f266cca Merge tag 'for_linus' of git://git.kernel.org..
git tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
console output: https://syzkaller.appspot.com/x/log.txt?x=10ae0ac3480000
kernel config: https://syzkaller.appspot.com/x/.config?x=723d250bd16cf869
dashboard link: https://syzkaller.appspot.com/bug?extid=a440341a59e3b7142895
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
patch: https://syzkaller.appspot.com/x/patch.diff?x=126ad755480000

Note: testing is done by a robot and is best-effort only.

David Howells

unread,
Feb 2, 2023, 3:52:22 AM2/2/23
to jhub...@nvidia.com, David Hildenbrand, syzbot, dhow...@redhat.com, da...@davemloft.net, edum...@google.com, h...@lst.de, joha...@sipsolutions.net, ku...@kernel.org, linux-...@vger.kernel.org, linux-w...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
Hi John, David,

Could you have a look at this?

> syzbot found the following issue on:
>
> HEAD commit: 80bd9028feca Add linux-next specific files for 20230131
> git tree: linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1468e369480000
> kernel config: https://syzkaller.appspot.com/x/.config?x=904dc2f450eaad4a
> dashboard link: https://syzkaller.appspot.com/bug?extid=a440341a59e3b7142895
> compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12c5d2be480000
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11259a79480000
> ...
> The issue was bisected to:
>
> commit 920756a3306a35f1c08f25207d375885bef98975
> Author: David Howells <dhow...@redhat.com>
> Date: Sat Jan 21 12:51:18 2023 +0000
>
> block: Convert bio_iov_iter_get_pages to use iov_iter_extract_pages
>
> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=170384f9480000
> final oops: https://syzkaller.appspot.com/x/report.txt?x=148384f9480000
> console output: https://syzkaller.appspot.com/x/log.txt?x=108384f9480000
> ...
> general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
> CPU: 0 PID: 2838 Comm: kworker/u4:6 Not tainted 6.2.0-rc6-next-20230131-syzkaller-09515-g80bd9028feca #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
> Workqueue: phy4 ieee80211_iface_work
> RIP: 0010:__skb_unlink include/linux/skbuff.h:2321 [inline]
> RIP: 0010:__skb_dequeue include/linux/skbuff.h:2337 [inline]
> RIP: 0010:skb_dequeue+0xf5/0x180 net/core/skbuff.c:3511

I don't think this is specifically related to anything networking. I've run
it a few times and weird stuff happens in various places. I'm wondering if
it's related to FOLL_PIN in some way.

The syzbot test in question does the following:

#{"repeat":true,"procs":1,"slowdown":1,"sandbox":"none","sandbox_arg":0,"netdev":true,"cgroups":true,"close_fds":true,"usb":true,"wifi":true,"sysctl":true,"tmpdir":true}
socket(0x0, 0x2, 0x0)
epoll_create(0x7)
r0 = creat(&(0x7f0000000040)='./bus\x00', 0x9)
ftruncate(r0, 0x800)
lseek(r0, 0x200, 0x2)
r1 = open(&(0x7f0000000000)='./bus\x00', 0x24000, 0x0) <-- O_DIRECT
sendfile(r0, r1, 0x0, 0x1dd00)

Basically a DIO splice from a file to itself.

I've hand-written my own much simpler tester (see attached). You need to run
at least two copies in parallel, I think, to trigger the bug. It's possible
truncate is interfering somehow.

David
---
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/sendfile.h>
#include <sys/wait.h>

#define file_size 0x800
#define send_size 0x1dd00
#define repeat_count 1000

int main(int argc, char *argv[])
{
int in, out, i, wt;

if (argc != 2 || !argv[1][0]) {
fprintf(stderr, "Usage: %s <file>\n", argv[0]);
exit(2);
}

for (i = 0; i < repeat_count; i++) {
switch (fork()) {
case -1:
perror("fork");
exit(1);
case 0:
out = creat(argv[1], 0666);
if (out < 0) {
perror(argv[1]);
exit(1);
}

if (ftruncate(out, file_size) < 0) {
perror("ftruncate");
exit(1);
}

if (lseek(out, file_size, SEEK_SET) < 0) {
perror("lseek");
exit(1);
}

in = open(argv[1], O_RDONLY | O_DIRECT | O_NOFOLLOW);
if (in < 0) {
perror("open");
exit(1);
}

if (sendfile(out, in, NULL, send_size) < 0) {
perror("sendfile");
exit(1);
}
exit(0);

default:
if (wait(&wt) < 0) {
perror("wait");
exit(1);
}
break;
}
}

exit(0);
}

David Hildenbrand

unread,
Feb 2, 2023, 4:19:15 AM2/2/23
to David Howells, jhub...@nvidia.com, syzbot, da...@davemloft.net, edum...@google.com, h...@lst.de, joha...@sipsolutions.net, ku...@kernel.org, linux-...@vger.kernel.org, linux-w...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
[as raised on IRC]

At first, I wondered if that's related to shared anonymous pages getting
pinned R/O that would trigger COW-unsharing ... but I don't even see
where we are supposed to use FOLL_PIN vs. FOLL_GET here? IOW, we're not
even supposed to access user space memory (neither FOLL_GET nor
FOLL_PIN) but still end up with a change in behavior.

--
Thanks,

David / dhildenb

David Howells

unread,
Feb 2, 2023, 10:15:19 AM2/2/23
to David Hildenbrand, dhow...@redhat.com, jhub...@nvidia.com, syzbot, da...@davemloft.net, edum...@google.com, h...@lst.de, joha...@sipsolutions.net, ku...@kernel.org, linux-...@vger.kernel.org, linux-w...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
David Hildenbrand <da...@redhat.com> wrote:

> At first, I wondered if that's related to shared anonymous pages getting
> pinned R/O that would trigger COW-unsharing ... but I don't even see where we
> are supposed to use FOLL_PIN vs. FOLL_GET here? IOW, we're not even supposed
> to access user space memory (neither FOLL_GET nor FOLL_PIN) but still end up
> with a change in behavior.

I'm not sure that using FOLL_PIN is part of the problem here.

sendfile() is creating a transient buffer attached to a pipe, doing a DIO read
into it (which uses iov_iter_extract_pages() to populate a bio) and then feeds
the pages from the pipe one at a time using a BVEC-type iterator to a buffered
write.

Note that iov_iter_extract_pages() does not pin/get pages when accessing a
BVEC, KVEC, XARRAY or PIPE iterator. However, in this case, this should not
matter as the pipe is holding refs on the pages in the buffer.

I have found that the issue goes away if I add an extra get_page() into
iov_iter_extract_pipe_pages() and don't release it (the page is then leaked).

This makes me think that the problem might be due to the pages getting
recycled from the pipe before DIO has finished writing to them - but that
shouldn't be the case as the splice has to be synchronous - we can't mark data
in the pipe as 'produced' until we've finished reading it.

I've also turned on CONFIG_DEBUG_PAGE_REF with a twist of my own to use a page
flag to mark the pages of interest and only trace those pages (see attached
patch) - and that consistently shows the pages being released after the bio is
cleared.

Further, when it does arise, the issue only happens when two or more copies of
the test program are run in parallel - which makes me wonder if truncate is
implicated, except that that would have to happen inside the filesystem and be
nothing to do with the pipe in the middle of sendfile().

David
--
commit 8ec9d7d951166d77e283079151b496632c290958
Author: David Howells <dhow...@redhat.com>
Date: Fri Nov 13 13:21:01 2020 +0000

mm: Add a page bit to trigger page_ref tracing

Add a facility to mark a page using an extra page bit so that the page is
flagged in tracing. Also make it possible to restrict the tracepoint to
only marked pages. The mark is automatically removed when the page is
freed.

Enable the followng:

CONFIG_DEBUG_PAGE_REF
CONFIG_DEBUG_PAGE_MARK
CONFIG_DEBUG_PAGE_REF_ONLY_MARKED

and then enable the page_ref tracepoints:

echo 1 >/sys/kernel/debug/tracing/events/page_ref/enable

and it shows the fate of pages created by readahead for the netfs helpers.

This also contains some bits to turn it on. If you want to trace a
filesystem other than XFS, you need to change the magic numbers in the
patch.

Signed-off-by: David Howells <dhow...@redhat.com>

diff --git a/block/bio.c b/block/bio.c
index fc57f0aa098e..69b62ebf68ed 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -20,6 +20,7 @@
#include <linux/blk-crypto.h>
#include <linux/xarray.h>

+#include <trace/events/page_ref.h>
#include <trace/events/block.h>
#include "blk.h"
#include "blk-rq-qos.h"
@@ -1174,10 +1175,17 @@ void __bio_release_pages(struct bio *bio, bool mark_dirty)
struct bio_vec *bvec;

bio_for_each_segment_all(bvec, bio, iter_all) {
- if (mark_dirty && !PageCompound(bvec->bv_page))
- set_page_dirty_lock(bvec->bv_page);
- bio_release_page(bio, bvec->bv_page);
+ if (PageDebugMark(bvec->bv_page))
+ trace_page_ref_set(bvec->bv_page, 999);
}
+
+ if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+ bio_flagged(bio, BIO_PAGE_PINNED))
+ bio_for_each_segment_all(bvec, bio, iter_all) {
+ if (mark_dirty && !PageCompound(bvec->bv_page))
+ set_page_dirty_lock(bvec->bv_page);
+ bio_release_page(bio, bvec->bv_page);
+ }
}
EXPORT_SYMBOL_GPL(__bio_release_pages);

diff --git a/fs/splice.c b/fs/splice.c
index 5969b7a1d353..4c2f13f5d6d5 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -304,6 +304,7 @@ ssize_t generic_file_splice_read(struct file *in, loff_t *ppos,
int ret;

iov_iter_pipe(&to, ITER_DEST, pipe, len);
+ to.debug = true;
init_sync_kiocb(&kiocb, in);
kiocb.ki_pos = *ppos;
ret = call_read_iter(in, &kiocb, &to);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index b2c09997d79c..cafa26637067 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,8 +484,8 @@ void zero_fill_bio(struct bio *bio);

static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
{
- if (bio_flagged(bio, BIO_PAGE_REFFED) ||
- bio_flagged(bio, BIO_PAGE_PINNED))
+ //if (bio_flagged(bio, BIO_PAGE_REFFED) ||
+ // bio_flagged(bio, BIO_PAGE_PINNED))
__bio_release_pages(bio, mark_dirty);
}

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 69e93a0c1277..80cbf784239e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -138,6 +138,9 @@ enum pageflags {
#endif
#ifdef CONFIG_KASAN_HW_TAGS
PG_skip_kasan_poison,
+#endif
+#ifdef CONFIG_DEBUG_PAGE_MARK
+ PG_debug_mark,
#endif
__NR_PAGEFLAGS,

@@ -694,6 +697,15 @@ static __always_inline bool PageKsm(struct page *page)
TESTPAGEFLAG_FALSE(Ksm, ksm)
#endif

+#ifdef CONFIG_DEBUG_PAGE_MARK
+/*
+ * Debug marks are just used for page_ref tracepoint control and display.
+ */
+PAGEFLAG(DebugMark, debug_mark, PF_ANY)
+#else
+TESTPAGEFLAG_FALSE(DebugMark, debug_mark)
+#endif
+
u64 stable_page_flags(struct page *page);

/**
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index d7c2d33baa7f..7bc1a94d9cbb 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -24,7 +24,11 @@ DECLARE_TRACEPOINT(page_ref_unfreeze);
*
* See trace_##name##_enabled(void) in include/linux/tracepoint.h
*/
-#define page_ref_tracepoint_active(t) tracepoint_enabled(t)
+#ifndef CONFIG_DEBUG_PAGE_REF_ONLY_MARKED
+#define page_ref_tracepoint_active(p, t) tracepoint_enabled(t)
+#else
+#define page_ref_tracepoint_active(p, t) (tracepoint_enabled(t) && PageDebugMark(p))
+#endif

extern void __page_ref_set(struct page *page, int v);
extern void __page_ref_mod(struct page *page, int v);
@@ -36,7 +40,7 @@ extern void __page_ref_unfreeze(struct page *page, int v);

#else

-#define page_ref_tracepoint_active(t) false
+#define page_ref_tracepoint_active(page, t) false

static inline void __page_ref_set(struct page *page, int v)
{
@@ -97,7 +101,7 @@ static inline int page_count(const struct page *page)
static inline void set_page_count(struct page *page, int v)
{
atomic_set(&page->_refcount, v);
- if (page_ref_tracepoint_active(page_ref_set))
+ if (page_ref_tracepoint_active(page, page_ref_set))
__page_ref_set(page, v);
}

@@ -118,7 +122,7 @@ static inline void init_page_count(struct page *page)
static inline void page_ref_add(struct page *page, int nr)
{
atomic_add(nr, &page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod))
+ if (page_ref_tracepoint_active(page, page_ref_mod))
__page_ref_mod(page, nr);
}

@@ -130,7 +134,7 @@ static inline void folio_ref_add(struct folio *folio, int nr)
static inline void page_ref_sub(struct page *page, int nr)
{
atomic_sub(nr, &page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod))
+ if (page_ref_tracepoint_active(page, page_ref_mod))
__page_ref_mod(page, -nr);
}

@@ -143,7 +147,7 @@ static inline int page_ref_sub_return(struct page *page, int nr)
{
int ret = atomic_sub_return(nr, &page->_refcount);

- if (page_ref_tracepoint_active(page_ref_mod_and_return))
+ if (page_ref_tracepoint_active(page, page_ref_mod_and_return))
__page_ref_mod_and_return(page, -nr, ret);
return ret;
}
@@ -156,7 +160,7 @@ static inline int folio_ref_sub_return(struct folio *folio, int nr)
static inline void page_ref_inc(struct page *page)
{
atomic_inc(&page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod))
+ if (page_ref_tracepoint_active(page, page_ref_mod))
__page_ref_mod(page, 1);
}

@@ -168,7 +172,7 @@ static inline void folio_ref_inc(struct folio *folio)
static inline void page_ref_dec(struct page *page)
{
atomic_dec(&page->_refcount);
- if (page_ref_tracepoint_active(page_ref_mod))
+ if (page_ref_tracepoint_active(page, page_ref_mod))
__page_ref_mod(page, -1);
}

@@ -181,7 +185,7 @@ static inline int page_ref_sub_and_test(struct page *page, int nr)
{
int ret = atomic_sub_and_test(nr, &page->_refcount);

- if (page_ref_tracepoint_active(page_ref_mod_and_test))
+ if (page_ref_tracepoint_active(page, page_ref_mod_and_test))
__page_ref_mod_and_test(page, -nr, ret);
return ret;
}
@@ -195,7 +199,7 @@ static inline int page_ref_inc_return(struct page *page)
{
int ret = atomic_inc_return(&page->_refcount);

- if (page_ref_tracepoint_active(page_ref_mod_and_return))
+ if (page_ref_tracepoint_active(page, page_ref_mod_and_return))
__page_ref_mod_and_return(page, 1, ret);
return ret;
}
@@ -209,7 +213,7 @@ static inline int page_ref_dec_and_test(struct page *page)
{
int ret = atomic_dec_and_test(&page->_refcount);

- if (page_ref_tracepoint_active(page_ref_mod_and_test))
+ if (page_ref_tracepoint_active(page, page_ref_mod_and_test))
__page_ref_mod_and_test(page, -1, ret);
return ret;
}
@@ -223,7 +227,7 @@ static inline int page_ref_dec_return(struct page *page)
{
int ret = atomic_dec_return(&page->_refcount);

- if (page_ref_tracepoint_active(page_ref_mod_and_return))
+ if (page_ref_tracepoint_active(page, page_ref_mod_and_return))
__page_ref_mod_and_return(page, -1, ret);
return ret;
}
@@ -237,7 +241,7 @@ static inline bool page_ref_add_unless(struct page *page, int nr, int u)
{
bool ret = atomic_add_unless(&page->_refcount, nr, u);

- if (page_ref_tracepoint_active(page_ref_mod_unless))
+ if (page_ref_tracepoint_active(page, page_ref_mod_unless))
__page_ref_mod_unless(page, nr, ret);
return ret;
}
@@ -317,7 +321,7 @@ static inline int page_ref_freeze(struct page *page, int count)
{
int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);

- if (page_ref_tracepoint_active(page_ref_freeze))
+ if (page_ref_tracepoint_active(page, page_ref_freeze))
__page_ref_freeze(page, count, ret);
return ret;
}
@@ -333,7 +337,7 @@ static inline void page_ref_unfreeze(struct page *page, int count)
VM_BUG_ON(count == 0);

atomic_set_release(&page->_refcount, count);
- if (page_ref_tracepoint_active(page_ref_unfreeze))
+ if (page_ref_tracepoint_active(page, page_ref_unfreeze))
__page_ref_unfreeze(page, count);
}

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 514e3b7b06b8..89272c05d74d 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -45,6 +45,7 @@ struct iov_iter {
bool nofault;
bool data_source;
bool user_backed;
+ bool debug;
union {
size_t iov_offset;
int last_offset;
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 412b5a46374c..5f3b9b0e4b53 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -103,6 +103,12 @@
#define IF_HAVE_PG_SKIP_KASAN_POISON(flag,string)
#endif

+#ifdef CONFIG_DEBUG_PAGE_MARK
+#define IF_HAVE_PG_DEBUG_MARK(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_DEBUG_MARK(flag,string)
+#endif
+
#define __def_pageflag_names \
{1UL << PG_locked, "locked" }, \
{1UL << PG_waiters, "waiters" }, \
@@ -132,7 +138,8 @@ IF_HAVE_PG_IDLE(PG_young, "young" ) \
IF_HAVE_PG_IDLE(PG_idle, "idle" ) \
IF_HAVE_PG_ARCH_X(PG_arch_2, "arch_2" ) \
IF_HAVE_PG_ARCH_X(PG_arch_3, "arch_3" ) \
-IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
+IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison") \
+IF_HAVE_PG_DEBUG_MARK(PG_debug_mark, "debug_mark" )

#define show_page_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index d69a05950555..7ac030208a2c 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -10,9 +10,11 @@
#include <linux/vmalloc.h>
#include <linux/splice.h>
#include <linux/compat.h>
+#include <linux/page-flags.h>
#include <net/checksum.h>
#include <linux/scatterlist.h>
#include <linux/instrumented.h>
+#include <trace/events/page_ref.h>

#define PIPE_PARANOIA /* for now */

@@ -1331,6 +1333,10 @@ static ssize_t pipe_get_pages(struct iov_iter *i,
struct page *page = append_pipe(i, left, &off);
if (!page)
break;
+ if (i->debug && !PageDebugMark(page)) {
+ //SetPageDebugMark(page);
+ get_page(page);
+ }
chunk = min_t(size_t, left, PAGE_SIZE - off);
get_page(*p++ = page);
}
@@ -1955,6 +1961,11 @@ static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
struct page *page = append_pipe(i, left, &offset);
if (!page)
break;
+ if (i->debug && !PageDebugMark(page)) {
+ SetPageDebugMark(page);
+ trace_page_ref_set(page, 888);
+ //get_page(page);
+ }
chunk = min_t(size_t, left, PAGE_SIZE - offset);
left -= chunk;
*p++ = page;
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index fca699ad1fb0..111a946a676f 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -149,6 +149,23 @@ config DEBUG_PAGE_REF
kernel code. However the runtime performance overhead is virtually
nil until the tracepoints are actually enabled.

+config DEBUG_PAGE_MARK
+ bool "Reserve a page bit to mark pages to be debugged"
+ depends on DEBUG_PAGE_REF
+ help
+ This option adds an extra page flag that can be used to mark pages
+ for debugging. The mark can be observed in the page_ref tracepoints.
+ The mark isn't set on any pages without alteration of the code. This
+ is intended for filesystem debugging and code to set the mark must be
+ added manually into the source.
+
+config DEBUG_PAGE_REF_ONLY_MARKED
+ bool "Only trace marked pages"
+ depends on DEBUG_PAGE_REF && DEBUG_PAGE_MARK
+ help
+ This option restricts the page_ref tracepoints to only track marked
+ pages.
+
config DEBUG_RODATA_TEST
bool "Testcase for the marking rodata read-only"
depends on STRICT_KERNEL_RWX
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0745aedebb37..37f146e5b2eb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1102,6 +1102,9 @@ static inline void __free_one_page(struct page *page,

VM_BUG_ON(!zone_is_initialized(zone));
VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
+#ifdef CONFIG_DEBUG_PAGE_MARK
+ ClearPageDebugMark(page);
+#endif

VM_BUG_ON(migratetype == -1);
if (likely(!is_migrate_isolate(migratetype)))
diff --git a/mm/readahead.c b/mm/readahead.c
index b10f0cf81d80..c5558daf3a56 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -248,6 +248,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
folio = filemap_alloc_folio(gfp_mask, 0);
if (!folio)
break;
+#define XFS_SUPER_MAGIC 0x58465342 /* "XFSB" */
+ if (mapping->host->i_sb->s_magic == XFS_SUPER_MAGIC)
+ folio_set_debug_mark(folio);
+
if (filemap_add_folio(mapping, folio, index + i,
gfp_mask) < 0) {
folio_put(folio);
@@ -809,6 +813,7 @@ void readahead_expand(struct readahead_control *ractl,
page = __page_cache_alloc(gfp_mask);
if (!page)
return;
+ //SetPageDebugMark(page);
if (add_to_page_cache_lru(page, mapping, index, gfp_mask) < 0) {
put_page(page);
return;
@@ -832,6 +837,7 @@ void readahead_expand(struct readahead_control *ractl,
page = __page_cache_alloc(gfp_mask);
if (!page)
return;
+ //SetPageDebugMark(page);
if (add_to_page_cache_lru(page, mapping, index, gfp_mask) < 0) {
put_page(page);
return;

David Howells

unread,
Feb 3, 2023, 3:36:20 AM2/3/23
to John Hubbard, dhow...@redhat.com, David Hildenbrand, syzbot, h...@lst.de, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Interestingly, the problem also goes away if the ftruncate is removed from the
test program.

Are pages ever spliced from the pipe directly into the pagecache on both xfs
and ext4? I'm not sure whether that could happen as the test seeks 2048 bytes
in before doing the sendfile().

David

John Hubbard

unread,
Feb 3, 2023, 4:39:52 AM2/3/23
to David Howells, David Hildenbrand, syzbot, da...@davemloft.net, edum...@google.com, h...@lst.de, joha...@sipsolutions.net, ku...@kernel.org, linux-...@vger.kernel.org, linux-w...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
On 2/2/23 00:52, David Howells wrote:
> Hi John, David,
>
> Could you have a look at this?

Sure. So far, I have reproduced a crash using your simplified test
program (it required three simulaneous running copies), and will look
deeper now.

In case it illuminates anything, the crash looked like this (below), and
was obtained *without* setting KASAN. Also a minor point: this is from a
git branch of the last commit in the series (commit fd20d0c1852e "block:
convert bio_map_user_iov to use iov_iter_extract_pages"), rather than
from top of linux-next.

Kernel panic - not syncing: corrupted stack end detected inside scheduler
CPU: 2 PID: 27177 Comm: syzbot_howells Not tainted 6.2.0-rc5-hubbard-github+ #3
Hardware name: ASUS X299-A/PRIME X299-A, BIOS 1503 08/03/2018
Call Trace:
<TASK>
dump_stack_lvl+0x4c/0x63
panic+0x113/0x2c4
? folio_wait_bit_common+0xf6/0x360
__schedule+0xd1b/0xd20
schedule+0x5d/0xe0
io_schedule+0x42/0x70
folio_wait_bit_common+0x123/0x360
? __pfx_wake_page_function+0x10/0x10
folio_wait_writeback+0x24/0x100
__filemap_fdatawait_range+0x7a/0x120
? filemap_fdatawrite_wbc+0x69/0x80
? __filemap_fdatawrite_range+0x58/0x80
filemap_write_and_wait_range+0x84/0xb0
__iomap_dio_rw+0x183/0x830
? __lock_acquire+0x3b4/0x2620
iomap_dio_rw+0xe/0x40
ext4_file_read_iter+0x141/0x1c0
generic_file_splice_read+0x90/0x160
splice_direct_to_actor+0xb1/0x210
? __pfx_direct_splice_actor+0x10/0x10
do_splice_direct+0x8c/0xd0
do_sendfile+0x352/0x600
do_syscall_64+0x37/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7f322d5116be
Code: c3 0f 1f 00 4c 89 d2 4c 89 c6 e9 fd fd ff ff 0f 1f 44 00 00 31 c0 c3 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca b8 28 00 00 00 0f 05 <48> 3d 01 f0 ff ff8
RSP: 002b:00007ffd8c914538 EFLAGS: 00000202 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 00007ffd8c914678 RCX: 00007f322d5116be
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000001 R09: 00007f322d7f6740
R10: 000000000001dd00 R11: 0000000000000202 R12: 0000000000000000
R13: 00007ffd8c914690 R14: 0000558a11e29d78 R15: 00007f322d843020
</TASK>

thanks,
--
John Hubbard
NVIDIA

John Hubbard

unread,
Feb 3, 2023, 4:39:58 AM2/3/23
to David Howells, David Hildenbrand, syzbot, da...@davemloft.net, edum...@google.com, h...@lst.de, joha...@sipsolutions.net, ku...@kernel.org, linux-...@vger.kernel.org, linux-w...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
On 2/2/23 07:15, David Howells wrote:
> David Hildenbrand <da...@redhat.com> wrote:
>
>> At first, I wondered if that's related to shared anonymous pages getting
>> pinned R/O that would trigger COW-unsharing ... but I don't even see where we
>> are supposed to use FOLL_PIN vs. FOLL_GET here? IOW, we're not even supposed
>> to access user space memory (neither FOLL_GET nor FOLL_PIN) but still end up
>> with a change in behavior.
>
> I'm not sure that using FOLL_PIN is part of the problem here.

I agree. It's really not.

>
> sendfile() is creating a transient buffer attached to a pipe, doing a DIO read
> into it (which uses iov_iter_extract_pages() to populate a bio) and then feeds
> the pages from the pipe one at a time using a BVEC-type iterator to a buffered
> write.
>
> Note that iov_iter_extract_pages() does not pin/get pages when accessing a
> BVEC, KVEC, XARRAY or PIPE iterator. However, in this case, this should not
> matter as the pipe is holding refs on the pages in the buffer.
>
> I have found that the issue goes away if I add an extra get_page() into
> iov_iter_extract_pipe_pages() and don't release it (the page is then leaked).
>
> This makes me think that the problem might be due to the pages getting
> recycled from the pipe before DIO has finished writing to them - but that
> shouldn't be the case as the splice has to be synchronous - we can't mark data
> in the pipe as 'produced' until we've finished reading it.


So I thought about this for a while, and one really big glaring point
that stands out for me is: before this commit, we had this:

iov_iter_get_pages()
__iov_iter_get_pages_alloc()
pipe_get_pages()
get_page()

But now, based on the claim from various folks that "pipe cases don't
require a get_page()", we have boldly--too boldy, I believe-- moved
directly into case that doesn't do a get_page():

iov_iter_extract_pipe_pages()
...(nothing)

And your testing backs this up: adding the get_page() back hides the
failure.

So that's as far as I've got, but I am really suspecting that this is
where the root cause is: pipe references are not as locked down as we
think they are. At least in this case.

David Howells

unread,
Feb 3, 2023, 9:36:46 AM2/3/23
to John Hubbard, dhow...@redhat.com, David Hildenbrand, syzbot, h...@lst.de, linux-...@vger.kernel.org, syzkall...@googlegroups.com
I think I have managed to isolate the bug to the read side of sendfile() or
the pipe in the middle by the following:

In iter_file_splice_write(), I allocate a permanent page:

+ mutex_lock(&splice_tmp_lock);
+ if (!splice_tmp) {
+ pr_notice("alloc splice_tmp\n");
+ splice_tmp = alloc_page(GFP_USER);
+ if (splice_tmp) {
+ SetPageDebugMark(splice_tmp);
+ page_ref_add(splice_tmp, 100);
+ }
+ }
+ mutex_unlock(&splice_tmp_lock);
+ if (!splice_tmp)
+ return -ENOMEM;
+

and then stick it into the BVEC iter to be handed over to vfs_iter_write()
instead of buf->page:

- array[n].bv_page = buf->page;
+ array[n].bv_page = splice_tmp;
+ trace_page_ref_set(splice_tmp, 887);
array[n].bv_len = this_len;
array[n].bv_offset = buf->offset;

that prevents vfs_iter_write() from ever seeing the pages from the pipe - but
the crash still happens even with this change.

One thing that does make me wonder is that I've made bio_release_pages()
always call __bio_release_pages() and made the latter dump all the pages in
the bio to the trace with val=999 - but some of the time, I don't see that
happening, so I'm wondering if there are some bio structs with pointers to
released pages floating around and not getting cleaned up.

I've added some event-specific tracepoints using the page_ref_set tracepoint,
using val=N as a key indicate the event:

val=888 -> Page seen in iter_file_splice_write() with PG_debug_mark set.
val=887 -> splice_tmp subbed for page
val=777 -> Page obtained from append_pipe() in iov_iter_extract_pipe_pages()
val=623 -> Page listed in bio_endio()
val=666 -> Page added by __bio_add_page()
val=98x -> xth page listed by __bio_release_pages()
val=-1 -> put_page()

Here's an excerpt from the trace when it looks right (note I removed
mapcount=0 from all lines):

page_ref_set: pfn=0x10b70b flags=debug_mark count=1 mapping=0000000000000000 mt=0 val=777
page_ref_set: pfn=0x10b70b flags=debug_mark count=1 mapping=0000000000000000 mt=0 val=666
page_ref_set: pfn=0x10b70b flags=debug_mark count=1 mapping=0000000000000000 mt=0 val=623
page_ref_set: pfn=0x10b70b flags=debug_mark count=1 mapping=0000000000000000 mt=0 val=980
page_ref_set: pfn=0x10b70b flags=debug_mark count=1 mapping=0000000000000000 mt=0 val=888
page_ref_set: pfn=0x10642f flags=debug_mark count=101 mapping=0000000000000000 mt=0 val=887
page_ref_mod_and_test: pfn=0x116209 flags=uptodate|dirty|debug_mark count=2 mapping=00000000d367da24 mt=1 val=-1 ret=0
page_ref_mod_and_test: pfn=0x10b70b flags=debug_mark count=0 mapping=0000000000000000 mt=0 val=-1 ret=1
page_ref_set: pfn=0x116209 flags=locked|uptodate|debug_mark count=3 mapping=00000000d367da24 mt=1 val=666
page_ref_mod_and_test: pfn=0x116209 flags=uptodate|lru|writeback|debug_mark count=2 mapping=00000000d367da24 mt=1 val=-1 ret=0
page_ref_mod_and_test: pfn=0x116209 flags=uptodate|lru|writeback|debug_mark count=1 mapping=00000000d367da24 mt=1 val=-1 ret=0
page_ref_set: pfn=0x116209 flags=waiters|uptodate|lru|writeback|debug_mark count=2 mapping=00000000d367da24 mt=1 val=623
page_ref_mod_and_test: pfn=0x116209 flags=uptodate|lru|debug_mark count=2 mapping=00000000d367da24 mt=1 val=-1 ret=0
page_ref_mod_and_test: pfn=0x116209 flags=uptodate|lru|debug_mark count=1 mapping=00000000d367da24 mt=1 val=-1 ret=0

pfn=0x10b70b is a page pulled out of the pipe, pfn=0x10642f is splice_tmp (has
count=101) and pfn=0x116209 is presumably the page the data is to be written
to.

And when it looks off:

page_ref_set: pfn=0x106fca flags=debug_mark count=1 mapping=0000000000000000 mt=0 val=777
page_ref_set: pfn=0x106fca flags=debug_mark count=1 mapping=0000000000000000 mt=0 val=666
page_ref_set: pfn=0x106fca flags=debug_mark count=1 mapping=0000000000000000 mt=0 val=888
page_ref_set: pfn=0x11e6bc flags=debug_mark count=101 mapping=0000000000000000 mt=0 val=887
page_ref_mod_and_test: pfn=0x11399a flags=uptodate|dirty|debug_mark count=2 mapping=000000003bd7df47 mt=1 val=-1 ret=0
page_ref_mod_and_test: pfn=0x106fca flags=debug_mark count=0 mapping=0000000000000000 mt=0 val=-1 ret=1
page_ref_set: pfn=0x11399a flags=locked|uptodate|debug_mark count=3 mapping=000000003bd7df47 mt=1 val=666
page_ref_mod_and_test: pfn=0x11399a flags=uptodate|lru|writeback|debug_mark count=2 mapping=000000003bd7df47 mt=1 val=-1 ret=0
page_ref_mod_and_test: pfn=0x11399a flags=uptodate|lru|writeback|debug_mark count=1 mapping=000000003bd7df47 mt=1 val=-1 ret=0
page_ref_mod_and_test: pfn=0x11399a flags=uptodate|lru|debug_mark count=1 mapping=000000003bd7df47 mt=1 val=-1 ret=0

pfn=0x106fca is the page from the pipe, pfn=0x11e6bc is splice_tmp and
pfn=0x11399a is the destination.

In the first trace, there's a val=980 line (page released from bio), but not
in the second trace.

David

David Howells

unread,
Feb 3, 2023, 11:24:08 AM2/3/23
to John Hubbard, dhow...@redhat.com, David Hildenbrand, syzbot, h...@lst.de, linux-...@vger.kernel.org, syzkall...@googlegroups.com
David Howells <dhow...@redhat.com> wrote:

> I think I have managed to isolate the bug to the read side of sendfile() or
> the pipe in the middle by the following:

I did something similar in iov_iter_extract_pipe_pages(), allocating a
permanent page there:

+ mutex_lock(&extract_tmp_lock);
+ if (!extract_tmp) {
+ pr_notice("alloc extract_tmp\n");
+ extract_tmp = alloc_page(GFP_USER);
+ if (extract_tmp) {
+ SetPageDebugMark(extract_tmp);
+ page_ref_add(extract_tmp, 200);
+ }
+ }
+ mutex_unlock(&extract_tmp_lock);
+ if (!extract_tmp)
+ return -ENOMEM;

and then subbing that for the returned page:

chunk = min_t(size_t, left, PAGE_SIZE - offset);
left -= chunk;
- *p++ = page;
+ //*p++ = page;
+ *p++ = extract_tmp;

That makes the oopses stop happening. Pages are still being added to the pipe
at one end and being removed at the other.

So I'm guessing a DMA happens to the destination buffer for the DIO read after
it has been released.

David

David Howells

unread,
Feb 3, 2023, 11:27:18 AM2/3/23
to Christoph Hellwig, dhow...@redhat.com, John Hubbard, David Hildenbrand, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hi Christoph,

What does ftruncate() do if there's a conflicting DIO read happening on the
same file? Does it cancel the bio?

David

David Howells

unread,
Feb 3, 2023, 11:30:27 AM2/3/23
to John Hubbard, dhow...@redhat.com, David Hildenbrand, syzbot, h...@lst.de, linux-...@vger.kernel.org, syzkall...@googlegroups.com
David Howells <dhow...@redhat.com> wrote:

> I think I have managed to isolate the bug to the read side of sendfile() or
> the pipe in the middle by the following:
>
> In iter_file_splice_write(), I allocate a permanent page:
> ...
> and then stick it into the BVEC iter to be handed over to vfs_iter_write()
> instead of buf->page:
>
> - array[n].bv_page = buf->page;
> + array[n].bv_page = splice_tmp;
> + trace_page_ref_set(splice_tmp, 887);
> array[n].bv_len = this_len;
> array[n].bv_offset = buf->offset;
>
> that prevents vfs_iter_write() from ever seeing the pages from the pipe - but
> the crash still happens even with this change.

With the DIO output isolation in iov_iter_extract_pipe_pages(), this change
can be removed without causing oopses to happen.

David

Christoph Hellwig

unread,
Feb 3, 2023, 11:31:12 AM2/3/23
to David Howells, Christoph Hellwig, John Hubbard, David Hildenbrand, syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
There is no way to cancel a bio. ->setattr is supposed to call
inode_dio_wait to wait for pending I/O, although some file systems
like btrfs have their own hand crafted and more complicated version
of that.

Hillf Danton

unread,
Feb 7, 2023, 4:58:22 AM2/7/23
to syzbot, linux-...@vger.kernel.org, syzkall...@googlegroups.com
On Wed, 01 Feb 2023 02:04:41 -0800
> syzbot found the following issue on:
>
> HEAD commit: 80bd9028feca Add linux-next specific files for 20230131
> git tree: linux-next
> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=11259a79480000

See if it is due to ref leak.

#syz test https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git 4fafd96910ad

--- x/mm/mmu_gather.c
+++ y/mm/mmu_gather.c
@@ -26,7 +26,11 @@ static bool tlb_next_batch(struct mmu_ga
batch = tlb->active;
if (batch->next) {
tlb->active = batch->next;
- return true;
+ batch = tlb->active;
+ if (batch->nr < batch->max)
+ return true;
+ else
+ return false;
}

if (tlb->batch_count == MAX_GATHER_BATCH_COUNT)
--- x/block/bio.c
+++ y/block/bio.c
@@ -1253,6 +1253,7 @@ static int __bio_iov_iter_get_pages(stru
unsigned len, i = 0;
size_t offset, trim;
int ret = 0;
+ int setref = 0;

/*
* Move page array up in the allocated memory for the bio vecs as far as
@@ -1302,13 +1303,16 @@ static int __bio_iov_iter_get_pages(stru
bio_iov_add_page(bio, page, len, offset);

offset = 0;
+ get_page(page);
}
+ setref = i;

- iov_iter_revert(iter, left);
out:
while (i < nr_pages)
bio_release_page(bio, pages[i++]);

+ if (setref)
+ bio_set_flag(bio, BIO_PAGE_REFFED);
return ret;
}

@@ -1342,8 +1346,6 @@ int bio_iov_iter_get_pages(struct bio *b
return 0;
}

- if (iov_iter_extract_will_pin(iter))
- bio_set_flag(bio, BIO_PAGE_PINNED);
do {
ret = __bio_iov_iter_get_pages(bio, iter);
} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));
--

syzbot

unread,
Feb 7, 2023, 5:24:18 AM2/7/23
to hda...@sina.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-and-tested-by: syzbot+a44034...@syzkaller.appspotmail.com

Tested on:

commit: 4fafd969 Add linux-next specific files for 20230203
git tree: https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
console output: https://syzkaller.appspot.com/x/log.txt?x=158599df480000
kernel config: https://syzkaller.appspot.com/x/.config?x=8448640b98955cb4
dashboard link: https://syzkaller.appspot.com/bug?extid=a440341a59e3b7142895
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
patch: https://syzkaller.appspot.com/x/patch.diff?x=17d9b237480000

David Howells

unread,
Feb 7, 2023, 6:22:40 AM2/7/23
to syzbot, dhow...@redhat.com, da...@davemloft.net, edum...@google.com, h...@lst.de, jhub...@nvidia.com, joha...@sipsolutions.net, ku...@kernel.org, linux-...@vger.kernel.org, linux-w...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com

syzbot

unread,
Feb 7, 2023, 7:29:18 AM2/7/23
to da...@davemloft.net, dhow...@redhat.com, edum...@google.com, h...@lst.de, jhub...@nvidia.com, joha...@sipsolutions.net, ku...@kernel.org, linux-...@vger.kernel.org, linux-w...@vger.kernel.org, net...@vger.kernel.org, pab...@redhat.com, syzkall...@googlegroups.com
Hello,

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-and-tested-by: syzbot+a44034...@syzkaller.appspotmail.com

Tested on:

commit: e7b5bbb5 iov_iter: Kill ITER_PIPE
git tree: https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/ iov-fixes
console output: https://syzkaller.appspot.com/x/log.txt?x=129d541f480000
kernel config: https://syzkaller.appspot.com/x/.config?x=318265a8e536ca3f
dashboard link: https://syzkaller.appspot.com/bug?extid=a440341a59e3b7142895
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2

Note: no patches were applied.
Reply all
Reply to author
Forward
0 new messages