INFO: task hung in pipe_release (2)

31 views
Skip to first unread message

syzbot

unread,
Aug 2, 2020, 2:26:20 PM8/2/20
to linux-...@vger.kernel.org, linux-...@vger.kernel.org, syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk
Hello,

syzbot found the following issue on:

HEAD commit: 6ba1b005 Merge tag 'asm-generic-fixes-5.8' of git://git.ke..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16a5a3b0900000
kernel config: https://syzkaller.appspot.com/x/.config?x=84f076779e989e69
dashboard link: https://syzkaller.appspot.com/bug?extid=61acc40a49a3e46e25ea
compiler: gcc (GCC) 10.1.0-syz 20200507
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=142ae224900000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+61acc4...@syzkaller.appspotmail.com

INFO: task syz-executor.0:7075 blocked for more than 143 seconds.
Not tainted 5.8.0-rc7-syzkaller #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor.0 D28288 7075 6828 0x00000004
Call Trace:
context_switch kernel/sched/core.c:3458 [inline]
__schedule+0x8ea/0x2210 kernel/sched/core.c:4219
schedule+0xd0/0x2a0 kernel/sched/core.c:4294
schedule_preempt_disabled+0xf/0x20 kernel/sched/core.c:4353
__mutex_lock_common kernel/locking/mutex.c:1033 [inline]
__mutex_lock+0x3e2/0x10d0 kernel/locking/mutex.c:1103
__pipe_lock fs/pipe.c:87 [inline]
pipe_release+0x49/0x320 fs/pipe.c:722
__fput+0x33c/0x880 fs/file_table.c:281
task_work_run+0xdd/0x190 kernel/task_work.c:135
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
exit_to_usermode_loop arch/x86/entry/common.c:239 [inline]
__prepare_exit_to_usermode+0x1e9/0x1f0 arch/x86/entry/common.c:269
do_syscall_64+0x6c/0xe0 arch/x86/entry/common.c:393
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x415ee1
Code: Bad RIP value.
RSP: 002b:00007ffd249bc0c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000415ee1
RDX: 0000000000000000 RSI: 00000000007904d0 RDI: 0000000000000006
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 00007ffd249bc1b0 R11: 0000000000000293 R12: 00000000007905a8
R13: 00000000000ffc79 R14: ffffffffffffffff R15: 000000000078bfac

Showing all locks held in the system:
1 lock held by khungtaskd/1153:
#0: ffffffff89bc11c0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x53/0x260 kernel/locking/lockdep.c:5779
1 lock held by in:imklog/6695:
#0: ffff8880979b6670 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0xe9/0x100 fs/file.c:826
1 lock held by syz-executor.0/7075:
#0: ffff8880a8e02068 (&pipe->mutex/1){+.+.}-{3:3}, at: __pipe_lock fs/pipe.c:87 [inline]
#0: ffff8880a8e02068 (&pipe->mutex/1){+.+.}-{3:3}, at: pipe_release+0x49/0x320 fs/pipe.c:722
2 locks held by syz-executor.0/7077:

=============================================

NMI backtrace for cpu 0
CPU: 0 PID: 1153 Comm: khungtaskd Not tainted 5.8.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x18f/0x20d lib/dump_stack.c:118
nmi_cpu_backtrace.cold+0x70/0xb1 lib/nmi_backtrace.c:101
nmi_trigger_cpumask_backtrace+0x1b3/0x223 lib/nmi_backtrace.c:62
trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
check_hung_uninterruptible_tasks kernel/hung_task.c:209 [inline]
watchdog+0xd7d/0x1000 kernel/hung_task.c:295
kthread+0x3b5/0x4a0 kernel/kthread.c:291
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 7077 Comm: syz-executor.0 Not tainted 5.8.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:iov_iter_alignment+0x39e/0x850 lib/iov_iter.c:1236
Code: 48 8b 44 24 18 80 38 00 0f 85 f4 03 00 00 4d 8b 7d 18 44 89 e3 48 89 d8 48 c1 e0 04 4c 01 f8 48 8d 78 0c 48 89 fa 48 c1 ea 03 <42> 0f b6 0c 32 48 89 fa 83 e2 07 83 c2 03 38 ca 7c 08 84 c9 0f 85
RSP: 0018:ffffc9000206f868 EFLAGS: 00000a07
RAX: ffff8880a8124c00 RBX: 0000000000000000 RCX: ffffffff8395bf6f
RDX: 1ffff11015024981 RSI: ffffffff8395be14 RDI: ffff8880a8124c0c
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff8880994e63c7
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffc9000206fc68 R14: dffffc0000000000 R15: ffff8880a8124c00
FS: 00007fba306d0700(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f9573761000 CR3: 0000000091a0d000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
ext4_unaligned_io fs/ext4/file.c:175 [inline]
ext4_dio_write_iter fs/ext4/file.c:465 [inline]
ext4_file_write_iter+0x345/0x13d0 fs/ext4/file.c:653
call_write_iter include/linux/fs.h:1908 [inline]
do_iter_readv_writev+0x567/0x780 fs/read_write.c:713
do_iter_write+0x188/0x5f0 fs/read_write.c:1018
vfs_iter_write+0x70/0xa0 fs/read_write.c:1059
iter_file_splice_write+0x721/0xbe0 fs/splice.c:750
do_splice_from fs/splice.c:846 [inline]
do_splice+0xbcd/0x1820 fs/splice.c:1144
__do_sys_splice fs/splice.c:1419 [inline]
__se_sys_splice fs/splice.c:1401 [inline]
__x64_sys_splice+0x198/0x250 fs/splice.c:1401
do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:384
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45c369
Code: 8d b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 5b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fba306cfc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
RAX: ffffffffffffffda RBX: 0000000000032240 RCX: 000000000045c369
RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000005
RBP: 000000000078bff8 R08: 000000000000ffe0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000078bfac
R13: 00007ffd249bc04f R14: 00007fba306d09c0 R15: 000000000078bfac


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

Tetsuo Handa

unread,
Aug 7, 2020, 6:35:20 AM8/7/20
to syzkall...@googlegroups.com, vi...@zeniv.linux.org.uk, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org
syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
call_write_iter() from do_iter_readv_writev() from do_iter_write() from
vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
with pipe->mutex held.

The reason of falling into infinite busy loop is that iter_file_splice_write()
for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
while for_each_bvec() cannot handle .bv_len == 0.

--- a/fs/splice.c
+++ b/fs/splice.c
@@ -747,6 +747,14 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
}

iov_iter_bvec(&from, WRITE, array, n, sd.total_len - left);
+ if (!strncmp(current->comm, "syz-executor", 12)) {
+ int i;
+ printk("Starting vfs_write_iter from.type=%d from.iov_offset=%zu from.count=%zu n=%u sd.total_len=%zu left=%zu\n",
+ from.type, from.iov_offset, from.count, n, sd.total_len, left);
+ for (i = 0; i < n; i++)
+ printk(" array[%u]: bv_page=%px bv_len=%u bv_offset=%u\n",
+ i, array[i].bv_page, array[i].bv_len, array[i].bv_offset);
+ }
ret = vfs_iter_write(out, &from, &sd.pos, 0);
if (ret <= 0)
break;

When splice() from pipe to file works.

[ 31.704915][ T6552] Starting vfs_write_iter from.type=17 from.iov_offset=0 from.count=4096 n=1 sd.total_len=65504 left=61408
[ 31.709098][ T6552] array[0]: bv_page=ffffea000870a7c0 bv_len=4096 bv_offset=0

When splice() from pipe to file falls into infinite busy loop.

[ 31.717178][ T6553] Starting vfs_write_iter from.type=17 from.iov_offset=0 from.count=4096 n=2 sd.total_len=65504 left=61408
[ 31.720983][ T6553] array[0]: bv_page=ffffea0008706680 bv_len=0 bv_offset=0
[ 31.723565][ T6553] array[1]: bv_page=ffffea00086f4e80 bv_len=4096 bv_offset=0

Is it normal behavior that an empty page is linked to pipe's array?
If yes, don't we need to skip empty pages when iter_file_splice_write() fills in "struct bio_vec *array" ?

[1] https://syzkaller.appspot.com/bug?id=2ccac875e85dc852911a0b5b788ada82dc0a081e

Al Viro

unread,
Aug 7, 2020, 8:27:33 AM8/7/20
to Tetsuo Handa, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Ming Lei
On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
> syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
> iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
> ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
> call_write_iter() from do_iter_readv_writev() from do_iter_write() from
> vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
> with pipe->mutex held.
>
> The reason of falling into infinite busy loop is that iter_file_splice_write()
> for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
> while for_each_bvec() cannot handle .bv_len == 0.

broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
unless I'm misreading it...

Zero-length segments are not disallowed; it's not all that hard to filter them
out in iter_file_splice_write(), but the intent had always been to have
iterate_all_kinds() et.al. able to cope with those.

How are these pipe_buffers with ->len == 0 generated in that reproducer, BTW?
There might be something else fishy going on...

Tetsuo Handa

unread,
Aug 7, 2020, 8:34:22 AM8/7/20
to Al Viro, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Ming Lei
OK. Indeed writing to empty pipe which returns -EFAULT allows an empty
page to be linked to pipe's array.

Now, I've just found a simple reproducer, and confirmed that this bug is
a local lockup DoS by unprivileged user. Please fix.

----------
#define _GNU_SOURCE
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
static char buffer[4096];
const int fd = open("/tmp/testfile", O_WRONLY | O_CREAT, 0600);
int pipe_fd[2] = { EOF, EOF };
pipe(pipe_fd);
write(pipe_fd[1], NULL, 4096);
write(pipe_fd[1], buffer, 4096);
splice(pipe_fd[0], NULL, fd, NULL, 65536, 0);
return 0;
}
----------

[ 125.598898][ C0] rcu: INFO: rcu_sched self-detected stall on CPU
[ 125.601072][ C0] rcu: 0-....: (20171 ticks this GP) idle=526/1/0x4000000000000000 softirq=7918/7918 fqs=5136
[ 125.604874][ C0] (t=21006 jiffies g=9341 q=30)
[ 125.606512][ C0] NMI backtrace for cpu 0
[ 125.607931][ C0] CPU: 0 PID: 2792 Comm: a.out Not tainted 5.8.0+ #793
[ 125.610948][ C0] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
[ 125.614938][ C0] Call Trace:
[ 125.616049][ C0] <IRQ>
[ 125.617010][ C0] dump_stack+0x5e/0x7a
[ 125.618370][ C0] nmi_cpu_backtrace.cold.7+0x14/0x52
[ 125.620148][ C0] ? lapic_can_unplug_cpu.cold.39+0x3a/0x3a
[ 125.622074][ C0] nmi_trigger_cpumask_backtrace+0x92/0x9f
[ 125.624154][ C0] arch_trigger_cpumask_backtrace+0x14/0x20
[ 125.626102][ C0] rcu_dump_cpu_stacks+0xa0/0xd0
[ 125.627919][ C0] rcu_sched_clock_irq.cold.95+0x121/0x39c
[ 125.629833][ C0] ? acct_account_cputime+0x17/0x20
[ 125.631534][ C0] ? account_system_index_time+0x8a/0xa0
[ 125.633422][ C0] update_process_times+0x23/0x60
[ 125.635070][ C0] tick_sched_handle.isra.22+0x20/0x60
[ 125.636870][ C0] tick_sched_timer+0x68/0x80
[ 125.638403][ C0] ? tick_sched_handle.isra.22+0x60/0x60
[ 125.640588][ C0] __hrtimer_run_queues+0xf9/0x1a0
[ 125.642591][ C0] hrtimer_interrupt+0xfc/0x210
[ 125.645033][ C0] __sysvec_apic_timer_interrupt+0x4c/0x60
[ 125.647292][ C0] asm_call_on_stack+0xf/0x20
[ 125.649192][ C0] </IRQ>
[ 125.650501][ C0] sysvec_apic_timer_interrupt+0x75/0x80
[ 125.652900][ C0] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 125.655487][ C0] RIP: 0010:iov_iter_copy_from_user_atomic+0x19b/0x350
[ 125.658124][ C0] Code: 89 45 d0 48 c1 e6 06 48 03 37 4d 8d 3c 09 4c 89 cf e8 d9 e5 ff ff 48 8b 45 d0 45 39 eb 0f 87 35 01 00 00 49 8b 4a 18 4d 89 f9 <45> 29 dd 45 01 d8 75 12 eb 19 41 83 c4 01 41 29 c0 74 10 44 89 e0
[ 125.666132][ C0] RSP: 0018:ffffa6cdc1237aa8 EFLAGS: 00000246
[ 125.668557][ C0] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff945035a25100
[ 125.671576][ C0] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff945035a25100
[ 125.674851][ C0] RBP: ffffa6cdc1237ad8 R08: 0000000000000000 R09: ffff945028a80000
[ 125.677989][ C0] R10: ffffa6cdc1237de0 R11: 0000000000000000 R12: 0000000000000000
[ 125.680990][ C0] R13: 0000000000001000 R14: 0000000000001000 R15: 0000000000001000
[ 125.684006][ C0] iomap_write_actor+0xbe/0x190
[ 125.685982][ C0] ? iomap_write_begin+0x460/0x460
[ 125.688031][ C0] iomap_apply+0xf4/0x1a0
[ 125.689810][ C0] ? iomap_write_begin+0x460/0x460
[ 125.691826][ C0] iomap_file_buffered_write+0x69/0x90
[ 125.698598][ C0] ? iomap_write_begin+0x460/0x460
[ 125.705341][ C0] xfs_file_buffered_aio_write+0xc2/0x2c0
[ 125.707780][ C0] xfs_file_write_iter+0xa3/0xc0
[ 125.709802][ C0] do_iter_readv_writev+0x15b/0x1c0
[ 125.712496][ C0] do_iter_write+0x81/0x190
[ 125.715245][ C0] vfs_iter_write+0x14/0x20
[ 125.717221][ C0] iter_file_splice_write+0x288/0x3e0
[ 125.719340][ C0] do_splice_from+0x1a/0x40
[ 125.721175][ C0] do_splice+0x2e5/0x620
[ 125.722950][ C0] __x64_sys_splice+0x97/0x100
[ 125.724937][ C0] do_syscall_64+0x31/0x40
[ 125.726766][ C0] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 125.729388][ C0] RIP: 0033:0x7f7515dd91c3
[ 125.731246][ C0] Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 83 3d fd dd 2c 00 00 75 13 49 89 ca b8 13 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 2b d4 00 00 48 89 04 24
[ 125.749632][ C0] RSP: 002b:00007ffd553cde18 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
[ 125.753713][ C0] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f7515dd91c3
[ 125.756867][ C0] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000004
[ 125.759872][ C0] RBP: 0000000000000000 R08: 0000000000010000 R09: 0000000000000000
[ 125.763023][ C0] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004005a6
[ 125.766225][ C0] R13: 00007ffd553cdf10 R14: 0000000000000000 R15: 0000000000000000

Al Viro

unread,
Aug 7, 2020, 8:38:59 AM8/7/20
to Tetsuo Handa, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Ming Lei
FWIW, my preference would be to have for_each_bvec() advance past zero-length
segments; I'll need to go through its uses elsewhere in the tree first, though
(after I grab some sleep),

Ming Lei

unread,
Aug 7, 2020, 9:41:52 AM8/7/20
to Al Viro, Tetsuo Handa, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org
Usually block layer doesn't allow/support zero bvec, however we can make
for_each_bvec() to support it only.

Tetsuo, can you try the following patch?

diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index ac0c7299d5b8..b03c793dd28d 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -117,11 +117,19 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,
return true;
}

+static inline void bvec_iter_skip_zero_vec(const struct bio_vec *bv,
+ struct bvec_iter *iter)
+{
+ iter->bi_idx++;
+ iter->bi_bvec_done = 0;
+}
+
#define for_each_bvec(bvl, bio_vec, iter, start) \
for (iter = (start); \
(iter).bi_size && \
- ((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
- bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len))
+ ((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
+ (bvl).bv_len ? bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len) : \
+ bvec_iter_skip_zero_vec((bio_vec), &(iter)))

/* for iterating one bio from start to end */
#define BVEC_ITER_ALL_INIT (struct bvec_iter) \

Thanks,
Ming

Matthew Wilcox

unread,
Aug 7, 2020, 10:12:04 AM8/7/20
to Ming Lei, Al Viro, Tetsuo Handa, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org
Uhm, bvec_iter_advance() already skips over zero length bio_vecs.

while (bytes && bytes >= bv[idx].bv_len) {
bytes -= bv[idx].bv_len;
idx++;
}

The problem is when the _first_ bio_vec is zero length. Maybe something more
like this (which doesn't even compile, but hopefully makes my point):

@@ -86,12 +86,24 @@ struct bvec_iter_all {
(mp_bvec_iter_page((bvec), (iter)) + \
mp_bvec_iter_page_idx((bvec), (iter)))

-#define bvec_iter_bvec(bvec, iter) \
-((struct bio_vec) { \
- .bv_page = bvec_iter_page((bvec), (iter)), \
- .bv_len = bvec_iter_len((bvec), (iter)), \
- .bv_offset = bvec_iter_offset((bvec), (iter)), \
-})
+static inline bool bvec_iter_bvec(struct bio_vec *bv, struct bio_vec *bvec,
+ struct bvec_iter *iter)
+{
+ unsigned int idx = iter->bi_idx;
+
+ if (!iter->bi_size)
+ return false;
+
+ while (!bv[idx].bv_len)
+ idx++;
+ iter->bi_idx = idx;
+
+ bv->bv_page = bvec_iter_page(bvec, *iter);
+ bv->bv_len = bvec_iter_len(bvec, *iter);
+ bv->bv_offset = bvec_iter_offset(bvec, *iter);
+
+ return true;
+}

static inline bool bvec_iter_advance(const struct bio_vec *bv,
struct bvec_iter *iter, unsigned bytes)
@@ -119,8 +131,7 @@ static inline bool bvec_iter_advance(const struct bio_vec *bv,

#define for_each_bvec(bvl, bio_vec, iter, start) \
for (iter = (start); \
- (iter).bi_size && \
- ((bvl = bvec_iter_bvec((bio_vec), (iter))), 1); \
+ bvec_iter_bvec(&(bvl), (bio_vec), &(iter)); \
bvec_iter_advance((bio_vec), &(iter), (bvl).bv_len))

/* for iterating one bio from start to end */

(I find the whole bvec handling a mess of confusing macros and would
welcome more of it being inline functions, in general).

Tetsuo Handa

unread,
Aug 7, 2020, 10:17:42 AM8/7/20
to Ming Lei, Al Viro, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org
On 2020/08/07 22:41, Ming Lei wrote:
>> FWIW, my preference would be to have for_each_bvec() advance past zero-length
>> segments; I'll need to go through its uses elsewhere in the tree first, though
>> (after I grab some sleep),
>
> Usually block layer doesn't allow/support zero bvec, however we can make
> for_each_bvec() to support it only.
>
> Tetsuo, can you try the following patch?

Yes, this patch solves the lockup. Thank you.

Tetsuo Handa

unread,
Aug 7, 2020, 11:11:51 AM8/7/20
to Matthew Wilcox, Ming Lei, Al Viro, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Dmitry Vyukov
On 2020/08/07 23:11, Matthew Wilcox wrote:
> (I find the whole bvec handling a mess of confusing macros and would
> welcome more of it being inline functions, in general).

Indeed. Inlined functions will be more useful than macros when syzbot
calculates the location of the source code from address for reporting.
I spent a lot of time where

RIP: 0010:iov_iter_alignment+0x39e/0x850 lib/iov_iter.c:1236

within the complicated macros is. If inlined line numbers were available,
I could have narrowed down the location of infinite loop faster...

Ming Lei

unread,
Aug 8, 2020, 10:31:45 PM8/8/20
to Matthew Wilcox, Al Viro, Tetsuo Handa, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org
The issue is that zero (bvl).bv_len passed to bvec_iter_advance(), so
the iterator can't move on.

And I tried to avoid change to bvec_iter_advance() since this exact
issue only exists on for_each_bvec, and block layer won't support/allow
zero-length bvec.

>
> The problem is when the _first_ bio_vec is zero length.

It can be any zero-length bvec during the iterating.
The above change may bring more code duplication. Meantime, it can't
work because (bvl).bv_len isn't taken into account into bvec_iter_bvec(),
then how can the iterator advance?


Thanks,
Ming

Ming Lei

unread,
Aug 8, 2020, 10:50:15 PM8/8/20
to Matthew Wilcox, Al Viro, Tetsuo Handa, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org
oops, looks you change bvec_iter_bvec() only for skipping zero length bvec,
and this way might work(still ->bi_bvec_done isn't reset). However the
change is ugly, cause the iterator is supposed to not be updated in
bvec_iter_bvec(). Also block layer code doesn't require such change.

BTW, I agree on switching to inline if performance isn't affected.


Thanks,
Ming

Qian Cai

unread,
Aug 31, 2020, 8:51:40 PM8/31/20
to Tetsuo Handa, Al Viro, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Ming Lei, pau...@kernel.org
On Fri, Aug 07, 2020 at 09:34:08PM +0900, Tetsuo Handa wrote:
> On 2020/08/07 21:27, Al Viro wrote:
> > On Fri, Aug 07, 2020 at 07:35:08PM +0900, Tetsuo Handa wrote:
> >> syzbot is reporting hung task at pipe_release() [1], for for_each_bvec() from
> >> iterate_bvec() from iterate_all_kinds() from iov_iter_alignment() from
> >> ext4_unaligned_io() from ext4_dio_write_iter() from ext4_file_write_iter() from
> >> call_write_iter() from do_iter_readv_writev() from do_iter_write() from
> >> vfs_iter_write() from iter_file_splice_write() falls into infinite busy loop
> >> with pipe->mutex held.
> >>
> >> The reason of falling into infinite busy loop is that iter_file_splice_write()
> >> for some reason generates "struct bio_vec" entry with .bv_len=0 and .bv_offset=0
> >> while for_each_bvec() cannot handle .bv_len == 0.
> >
> > broken in 1bdc76aea115 "iov_iter: use bvec iterator to implement iterate_bvec()",
> > unless I'm misreading it...

I have been chasing something similar for a while as in,

https://lore.kernel.org/linux-fsdevel/89F418A9-EB20-48CB...@lca.pw/

In my case, it seems the endless loop happens in iterate_iovec() instead where
I put a debug patch here,

--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -33,6 +33,7 @@
if (unlikely(!__v.iov_len)) \
continue; \
__v.iov_base = __p->iov_base; \
+ printk_ratelimited("ITER_IOVEC left = %zu, n = %zu\n", left, n); \
left = (STEP); \
__v.iov_len -= left; \
skip = __v.iov_len; \

and end up seeing overflows ("n" supposes to be less than PAGE_SIZE) before the
soft-lockups and a dead system,

[ 4300.249180][T470195] ITER_IOVEC left = 0, n = 48566423

Thoughts?

Al Viro

unread,
Aug 31, 2020, 9:09:34 PM8/31/20
to Qian Cai, Tetsuo Handa, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Ming Lei, pau...@kernel.org
Er... Where does that size come from? If that's generic_perform_write(),
I'd like to see pos, offset and bytes at the time of call... ->iov_offset would
also be interesting to see (along with the entire iovec array, really).

Ming Lei

unread,
Aug 31, 2020, 9:10:17 PM8/31/20
to Qian Cai, Tetsuo Handa, Al Viro, syzkall...@googlegroups.com, syzbot, Linux FS Devel, Linux Kernel Mailing List, Ming Lei, pau...@kernel.org
Does the following patch make a difference for you?

https://lore.kernel.org/linux-block/20200817100055.2...@redhat.com/

thanks,
Ming Lei

Qian Cai

unread,
Aug 31, 2020, 9:48:05 PM8/31/20
to Al Viro, Tetsuo Handa, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org, Ming Lei, pau...@kernel.org
Yes, generic_perform_write(). I'll see if I can capture more information.

[ 2867.463013][T217919] ITER_IOVEC left = 0, n = 2209
[ 2867.466154][T217971] ITER_IOVEC left = 0, n = 2093279
[ 2867.903689][T217971] ITER_IOVEC left = 0, n = 2093259
[ 2867.928514][T217971] ITER_IOVEC left = 0, n = 2093239
[ 2867.952450][T217971] ITER_IOVEC left = 0, n = 2090980
[ 2867.976585][T217971] ITER_IOVEC left = 0, n = 2090960
[ 2869.219459][T217774] futex_wake_op: trinity-c61 tries to shift op by -1; fix this program
[ 2870.005178][T218110] futex_wake_op: trinity-c9 tries to shift op by -1; fix this program
[ 2870.297607][T218213] futex_wake_op: trinity-c4 tries to shift op by -836; fix this program
[ 2870.338118][T218213] futex_wake_op: trinity-c4 tries to shift op by -836; fix this program
[-- MARK -- Mon Aug 31 18:50:00 2020]
[ 2893.870387][ C62] watchdog: BUG: soft lockup - CPU#62 stuck for 23s! [trinity-c5:218155]
[ 2893.912341][ C62] Modules linked in: nls_ascii nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ip_tables x_tables sd_mod ahci bnx2x libahci mdio libata firmware_class dm_mirror dm_region_hash dm_log dm_mod efivarfs
[ 2894.003787][ C62] irq event stamp: 55378
[ 2894.022398][ C62] hardirqs last enabled at (55377): [<ffffffff95a00c42>] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 2894.070770][ C62] hardirqs last disabled at (55378): [<ffffffff959b111d>] irqentry_enter+0x1d/0x50
[ 2894.112794][ C62] softirqs last enabled at (50602): [<ffffffff95c0070f>] __do_softirq+0x70f/0xa9f
[ 2894.154580][ C62] softirqs last disabled at (49393): [<ffffffff95a00ec2>] asm_call_on_stack+0x12/0x20
[ 2894.197654][ C62] CPU: 62 PID: 218155 Comm: trinity-c5 Not tainted 5.9.0-rc2-next-20200828+ #4
[ 2894.239807][ C62] Hardware name: HP Synergy 480 Gen9/Synergy 480 Gen9 Compute Module, BIOS I37 10/21/2019
[ 2894.284894][ C62] RIP: 0010:iov_iter_copy_from_user_atomic+0x598/0xab0
[ 2894.316037][ C62] Code: 42 0f b6 14 08 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 7c 03 00 00 48 8d 7e 08 8b 46 0c 48 89 fa 48 c1 ea 03 01 e8 <42> 0f b6 14 0a 84 d2 74 09 80 fa 03 0f 8e 0e 03 00 00 44 8b 6e 08
[ 2894.408763][ C62] RSP: 0018:ffffc9000fb7f848 EFLAGS: 00000246
[ 2894.440034][ C62] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff888fce295100
[ 2894.480312][ C62] RDX: 1ffff111f9c52a21 RSI: ffff888fce295100 RDI: ffff888fce295108
[ 2894.517551][ C62] RBP: 0000000000000000 R08: fffff52001f6ffa1 R09: dffffc0000000000
[ 2894.556460][ C62] R10: ffff889055c1f000 R11: 0000000000000400 R12: ffffc9000fb7fcf0
[ 2894.593928][ C62] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000400
[ 2894.630979][ C62] FS: 00007ff89140a740(0000) GS:ffff88905fd80000(0000) knlGS:0000000000000000
[ 2894.673104][ C62] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2894.702914][ C62] CR2: 00007ff8906242fc CR3: 000000101bb76004 CR4: 00000000003706e0
[ 2894.740628][ C62] DR0: 00007ff891250000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2894.778016][ C62] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
[ 2894.815393][ C62] Call Trace:
[ 2894.830531][ C62] ? shmem_write_begin+0xc4/0x1b0
[ 2894.853643][ C62] generic_perform_write+0x2ac/0x410
[ 2894.878266][ C62] ? filemap_check_errors+0xe0/0xe0
[ 2894.901557][ C62] ? file_update_time+0x215/0x380
[ 2894.925234][ C62] ? update_time+0xa0/0xa0
[ 2894.947683][ C62] ? down_write+0xdb/0x150
[ 2894.969843][ C62] __generic_file_write_iter+0x2fe/0x4f0
[ 2894.995527][ C62] ? rcu_read_unlock+0x50/0x50
[ 2895.017683][ C62] generic_file_write_iter+0x2ee/0x520
[ 2895.042981][ C62] ? __generic_file_write_iter+0x4f0/0x4f0
[ 2895.069612][ C62] ? __mutex_lock+0x4af/0x1390
[ 2895.092240][ C62] do_iter_readv_writev+0x388/0x6f0
[ 2895.115794][ C62] ? lockdep_hardirqs_on_prepare+0x33e/0x4e0
[ 2895.143303][ C62] ? default_llseek+0x240/0x240
[ 2895.165522][ C62] ? rcu_read_lock_bh_held+0xc0/0xc0
[ 2895.190122][ C62] do_iter_write+0x130/0x5f0
[ 2895.211226][ C62] iter_file_splice_write+0x54c/0xa40
[ 2895.235917][ C62] ? page_cache_pipe_buf_try_steal+0x1e0/0x1e0
[ 2895.264415][ C62] ? rcu_read_lock_any_held+0xdb/0x100
[ 2895.290249][ C62] do_splice+0x86c/0x1440
[ 2895.310155][ C62] ? syscall_enter_from_user_mode+0x1b/0x230
[ 2895.337531][ C62] ? direct_splice_actor+0x100/0x100
[ 2895.362348][ C62] __x64_sys_splice+0x151/0x200
[ 2895.384677][ C62] do_syscall_64+0x33/0x40
[ 2895.405223][ C62] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 2895.433800][ C62] RIP: 0033:0x7ff890d246ed
[ 2895.456812][ C62] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6b 57 2c 00 f7 d8 64 89 01 48
[ 2895.553070][ C62] RSP: 002b:00007fff8a6acfb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000113
[ 2895.592859][ C62] RAX: ffffffffffffffda RBX: 0000000000000113 RCX: 00007ff890d246ed
[ 2895.630744][ C62] RDX: 000000000000015b RSI: 0000000000000000 RDI: 000000000000016b
[ 2895.668870][ C62] RBP: 0000000000000113 R08: 0000000000000400 R09: 0000000000000000
[ 2895.707052][ C62] R10: 00007ff88f30a000 R11: 0000000000000246 R12: 0000000000000002
[ 2895.744716][ C62] R13: 00007ff8913ee058 R14: 00007ff89140a6c0 R15: 00007ff8913ee000

Qian Cai

unread,
Aug 31, 2020, 11:32:37 PM8/31/20
to Al Viro, Tetsuo Handa, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org, tom.l...@gmail.com, pau...@kernel.org
On Tue, Sep 01, 2020 at 02:09:28AM +0100, Al Viro wrote:
I used a new debug patch but not sure how to capture without
printk_ratelimited() because the call sites are large,

diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index d812eef23a32..214b93c3d8a8 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -33,6 +33,7 @@
if (unlikely(!__v.iov_len)) \
continue; \
__v.iov_base = __p->iov_base; \
+ printk_ratelimited("ITER_IOVEC left = %zu, n = %zu, iov_offset = %zu, iov_base = %px, iov_len = %lu\n", left, n, i->iov_offset, __p->iov_base, __p->iov_len); \
left = (STEP); \
__v.iov_len -= left; \
skip = __v.iov_len; \
diff --git a/mm/filemap.c b/mm/filemap.c
index 1aaea26556cc..202b0ab28adf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3408,6 +3408,7 @@ ssize_t generic_perform_write(struct file *file,
if (mapping_writably_mapped(mapping))
flush_dcache_page(page);

+ printk_ratelimited("pos = %lld, offset = %ld, bytes = %ld\n", pos, offset, bytes);
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
flush_dcache_page(page);

Al, does it help? If not, can you cook a debug patch instead?

[ 587.765400][T21348] pos = 468152746240, offset = 256, bytes = 3840
[ 587.765424][T21348] pos = 468152750080, offset = 0, bytes = 4096
[ 587.765439][T21348] pos = 468152754176, offset = 0, bytes = 4096
[ 587.765455][T21348] pos = 468152758272, offset = 0, bytes = 4096
[ 591.235409][T22038] ITER_IOVEC left = 0, n = 22476968, iov_offset = 0, iov_base = 00007ff12570c000, iov_len = 20
[ 591.313456][T22038] ITER_IOVEC left = 0, n = 22476948, iov_offset = 0, iov_base = 00007ff123250000, iov_len = 20
[ 591.363679][T22038] ITER_IOVEC left = 0, n = 22476928, iov_offset = 0, iov_base = 00007ff123650000, iov_len = 6370
[ 591.413217][T22038] ITER_IOVEC left = 0, n = 22470558, iov_offset = 0, iov_base = 00007ff123850000, iov_len = 20
[ 591.459754][T22038] ITER_IOVEC left = 0, n = 22470538, iov_offset = 0, iov_base = 00007ff12570c000, iov_len = 376
[ 591.507748][T22038] ITER_IOVEC left = 0, n = 22470162, iov_offset = 0, iov_base = 00007ff12570c000, iov_len = 3473
[ 591.557395][T22038] ITER_IOVEC left = 0, n = 22466689, iov_offset = 0, iov_base = 00007ff123250000, iov_len = 20
[ 591.605295][T22038] ITER_IOVEC left = 0, n = 22466669, iov_offset = 0, iov_base = 00007ff12570a000, iov_len = 3392
[ 591.653045][T22038] ITER_IOVEC left = 0, n = 22463277, iov_offset = 0, iov_base = 00007ff123250000, iov_len = 1329038
[ 591.705324][T22038] ITER_IOVEC left = 0, n = 21134239, iov_offset = 0, iov_base = 00007ff123850000, iov_len = 20

Al Viro

unread,
Sep 1, 2020, 12:08:01 AM9/1/20
to Qian Cai, Tetsuo Handa, syzkall...@googlegroups.com, syzbot, linux-...@vger.kernel.org, linux-...@vger.kernel.org, tom.l...@gmail.com, pau...@kernel.org
On Mon, Aug 31, 2020 at 11:32:28PM -0400, Qian Cai wrote:

> I used a new debug patch but not sure how to capture without
> printk_ratelimited() because the call sites are large,

if (!strcmp(current->comm, "bugger"))
printk(KERN_ERR....
and call the binary you are running ./bugger... And I'd slap such
printk into the beginning of iterate_iovec() as well, if not into
the entry of iov_iter_copy_from_user_atomic(). That BS value of
n must've come from somewhere; it should expand to 'bytes'.
What we have in the beginning is

const struct iovec *iov;
struct iovec v;
size_t skip = i->iov_offset;
size_t left;
size_t wanted = bytes;

iov = i->iov;

__v.iov_len = min(bytes, iov->iov_len - skip);
if (likely(__v.iov_len)) {
__v.iov_base = iov->iov_base + skip;
left = copyin((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len);
__v.iov_len -= left;
skip += __v.iov_len;
bytes -= __v.iov_len;
} else {
left = 0;
}

and something leaves you with bytes bumped to 22476968. What was in that first
iovec? Incidentally, what's in 'wanted'? And... Timestamps don't look like
that crap has come from generic_perform_write() - it's about 4 seconds later.

While we are at it, there are other users of iterate_all_kinds(), and some of them
can very well get large sizes; they are not copying anything (iov_iter_advance(),
for starters). There that kind of values would be just fine; are you sure those
printks came from iov_iter_copy_from_user_atomic()?

Qian Cai

unread,
Sep 1, 2020, 10:23:02 AM9/1/20
to Ming Lei, Tetsuo Handa, Al Viro, syzkall...@googlegroups.com, syzbot, Linux FS Devel, Linux Kernel Mailing List, Ming Lei, pau...@kernel.org
Yes, it does. I could no longer be able to reproduce the soft lockups with the
patch applied.
Reply all
Reply to author
Forward
0 new messages