[syzbot] [xfs?] INFO: task hung in xfs_ail_push_all_sync

5 views
Skip to first unread message

syzbot

unread,
Sep 21, 2024, 8:30:26 AMSep 21
to chanda...@oracle.com, djw...@kernel.org, linux-...@vger.kernel.org, linu...@vger.kernel.org, syzkall...@googlegroups.com
Hello,

syzbot found the following issue on:

HEAD commit: a430d95c5efa Merge tag 'lsm-pr-20240911' of git://git.kern..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=15569d00580000
kernel config: https://syzkaller.appspot.com/x/.config?x=d9ab5893ec5191eb
dashboard link: https://syzkaller.appspot.com/bug?extid=bf0143df12877cb1a21b
compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=109e6a77980000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/f8549392ace4/disk-a430d95c.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/9b866ef0c06c/vmlinux-a430d95c.xz
kernel image: https://storage.googleapis.com/syzbot-assets/9d6691641029/bzImage-a430d95c.xz
mounted in repro: https://storage.googleapis.com/syzbot-assets/dd50a033ca89/mount_2.gz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+bf0143...@syzkaller.appspotmail.com

INFO: task syz.0.75:6468 blocked for more than 143 seconds.
Not tainted 6.11.0-syzkaller-02574-ga430d95c5efa #0
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:syz.0.75 state:D stack:24864 pid:6468 tgid:6446 ppid:6378 flags:0x00004004
Call Trace:
<TASK>
context_switch kernel/sched/core.c:5188 [inline]
__schedule+0x17ae/0x4a10 kernel/sched/core.c:6529
__schedule_loop kernel/sched/core.c:6606 [inline]
schedule+0x14b/0x320 kernel/sched/core.c:6621
xfs_ail_push_all_sync+0x236/0x310 fs/xfs/xfs_trans_ail.c:726
xfs_log_quiesce+0xdf/0x5b0 fs/xfs/xfs_log.c:1018
xfs_fs_freeze+0x8d/0x1a0 fs/xfs/xfs_super.c:936
freeze_super+0x81b/0xee0 fs/super.c:2107
fs_bdev_freeze+0x1ac/0x320 fs/super.c:1484
bdev_freeze+0xd6/0x220 block/bdev.c:257
xfs_fs_goingdown+0xa9/0x160 fs/xfs/xfs_fsops.c:446
xfs_file_ioctl+0x12d4/0x19e0 fs/xfs/xfs_ioctl.c:1473
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:907 [inline]
__se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fc862d7def9
RSP: 002b:00007fc863b4a038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fc862f36058 RCX: 00007fc862d7def9
RDX: 0000000020000080 RSI: 000000008004587d RDI: 0000000000000006
RBP: 00007fc862df0b76 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 00007fc862f36058 R15: 00007ffd1aceae18
</TASK>

Showing all locks held in the system:
1 lock held by khungtaskd/30:
#0: ffffffff8e738660 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:326 [inline]
#0: ffffffff8e738660 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
#0: ffffffff8e738660 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x55/0x2a0 kernel/locking/lockdep.c:6626
2 locks held by getty/4972:
#0: ffff888034b890a0 (&tty->ldisc_sem){++++}-{0:0}, at: tty_ldisc_ref_wait+0x25/0x70 drivers/tty/tty_ldisc.c:243
#1: ffffc9000311b2f0 (&ldata->atomic_read_lock){+.+.}-{3:3}, at: n_tty_read+0x6a6/0x1e00 drivers/tty/n_tty.c:2211
5 locks held by syz.0.75/6468:
#0: ffff888022c9c6b0 (&bdev->bd_fsfreeze_mutex){+.+.}-{3:3}, at: bdev_freeze+0x2a/0x220 block/bdev.c:248
#1: ffff888031aac420 (sb_writers#12){++++}-{0:0}, at: sb_wait_write fs/super.c:1896 [inline]
#1: ffff888031aac420 (sb_writers#12){++++}-{0:0}, at: freeze_super+0x4e9/0xee0 fs/super.c:2085
#2: ffff888031aac0e0 (&type->s_umount_key#53){+.+.}-{3:3}, at: __super_lock fs/super.c:56 [inline]
#2: ffff888031aac0e0 (&type->s_umount_key#53){+.+.}-{3:3}, at: __super_lock_excl fs/super.c:71 [inline]
#2: ffff888031aac0e0 (&type->s_umount_key#53){+.+.}-{3:3}, at: freeze_super+0x4f1/0xee0 fs/super.c:2086
#3: ffff888031aac518 (sb_pagefaults#2){+.+.}-{0:0}, at: sb_wait_write fs/super.c:1896 [inline]
#3: ffff888031aac518 (sb_pagefaults#2){+.+.}-{0:0}, at: freeze_super+0x519/0xee0 fs/super.c:2090
#4: ffff888031aac610 (sb_internal#2){++++}-{0:0}, at: sb_wait_write fs/super.c:1896 [inline]
#4: ffff888031aac610 (sb_internal#2){++++}-{0:0}, at: freeze_super+0x7cc/0xee0 fs/super.c:2104
1 lock held by syz-executor/6470:
#0: ffffffff8e73da38 (rcu_state.exp_mutex){+.+.}-{3:3}, at: exp_funnel_lock kernel/rcu/tree_exp.h:296 [inline]
#0: ffffffff8e73da38 (rcu_state.exp_mutex){+.+.}-{3:3}, at: synchronize_rcu_expedited+0x381/0x830 kernel/rcu/tree_exp.h:958

=============================================

NMI backtrace for cpu 0
CPU: 0 UID: 0 PID: 30 Comm: khungtaskd Not tainted 6.11.0-syzkaller-02574-ga430d95c5efa #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:93 [inline]
dump_stack_lvl+0x241/0x360 lib/dump_stack.c:119
nmi_cpu_backtrace+0x49c/0x4d0 lib/nmi_backtrace.c:113
nmi_trigger_cpumask_backtrace+0x198/0x320 lib/nmi_backtrace.c:62
trigger_all_cpu_backtrace include/linux/nmi.h:162 [inline]
check_hung_uninterruptible_tasks kernel/hung_task.c:223 [inline]
watchdog+0xff4/0x1040 kernel/hung_task.c:379
kthread+0x2f0/0x390 kernel/kthread.c:389
ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 UID: 0 PID: 4661 Comm: klogd Not tainted 6.11.0-syzkaller-02574-ga430d95c5efa #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/06/2024
RIP: 0010:filter_irq_stacks+0x4f/0xa0 kernel/stacktrace.c:397
Code: f0 84 bd 8b 48 89 fb eb 0c 49 ff c7 48 83 c3 08 4d 39 fc 74 4f 48 89 d8 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df 80 3c 08 00 <74> 08 48 89 df e8 e7 6a 78 00 4b 8b 04 fe 48 39 e8 72 0c 48 c7 c1
RSP: 0018:ffffc90003137418 EFLAGS: 00000246
RAX: 1ffff92000626e96 RBX: ffffc900031374b0 RCX: dffffc0000000000
RDX: 0000000000400cc0 RSI: 000000000000000c RDI: ffffc900031374b0
RBP: ffffffff8bc00230 R08: ffffffff81378892 R09: 1ffffffff1ff501d
R10: dffffc0000000000 R11: fffffbfff1ff501e R12: 000000000000000c
R13: ffffffff8bbd84f0 R14: ffffc900031374b0 R15: 0000000000000000
FS: 00007f80f557f380(0000) GS:ffff8880b8900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37b7714000 CR3: 0000000063a2e000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<NMI>
</NMI>
<TASK>
stack_depot_save_flags+0x29/0x830 lib/stackdepot.c:609
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_save_track+0x51/0x80 mm/kasan/common.c:68
unpoison_slab_object mm/kasan/common.c:312 [inline]
__kasan_slab_alloc+0x66/0x80 mm/kasan/common.c:338
kasan_slab_alloc include/linux/kasan.h:201 [inline]
slab_post_alloc_hook mm/slub.c:3989 [inline]
slab_alloc_node mm/slub.c:4038 [inline]
kmem_cache_alloc_node_noprof+0x16b/0x320 mm/slub.c:4081
__alloc_skb+0x1c3/0x440 net/core/skbuff.c:668
alloc_skb include/linux/skbuff.h:1322 [inline]
alloc_skb_with_frags+0xc3/0x820 net/core/skbuff.c:6612
sock_alloc_send_pskb+0x91a/0xa60 net/core/sock.c:2883
unix_dgram_sendmsg+0x6d3/0x1f80 net/unix/af_unix.c:2027
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg+0x221/0x270 net/socket.c:745
__sys_sendto+0x398/0x4f0 net/socket.c:2210
__do_sys_sendto net/socket.c:2222 [inline]
__se_sys_sendto net/socket.c:2218 [inline]
__x64_sys_sendto+0xde/0x100 net/socket.c:2218
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f80f56e19b5
Code: 8b 44 24 08 48 83 c4 28 48 98 c3 48 98 c3 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 26 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 76 7a 48 8b 15 44 c4 0c 00 f7 d8 64 89 02 48 83
RSP: 002b:00007ffe65f0c128 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f80f56e19b5
RDX: 000000000000004f RSI: 000055b41fbcd6b0 RDI: 0000000000000003
RBP: 000055b41fbc8910 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000004000 R11: 0000000000000246 R12: 0000000000000013
R13: 00007f80f586f212 R14: 00007ffe65f0c228 R15: 0000000000000000
</TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

Dave Chinner

unread,
Sep 22, 2024, 7:15:24 PMSep 22
to syzbot, chanda...@oracle.com, djw...@kernel.org, linux-...@vger.kernel.org, linu...@vger.kernel.org, syzkall...@googlegroups.com
Not a bug.

The reproducer is mounting the filesystem, then some other thread
is immediately resizing the block device underneath the filesystem
to zero length.

In the cases where the test "passes", this happens:

[ 215.851387][ T6299] XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
[ 215.895695][ T6299] XFS (loop0): Ending clean mount
[ 215.953257][ T6299] loop0: detected capacity change from 32768 to 0
[ 215.963786][ T6299] syz.0.70: attempt to access beyond end of device
[ 215.963786][ T6299] loop0: rw=432129, sector=768, nr_sectors = 128 limit=0
[ 215.978206][ T45] XFS (loop0): log I/O error -5
[ 215.984613][ T45] XFS (loop0): Filesystem has been shut down due to log error (0x2).
[ 215.992720][ T45] XFS (loop0): Please unmount the filesystem and rectify the problem(s).
[ 216.011609][ T6230] XFS (loop0): Unmounting Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791o

The loop device change of size happens about 600ms after the mount
finishes, and the next IO that is issued is a log IO. That gets
errored out and the filesystem shuts itseld down.

In the case where it has hung:

231.577098][ T6447] loop0: detected capacity change from 0 to 32768
[ 231.590389][ T6447] XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
[ 231.642021][ T6447] XFS (loop0): Ending clean mount
[ 231.713195][ T6447] loop0: detected capacity change from 32768 to 0
[ 231.727969][ T6467] xfsaild/loop0: attempt to access beyond end of device
[ 231.727969][ T6467] loop0: rw=4097, sector=2, nr_sectors = 1 limit=0
[ 231.742236][ T6467] xfsaild/loop0: attempt to access beyond end of device
[ 231.742236][ T6467] loop0: rw=4097, sector=12, nr_sectors = 4 limit=0
....

The loop dev changes size about 700ms afte the mount, and the log IO
has already completed. Hence the next IOs are from the user driven
shutdown (not sure where that is coming from - nothing in syzbot
logs, and no C reproducer was provided) freezing the device before
shutting down the filesystem. These error out, but don't cause a
shutdown because the default XFS config for async metadata writeback
failures is "retry forever":

.....
[ 231.849797][ T52] kworker/1:1: attempt to access beyond end of device
[ 231.849797][ T52] loop0: rw=4097, sector=18560, nr_sectors = 64 limit=0
[ 231.933061][ T6467] XFS (loop0): Failing async write on buffer block 0x2. Retrying async write.
[ 231.942187][ T6467] XFS (loop0): Failing async write on buffer block 0x10. Retrying async write.
[ 231.951196][ T6467] XFS (loop0): Failing async write on buffer block 0xc. Retrying async write.
[ 231.960108][ T6467] xfsaild/loop0: attempt to access beyond end of device
[ 231.960108][ T6467] loop0: rw=4097, sector=2, nr_sectors = 1 limit=0
[ 231.973536][ T6467] xfsaild/loop0: attempt to access beyond end of device
[ 231.973536][ T6467] loop0: rw=4097, sector=12, nr_sectors = 4 limit=0
[ 232.053094][ T6467] XFS (loop0): Failing async write on buffer block 0x2. Retrying async write.
[ 232.061978][ T6467] XFS (loop0): Failing async write on buffer block 0x10. Retrying async write.
[ 232.070960][ T6467] XFS (loop0): Failing async write on buffer block 0xc. Retrying async write.
[ 232.143152][ T6467] XFS (loop0): Failing async write on buffer block 0x2. Retrying async write.
[ 232.152027][ T6467] XFS (loop0): Failing async write on buffer block 0x10. Retrying async write.
.....

And so the user driven shutdown hangs at this point waiting for
metadata to flush.

This is expected behaviour. If the user wants metadata writes to
fail immediately in this situation, then they need to do:

# echo 1 > /sys/fs/xfs/<dev>/error/metadata/EIO/max_retries

So on IO error the writeback code will only retry once before
failing and shutting down the filesystem immediately.

So, no bug here, just syzbot not understanding how metadata IO
errors are handled by different filesystems. Hence:

#syz invalid

-Dave.
--
Dave Chinner
da...@fromorbit.com
Reply all
Reply to author
Forward
0 new messages