[BUG] nbd: NBD_CLEAR_SOCK races with buffered writes and triggers mark_buffer_dirty() warning

10 views
Skip to first unread message

Peiyang He

unread,
Jun 28, 2026, 1:51:47 AM (yesterday) Jun 28
to jo...@toxicpanda.com, ax...@kernel.dk, linux-...@vger.kernel.org, n...@other.debian.org, linux...@vger.kernel.org, syzk...@googlegroups.com, linux-...@vger.kernel.org, vi...@zeniv.linux.org.uk, bra...@kernel.org, ja...@suse.cz
Hello Linux kernel developers and maintainers,

We found a WARNING in mark_buffer_dirty() in fs/buffer.c when fuzzing drivers/block/nbd.c with our modified Syzkaller.

The warning means a buffer_head was being dirtied after losing the BH_Uptodate bit,
because NBD_CLEAR_SOCK RACED with buffered block-device writes and removed the BH_Uptodate bit.

Kernel version: commit 8cd9520d35a6c38db6567e97dd93b1f11f185dc6 (tag v7.1).
And the bug is also possible in the current mainline.

Relevant kernel config: (the complete config is included in the attachments.)

CONFIG_BLOCK=y
CONFIG_BLK_DEV_NBD=y
CONFIG_FS_IOMAP=y
CONFIG_BUFFER_HEAD=y
CONFIG_BUG=y

=============================
The original Syzkaller report
=============================

------------[ cut here ]------------
WARNING: fs/buffer.c:1087 at mark_buffer_dirty+0x273/0x4c0 fs/buffer.c:1087, CPU#1: syz.1.1229/17574
Modules linked in:
CPU: 1 UID: 0 PID: 17574 Comm: syz.1.1229 Not tainted 7.1.0 #6 PREEMPT(lazy)
Hardware name: QEMU Ubuntu 24.04 PC v2 (i440FX + PIIX, arch_caps fix, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:mark_buffer_dirty+0x273/0x4c0 fs/buffer.c:1087
Code: 4c 8b 33 48 89 df e8 6c a5 c3 ff 48 83 38 00 0f 85 d6 00 00 00 4c 89 f7 be 40 00 00 00 e8 b5 c5 f8 ff eb 24 e8 1e 63 17 ff 90 <0f> 0b 90 e9 ec fd ff ff 44 89 e7 e8 4d b0 c3 ff 4d 85 ff 0f 84 95
RSP: 0018:ffff88807efbb800 EFLAGS: 00010287
RAX: ffffffff82d88f62 RBX: ffff888074bdad58 RCX: 0000000000080000
RDX: ffffc900089cc000 RSI: 000000000000c6b0 RDI: 000000000000c6b1
RBP: ffff88807efbb838 R08: ffffea000000000f R09: 0000000000000000
R10: ffff8880466f4068 R11: 0000000000000002 R12: 0000000000000001
R13: 0000000000000000 R14: ffff888046ef40d0 R15: 0000000000000000
FS: 00007f6c4c0c96c0(0000) GS:ffff8881aa7fc000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb615f29ff8 CR3: 000000006df0f000 CR4: 0000000000350ef0
Call Trace:
<TASK>
block_commit_write fs/buffer.c:2115 [inline]
block_write_end+0x393/0x6f0 fs/buffer.c:2191
iomap_write_end+0x4d4/0xc50 fs/iomap/buffered-io.c:1091
iomap_write_iter fs/iomap/buffered-io.c:1159 [inline]
iomap_file_buffered_write+0xbfd/0x1d20 fs/iomap/buffered-io.c:1225
blkdev_buffered_write block/fops.c:735 [inline]
blkdev_write_iter+0x92c/0xd10 block/fops.c:801
new_sync_write fs/read_write.c:595 [inline]
vfs_write+0xb5c/0x1550 fs/read_write.c:688
ksys_write+0x23c/0x490 fs/read_write.c:740
__do_sys_write fs/read_write.c:751 [inline]
__se_sys_write fs/read_write.c:748 [inline]
__x64_sys_write+0x97/0xf0 fs/read_write.c:748
x64_sys_call+0x2ff0/0x3ea0 arch/x86/include/generated/asm/syscalls_64.h:2
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15c/0x3c0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f6c4b1a788d
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f6c4c0c9018 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00007f6c4b435fa0 RCX: 00007f6c4b1a788d
RDX: 00000000fffffc53 RSI: 0000200000001100 RDI: 0000000000000009
RBP: 00007f6c4b24e9cf R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f6c4b436038 R14: 00007f6c4b435fa0 R15: 00007ffc2d3d7490
</TASK>

<<<<<<<<<<<<<<< tail report >>>>>>>>>>>>>>>

==========
Root cause
==========

The relevant locking domains do not match. blkdev_write_iter() runs the
buffered write path under inode_lock_shared(bd_inode), while NBD_CLEAR_SOCK is
handled under nbd->config_lock. These two locks do not exclude each other, so
NBD_CLEAR_SOCK can complete an in-flight NBD write request with BLK_STS_IOERR
and clear BH_Uptodate on the same bh while block_commit_write() is about to
dirty that bh.

Race timeline:

CPU0 CPU1
==== ====

write(/dev/nbdX)
--> blkdev_write_iter
--> bd_inode = bdev_file_inode(file)
--> inode_lock_shared(bd_inode)
--> blkdev_buffered_write
--> iomap_file_buffered_write
--> iomap_write_end
--> block_write_end
--> block_commit_write
--> bh = head = folio_buffers(folio)
--> set_buffer_uptodate(bh) /* BH_Uptodate is set here */

ioctl(/dev/nbdX, NBD_CLEAR_SOCK)
--> nbd_ioctl
--> mutex_lock(&nbd->config_lock)
--> __nbd_ioctl
--> nbd_clear_sock_ioctl
--> nbd_clear_sock
--> nbd_clear_que
--> nbd_clear_req(req, NULL)
--> cmd->status = BLK_STS_IOERR
--> blk_mq_complete_request(req)
--> nbd_complete_rq
--> blk_mq_end_request(req, cmd->status)
--> bio_endio(bio)
--> end_bio_bh_io_sync
--> end_buffer_async_write(bh, uptodate = 0)
--> clear_buffer_uptodate(bh) /* BUGGY: BH_Uptodate is lost here */
--> mutex_unlock(&nbd->config_lock)

CPU0 continues block_commit_write() with the same bh
--> mark_buffer_dirty(bh)
--> WARN_ON_ONCE(!buffer_uptodate(bh)) /* BUGGY here */


===
PoC
===

The kernel instrumentation patch, C PoC and helper script are included in the attachments.

Usage:

1. apply the patch to the v7.1 kernel, enable relevant kernel config (see above) and compile the kernel.
2. Override the KERNEL, IMAGE, SSH_KEY environments with local paths and just run run_warning_repro.sh.
The script will automatically compile the C PoC, boot the kernel with QEMU, run the PoC in the guest and
check for the WARNING message.

Details:

Kernel instrumentation patch: the tested kernel should be instrumented to enlarge the race window.
The patch adds a boot parameter, nbd_block_commit_delay_ms=, and makes block_commit_write()
sleep for the requested time only when the current buffer_head belongs to an NBD block device.
The delay is inserted immediately after set_buffer_uptodate(bh) and immediately before mark_buffer_dirty(bh),
so NBD_CLEAR_SOCK has a stable window to complete an in-flight NBD write with BLK_STS_IOERR and clear
BH_Uptodate before mark_buffer_dirty() checks it.

C PoC: the C reproducer configures /dev/nbd0 through the legacy NBD ioctl interface
and uses a socketpair-backed userspace NBD server.
The userspace backend replies to READ requests with zeroes, but intentionally drains and stalls the
first WRITE request without sending an NBD reply.
The main thread first writes block 0 and starts fsync(), which leaves that first NBD WRITE in flight.
After the backend confirms that the WRITE is stalled, the PoC starts a second buffered pwrite()
to the same block and then issues NBD_CLEAR_SOCK from another NBD file descriptor.
NBD_CLEAR_SOCK completes the stalled writeback request with BLK_STS_IOERR
while the second buffered write is inside the instrumented block_commit_write() window,
which makes mark_buffer_dirty() observe the lost BH_Uptodate bit and trigger the warning.


Best,
Peiyang
poc_warning_block_write_end.c
.config
nbd_block_write_end_race_window.patch
run_warning_repro.sh
Reply all
Reply to author
Forward
0 new messages