We found a WARNING in mark_buffer_dirty() in fs/buffer.c when fuzzing drivers/block/nbd.c with our modified Syzkaller.
The warning means a buffer_head was being dirtied after losing the BH_Uptodate bit,
because NBD_CLEAR_SOCK RACED with buffered block-device writes and removed the BH_Uptodate bit.
Kernel version: commit 8cd9520d35a6c38db6567e97dd93b1f11f185dc6 (tag v7.1).
And the bug is also possible in the current mainline.
Relevant kernel config: (the complete config is included in the attachments.)
The relevant locking domains do not match. blkdev_write_iter() runs the
buffered write path under inode_lock_shared(bd_inode), while NBD_CLEAR_SOCK is
handled under nbd->config_lock. These two locks do not exclude each other, so
NBD_CLEAR_SOCK can complete an in-flight NBD write request with BLK_STS_IOERR
and clear BH_Uptodate on the same bh while block_commit_write() is about to
dirty that bh.
Race timeline:
CPU0 CPU1
==== ====
write(/dev/nbdX)
--> blkdev_write_iter
--> bd_inode = bdev_file_inode(file)
--> inode_lock_shared(bd_inode)
--> blkdev_buffered_write
--> iomap_file_buffered_write
--> iomap_write_end
--> block_write_end
--> block_commit_write
--> bh = head = folio_buffers(folio)
--> set_buffer_uptodate(bh) /* BH_Uptodate is set here */
CPU0 continues block_commit_write() with the same bh
--> mark_buffer_dirty(bh)
--> WARN_ON_ONCE(!buffer_uptodate(bh)) /* BUGGY here */
===
PoC
===
The kernel instrumentation patch, C PoC and helper script are included in the attachments.
Usage:
1. apply the patch to the v7.1 kernel, enable relevant kernel config (see above) and compile the kernel.
2. Override the KERNEL, IMAGE, SSH_KEY environments with local paths and just run run_warning_repro.sh.
The script will automatically compile the C PoC, boot the kernel with QEMU, run the PoC in the guest and
check for the WARNING message.
Details:
Kernel instrumentation patch: the tested kernel should be instrumented to enlarge the race window.
The patch adds a boot parameter, nbd_block_commit_delay_ms=, and makes block_commit_write()
sleep for the requested time only when the current buffer_head belongs to an NBD block device.
The delay is inserted immediately after set_buffer_uptodate(bh) and immediately before mark_buffer_dirty(bh),
so NBD_CLEAR_SOCK has a stable window to complete an in-flight NBD write with BLK_STS_IOERR and clear
BH_Uptodate before mark_buffer_dirty() checks it.
C PoC: the C reproducer configures /dev/nbd0 through the legacy NBD ioctl interface
and uses a socketpair-backed userspace NBD server.
The userspace backend replies to READ requests with zeroes, but intentionally drains and stalls the
first WRITE request without sending an NBD reply.
The main thread first writes block 0 and starts fsync(), which leaves that first NBD WRITE in flight.
After the backend confirms that the WRITE is stalled, the PoC starts a second buffered pwrite()
to the same block and then issues NBD_CLEAR_SOCK from another NBD file descriptor.
NBD_CLEAR_SOCK completes the stalled writeback request with BLK_STS_IOERR
while the second buffered write is inside the instrumented block_commit_write() window,
which makes mark_buffer_dirty() observe the lost BH_Uptodate bit and trigger the warning.