[syzbot] KASAN: use-after-free Read in srcu_invoke

syzbot

unread,

Jan 11, 2022, 6:03:19 AM1/11/22

to and...@kernel.org, a...@kernel.org, ax...@kernel.dk, b...@vger.kernel.org, dan...@iogearbox.net, john.fa...@gmail.com, ka...@fb.com, kps...@kernel.org, linux...@vger.kernel.org, linux-...@vger.kernel.org, net...@vger.kernel.org, songliu...@fb.com, syzkall...@googlegroups.com, y...@fb.com

Hello,

syzbot found the following issue on:

HEAD commit: 3770333b3f8c Add linux-next specific files for 20220106
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=171aa4e3b00000
kernel config: https://syzkaller.appspot.com/x/.config?x=f9eb40d9f910b474
dashboard link: https://syzkaller.appspot.com/bug?extid=4f789823c1abc5accf13
compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12b08f53b00000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+4f7898...@syzkaller.appspotmail.com

==================================================================
BUG: KASAN: use-after-free in rcu_seq_snap kernel/rcu/rcu.h:91 [inline]
BUG: KASAN: use-after-free in srcu_invoke_callbacks+0x391/0x3c0 kernel/rcu/srcutree.c:1283
Read of size 8 at addr ffff8880189b6968 by task kworker/0:1/7

CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 5.16.0-rc8-next-20220106-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: rcu_gp srcu_invoke_callbacks
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0xa5/0x3ed mm/kasan/report.c:255
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
rcu_seq_snap kernel/rcu/rcu.h:91 [inline]
srcu_invoke_callbacks+0x391/0x3c0 kernel/rcu/srcutree.c:1283
process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
worker_thread+0x657/0x1110 kernel/workqueue.c:2454
kthread+0x2e9/0x3a0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
</TASK>

Allocated by task 19830:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
kasan_set_track mm/kasan/common.c:45 [inline]
set_alloc_info mm/kasan/common.c:436 [inline]
__kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:469
kasan_slab_alloc include/linux/kasan.h:260 [inline]
slab_post_alloc_hook mm/slab.h:738 [inline]
slab_alloc_node mm/slub.c:3230 [inline]
kmem_cache_alloc_node+0x255/0x3e0 mm/slub.c:3266
blk_alloc_queue+0x5ad/0x870 block/blk-core.c:446
blk_mq_init_queue_data block/blk-mq.c:3878 [inline]
__blk_mq_alloc_disk+0x8c/0x1c0 block/blk-mq.c:3902
nbd_dev_add+0x3b2/0xcd0 drivers/block/nbd.c:1765
nbd_genl_connect+0x11f3/0x1930 drivers/block/nbd.c:1948
genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1343
netlink_sendmsg+0x904/0xdf0 net/netlink/af_netlink.c:1919
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:725
____sys_sendmsg+0x6e8/0x810 net/socket.c:2410
___sys_sendmsg+0xf3/0x170 net/socket.c:2464
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2493
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Freed by task 13:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
kasan_set_track+0x21/0x30 mm/kasan/common.c:45
kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
____kasan_slab_free mm/kasan/common.c:366 [inline]
____kasan_slab_free+0x166/0x1a0 mm/kasan/common.c:328
kasan_slab_free include/linux/kasan.h:236 [inline]
slab_free_hook mm/slub.c:1728 [inline]
slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1754
slab_free mm/slub.c:3509 [inline]
kmem_cache_free+0xdb/0x3b0 mm/slub.c:3526
rcu_do_batch kernel/rcu/tree.c:2536 [inline]
rcu_core+0x7b8/0x1540 kernel/rcu/tree.c:2787
__do_softirq+0x29b/0x9c2 kernel/softirq.c:558

Last potentially related work creation:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
__kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348
call_rcu+0x99/0x730 kernel/rcu/tree.c:3072
kobject_cleanup lib/kobject.c:705 [inline]
kobject_release lib/kobject.c:736 [inline]
kref_put include/linux/kref.h:65 [inline]
kobject_put+0x1c8/0x540 lib/kobject.c:753
disk_release+0x19a/0x260 block/genhd.c:1116
device_release+0x9f/0x240 drivers/base/core.c:2229
kobject_cleanup lib/kobject.c:705 [inline]
kobject_release lib/kobject.c:736 [inline]
kref_put include/linux/kref.h:65 [inline]
kobject_put+0x1c8/0x540 lib/kobject.c:753
put_device+0x1b/0x30 drivers/base/core.c:3512
put_disk block/genhd.c:1368 [inline]
blk_cleanup_disk+0x6b/0x80 block/genhd.c:1384
nbd_dev_remove+0x44/0xf0 drivers/block/nbd.c:253
process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
worker_thread+0x657/0x1110 kernel/workqueue.c:2454
kthread+0x2e9/0x3a0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

Second to last potentially related work creation:
kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
__kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348
insert_work+0x48/0x370 kernel/workqueue.c:1368
__queue_work+0x5ca/0xf30 kernel/workqueue.c:1534
__queue_delayed_work+0x1c8/0x270 kernel/workqueue.c:1682
queue_delayed_work_on+0x105/0x120 kernel/workqueue.c:1718
process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
worker_thread+0x657/0x1110 kernel/workqueue.c:2454
kthread+0x2e9/0x3a0 kernel/kthread.c:377
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295

The buggy address belongs to the object at ffff8880189b5c70
which belongs to the cache request_queue_srcu of size 3816
The buggy address is located 3320 bytes inside of
3816-byte region [ffff8880189b5c70, ffff8880189b6b58)
The buggy address belongs to the page:
page:ffffea0000626c00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x189b0
head:ffffea0000626c00 order:3 compound_mapcount:0 compound_pincount:0
flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
raw: 00fff00000010200 0000000000000000 dead000000000001 ffff888012787140
raw: 0000000000000000 0000000080080008 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 3, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 14963, ts 196854875504, free_ts 196839673140
prep_new_page mm/page_alloc.c:2434 [inline]
get_page_from_freelist+0xa72/0x2f40 mm/page_alloc.c:4165
__alloc_pages+0x1b2/0x500 mm/page_alloc.c:5381
alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
alloc_slab_page mm/slub.c:1799 [inline]
allocate_slab mm/slub.c:1944 [inline]
new_slab+0x28d/0x380 mm/slub.c:2004
___slab_alloc+0x6be/0xd60 mm/slub.c:3018
__slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
slab_alloc_node mm/slub.c:3196 [inline]
kmem_cache_alloc_node+0x122/0x3e0 mm/slub.c:3266
blk_alloc_queue+0x5ad/0x870 block/blk-core.c:446
blk_mq_init_queue_data block/blk-mq.c:3878 [inline]
__blk_mq_alloc_disk+0x8c/0x1c0 block/blk-mq.c:3902
nbd_dev_add+0x3b2/0xcd0 drivers/block/nbd.c:1765
nbd_genl_connect+0x11f3/0x1930 drivers/block/nbd.c:1948
genl_family_rcv_msg_doit+0x228/0x320 net/netlink/genetlink.c:731
genl_family_rcv_msg net/netlink/genetlink.c:775 [inline]
genl_rcv_msg+0x328/0x580 net/netlink/genetlink.c:792
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2494
genl_rcv+0x24/0x40 net/netlink/genetlink.c:803
netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1343
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1352 [inline]
free_pcp_prepare+0x414/0xb60 mm/page_alloc.c:1404
free_unref_page_prepare mm/page_alloc.c:3325 [inline]
free_unref_page+0x19/0x690 mm/page_alloc.c:3404
__unfreeze_partials+0x17c/0x1a0 mm/slub.c:2536
qlink_free mm/kasan/quarantine.c:157 [inline]
qlist_free_all+0x6a/0x170 mm/kasan/quarantine.c:176
kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:283
__kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:446
kasan_slab_alloc include/linux/kasan.h:260 [inline]
slab_post_alloc_hook mm/slab.h:738 [inline]
slab_alloc_node mm/slub.c:3230 [inline]
slab_alloc mm/slub.c:3238 [inline]
__kmalloc+0x1e7/0x340 mm/slub.c:4420
kmalloc include/linux/slab.h:586 [inline]
tomoyo_realpath_from_path+0xc3/0x620 security/tomoyo/realpath.c:254
tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
tomoyo_path_perm+0x21b/0x400 security/tomoyo/file.c:822
security_inode_getattr+0xcf/0x140 security/security.c:1326
vfs_getattr fs/stat.c:157 [inline]
vfs_statx+0x164/0x390 fs/stat.c:225
vfs_fstatat fs/stat.c:243 [inline]
__do_sys_newfstatat+0x96/0x120 fs/stat.c:412
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae

Memory state around the buggy address:
ffff8880189b6800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880189b6880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff8880189b6900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff8880189b6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff8880189b6a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

Hillf Danton

unread,

Jan 11, 2022, 8:43:42 AM1/11/22

to syzbot, Joel Fernandes, Paul E . McKenney, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com

On Tue, 11 Jan 2022 03:03:17 -0800

To fix the uaf, add completion to sdp and wait for sdp->work to complete
even after work flush while cleaning ssp up.

Only for thoughts now.

Hillf

+++ y/kernel/rcu/srcutree.c
@@ -74,6 +74,12 @@ do { \
#define spin_unlock_irqrestore_rcu_node(p, flags) \
spin_unlock_irqrestore(&ACCESS_PRIVATE(p, lock), flags) \

+static inline void init_completion_done(struct completion *x)
+{
+ x->done = 1;
+ init_swait_queue_head(&x->wait);
+}
+
/*
* Initialize SRCU combining tree. Note that statically allocated
* srcu_struct structures might already have srcu_read_lock() and
@@ -150,6 +156,7 @@ static void init_srcu_struct_nodes(struc
INIT_WORK(&sdp->work, srcu_invoke_callbacks);
timer_setup(&sdp->delay_work, srcu_delay_timer, 0);
sdp->ssp = ssp;
+ init_completion_done(&sdp->work_done);
sdp->grpmask = 1 << (cpu - sdp->mynode->grplo);
}
}
@@ -378,6 +385,7 @@ void cleanup_srcu_struct(struct srcu_str

del_timer_sync(&sdp->delay_work);
flush_work(&sdp->work);
+ wait_for_completion(&sdp->work_done);
if (WARN_ON(rcu_segcblist_n_cbs(&sdp->srcu_cblist)))
return; /* Forgot srcu_barrier(), so just leak it! */
}
@@ -1248,6 +1256,7 @@ static void srcu_invoke_callbacks(struct
struct srcu_struct *ssp;

sdp = container_of(work, struct srcu_data, work);
+ reinit_completion(&sdp->work_done);

ssp = sdp->ssp;
rcu_cblist_init(&ready_cbs);
@@ -1257,7 +1266,7 @@ static void srcu_invoke_callbacks(struct
if (sdp->srcu_cblist_invoking ||
!rcu_segcblist_ready_cbs(&sdp->srcu_cblist)) {
spin_unlock_irq_rcu_node(sdp);
- return; /* Someone else on the job or nothing to do. */
+ goto out; /* Someone else on the job or nothing to do. */
}

/* We are on the job! Extract and invoke ready callbacks. */
@@ -1287,6 +1296,9 @@ static void srcu_invoke_callbacks(struct
spin_unlock_irq_rcu_node(sdp);
if (more)
srcu_schedule_cbs_sdp(sdp, 0);
+ else
+out:
+ complete(&sdp->work_done);
}

/*
+++ y/include/linux/srcutree.h
@@ -40,6 +40,7 @@ struct srcu_data {
/* ->srcu_data_have_cbs[]. */
int cpu;
struct srcu_struct *ssp;
+ struct completion work_done;
};

/*

Paul E. McKenney

unread,

Jan 11, 2022, 2:05:04 PM1/11/22

to Hillf Danton, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, boqun...@gmail.com

This can happen when a given srcu_struct was passed to cleanup_srcu_struct
before all callbacks were invoked.

> > srcu_invoke_callbacks+0x391/0x3c0 kernel/rcu/srcutree.c:1283
> > process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
> > worker_thread+0x657/0x1110 kernel/workqueue.c:2454
> > kthread+0x2e9/0x3a0 kernel/kthread.c:377
> > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
> > </TASK>
> >
> > Allocated by task 19830:
> > kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
> > kasan_set_track mm/kasan/common.c:45 [inline]
> > set_alloc_info mm/kasan/common.c:436 [inline]
> > __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:469
> > kasan_slab_alloc include/linux/kasan.h:260 [inline]
> > slab_post_alloc_hook mm/slab.h:738 [inline]
> > slab_alloc_node mm/slub.c:3230 [inline]
> > kmem_cache_alloc_node+0x255/0x3e0 mm/slub.c:3266

The memory in question was allocated by blk_alloc_queue().

This is an RCU (not SRCU!) callback.

> > rcu_do_batch kernel/rcu/tree.c:2536 [inline]
> > rcu_core+0x7b8/0x1540 kernel/rcu/tree.c:2787
> > __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
> >
> > Last potentially related work creation:
> > kasan_save_stack+0x1e/0x40 mm/kasan/common.c:38
> > __kasan_record_aux_stack+0xbe/0xd0 mm/kasan/generic.c:348

This matches the RCU callback.

Again blk_alloc_queue().

Good catch!!!

First, restating the usage rules to help explain the failure scenario:

------------------------------------------------------------------------

You are supposed to wait until all SRCU callbacks for a given srcu_struct
have been invoked before invoking cleanup_srcu_struct(). There is an
srcu_barrier() to help with this, but for it to do you any good you have
to have made sure that you stopped calling call_srcu() before invoking
srcu_barrier().

Failing to follow this rule, that is, invoking cleanup_srcu_struct()
before all the corresponding SRCU callbacks have been invoked, could
indeed result in the mess above, and much worse.

There are checks in cleanup_srcu_struct() for this sort of thing,
namely:

if (WARN_ON(rcu_segcblist_n_cbs(&sdp->srcu_cblist)))

However, such checks cannot be perfect. For example, the
cleanup_srcu_struct() might be invoked just a bit too early,
after the last callback was invoked (hence the above WARN_ON() not
triggering) and just before the call to rcu_seq_snap() at the end of
srcu_invoke_callbacks() (hence KASAN being unhappy with that call to
srcu_seq_snap()).

So:

1. Make sure that you stop invoking call_srcu() before starting the
process of freeing that srcu_struct. In addition, you of course
need to make sure that you have stopped passing the srcu_struct
in question to anything else.

2. Once all CPUs agree that you have stopped invoking call_srcu(),
invoke srcu_barrier() to ensure that all callbacks have been
invoked. Keep in mind that srcu_barrier() is only guaranteed
to wait for callbacks queued by calls to call_srcu() that happen
-before- that call to srcu_barrier(). This is one of those cases
where being almost correct has nothing to do with being correct.

3. Then, and only then, is it safe to invoke cleanup_srcu_struct().

Of course, if you never use call_srcu() on a given srcu_struct,
you don't need to invoke srcu_barrier().

------------------------------------------------------------------------

Except that srcu_invoke_callbacks() accesses the various srcu_struct
fields after the last rcu_barrier() callback was invoked. Which
means that the following sequence of events can happen:

o Task A invokes the last call_srcu().

o Later, Task B invokes rcu_barrier().

o The last rcu_barrier() callback is invoked by CPU 0 from
within srcu_invoke_callbacks(), which awakens Task B.

o Task B invokes cleanup_srcu_struct(), which is perfectly legal
given that Task B followed the rules called out above.

o But now CPU 0 continues executing srcu_invoke_callbacks(),
which accesses srcu_struct fields and related structures
that were freed by cleanup_srcu_struct().

As both you and syzbot say, use after free bug.

> Only for thoughts now.
>
> Hillf
>
> +++ y/kernel/rcu/srcutree.c
> @@ -74,6 +74,12 @@ do { \
> #define spin_unlock_irqrestore_rcu_node(p, flags) \
> spin_unlock_irqrestore(&ACCESS_PRIVATE(p, lock), flags) \
>
> +static inline void init_completion_done(struct completion *x)
> +{
> + x->done = 1;
> + init_swait_queue_head(&x->wait);

This is small enough that it should be inlined.

But it turns out that we don't really need it.

> +}
> +
> /*
> * Initialize SRCU combining tree. Note that statically allocated
> * srcu_struct structures might already have srcu_read_lock() and
> @@ -150,6 +156,7 @@ static void init_srcu_struct_nodes(struc
> INIT_WORK(&sdp->work, srcu_invoke_callbacks);
> timer_setup(&sdp->delay_work, srcu_delay_timer, 0);
> sdp->ssp = ssp;
> + init_completion_done(&sdp->work_done);
> sdp->grpmask = 1 << (cpu - sdp->mynode->grplo);
> }
> }
> @@ -378,6 +385,7 @@ void cleanup_srcu_struct(struct srcu_str
>
> del_timer_sync(&sdp->delay_work);
> flush_work(&sdp->work);
> + wait_for_completion(&sdp->work_done);

With the changes described below, this happens once rather than once
per CPU. The trick is to make use of the ordering that is already
provided by the atomic_dec_and_test(&ssp->srcu_barrier_cpu_cnt) in the
rcu_barrier() function's callback function, that is, srcu_barrier_cb(),
but by moving this to the end of srcu_barrier().

> if (WARN_ON(rcu_segcblist_n_cbs(&sdp->srcu_cblist)))
> return; /* Forgot srcu_barrier(), so just leak it! */
> }
> @@ -1248,6 +1256,7 @@ static void srcu_invoke_callbacks(struct
> struct srcu_struct *ssp;
>
> sdp = container_of(work, struct srcu_data, work);
> + reinit_completion(&sdp->work_done);

We can avoid this by reusing the existing ->srcu_barrier_completion.

> ssp = sdp->ssp;
> rcu_cblist_init(&ready_cbs);
> @@ -1257,7 +1266,7 @@ static void srcu_invoke_callbacks(struct
> if (sdp->srcu_cblist_invoking ||
> !rcu_segcblist_ready_cbs(&sdp->srcu_cblist)) {
> spin_unlock_irq_rcu_node(sdp);
> - return; /* Someone else on the job or nothing to do. */
> + goto out; /* Someone else on the job or nothing to do. */
> }
>
> /* We are on the job! Extract and invoke ready callbacks. */
> @@ -1287,6 +1296,9 @@ static void srcu_invoke_callbacks(struct
> spin_unlock_irq_rcu_node(sdp);
> if (more)
> srcu_schedule_cbs_sdp(sdp, 0);
> + else
> +out:
> + complete(&sdp->work_done);

This is too ugly to live. ;-)

And it should also be executed in srcu_barrier(). This can be done using
a flag that is cleared early on in rcu_barrier() and then set by the last
rcu_barrier() callback (the same one that does the complete() invocation).

Plus we really do want to execute complete() in the case where there are
callbacks in order to increase the probability of cleanup_srcu_struct()
finding SRCU-abuse bugs.

> }
>
> /*
> +++ y/include/linux/srcutree.h
> @@ -40,6 +40,7 @@ struct srcu_data {
> /* ->srcu_data_have_cbs[]. */
> int cpu;
> struct srcu_struct *ssp;
> + struct completion work_done;

All we really need to add here is a flag that allows srcu_barrier_cb()
to signal srcu_invoke_callbacks().

OK, I didn't do a really good job of describing what I am after, so here
is a patch with your Reported-by and that of syzbot.

Thoughts?

Thanx, Paul

------------------------------------------------------------------------

commit b01a1d25f93633c89f782ec5d3830beeb3892030
Author: Paul E. McKenney <pau...@kernel.org>
Date: Tue Jan 11 10:30:07 2022 -0800

srcu: Prevent srcu_barrier()->cleanup_srcu_struct() use-after-free

Currently, srcu_barrier() will return immediately after its last
srcu_barrier_cb() has been invoked, thus guaranteeing that all previous
callbacks have been completely invoked, as required. Unfortunately,
this is insufficient in the case where the call to srcu_barrier()
is immediately followed by a call to cleanup_srcu_struct(). In this
case, srcu_invoke_callbacks() will access the srcu_struct and related
structures after invoking the callbacks, and these structures might have
already have been freed by cleanup_srcu_struct().

This can happen as follows:

o Task A invokes the last call_srcu().

o Later, Task B invokes rcu_barrier().

o The last rcu_barrier() callback is invoked by CPU 0 from
within srcu_invoke_callbacks(), which awakens Task B.

o Task B invokes cleanup_srcu_struct(), which is perfectly legal
assuming that Task B waited until the last call_srcu()
completed before invoking rcu_barrier().

o But now CPU 0 continues executing srcu_invoke_callbacks(),
which accesses srcu_struct fields and related structures
that were freed by cleanup_srcu_struct().

This commit fixes this by moving the complete() call that awakens
srcu_barrier() from the the srcu_barrier_cb() callback to the end of
srcu_invoke_callbacks(). This requires introducing a per-srcu_data
structure ->srcu_barrier_cb_done flag that allows srcu_barrier_cb()
to tell srcu_invoke_callbacks() that a wakeup might be required.

Link: https://lore.kernel.org/all/00000000000081...@google.com/
Reported-by: syzbot+4f7898...@syzkaller.appspotmail.com
Reported-by: Hillf Danton <hda...@sina.com>
Signed-off-by: Paul E. McKenney <pau...@kernel.org>

diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index cb1f4351e8baa..339dd55fbea5c 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -32,6 +32,7 @@ struct srcu_data {
unsigned long srcu_gp_seq_needed; /* Furthest future GP needed. */
unsigned long srcu_gp_seq_needed_exp; /* Furthest future exp GP. */
bool srcu_cblist_invoking; /* Invoking these CBs? */
+ bool srcu_barrier_cb_done; /* This CPU's srcu_rcu_barrier_cb() done? */
struct timer_list delay_work; /* Delay for CB invoking */
struct work_struct work; /* Context for CB invoking. */
struct rcu_head srcu_barrier_head; /* For srcu_barrier() use. */
diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 6833d88871816..128034f858897 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -1088,13 +1088,9 @@ EXPORT_SYMBOL_GPL(poll_state_synchronize_srcu);
*/
static void srcu_barrier_cb(struct rcu_head *rhp)
{
- struct srcu_data *sdp;
- struct srcu_struct *ssp;
+ struct srcu_data *sdp = container_of(rhp, struct srcu_data, srcu_barrier_head);

- sdp = container_of(rhp, struct srcu_data, srcu_barrier_head);
- ssp = sdp->ssp;
- if (atomic_dec_and_test(&ssp->srcu_barrier_cpu_cnt))
- complete(&ssp->srcu_barrier_completion);
+ sdp->srcu_barrier_cb_done = true;
}

/**
@@ -1287,6 +1283,15 @@ static void srcu_invoke_callbacks(struct work_struct *work)

spin_unlock_irq_rcu_node(sdp);
if (more)
srcu_schedule_cbs_sdp(sdp, 0);
+

+ // Do complete() to wake up srcu_barrier() here to ensure no
+ // use-after-free should cleanup_srcu_struct() be invoked immediately
+ // after the return from srcu_barrier().
+ if (sdp->srcu_barrier_cb_done) {
+ sdp->srcu_barrier_cb_done = false;
+ if (atomic_dec_and_test(&ssp->srcu_barrier_cpu_cnt))
+ complete(&ssp->srcu_barrier_completion);
+ }
}

/*

Hillf Danton

unread,

Jan 12, 2022, 4:44:05 AM1/12/22

to Paul E. McKenney, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, Hillf Danton, boqun...@gmail.com

Hi Paul,

Thank you for taking a look at it.

Currently no waiter is waken up without running the last callback, given
atomic_dec_and_test().

> }
>
> /**
> @@ -1287,6 +1283,15 @@ static void srcu_invoke_callbacks(struct work_struct *work)
> spin_unlock_irq_rcu_node(sdp);
> if (more)
> srcu_schedule_cbs_sdp(sdp, 0);
> +
> + // Do complete() to wake up srcu_barrier() here to ensure no
> + // use-after-free should cleanup_srcu_struct() be invoked immediately
> + // after the return from srcu_barrier().
> + if (sdp->srcu_barrier_cb_done) {
> + sdp->srcu_barrier_cb_done = false;
> + if (atomic_dec_and_test(&ssp->srcu_barrier_cpu_cnt))
> + complete(&ssp->srcu_barrier_completion);
> + }
> }

After this patch no waiter is waken up without running the last callback, given
atomic_dec_and_test(), while the uaf says it is not safe to derefer ssp in the
workqueue work callback after flushing work. I mean moving wakeup to the end of
WQ work callback does not do more than flush_work().

cleanup_srcu_struct()
flush_work(&sdp->work)

Hillf
>
> /*
>

Paul E. McKenney

unread,

Jan 12, 2022, 12:12:51 PM1/12/22

to Hillf Danton, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, boqun...@gmail.com

Heh!!!

It is worse than that. There are no callers of srcu_barrier() outside
of RCU and rcutorture. So neither of our patches can possibly have any
effect whatsoever!

Given that you have shown that the flush_work() covers the race window I
was thinking that I was seeing, I will revert mine. Or can you provide
a detailed sequence of events no SRCU API calls for a given srcu_struct
fail to happen before happen before the cleanup_srcu_struct() but still
result in a use-after-free error?

And here I was thinking that that COVID booster a few days ago hadn't
impaired my mental abilities. :-/

Given that there are no srcu_barrier() invocations, any use of call_srcu()
in an unloadable module is suspect. From what I can see, the only
invocations of call_srcu() from a module are in fs/dlm/lowcomms.c.
So can fs/dlm be unloaded? Might it have been unloaded in this test?
For that matter, why does fs/dlm need a per-instance srcu_struct?
At first glance, it looks like a single global srcu_struct would
serve just as well. This would not make srcu_barrier() unnecessary at
module-unload time, but it would make it so that only one srcu_barrier()
was required at that time.

Another possibility is that something is invoking cleanup_srcu_struct()
too early. If any other use of that same srcu_struct could happen
concurrently with that cleanup_srcu_struct(), then we could see this
sort of problem.

Another possibility, this time within SRCU, is if some set of race
conditions can result in SRCU's internals doing an extra unnecessary
grace period. If this race happened for the last synchronize_srcu()
prior to the cleanup_srcu_struct(), that could explain this issue.

Thoughts?

Thanx, Paul

Paul E. McKenney

unread,

Jan 12, 2022, 1:05:35 PM1/12/22

to Hillf Danton, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, boqun...@gmail.com

And in the meantime, please see below for a patch designed to catch
extra-SRCU-grace-period isseus.

Again, thoughts?

Thanx, Paul

------------------------------------------------------------------------

commit d730d54e543e5cf7376ff61e3ad407490f08c85e

Author: Paul E. McKenney <pau...@kernel.org>

Date: Wed Jan 12 09:52:44 2022 -0800

srcu: Tighten cleanup_srcu_struct() GP checks

Currently, cleanup_srcu_struct() checks for a grace period in progress,
but it does not check for a grace period that has not yet started but
which might start at any time. Such a situation could result in a
use-after-free bug, so this commit adds a check for a grace period that
is needed but not yet started to cleanup_srcu_struct().

Signed-off-by: Paul E. McKenney <pau...@kernel.org>

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 128034f858897..c78391a5e20dc 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -382,9 +382,11 @@ void cleanup_srcu_struct(struct srcu_struct *ssp)

return; /* Forgot srcu_barrier(), so just leak it! */
}

if (WARN_ON(rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)) != SRCU_STATE_IDLE) ||
+ WARN_ON(rcu_seq_current(&ssp->srcu_gp_seq) != ssp->srcu_gp_seq_needed) ||
WARN_ON(srcu_readers_active(ssp))) {
- pr_info("%s: Active srcu_struct %p state: %d\n",
- __func__, ssp, rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)));
+ pr_info("%s: Active srcu_struct %p read state: %d gp state: %lu/%lu\n",
+ __func__, ssp, rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)),
+ rcu_seq_current(&ssp->srcu_gp_seq), ssp->srcu_gp_seq_needed);
return; /* Caller forgot to stop doing call_srcu()? */
}
free_percpu(ssp->sda);

Ming Lei

unread,

Jan 12, 2022, 10:20:05 PM1/12/22

to syzbot, and...@kernel.org, a...@kernel.org, Jens Axboe, b...@vger.kernel.org, dan...@iogearbox.net, john.fa...@gmail.com, ka...@fb.com, kps...@kernel.org, linux-block, linux-...@vger.kernel.org, net...@vger.kernel.org, songliu...@fb.com, syzkall...@googlegroups.com, y...@fb.com

On Tue, Jan 11, 2022 at 7:03 PM syzbot
<syzbot+4f7898...@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: 3770333b3f8c Add linux-next specific files for 20220106
> git tree: linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=171aa4e3b00000
> kernel config: https://syzkaller.appspot.com/x/.config?x=f9eb40d9f910b474
> dashboard link: https://syzkaller.appspot.com/bug?extid=4f789823c1abc5accf13
> compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=12b08f53b00000
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+4f7898...@syzkaller.appspotmail.com

BTW, the report should be addressed by the patch:

https://lore.kernel.org/linux-block/20220111123401.5...@redhat.com/T/#u

Thanks,
Ming

Hillf Danton

unread,

Jan 12, 2022, 11:49:55 PM1/12/22

to Paul E. McKenney, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, boqun...@gmail.com

Without the help from a covid booster, what I can do for the flush_work()
in cleanup_srcu_struct() is go another round of invoking srcu callbacks
directly without bothering to queue the work again.

With the minor change, we are sure work is idle after flush.

Hillf

+++ b/kernel/rcu/srcutree.c
@@ -1232,12 +1232,6 @@ static void srcu_advance_state(struct sr
}
}

-/*
- * Invoke a limited number of SRCU callbacks that have passed through
- * their grace period. If there are more to do, SRCU will reschedule
- * the workqueue. Note that needed memory barriers have been executed
- * in this task's context by srcu_readers_active_idx_check().
- */

static void srcu_invoke_callbacks(struct work_struct *work)

{
long len;
@@ -1249,6 +1243,7 @@ static void srcu_invoke_callbacks(struct

sdp = container_of(work, struct srcu_data, work);

+again:

ssp = sdp->ssp;
rcu_cblist_init(&ready_cbs);

spin_lock_irq_rcu_node(sdp);
@@ -1276,7 +1271,7 @@ static void srcu_invoke_callbacks(struct

/*
* Update counts, accelerate new callbacks, and if needed,
- * schedule another round of callback invocation.
+ * go another round of callback invocation.
*/
spin_lock_irq_rcu_node(sdp);
rcu_segcblist_add_len(&sdp->srcu_cblist, -len);
@@ -1286,7 +1281,7 @@ static void srcu_invoke_callbacks(struct
more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
spin_unlock_irq_rcu_node(sdp);
if (more)
- srcu_schedule_cbs_sdp(sdp, 0);
+ goto again;
}

/*

Paul E. McKenney

unread,

Jan 13, 2022, 12:15:03 AM1/13/22

to Hillf Danton, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, boqun...@gmail.com

First, exactly what sequence of events would this change protect against?

Second, in non-shutdown conditions, what prevents an infinite loop inside
of srcu_invoke_callbacks()?

Thanx, Paul

Hillf Danton

unread,

Jan 13, 2022, 8:13:15 AM1/13/22

to Paul E. McKenney, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, boqun...@gmail.com

On Wed, 12 Jan 2022 21:14:59 -0800 Paul E. McKenney wrote:
> On Thu, Jan 13, 2022 at 12:49:38PM +0800, Hillf Danton wrote:

...

It makes sure that work is idle after flush only within cleanup_srcu_struct()
regardless of whatever prior to the cleanup.

That tight loop can be avoid by adding a flushing flag to srcu_data,
which is set before flush in cleanup_srcu_struct() and checked at the
end of srcu_invoke_callbacks().

Only for thoughts now.

Hillf

+++ b/kernel/rcu/srcutree.c
@@ -376,6 +376,9 @@ void cleanup_srcu_struct(struct srcu_str
for_each_possible_cpu(cpu) {
struct srcu_data *sdp = per_cpu_ptr(ssp->sda, cpu);

+ spin_lock_irq_rcu_node(sdp);
+ sdp->flushing = true;
+ spin_unlock_irq_rcu_node(sdp);
del_timer_sync(&sdp->delay_work);
flush_work(&sdp->work);
if (WARN_ON(rcu_segcblist_n_cbs(&sdp->srcu_cblist)))
@@ -1232,16 +1235,11 @@ static void srcu_advance_state(struct sr

}
}

-/*
- * Invoke a limited number of SRCU callbacks that have passed through
- * their grace period. If there are more to do, SRCU will reschedule
- * the workqueue. Note that needed memory barriers have been executed
- * in this task's context by srcu_readers_active_idx_check().
- */
static void srcu_invoke_callbacks(struct work_struct *work)
{
long len;

bool more;
+ bool flushing;
struct rcu_cblist ready_cbs;
struct rcu_head *rhp;
struct srcu_data *sdp;
@@ -1249,6 +1247,7 @@ static void srcu_invoke_callbacks(struct

sdp = container_of(work, struct srcu_data, work);

+again:
ssp = sdp->ssp;
rcu_cblist_init(&ready_cbs);
spin_lock_irq_rcu_node(sdp);

@@ -1276,17 +1275,21 @@ static void srcu_invoke_callbacks(struct

/*
* Update counts, accelerate new callbacks, and if needed,
- * schedule another round of callback invocation.
+ * go another round of callback invocation.
*/
spin_lock_irq_rcu_node(sdp);
rcu_segcblist_add_len(&sdp->srcu_cblist, -len);

(void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
rcu_seq_snap(&ssp->srcu_gp_seq));
sdp->srcu_cblist_invoking = false;
+ flushing = sdp->flushing;

more = rcu_segcblist_ready_cbs(&sdp->srcu_cblist);
spin_unlock_irq_rcu_node(sdp);
if (more)
- srcu_schedule_cbs_sdp(sdp, 0);

+ if (flushing)
+ goto again;
+ else
+ srcu_schedule_cbs_sdp(sdp, 0);
}

/*

Paul E. McKenney

unread,

Jan 13, 2022, 9:59:04 AM1/13/22

to Hillf Danton, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, boqun...@gmail.com

Nice try, but that answer does not match the question. ;-)

I need to know a definite sequence of events corresponding to your
"whatever prior".

Otherwise, we are just as likely to be breaking something as we are to be
fixing something, and the breakage might well be worse than the problem.

Is the syzbot report reproducible?

Thanx, Paul

Boqun Feng

unread,

Jan 14, 2022, 9:38:56 AM1/14/22

to Paul E. McKenney, Hillf Danton, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, Jens Axboe, Ming Lei

Hi,

On Tue, Jan 11, 2022 at 11:05:00AM -0800, Paul E. McKenney wrote:
[...]

> > > The buggy address belongs to the object at ffff8880189b5c70
> > > which belongs to the cache request_queue_srcu of size 3816

This cache name drew my attention when I was trying to look into this,
because I couldn't find it in v5.16, later on I realized the UAF was
found in linux-next and the commit introducing the cache was merged into
mainline if 5.17 merge window:

704b914f15fb blk-mq: move srcu from blk_mq_hw_ctx to request_queue

I think the UAF is actually a bug introduced by the commit, because in
that commit srcu structure was moved from blk_mq_hw_ctx to
request_queue, and therefore the cleanup_srcu_struct() should be moved
from blk_mq_hw_sysfs_release() to blk_release_queue(), however the above
commit only deleted the one in blk_mq_hw_sysfs_release() but didn't add
a new one in blk_release_queue(). As a result when a request queue is
freed, the srcu structure is not fully clean up, therefore the UAF.

IOW, something like below (untested) should fix this. Copy the auther
and block maintainers.

Regards,
Boqun

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index e20eadfcf5c8..10f59416376e 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -812,6 +812,10 @@ static void blk_release_queue(struct kobject *kobj)
bioset_exit(&q->bio_split);

ida_simple_remove(&blk_queue_ida, q->id);
+
+ if (test_bit(QUEUE_FLAG_HAS_SRCU, &q->queue_flags)) {
+ cleanup_srcu_struct(q->srcu);
+ }
call_rcu(&q->rcu_head, blk_free_queue_rcu);

Paul E. McKenney

unread,

Jan 14, 2022, 10:27:55 AM1/14/22

to Boqun Feng, Hillf Danton, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, Jens Axboe, Ming Lei

On Fri, Jan 14, 2022 at 10:38:42PM +0800, Boqun Feng wrote:
> Hi,
>
> On Tue, Jan 11, 2022 at 11:05:00AM -0800, Paul E. McKenney wrote:
> [...]
> > > > The buggy address belongs to the object at ffff8880189b5c70
> > > > which belongs to the cache request_queue_srcu of size 3816
>
> This cache name drew my attention when I was trying to look into this,
> because I couldn't find it in v5.16, later on I realized the UAF was
> found in linux-next and the commit introducing the cache was merged into
> mainline if 5.17 merge window:
>
> 704b914f15fb blk-mq: move srcu from blk_mq_hw_ctx to request_queue
>
> I think the UAF is actually a bug introduced by the commit, because in
> that commit srcu structure was moved from blk_mq_hw_ctx to
> request_queue, and therefore the cleanup_srcu_struct() should be moved
> from blk_mq_hw_sysfs_release() to blk_release_queue(), however the above
> commit only deleted the one in blk_mq_hw_sysfs_release() but didn't add
> a new one in blk_release_queue(). As a result when a request queue is
> freed, the srcu structure is not fully clean up, therefore the UAF.
>
> IOW, something like below (untested) should fix this. Copy the auther
> and block maintainers.

One question for the author and block maintainers... Why not simply have
a single srcu_struct for all of the queues? Or is there some situation
where you need one queue's reader to avoid blocking other queues' SRCU
grace periods?

Thanx, Paul

Ming Lei

unread,

Jan 14, 2022, 10:55:10 AM1/14/22

to Boqun Feng, Paul E. McKenney, Hillf Danton, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, Jens Axboe

On Fri, Jan 14, 2022 at 10:38:42PM +0800, Boqun Feng wrote:

Yes, this fix has been posted days ago:

https://lore.kernel.org/linux-block/20220111123401.5...@redhat.com/T/#u

Thanks,
Ming

Ming Lei

unread,

Jan 14, 2022, 10:58:45 AM1/14/22

to Paul E. McKenney, Boqun Feng, Hillf Danton, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, Jens Axboe

Because srcu_struct is too fat, and only few drivers need it, and
most block drivers needn't it.

Thanks,
Ming

Paul E. McKenney

unread,

Jan 14, 2022, 11:11:41 AM1/14/22

to Ming Lei, Boqun Feng, Hillf Danton, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, Jens Axboe

Fair points.

But would it make sense to dynamically allocate a single srcu_struct
when the first need arose, and only remove it when the last need passed?

Thanx, Paul

Ming Lei

unread,

Jan 14, 2022, 11:16:06 AM1/14/22

to Paul E. McKenney, Boqun Feng, Hillf Danton, syzbot, Joel Fernandes, Lai Jiangshan, linux-...@vger.kernel.org, syzkall...@googlegroups.com, quic_n...@quicinc.com, fred...@kernel.org, ure...@gmail.com, Jens Axboe

q->srcu is referred in fast IO path, we want to avoid to add indirect
reference so not allocating it dynamically.

Thanks,
Ming

Reply all

Reply to author

Forward

[syzbot] KASAN: use-after-free Read in srcu_invoke_callbacks

syzbot

Hillf Danton

Paul E. McKenney

Hillf Danton

Paul E. McKenney

Paul E. McKenney

Ming Lei

Hillf Danton

Paul E. McKenney

Hillf Danton

Paul E. McKenney

Boqun Feng

Paul E. McKenney

Ming Lei

Ming Lei

Paul E. McKenney

Ming Lei