[syzbot] [kernel?] KCSAN: assert: race in dequeue_entities

33 views
Skip to first unread message

syzbot

unread,
Sep 27, 2024, 10:57:26 AM9/27/24
to au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com
Hello,

syzbot found the following issue on:

HEAD commit: 075dbe9f6e3c Merge tag 'soc-ep93xx-dt-6.12' of git://git.k..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=15f07a80580000
kernel config: https://syzkaller.appspot.com/x/.config?x=86254f9e0a8f2c98
dashboard link: https://syzkaller.appspot.com/bug?extid=0ec1e96c2cdf5c0e512a
compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/1be80941df60/disk-075dbe9f.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/494a9ac89c09/vmlinux-075dbe9f.xz
kernel image: https://storage.googleapis.com/syzbot-assets/919788d8c731/bzImage-075dbe9f.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+0ec1e9...@syzkaller.appspotmail.com

audit: type=1400 audit(6022412112.141:243086): avc: denied { read } for pid=2950 comm="syslogd" name="log" dev="sda1" ino=1915 scontext=system_u:system_r:syslogd_t tcontext=system_u:object_r:var_t tclass=lnk_file permissive=0
audit: type=1400 audit(6022412112.181:243087): avc: denied { read } for pid=2950 comm="syslogd" name="log" dev="sda1" ino=1915 scontext=system_u:system_r:syslogd_t tcontext=system_u:object_r:var_t tclass=lnk_file permissive=0
==================================================================
BUG: KCSAN: assert: race in __block_task kernel/sched/sched.h:2770 [inline]
BUG: KCSAN: assert: race in dequeue_entities+0x6df/0x760 kernel/sched/fair.c:7177

race at unknown origin, with assert no writes to 0xffff888101764268 of 4 bytes by task 29 on cpu 1:
__block_task kernel/sched/sched.h:2770 [inline]
dequeue_entities+0x6df/0x760 kernel/sched/fair.c:7177
pick_next_entity kernel/sched/fair.c:5627 [inline]
pick_task_fair kernel/sched/fair.c:8856 [inline]
pick_next_task_fair+0x7d/0x410 kernel/sched/fair.c:8876
__pick_next_task kernel/sched/core.c:5955 [inline]
pick_next_task kernel/sched/core.c:6477 [inline]
__schedule+0x284/0x940 kernel/sched/core.c:6629
__schedule_loop kernel/sched/core.c:6752 [inline]
schedule+0x55/0xc0 kernel/sched/core.c:6767
kauditd_thread+0x56b/0x650 kernel/audit.c:911
kthread+0x1d1/0x210 kernel/kthread.c:389
ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244

value changed: 0x00000000 -> 0x00000001

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 29 Comm: kauditd Not tainted 6.11.0-syzkaller-11558-g075dbe9f6e3c #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/13/2024
==================================================================


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

Alexander Potapenko

unread,
Oct 22, 2024, 4:07:04 AM10/22/24
to syzbot, Peter Zijlstra, au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com
+PeterZ, who added the KCSAN assertion.

Peter, I am also hitting the same assert on ARM while fuzzing KVM. It
is tricky to reproduce though (we probably need to stress the
scheduler hard enough):

==================================================================
BUG: KCSAN: assert: race in dequeue_entities+0x6d8/0x794
kernel/sched/sched.h:2773

race at unknown origin, with assert no writes to 0xffffff80039f91e8 of
4 bytes by task 27858 on cpu 2:
dequeue_entities+0x6d8/0x794 kernel/sched/sched.h:2773
pick_next_task_fair+0x84/0x394 kernel/sched/fair.c:5627
__schedule+0x218/0x888 kernel/sched/core.c:5968
preempt_schedule_irq+0x3c/0x68 kernel/sched/core.c:7012
arm64_preempt_schedule_irq+0x2c/0x40 arch/arm64/kernel/entry-common.c:301
__el1_irq arch/arm64/kernel/entry-common.c:539 [inline]
el1_interrupt+0x3c/0x54 arch/arm64/kernel/entry-common.c:551
el1h_64_irq_handler+0x18/0x24 arch/arm64/kernel/entry-common.c:556
el1h_64_irq+0x64/0x68 arch/arm64/kernel/entry.S:594
flush_ptrace_hw_breakpoint+0x4f8/0x6b0 arch/arm64/kernel/ptrace.c:221
flush_thread+0x70/0xf4 arch/arm64/kernel/process.c:287
begin_new_exec+0x9ac/0xcdc fs/exec.c:1307
load_elf_binary+0x4d0/0x1330 fs/binfmt_elf.c:996
bprm_execve+0x3d8/0x958 fs/exec.c:1752
kernel_execve+0x550/0x57c fs/exec.c:2012
call_usermodehelper_exec_async+0x18c/0x270 kernel/umh.c:110
ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860

value changed: 0x00000000 -> 0x00000001

Reported by Kernel Concurrency Sanitizer on:
CPU: 2 UID: 0 PID: 27858 Comm: kworker/u16:3 Not tainted
6.12.0-rc4-00001-g517096192323-dirty #84
Hardware name: linux,dummy-virt (DT)
==================================================================

Peter Zijlstra

unread,
Oct 22, 2024, 7:31:36 AM10/22/24
to Alexander Potapenko, syzbot, au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com
Well, PaulMck did in d6111cf45c57 ("sched: Use WRITE_ONCE() for
p->on_rq"), I just moved it around in e8901061ca0c ("sched: Split
DEQUEUE_SLEEP from deactivate_task()").

I'm not at all sure I have any inkling as to what the annotation does
nor what KCSAN is trying to tell us above.

Can someone please translate?


Marco Elver

unread,
Oct 22, 2024, 9:41:02 AM10/22/24
to Peter Zijlstra, pau...@kernel.org, Alexander Potapenko, syzbot, au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com
On Tue, Oct 22, 2024 at 01:31PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 22, 2024 at 10:06:23AM +0200, Alexander Potapenko wrote:
> > On Fri, Sep 27, 2024 at 4:57 PM syzbot
> > <syzbot+0ec1e9...@syzkaller.appspotmail.com> wrote:
> > >
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit: 075dbe9f6e3c Merge tag 'soc-ep93xx-dt-6.12' of git://git.k..
> > > git tree: upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=15f07a80580000
> > > kernel config: https://syzkaller.appspot.com/x/.config?x=86254f9e0a8f2c98
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=0ec1e96c2cdf5c0e512a
> > > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 2.40
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > >
> > > Downloadable assets:
> > > disk image: https://storage.googleapis.com/syzbot-assets/1be80941df60/disk-075dbe9f.raw.xz
> > > vmlinux: https://storage.googleapis.com/syzbot-assets/494a9ac89c09/vmlinux-075dbe9f.xz
> > > kernel image: https://storage.googleapis.com/syzbot-assets/919788d8c731/bzImage-075dbe9f.xz
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+0ec1e9...@syzkaller.appspotmail.com
[...]
> > +PeterZ, who added the KCSAN assertion.
>
> Well, PaulMck did in d6111cf45c57 ("sched: Use WRITE_ONCE() for
> p->on_rq"), I just moved it around in e8901061ca0c ("sched: Split
> DEQUEUE_SLEEP from deactivate_task()").
>
> I'm not at all sure I have any inkling as to what the annotation does
> nor what KCSAN is trying to tell us above.

ASSERT_EXCLUSIVE_WRITER(var) is to say that that there should be no
concurrent writes to var; other readers are allowed. If KCSAN is
enabled, it then goes and reports any violations of that assertion.
Main usecase is for already marked accesses where concurrent accesses
are _not_ data races, but the algorithm does not assume concurrent
writers regardless.

In this case it seems that Paul was trying to say that there should be
no concurrent writers to this variable. But KCSAN disproved that.

> Can someone please translate?

We can get the 2nd stack trace with:

--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -10,8 +10,8 @@ KCOV_INSTRUMENT := n

# Disable KCSAN to avoid excessive noise and performance degradation. To avoid
# false positives ensure barriers implied by sched functions are instrumented.
-KCSAN_SANITIZE := n
-KCSAN_INSTRUMENT_BARRIERS := y
+#KCSAN_SANITIZE := n
+#KCSAN_INSTRUMENT_BARRIERS := y

Which gives us:

| ==================================================================
| BUG: KCSAN: assert: race in dequeue_entities / ttwu_do_activate
|
| write (marked) to 0xffff9e100329c628 of 4 bytes by interrupt on cpu 0:
| activate_task kernel/sched/core.c:2064 [inline]

This is this one:

void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
if (task_on_rq_migrating(p))
flags |= ENQUEUE_MIGRATED;
if (flags & ENQUEUE_MIGRATED)
sched_mm_cid_migrate_to(rq, p);

enqueue_task(rq, p, flags);

WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
ASSERT_EXCLUSIVE_WRITER(p->on_rq);
}

| ttwu_do_activate+0x153/0x3e0 kernel/sched/core.c:3671
| ttwu_queue kernel/sched/core.c:3944 [inline]
| try_to_wake_up+0x60f/0xaf0 kernel/sched/core.c:4270
| default_wake_function+0x25/0x30 kernel/sched/core.c:7009
| __pollwake fs/select.c:205 [inline]
| pollwake+0xc0/0x100 fs/select.c:215
| __wake_up_common kernel/sched/wait.c:89 [inline]
| __wake_up_common_lock kernel/sched/wait.c:106 [inline]
| __wake_up_sync_key+0x85/0xc0 kernel/sched/wait.c:173
| sock_def_readable+0x6f/0x180 net/core/sock.c:3442
| tcp_data_ready+0x194/0x230 net/ipv4/tcp_input.c:5193
| tcp_data_queue+0x1052/0x2710 net/ipv4/tcp_input.c:5283
| tcp_rcv_established+0x7e3/0xd60 net/ipv4/tcp_input.c:6237
| tcp_v4_do_rcv+0x545/0x600 net/ipv4/tcp_ipv4.c:1915
| tcp_v4_rcv+0x159c/0x1890 net/ipv4/tcp_ipv4.c:2350
| ip_protocol_deliver_rcu+0x2d8/0x620 net/ipv4/ip_input.c:205
| ip_local_deliver_finish+0x11a/0x150 net/ipv4/ip_input.c:233
| NF_HOOK include/linux/netfilter.h:314 [inline]
| ip_local_deliver+0xce/0x1a0 net/ipv4/ip_input.c:254
| dst_input include/net/dst.h:460 [inline]
| ip_sublist_rcv_finish net/ipv4/ip_input.c:580 [inline]
| ip_list_rcv_finish net/ipv4/ip_input.c:630 [inline]
| ip_sublist_rcv+0x43d/0x520 net/ipv4/ip_input.c:638
| ip_list_rcv+0x262/0x2a0 net/ipv4/ip_input.c:672
| __netif_receive_skb_list_ptype net/core/dev.c:5709 [inline]
| __netif_receive_skb_list_core+0x4fc/0x520 net/core/dev.c:5756
| __netif_receive_skb_list net/core/dev.c:5808 [inline]
| netif_receive_skb_list_internal+0x46d/0x5e0 net/core/dev.c:5899
| gro_normal_list include/net/gro.h:515 [inline]
| napi_complete_done+0x161/0x3a0 net/core/dev.c:6250
| e1000_clean+0x7c7/0x1a70 drivers/net/ethernet/intel/e1000/e1000_main.c:3808
| __napi_poll+0x66/0x360 net/core/dev.c:6775
| napi_poll net/core/dev.c:6844 [inline]
| net_rx_action+0x3d9/0x820 net/core/dev.c:6966
| handle_softirqs+0xe6/0x2d0 kernel/softirq.c:554
| __do_softirq kernel/softirq.c:588 [inline]
| invoke_softirq kernel/softirq.c:428 [inline]
| __irq_exit_rcu+0x45/0xc0 kernel/softirq.c:637
| common_interrupt+0x4f/0xc0 arch/x86/kernel/irq.c:278
| asm_common_interrupt+0x26/0x40 arch/x86/include/asm/idtentry.h:693
|
| assert no writes to 0xffff9e100329c628 of 4 bytes by task 10571 on cpu 3:
| __block_task kernel/sched/sched.h:2770 [inline]

And that's:

static inline void __block_task(struct rq *rq, struct task_struct *p)
{
WRITE_ONCE(p->on_rq, 0);
ASSERT_EXCLUSIVE_WRITER(p->on_rq);
if (p->sched_contributes_to_load)
rq->nr_uninterruptible++;

| dequeue_entities+0xd83/0xe70 kernel/sched/fair.c:7177
| pick_next_entity kernel/sched/fair.c:5627 [inline]
| pick_task_fair kernel/sched/fair.c:8856 [inline]
| pick_next_task_fair+0xaf/0x710 kernel/sched/fair.c:8876
| __pick_next_task kernel/sched/core.c:5955 [inline]
| pick_next_task kernel/sched/core.c:6477 [inline]
| __schedule+0x47a/0x1130 kernel/sched/core.c:6629
| __schedule_loop kernel/sched/core.c:6752 [inline]
| schedule+0x7b/0x130 kernel/sched/core.c:6767
| do_nanosleep+0xdb/0x310 kernel/time/hrtimer.c:2032
| hrtimer_nanosleep+0xa0/0x180 kernel/time/hrtimer.c:2080
| common_nsleep+0x52/0x70 kernel/time/posix-timers.c:1365
| __do_sys_clock_nanosleep kernel/time/posix-timers.c:1411 [inline]
| __se_sys_clock_nanosleep+0x1b2/0x1f0 kernel/time/posix-timers.c:1388
| __x64_sys_clock_nanosleep+0x55/0x70 kernel/time/posix-timers.c:1388
| x64_sys_call+0x2612/0x2f00 arch/x86/include/generated/asm/syscalls_64.h:231
| do_syscall_x64 arch/x86/entry/common.c:52 [inline]
| do_syscall_64+0xd0/0x1a0 arch/x86/entry/common.c:83
| entry_SYSCALL_64_after_hwframe+0x77/0x7f
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 3 UID: 0 PID: 10571 Comm: syz.3.1083 Not tainted 6.12.0-rc2-00003-g44423ac48780-dirty #7
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
| ==================================================================

Paul E. McKenney

unread,
Oct 22, 2024, 10:31:32 AM10/22/24
to Marco Elver, Peter Zijlstra, Alexander Potapenko, syzbot, au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com
Just confirming that this was my intent.

And for all I know, maybe it is now OK to have concurrent writers to
that variable, but if so, would we please have an explanatory comment
(or a reference to one)?

Thanx, Paul

Peter Zijlstra

unread,
Oct 22, 2024, 3:12:51 PM10/22/24
to Marco Elver, pau...@kernel.org, Alexander Potapenko, syzbot, au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com
On Tue, Oct 22, 2024 at 03:40:52PM +0200, Marco Elver wrote:

> Which gives us:
>
> | ==================================================================
> | BUG: KCSAN: assert: race in dequeue_entities / ttwu_do_activate
> |
> | write (marked) to 0xffff9e100329c628 of 4 bytes by interrupt on cpu 0:
> | activate_task kernel/sched/core.c:2064 [inline]
>
> This is this one:
>
> void activate_task(struct rq *rq, struct task_struct *p, int flags)
> {
> if (task_on_rq_migrating(p))
> flags |= ENQUEUE_MIGRATED;
> if (flags & ENQUEUE_MIGRATED)
> sched_mm_cid_migrate_to(rq, p);
>
> enqueue_task(rq, p, flags);
>
> WRITE_ONCE(p->on_rq, TASK_ON_RQ_QUEUED);
> ASSERT_EXCLUSIVE_WRITER(p->on_rq);
> }
>
> | ttwu_do_activate+0x153/0x3e0 kernel/sched/core.c:3671
> | ttwu_queue kernel/sched/core.c:3944 [inline]
> | try_to_wake_up+0x60f/0xaf0 kernel/sched/core.c:4270

> | assert no writes to 0xffff9e100329c628 of 4 bytes by task 10571 on cpu 3:
> | __block_task kernel/sched/sched.h:2770 [inline]
>
> And that's:
>
> static inline void __block_task(struct rq *rq, struct task_struct *p)
> {
> WRITE_ONCE(p->on_rq, 0);
> ASSERT_EXCLUSIVE_WRITER(p->on_rq);
> if (p->sched_contributes_to_load)
> rq->nr_uninterruptible++;
>
> | dequeue_entities+0xd83/0xe70 kernel/sched/fair.c:7177
> | pick_next_entity kernel/sched/fair.c:5627 [inline]
> | pick_task_fair kernel/sched/fair.c:8856 [inline]
> | pick_next_task_fair+0xaf/0x710 kernel/sched/fair.c:8876
> | __pick_next_task kernel/sched/core.c:5955 [inline]
> | pick_next_task kernel/sched/core.c:6477 [inline]
> | __schedule+0x47a/0x1130 kernel/sched/core.c:6629
> | __schedule_loop kernel/sched/core.c:6752 [inline]
> | schedule+0x7b/0x130 kernel/sched/core.c:6767


So KCSAn is trying to tell me these two paths run concurrently on the
same 'p' ?!? That would be a horrible bug -- both these call chains
should be holding rq->__lock (for task_rq(p)).

Marco Elver

unread,
Oct 22, 2024, 3:58:08 PM10/22/24
to Peter Zijlstra, pau...@kernel.org, Alexander Potapenko, syzbot, au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com
Yes correct.

And just to confirm this is no false positive, the way KCSAN works
_requires_ the race to actually happen before it reports anything;
this can also be seen in Alexander's report with just 1 stack trace
where it saw the value transition from 0 to 1 (TASK_ON_RQ_QUEUED) but
didn't know who did the write because kernel/sched was uninstrumented.

Marco Elver

unread,
Oct 23, 2024, 4:54:22 AM10/23/24
to Peter Zijlstra, pau...@kernel.org, Alexander Potapenko, syzbot, au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com
On Tue, Oct 22, 2024 at 09:57PM +0200, Marco Elver wrote:
> On Tue, 22 Oct 2024 at 21:12, Peter Zijlstra <pet...@infradead.org> wrote:
[...]
> > So KCSAn is trying to tell me these two paths run concurrently on the
> > same 'p' ?!? That would be a horrible bug -- both these call chains
> > should be holding rq->__lock (for task_rq(p)).
>
> Yes correct.
>
> And just to confirm this is no false positive, the way KCSAN works
> _requires_ the race to actually happen before it reports anything;
> this can also be seen in Alexander's report with just 1 stack trace
> where it saw the value transition from 0 to 1 (TASK_ON_RQ_QUEUED) but
> didn't know who did the write because kernel/sched was uninstrumented.

Got another version of the splat with CONFIG_KCSAN_VERBOSE=y. Lockdep seems to
think that both threads here are holding rq->__lock.

| ==================================================================
| BUG: KCSAN: assert: race in activate_task / dequeue_entities
|
| write (marked) to 0xffffa2e506240068 of 4 bytes by interrupt on cpu 0:
| activate_task+0xfb/0x130 kernel/sched/core.c:2064
| ttwu_do_activate+0xee/0x2b0 kernel/sched/core.c:3671
| ttwu_queue kernel/sched/core.c:3944 [inline]
| try_to_wake_up+0x509/0x930 kernel/sched/core.c:4270
| default_wake_function+0x25/0x30 kernel/sched/core.c:7009
| __pollwake fs/select.c:205 [inline]
| pollwake+0xc0/0x100 fs/select.c:215
| __wake_up_common kernel/sched/wait.c:89 [inline]
| __wake_up_common_lock+0xc0/0x110 kernel/sched/wait.c:106
| __wake_up_sync_key+0x1b/0x30 kernel/sched/wait.c:173
| sock_def_readable+0x19a/0x460 net/core/sock.c:3442
| tcp_data_ready+0x194/0x230 net/ipv4/tcp_input.c:5193
| tcp_data_queue+0xfd8/0x26b0 net/ipv4/tcp_input.c:5283
| tcp_rcv_established+0x8e5/0xeb0 net/ipv4/tcp_input.c:6237
| tcp_v4_do_rcv+0x606/0x620 net/ipv4/tcp_ipv4.c:1915
| tcp_v4_rcv+0x16e9/0x1890 net/ipv4/tcp_ipv4.c:2350
| ip_protocol_deliver_rcu+0x382/0x6a0 net/ipv4/ip_input.c:205
| ip_local_deliver_finish+0x1bf/0x290 net/ipv4/ip_input.c:233
| NF_HOOK+0x274/0x2f0 include/linux/netfilter.h:314
| ip_local_deliver net/ipv4/ip_input.c:254 [inline]
| dst_input include/net/dst.h:460 [inline]
| ip_sublist_rcv_finish+0x22a/0x2c0 net/ipv4/ip_input.c:580
| ip_list_rcv_finish net/ipv4/ip_input.c:630 [inline]
| ip_sublist_rcv+0x2e3/0x560 net/ipv4/ip_input.c:638
| ip_list_rcv+0x275/0x2b0 net/ipv4/ip_input.c:672
| __netif_receive_skb_list_ptype net/core/dev.c:5709 [inline]
| __netif_receive_skb_list_core+0x4fc/0x520 net/core/dev.c:5756
| __netif_receive_skb_list net/core/dev.c:5808 [inline]
| netif_receive_skb_list_internal+0x516/0x720 net/core/dev.c:5899
| gro_normal_list include/net/gro.h:515 [inline]
| napi_complete_done+0x161/0x3c0 net/core/dev.c:6250
| e1000_clean+0x7c7/0x1a70 drivers/net/ethernet/intel/e1000/e1000_main.c:3808
| __napi_poll+0x6c/0x2c0 net/core/dev.c:6775
| napi_poll net/core/dev.c:6844 [inline]
| net_rx_action+0x433/0x8e0 net/core/dev.c:6966
| handle_softirqs+0x15c/0x450 kernel/softirq.c:554
| __do_softirq kernel/softirq.c:588 [inline]
| invoke_softirq kernel/softirq.c:428 [inline]
| __irq_exit_rcu+0x72/0x110 kernel/softirq.c:637
| irq_exit_rcu+0x9/0x30 kernel/softirq.c:649
| common_interrupt+0xa7/0xc0 arch/x86/kernel/irq.c:278
| asm_common_interrupt+0x26/0x40 arch/x86/include/asm/idtentry.h:693
| finish_task_switch+0x11c/0x3e0 kernel/sched/core.c:5189
| context_switch kernel/sched/core.c:5318 [inline]
| __schedule+0xf23/0x14a0 kernel/sched/core.c:6675
| preempt_schedule_irq+0x9a/0xf0 kernel/sched/core.c:6997
| raw_irqentry_exit_cond_resched+0x52/0x60 kernel/entry/common.c:311
| irqentry_exit+0x6a/0xa0 kernel/entry/common.c:354
| sysvec_apic_timer_interrupt+0x50/0xb0 arch/x86/kernel/apic/apic.c:1037
| asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:702
| __seqprop_sequence include/linux/seqlock.h:211 [inline]
| get_counters net/ipv6/netfilter/ip6_tables.c:774 [inline]
| alloc_counters net/ipv6/netfilter/ip6_tables.c:820 [inline]
| copy_entries_to_user net/ipv6/netfilter/ip6_tables.c:837 [inline]
| get_entries net/ipv6/netfilter/ip6_tables.c:1039 [inline]
| do_ip6t_get_ctl+0x665/0x9d0 net/ipv6/netfilter/ip6_tables.c:1677
| nf_getsockopt+0x14e/0x170 net/netfilter/nf_sockopt.c:116
| ipv6_getsockopt+0x13c/0x1b0 net/ipv6/ipv6_sockglue.c:1493
| tcp_getsockopt+0x9e/0xd0 net/ipv4/tcp.c:4670
| sock_common_getsockopt+0x5d/0x70 net/core/sock.c:3776
| do_sock_getsockopt+0x18f/0x220 net/socket.c:2391
| __sys_getsockopt+0xf1/0x170 net/socket.c:2420
| __do_sys_getsockopt net/socket.c:2430 [inline]
| __se_sys_getsockopt net/socket.c:2427 [inline]
| __x64_sys_getsockopt+0x64/0x80 net/socket.c:2427
| x64_sys_call+0x120e/0x2f00 arch/x86/include/generated/asm/syscalls_64.h:56
| do_syscall_x64 arch/x86/entry/common.c:52 [inline]
| do_syscall_64+0xec/0x1d0 arch/x86/entry/common.c:83
| entry_SYSCALL_64_after_hwframe+0x77/0x7f
|
| 8 locks held by syz-executor/24825:
| #0: ffffa2e5031a2f48 (&xt[i].mutex){+.+.}-{3:3}, at: xt_find_table_lock+0x5e/0x230 net/netfilter/x_tables.c:1243
| #1: ffffffff9a357d18 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
| #1: ffffffff9a357d18 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
| #1: ffffffff9a357d18 (rcu_read_lock){....}-{1:2}, at: netif_receive_skb_list_internal+0x24e/0x720 net/core/dev.c:5884
| #2: ffffffff9a357d18 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
| #2: ffffffff9a357d18 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
| #2: ffffffff9a357d18 (rcu_read_lock){....}-{1:2}, at: ip_local_deliver_finish+0xe1/0x290 net/ipv4/ip_input.c:232
| #3: ffffa2e5072201d8 (slock-AF_INET/1){+.-.}-{2:2}, at: tcp_v4_rcv+0x1677/0x1890 net/ipv4/tcp_ipv4.c:2346
| #4: ffffffff9a357d18 (rcu_read_lock){....}-{1:2}, at: rcu_lock_acquire include/linux/rcupdate.h:337 [inline]
| #4: ffffffff9a357d18 (rcu_read_lock){....}-{1:2}, at: rcu_read_lock include/linux/rcupdate.h:849 [inline]
| #4: ffffffff9a357d18 (rcu_read_lock){....}-{1:2}, at: sock_def_readable+0x91/0x460 net/core/sock.c:3439
| #5: ffffa2e503a65858 (&ei->socket.wq.wait){..-.}-{2:2}, at: __wake_up_common_lock+0x26/0x110 kernel/sched/wait.c:105
| #6: ffffa2e506240930 (&p->pi_lock){-.-.}-{2:2}, at: class_raw_spinlock_irqsave_constructor include/linux/spinlock.h:551 [inline]
| #6: ffffa2e506240930 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x50/0x930 kernel/sched/core.c:4154
| #7: ffffa2e57dc2f398 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested kernel/sched/core.c:593 [inline]
| #7: ffffa2e57dc2f398 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock kernel/sched/sched.h:1505 [inline]
| #7: ffffa2e57dc2f398 (&rq->__lock){-.-.}-{2:2}, at: rq_lock kernel/sched/sched.h:1804 [inline]
| #7: ffffa2e57dc2f398 (&rq->__lock){-.-.}-{2:2}, at: ttwu_queue kernel/sched/core.c:3942 [inline]
| #7: ffffa2e57dc2f398 (&rq->__lock){-.-.}-{2:2}, at: try_to_wake_up+0x4ce/0x930 kernel/sched/core.c:4270
| irq event stamp: 89470
| hardirqs last enabled at (89469): [<ffffffff978623b3>] seqcount_lockdep_reader_access include/linux/seqlock.h:74 [inline]
| hardirqs last enabled at (89469): [<ffffffff978623b3>] ktime_get+0xf3/0x2c0 kernel/time/timekeeping.c:848
| hardirqs last disabled at (89470): [<ffffffff9971605e>] __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline]
| hardirqs last disabled at (89470): [<ffffffff9971605e>] _raw_spin_lock_irqsave+0x4e/0xb0 kernel/locking/spinlock.c:162
| softirqs last enabled at (89413): [<ffffffff98f80605>] spin_unlock_bh include/linux/spinlock.h:396 [inline]
| softirqs last enabled at (89413): [<ffffffff98f80605>] release_sock+0x105/0x130 net/core/sock.c:3635
| softirqs last disabled at (89462): [<ffffffff976fb3f2>] __do_softirq kernel/softirq.c:588 [inline]
| softirqs last disabled at (89462): [<ffffffff976fb3f2>] invoke_softirq kernel/softirq.c:428 [inline]
| softirqs last disabled at (89462): [<ffffffff976fb3f2>] __irq_exit_rcu+0x72/0x110 kernel/softirq.c:637
|
| assert no writes to 0xffffa2e506240068 of 4 bytes by task 16 on cpu 2:
| __block_task kernel/sched/sched.h:2770 [inline]
| dequeue_entities+0xde4/0xed0 kernel/sched/fair.c:7177
| pick_next_entity kernel/sched/fair.c:5627 [inline]
| pick_task_fair kernel/sched/fair.c:8856 [inline]
| pick_next_task_fair+0xaf/0x820 kernel/sched/fair.c:8876
| __pick_next_task kernel/sched/core.c:5955 [inline]
| pick_next_task kernel/sched/core.c:6477 [inline]
| __schedule+0x5e0/0x14a0 kernel/sched/core.c:6629
| __schedule_loop kernel/sched/core.c:6752 [inline]
| schedule+0xe7/0x1a0 kernel/sched/core.c:6767
| schedule_timeout+0xaf/0x160 kernel/time/timer.c:2615
| rcu_gp_fqs_loop+0x2d8/0xd40 kernel/rcu/tree.c:2045
| rcu_gp_kthread+0x28/0x2f0 kernel/rcu/tree.c:2247
| kthread+0x1dd/0x220 kernel/kthread.c:389
| ret_from_fork+0x4b/0x60 arch/x86/kernel/process.c:147
| ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
|
| 1 lock held by rcu_preempt/16:
| #0: ffffa2e57dd2f398 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock_nested kernel/sched/core.c:593 [inline]
| #0: ffffa2e57dd2f398 (&rq->__lock){-.-.}-{2:2}, at: raw_spin_rq_lock kernel/sched/sched.h:1505 [inline]
| #0: ffffa2e57dd2f398 (&rq->__lock){-.-.}-{2:2}, at: rq_lock kernel/sched/sched.h:1804 [inline]
| #0: ffffa2e57dd2f398 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x1f9/0x14a0 kernel/sched/core.c:6575
| irq event stamp: 791056
| hardirqs last enabled at (791055): [<ffffffff99716270>] __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:151 [inline]
| hardirqs last enabled at (791055): [<ffffffff99716270>] _raw_spin_unlock_irqrestore+0x30/0x60 kernel/locking/spinlock.c:194
| hardirqs last disabled at (791056): [<ffffffff99706ad7>] __schedule+0x1d7/0x14a0 kernel/sched/core.c:6555
| softirqs last enabled at (788540): [<ffffffff976fb3f2>] __do_softirq kernel/softirq.c:588 [inline]
| softirqs last enabled at (788540): [<ffffffff976fb3f2>] invoke_softirq kernel/softirq.c:428 [inline]
| softirqs last enabled at (788540): [<ffffffff976fb3f2>] __irq_exit_rcu+0x72/0x110 kernel/softirq.c:637
| softirqs last disabled at (788529): [<ffffffff976fb3f2>] __do_softirq kernel/softirq.c:588 [inline]
| softirqs last disabled at (788529): [<ffffffff976fb3f2>] invoke_softirq kernel/softirq.c:428 [inline]
| softirqs last disabled at (788529): [<ffffffff976fb3f2>] __irq_exit_rcu+0x72/0x110 kernel/softirq.c:637
|
| Reported by Kernel Concurrency Sanitizer on:
| CPU: 2 UID: 0 PID: 16 Comm: rcu_preempt Not tainted 6.12.0-rc2-00003-g44423ac48780-dirty #8

Marco Elver

unread,
Oct 23, 2024, 5:03:51 AM10/23/24
to Peter Zijlstra, pau...@kernel.org, Alexander Potapenko, syzbot, au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com
On Wed, 23 Oct 2024 at 10:54, Marco Elver <el...@google.com> wrote:
>
> On Tue, Oct 22, 2024 at 09:57PM +0200, Marco Elver wrote:
> > On Tue, 22 Oct 2024 at 21:12, Peter Zijlstra <pet...@infradead.org> wrote:
> [...]
> > > So KCSAn is trying to tell me these two paths run concurrently on the
> > > same 'p' ?!? That would be a horrible bug -- both these call chains
> > > should be holding rq->__lock (for task_rq(p)).
> >
> > Yes correct.
> >
> > And just to confirm this is no false positive, the way KCSAN works
> > _requires_ the race to actually happen before it reports anything;
> > this can also be seen in Alexander's report with just 1 stack trace
> > where it saw the value transition from 0 to 1 (TASK_ON_RQ_QUEUED) but
> > didn't know who did the write because kernel/sched was uninstrumented.
>
> Got another version of the splat with CONFIG_KCSAN_VERBOSE=y. Lockdep seems to
> think that both threads here are holding rq->__lock.

Gotta read more carefully, one instance is ffffa2e57dc2f398 another is
ffffa2e57dd2f398. If I read it right, then they're not actually the
same lock.

Peter Zijlstra

unread,
Oct 23, 2024, 5:36:47 AM10/23/24
to Marco Elver, pau...@kernel.org, Alexander Potapenko, syzbot, au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com, kent.ov...@linux.dev
On Wed, Oct 23, 2024 at 11:03:11AM +0200, Marco Elver wrote:
> On Wed, 23 Oct 2024 at 10:54, Marco Elver <el...@google.com> wrote:
> >
> > On Tue, Oct 22, 2024 at 09:57PM +0200, Marco Elver wrote:
> > > On Tue, 22 Oct 2024 at 21:12, Peter Zijlstra <pet...@infradead.org> wrote:
> > [...]
> > > > So KCSAn is trying to tell me these two paths run concurrently on the
> > > > same 'p' ?!? That would be a horrible bug -- both these call chains
> > > > should be holding rq->__lock (for task_rq(p)).
> > >
> > > Yes correct.
> > >
> > > And just to confirm this is no false positive, the way KCSAN works
> > > _requires_ the race to actually happen before it reports anything;
> > > this can also be seen in Alexander's report with just 1 stack trace
> > > where it saw the value transition from 0 to 1 (TASK_ON_RQ_QUEUED) but
> > > didn't know who did the write because kernel/sched was uninstrumented.
> >
> > Got another version of the splat with CONFIG_KCSAN_VERBOSE=y. Lockdep seems to
> > think that both threads here are holding rq->__lock.
>
> Gotta read more carefully, one instance is ffffa2e57dc2f398 another is
> ffffa2e57dd2f398. If I read it right, then they're not actually the
> same lock.

Yeah, as explained in the diagram below, the moment the ->on_rq = 0
store goes through, we no longer own the task. And since
ASSERT_EXCLUSIVE_WRITER is after that, we go splat.

The below patch changes this order and switches to using
smp_store_release() and ensures to not reference the task after it.

I've boot tested it, but not much else.

Could you please give this a go (on top of -rc3)?

This also explains the SCHED_WARN_ON() Kent saw, that is subject to the
same race.

---
kernel/sched/fair.c | 21 ++++++++++++++-------
kernel/sched/sched.h | 34 ++++++++++++++++++++++++++++++++--
2 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6512258dc71f..8edac978edb2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5625,8 +5625,9 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
struct sched_entity *se = pick_eevdf(cfs_rq);
if (se->sched_delayed) {
dequeue_entities(rq, se, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
- SCHED_WARN_ON(se->sched_delayed);
- SCHED_WARN_ON(se->on_rq);
+ /*
+ * Must not reference @se again, see __block_task().
+ */
return NULL;
}
return se;
@@ -7170,7 +7171,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
/* Fix-up what dequeue_task_fair() skipped */
hrtick_update(rq);

- /* Fix-up what block_task() skipped. */
+ /*
+ * Fix-up what block_task() skipped.
+ *
+ * Must be last, @p might not be valid after this.
+ */
__block_task(rq, p);
}

@@ -7187,12 +7192,14 @@ static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
util_est_dequeue(&rq->cfs, p);

- if (dequeue_entities(rq, &p->se, flags) < 0) {
- util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);
+ util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
+ if (dequeue_entities(rq, &p->se, flags) < 0)
return false;
- }

- util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
+ /*
+ * Must not reference @p after dequeue_entities(DEQUEUE_DELAYED).
+ */
+
hrtick_update(rq);
return true;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7b139016cbd9..32e9c41b7ec0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2791,8 +2791,6 @@ static inline void sub_nr_running(struct rq *rq, unsigned count)

static inline void __block_task(struct rq *rq, struct task_struct *p)
{
- WRITE_ONCE(p->on_rq, 0);
- ASSERT_EXCLUSIVE_WRITER(p->on_rq);
if (p->sched_contributes_to_load)
rq->nr_uninterruptible++;

@@ -2800,6 +2798,38 @@ static inline void __block_task(struct rq *rq, struct task_struct *p)
atomic_inc(&rq->nr_iowait);
delayacct_blkio_start();
}
+
+ ASSERT_EXCLUSIVE_WRITER(p->on_rq);
+
+ /*
+ * The moment this write goes through, ttwu() can swoop in and migrate
+ * this task, rendering our rq->__lock ineffective.
+ *
+ * __schedule() try_to_wake_up()
+ * LOCK rq->__lock LOCK p->pi_lock
+ * pick_next_task()
+ * pick_next_task_fair()
+ * pick_next_entity()
+ * dequeue_entities()
+ * __block_task()
+ * RELEASE p->on_rq = 0; if (p->on_rq && ...)
+ * break;
+ *
+ * ACQUIRE (after ctrl-dep)
+ *
+ * cpu = select_task_rq();
+ * set_task_cpu(p, cpu);
+ * ttwu_queue()
+ * ttwu_do_activate()
+ * LOCK rq->__lock
+ * activate_task()
+ * STORE p->on_rq = 1
+ * UNLOCK rq->__lock
+ *
+ * Callers must ensure to not reference @p after this -- we no longer
+ * own it.
+ */
+ smp_store_release(&p->on_rq, 0);
}

extern void activate_task(struct rq *rq, struct task_struct *p, int flags);

Marco Elver

unread,
Oct 23, 2024, 9:19:35 AM10/23/24
to Peter Zijlstra, pau...@kernel.org, Alexander Potapenko, syzbot, au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com, kent.ov...@linux.dev
[...]

Tested-by: Marco Elver <el...@google.com>

Previously syzkaller would give us that report within ~1h of fuzzing.
Have been fuzzing with your patch applied for 3h now, and this report
has not resurfaced.

Paul E. McKenney

unread,
Oct 23, 2024, 10:13:32 AM10/23/24
to Marco Elver, Peter Zijlstra, Alexander Potapenko, syzbot, au...@vger.kernel.org, epa...@redhat.com, linux-...@vger.kernel.org, pa...@paul-moore.com, syzkall...@googlegroups.com
Except that it looks to me that Peter added this one, not me. ;-)
Reply all
Reply to author
Forward
0 new messages