KCSAN: data-race in exit_signals / prepare

syzbot

unread,

Oct 21, 2019, 6:34:08 AM10/21/19

to ak...@linux-foundation.org, ar...@arndb.de, chri...@brauner.io, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, ol...@redhat.com, syzkall...@googlegroups.com

Hello,

syzbot found the following crash on:

HEAD commit: d724f94f x86, kcsan: Enable KCSAN for x86
git tree: https://github.com/google/ktsan.git kcsan
console output: https://syzkaller.appspot.com/x/log.txt?x=13eab79f600000
kernel config: https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80
dashboard link: https://syzkaller.appspot.com/bug?extid=492a4acccd8fc75ddfd0
compiler: gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+492a4a...@syzkaller.appspotmail.com

==================================================================
BUG: KCSAN: data-race in exit_signals / prepare_signal

read to 0xffff888103566064 of 4 bytes by interrupt on cpu 0:
sig_task_ignored kernel/signal.c:94 [inline]
sig_ignored kernel/signal.c:119 [inline]
prepare_signal+0x1f5/0x790 kernel/signal.c:956
send_sigqueue+0xc1/0x4b0 kernel/signal.c:1859
posix_timer_event kernel/time/posix-timers.c:328 [inline]
posix_timer_fn+0x10d/0x230 kernel/time/posix-timers.c:354
__run_hrtimer kernel/time/hrtimer.c:1389 [inline]
__hrtimer_run_queues+0x288/0x600 kernel/time/hrtimer.c:1451
hrtimer_interrupt+0x22a/0x480 kernel/time/hrtimer.c:1509
local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1110 [inline]
smp_apic_timer_interrupt+0xdc/0x280 arch/x86/kernel/apic/apic.c:1135
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
arch_local_irq_enable arch/x86/include/asm/paravirt.h:778 [inline]
__raw_spin_unlock_irq include/linux/spinlock_api_smp.h:168 [inline]
_raw_spin_unlock_irq+0x4e/0x80 kernel/locking/spinlock.c:199
spin_unlock_irq include/linux/spinlock.h:388 [inline]
get_signal+0x1f4/0x1320 kernel/signal.c:2707
do_signal+0x3b/0xc00 arch/x86/kernel/signal.c:815
exit_to_usermode_loop+0x250/0x2c0 arch/x86/entry/common.c:159
prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
do_syscall_64+0x2d7/0x2f0 arch/x86/entry/common.c:299
entry_SYSCALL_64_after_hwframe+0x44/0xa9

write to 0xffff888103566064 of 4 bytes by task 7604 on cpu 1:
exit_signals+0x13b/0x490 kernel/signal.c:2822
do_exit+0x1af/0x18e0 kernel/exit.c:825
do_group_exit+0xb4/0x1c0 kernel/exit.c:983
__do_sys_exit_group kernel/exit.c:994 [inline]
__se_sys_exit_group kernel/exit.c:992 [inline]
__x64_sys_exit_group+0x2e/0x30 kernel/exit.c:992
do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 7604 Comm: syz-executor.4 Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
==================================================================

---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

Christian Brauner

unread,

Oct 21, 2019, 7:19:33 AM10/21/19

to syzbot, ol...@redhat.com, ak...@linux-foundation.org, ar...@arndb.de, chri...@brauner.io, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com, wi...@kernel.org

[+Cc Will]

On Mon, Oct 21, 2019 at 03:34:07AM -0700, syzbot wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit: d724f94f x86, kcsan: Enable KCSAN for x86
> git tree: https://github.com/google/ktsan.git kcsan
> console output: https://syzkaller.appspot.com/x/log.txt?x=13eab79f600000
> kernel config: https://syzkaller.appspot.com/x/.config?x=c0906aa620713d80
> dashboard link: https://syzkaller.appspot.com/bug?extid=492a4acccd8fc75ddfd0
> compiler: gcc (GCC) 9.0.0 20181231 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+492a4a...@syzkaller.appspotmail.com
>
> ==================================================================
> BUG: KCSAN: data-race in exit_signals / prepare_signal

This traces back to Oleg fixing a race between a group stop and a thread
exiting before it notices that it has a pending signal or is in the middle of
do_exit() already, causing group stop to get wacky.
The original commit to fix this race is
commit d12619b5ff56 ("fix group stop with exit race") which took sighand
lock before setting PF_EXITING on the thread.

Later on in
commit 5dee1707dfbf ("move the related code from exit_notify() to exit_signals()")
an improvement was made for the single-threaded case and the
case where the group stop is already in progress. This removed the
sighand lock around the PF_EXITING assignment.

If the race really matters and given how tsk->flags is currently accessed
everywhere the simple fix for now might be:

diff --git a/kernel/signal.c b/kernel/signal.c
index c4da1ef56fdf..cf61e044c4cc 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2819,7 +2819,9 @@ void exit_signals(struct task_struct *tsk)
cgroup_threadgroup_change_begin(tsk);

if (thread_group_empty(tsk) || signal_group_exit(tsk->signal)) {
+ spin_lock_irq(&tsk->sighand->siglock);
tsk->flags |= PF_EXITING;
+ spin_unlock_irq(&tsk->sighand->siglock);
cgroup_threadgroup_change_end(tsk);
return;
}

Christian

Oleg Nesterov

unread,

Oct 21, 2019, 8:00:40 AM10/21/19

to Christian Brauner, syzbot, ak...@linux-foundation.org, ar...@arndb.de, chri...@brauner.io, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com, wi...@kernel.org

On 10/21, Christian Brauner wrote:
>
> This traces back to Oleg fixing a race between a group stop and a thread
> exiting before it notices that it has a pending signal or is in the middle of
> do_exit() already, causing group stop to get wacky.
> The original commit to fix this race is
> commit d12619b5ff56 ("fix group stop with exit race") which took sighand
> lock before setting PF_EXITING on the thread.

Not really... sig_task_ignored() didn't check task->flags until the recent
33da8e7c81 ("signal: Allow cifs and drbd to receive their terminating signals").
But I think this doesn't matter, see below.

> If the race really matters and given how tsk->flags is currently accessed
> everywhere the simple fix for now might be:
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index c4da1ef56fdf..cf61e044c4cc 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2819,7 +2819,9 @@ void exit_signals(struct task_struct *tsk)
> cgroup_threadgroup_change_begin(tsk);
>
> if (thread_group_empty(tsk) || signal_group_exit(tsk->signal)) {
> + spin_lock_irq(&tsk->sighand->siglock);
> tsk->flags |= PF_EXITING;
> + spin_unlock_irq(&tsk->sighand->siglock);

Well, exit_signals() tries to avoid ->siglock in this case....

But this doesn't matter. IIUC the problem is not that exit_signals() sets
PF_EXITING, the problem is that it writes to tsk->flags and kasan detects
the data race.

For example, freezable_schedule() which sets PF_FREEZER_SKIP can equally
"race" with sig_task_ignored() or with ANY other code which checks this
task's flags.

I think this is WONTFIX.

Oleg.

Marco Elver

unread,

Oct 21, 2019, 8:15:28 AM10/21/19

to Oleg Nesterov, Christian Brauner, syzbot, Andrew Morton, Arnd Bergmann, chri...@brauner.io, deepa....@gmail.com, ebie...@xmission.com, gu...@fb.com, LKML, syzkall...@googlegroups.com, Will Deacon

If taking the spinlock is unnecessary (which AFAIK it probably is) and
there are no other writers to this flag, you will still need a
WRITE_ONCE(tsk->flags, tsk->flags | PF_EXITING) to avoid the
data-race.

However, if it is possible that there are concurrent writers setting
other bits in flags, you need to ensure that the bits are set
atomically (unless it's ok to lose some bits here). This can only be
done via 1) taking siglock, or 2) via e.g. atomic_or(...), however,
flags is not atomic_t ...

-- Marco

Christian Brauner

unread,

Oct 21, 2019, 8:51:14 AM10/21/19

to Oleg Nesterov, syzbot, ak...@linux-foundation.org, ar...@arndb.de, chri...@brauner.io, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com, wi...@kernel.org

Right, that's the reason I said "If the race really matters". I thought
that other writers/readers always take sighand lock. So the easy fix
would have been to take sighand lock too.
The alternative is that we need to fiddle with task_struct itself and
replace flags with an atomic_t or sm which is more invasive and probably
more controversial.

Christian

Oleg Nesterov

unread,

Oct 21, 2019, 9:47:09 AM10/21/19

to Marco Elver, Christian Brauner, syzbot, Andrew Morton, Arnd Bergmann, chri...@brauner.io, deepa....@gmail.com, ebie...@xmission.com, gu...@fb.com, LKML, syzkall...@googlegroups.com, Will Deacon

On 10/21, Marco Elver wrote:
>
> On Mon, 21 Oct 2019 at 14:00, Oleg Nesterov <ol...@redhat.com> wrote:
> >
> > I think this is WONTFIX.
>
> If taking the spinlock is unnecessary (which AFAIK it probably is) and
> there are no other writers to this flag, you will still need a
> WRITE_ONCE(tsk->flags, tsk->flags | PF_EXITING) to avoid the
> data-race.

Or even WRITE_ONCE(tsk->flags, READ_ONCE(tsk->flags) | PF_EXITING) in
theory. But in practice, I do not think compiler can turn

curent->flags |= PF_EXITING;

into something which temporary clears another flag, say, PF_KTHREAD.

> However, if it is possible that there are concurrent writers setting
> other bits in flags,

No, only current taks should change its ->flags.

Oleg.

Oleg Nesterov

unread,

Oct 21, 2019, 10:21:25 AM10/21/19

to Tejun Heo, Li Zefan, Johannes Weiner, ak...@linux-foundation.org, ar...@arndb.de, chri...@brauner.io, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com

could you explain the usage of siglock/PF_EXITING in
cgroup_enable_task_cg_lists() ?

PF_EXITING is protected by cgroup_threadgroup_rwsem, not by
sighand->siglock.

Oleg.

Tejun Heo

unread,

Oct 24, 2019, 1:54:31 PM10/24/19

to Oleg Nesterov, Li Zefan, Johannes Weiner, ak...@linux-foundation.org, ar...@arndb.de, chri...@brauner.io, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com

Hello, Oleg.

Yeah, the optimization was added a really long time ago and I'm not
sure it was ever correct. I'm removing it. If this ever becomes a
problem (pretty unlikely), I think the right thing to do is adding a
boot param instead of trying to do this dynamically.

Thanks.

--
tejun

Tejun Heo

unread,

Oct 24, 2019, 3:03:55 PM10/24/19

to Oleg Nesterov, Li Zefan, Johannes Weiner, ak...@linux-foundation.org, ar...@arndb.de, chri...@brauner.io, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com, cgr...@vger.kernel.org, kerne...@fb.com

cgroup_enable_task_cg_lists() is used to lazyily initialize task
cgroup associations on the first use to reduce fork / exit overheads
on systems which don't use cgroup. Unfortunately, locking around it
has never been actually correct and its value is dubious given how the
vast majority of systems use cgroup right away from boot.

This patch removes the optimization. For now, replace the cg_list
based branches with WARN_ON_ONCE()'s to be on the safe side. We can
simplify the logic further in the future.

Signed-off-by: Tejun Heo <t...@kernel.org>
Reported-by: Oleg Nesterov <ol...@redhat.com>
---
include/linux/cgroup.h | 1
kernel/cgroup/cgroup.c | 184 ++++++++++---------------------------------------
kernel/cgroup/cpuset.c | 2
3 files changed, 39 insertions(+), 148 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3ba3e6da13a6..f6b048902d6c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -150,7 +150,6 @@ struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
struct cgroup_subsys_state **dst_cssp);

-void cgroup_enable_task_cg_lists(void);
void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
struct css_task_iter *it);
struct task_struct *css_task_iter_next(struct css_task_iter *it);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 8b1c4fd47a7a..cf32c0c7a45d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1883,65 +1883,6 @@ static int cgroup_reconfigure(struct fs_context *fc)
return 0;
}

-/*
- * To reduce the fork() overhead for systems that are not actually using
- * their cgroups capability, we don't maintain the lists running through
- * each css_set to its tasks until we see the list actually used - in other
- * words after the first mount.
- */
-static bool use_task_css_set_links __read_mostly;
-
-void cgroup_enable_task_cg_lists(void)
-{
- struct task_struct *p, *g;
-
- /*
- * We need tasklist_lock because RCU is not safe against
- * while_each_thread(). Besides, a forking task that has passed
- * cgroup_post_fork() without seeing use_task_css_set_links = 1
- * is not guaranteed to have its child immediately visible in the
- * tasklist if we walk through it with RCU.
- */
- read_lock(&tasklist_lock);
- spin_lock_irq(&css_set_lock);
-
- if (use_task_css_set_links)
- goto out_unlock;
-
- use_task_css_set_links = true;
-
- do_each_thread(g, p) {
- WARN_ON_ONCE(!list_empty(&p->cg_list) ||
- task_css_set(p) != &init_css_set);
-
- /*
- * We should check if the process is exiting, otherwise
- * it will race with cgroup_exit() in that the list
- * entry won't be deleted though the process has exited.
- * Do it while holding siglock so that we don't end up
- * racing against cgroup_exit().
- *
- * Interrupts were already disabled while acquiring
- * the css_set_lock, so we do not need to disable it
- * again when acquiring the sighand->siglock here.
- */
- spin_lock(&p->sighand->siglock);
- if (!(p->flags & PF_EXITING)) {
- struct css_set *cset = task_css_set(p);
-
- if (!css_set_populated(cset))
- css_set_update_populated(cset, true);
- list_add_tail(&p->cg_list, &cset->tasks);
- get_css_set(cset);
- cset->nr_tasks++;
- }
- spin_unlock(&p->sighand->siglock);
- } while_each_thread(g, p);
-out_unlock:
- spin_unlock_irq(&css_set_lock);
- read_unlock(&tasklist_lock);
-}
-
static void init_cgroup_housekeeping(struct cgroup *cgrp)
{
struct cgroup_subsys *ss;
@@ -2187,13 +2128,6 @@ static int cgroup_init_fs_context(struct fs_context *fc)
if (!ctx)
return -ENOMEM;

- /*
- * The first time anyone tries to mount a cgroup, enable the list
- * linking each css_set to its tasks and fix up all existing tasks.
- */
- if (!use_task_css_set_links)
- cgroup_enable_task_cg_lists();
-
ctx->ns = current->nsproxy->cgroup_ns;
get_cgroup_ns(ctx->ns);
fc->fs_private = &ctx->kfc;
@@ -2371,9 +2305,8 @@ static void cgroup_migrate_add_task(struct task_struct *task,
if (task->flags & PF_EXITING)
return;

- /* leave @task alone if post_fork() hasn't linked it yet */
- if (list_empty(&task->cg_list))
- return;
+ /* cgroup_threadgroup_rwsem protects racing against forks */
+ WARN_ON_ONCE(list_empty(&task->cg_list));

cset = task_css_set(task);
if (!cset->mg_src_cgrp)
@@ -4586,9 +4519,6 @@ static void css_task_iter_advance(struct css_task_iter *it)
void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
struct css_task_iter *it)
{
- /* no one should try to iterate before mounting cgroups */
- WARN_ON_ONCE(!use_task_css_set_links);
-
memset(it, 0, sizeof(*it));

spin_lock_irq(&css_set_lock);
@@ -6022,62 +5952,38 @@ void cgroup_cancel_fork(struct task_struct *child)
void cgroup_post_fork(struct task_struct *child)
{
struct cgroup_subsys *ss;
+ struct css_set *cset;
int i;

+ spin_lock_irq(&css_set_lock);
+
+ WARN_ON_ONCE(!list_empty(&child->cg_list));
+ cset = task_css_set(current); /* current is @child's parent */
+ get_css_set(cset);
+ cset->nr_tasks++;
+ css_set_move_task(child, NULL, cset, false);
+
/*
- * This may race against cgroup_enable_task_cg_lists(). As that
- * function sets use_task_css_set_links before grabbing
- * tasklist_lock and we just went through tasklist_lock to add
- * @child, it's guaranteed that either we see the set
- * use_task_css_set_links or cgroup_enable_task_cg_lists() sees
- * @child during its iteration.
- *
- * If we won the race, @child is associated with %current's
- * css_set. Grabbing css_set_lock guarantees both that the
- * association is stable, and, on completion of the parent's
- * migration, @child is visible in the source of migration or
- * already in the destination cgroup. This guarantee is necessary
- * when implementing operations which need to migrate all tasks of
- * a cgroup to another.
- *
- * Note that if we lose to cgroup_enable_task_cg_lists(), @child
- * will remain in init_css_set. This is safe because all tasks are
- * in the init_css_set before cg_links is enabled and there's no
- * operation which transfers all tasks out of init_css_set.
+ * If the cgroup has to be frozen, the new task has too. Let's set
+ * the JOBCTL_TRAP_FREEZE jobctl bit to get the task into the
+ * frozen state.
*/
- if (use_task_css_set_links) {
- struct css_set *cset;
-
- spin_lock_irq(&css_set_lock);
- cset = task_css_set(current); /* current is @child's parent */
- if (list_empty(&child->cg_list)) {
- get_css_set(cset);
- cset->nr_tasks++;
- css_set_move_task(child, NULL, cset, false);
- }
+ if (unlikely(cgroup_task_freeze(child))) {
+ spin_lock(&child->sighand->siglock);
+ WARN_ON_ONCE(child->frozen);
+ child->jobctl |= JOBCTL_TRAP_FREEZE;
+ spin_unlock(&child->sighand->siglock);

/*
- * If the cgroup has to be frozen, the new task has too.
- * Let's set the JOBCTL_TRAP_FREEZE jobctl bit to get
- * the task into the frozen state.
+ * Calling cgroup_update_frozen() isn't required here,
+ * because it will be called anyway a bit later from
+ * do_freezer_trap(). So we avoid cgroup's transient switch
+ * from the frozen state and back.
*/
- if (unlikely(cgroup_task_freeze(child))) {
- spin_lock(&child->sighand->siglock);
- WARN_ON_ONCE(child->frozen);
- child->jobctl |= JOBCTL_TRAP_FREEZE;
- spin_unlock(&child->sighand->siglock);
-
- /*
- * Calling cgroup_update_frozen() isn't required here,
- * because it will be called anyway a bit later
- * from do_freezer_trap(). So we avoid cgroup's
- * transient switch from the frozen state and back.
- */
- }
-
- spin_unlock_irq(&css_set_lock);
}

+ spin_unlock_irq(&css_set_lock);
+
/*
* Call ss->fork(). This must happen after @child is linked on
* css_set; otherwise, @child might change state between ->fork()
@@ -6101,29 +6007,19 @@ void cgroup_exit(struct task_struct *tsk)
struct css_set *cset;
int i;

- /*
- * Unlink from @tsk from its css_set. As migration path can't race
- * with us (thanks to cgroup_threadgroup_rwsem), we can check css_set
- * and cg_list without synchronization.
- */
- cset = task_css_set(tsk);
+ spin_lock_irq(&css_set_lock);

- if (!list_empty(&tsk->cg_list)) {
- spin_lock_irq(&css_set_lock);
- css_set_move_task(tsk, cset, NULL, false);
- list_add_tail(&tsk->cg_list, &cset->dying_tasks);
- cset->nr_tasks--;
+ WARN_ON_ONCE(list_empty(&tsk->cg_list));
+ cset = task_css_set(tsk);
+ css_set_move_task(tsk, cset, NULL, false);
+ list_add_tail(&tsk->cg_list, &cset->dying_tasks);
+ cset->nr_tasks--;

- WARN_ON_ONCE(cgroup_task_frozen(tsk));
- if (unlikely(cgroup_task_freeze(tsk)))
- cgroup_update_frozen(task_dfl_cgroup(tsk));
+ WARN_ON_ONCE(cgroup_task_frozen(tsk));
+ if (unlikely(cgroup_task_freeze(tsk)))
+ cgroup_update_frozen(task_dfl_cgroup(tsk));

- spin_unlock_irq(&css_set_lock);
- } else {
- /* Take reference to avoid freeing init_css_set in cgroup_free,
- * see cgroup_fork(). */
- get_css_set(cset);
- }
+ spin_unlock_irq(&css_set_lock);

/* see cgroup_post_fork() for details */
do_each_subsys_mask(ss, i, have_exit_callback) {
@@ -6140,12 +6036,10 @@ void cgroup_release(struct task_struct *task)
ss->release(task);
} while_each_subsys_mask();

- if (use_task_css_set_links) {
- spin_lock_irq(&css_set_lock);
- css_set_skip_task_iters(task_css_set(task), task);
- list_del_init(&task->cg_list);
- spin_unlock_irq(&css_set_lock);
- }
+ spin_lock_irq(&css_set_lock);
+ css_set_skip_task_iters(task_css_set(task), task);
+ list_del_init(&task->cg_list);
+ spin_unlock_irq(&css_set_lock);
}

void cgroup_free(struct task_struct *task)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c52bc91f882b..faff8f99e8f2 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -928,8 +928,6 @@ static void rebuild_root_domains(void)
lockdep_assert_cpus_held();
lockdep_assert_held(&sched_domains_mutex);

- cgroup_enable_task_cg_lists();
-
rcu_read_lock();

/*

Tejun Heo

unread,

Oct 25, 2019, 8:56:11 AM10/25/19

to Oleg Nesterov, Li Zefan, Johannes Weiner, ak...@linux-foundation.org, ar...@arndb.de, chri...@brauner.io, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com, cgr...@vger.kernel.org, kerne...@fb.com

On Thu, Oct 24, 2019 at 12:03:51PM -0700, Tejun Heo wrote:
> cgroup_enable_task_cg_lists() is used to lazyily initialize task
> cgroup associations on the first use to reduce fork / exit overheads
> on systems which don't use cgroup. Unfortunately, locking around it
> has never been actually correct and its value is dubious given how the
> vast majority of systems use cgroup right away from boot.
>
> This patch removes the optimization. For now, replace the cg_list
> based branches with WARN_ON_ONCE()'s to be on the safe side. We can
> simplify the logic further in the future.
>
> Signed-off-by: Tejun Heo <t...@kernel.org>
> Reported-by: Oleg Nesterov <ol...@redhat.com>

Applying to cgroup/for-5.5.

Thanks.

--
tejun

Christian Brauner

unread,

Oct 25, 2019, 9:34:25 AM10/25/19

to Tejun Heo, dvy...@google.com, Oleg Nesterov, Li Zefan, Johannes Weiner, ak...@linux-foundation.org, ar...@arndb.de, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com, cgr...@vger.kernel.org, kerne...@fb.com

[+Dmitry]

The code you removed was the only place where task->flags was set from
!current. So I think this fixes the syzbot data-race report in:
https://lore.kernel.org/r/0000000000003b...@google.com

Link: syzbot+492a4a...@syzkaller.appspotmail.com

Thanks!
Christian

Oleg Nesterov

unread,

Oct 25, 2019, 10:13:39 AM10/25/19

to Christian Brauner, Tejun Heo, dvy...@google.com, Li Zefan, Johannes Weiner, ak...@linux-foundation.org, ar...@arndb.de, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com, cgr...@vger.kernel.org, kerne...@fb.com

On 10/25, Christian Brauner wrote:
>
> [+Dmitry]
>
> On Fri, Oct 25, 2019 at 05:56:06AM -0700, Tejun Heo wrote:
> > On Thu, Oct 24, 2019 at 12:03:51PM -0700, Tejun Heo wrote:
> > > cgroup_enable_task_cg_lists() is used to lazyily initialize task
> > > cgroup associations on the first use to reduce fork / exit overheads
> > > on systems which don't use cgroup. Unfortunately, locking around it
> > > has never been actually correct and its value is dubious given how the
> > > vast majority of systems use cgroup right away from boot.
> > >
> > > This patch removes the optimization. For now, replace the cg_list
> > > based branches with WARN_ON_ONCE()'s to be on the safe side. We can
> > > simplify the logic further in the future.
> > >
> > > Signed-off-by: Tejun Heo <t...@kernel.org>
> > > Reported-by: Oleg Nesterov <ol...@redhat.com>
> >
> > Applying to cgroup/for-5.5.
>
> The code you removed was the only place where task->flags was set from
> !current.

No, that code doesn't modify task->flags. It checks PF_EXITING under siglock
but this makes no sense and can't avoid the race with cgroup_exit().

> So I think this fixes the syzbot data-race report in:
> https://lore.kernel.org/r/0000000000003b...@google.com

No.

Almost every usage of task->flags (load or sore) can be reported as "data race".

Say, you do

if (task->flags & PF_KTHREAD)

while this task does

current->flags |= PF_FREEZER_SKIP;
schedule().

this is data race.

Oleg.

Christian Brauner

unread,

Oct 25, 2019, 10:32:46 AM10/25/19

to Oleg Nesterov, Tejun Heo, dvy...@google.com, Li Zefan, Johannes Weiner, ak...@linux-foundation.org, ar...@arndb.de, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com, cgr...@vger.kernel.org, kerne...@fb.com

On Fri, Oct 25, 2019 at 04:13:25PM +0200, Oleg Nesterov wrote:
> On 10/25, Christian Brauner wrote:
> >
> > [+Dmitry]
> >
> > On Fri, Oct 25, 2019 at 05:56:06AM -0700, Tejun Heo wrote:
> > > On Thu, Oct 24, 2019 at 12:03:51PM -0700, Tejun Heo wrote:
> > > > cgroup_enable_task_cg_lists() is used to lazyily initialize task
> > > > cgroup associations on the first use to reduce fork / exit overheads
> > > > on systems which don't use cgroup. Unfortunately, locking around it
> > > > has never been actually correct and its value is dubious given how the
> > > > vast majority of systems use cgroup right away from boot.
> > > >
> > > > This patch removes the optimization. For now, replace the cg_list
> > > > based branches with WARN_ON_ONCE()'s to be on the safe side. We can
> > > > simplify the logic further in the future.
> > > >
> > > > Signed-off-by: Tejun Heo <t...@kernel.org>
> > > > Reported-by: Oleg Nesterov <ol...@redhat.com>
> > >
> > > Applying to cgroup/for-5.5.
> >
> > The code you removed was the only place where task->flags was set from
> > !current.
>
> No, that code doesn't modify task->flags. It checks PF_EXITING under siglock
> but this makes no sense and can't avoid the race with cgroup_exit().

Sorry, you are right. I misread
Ah right, sorry I misremembered this from the prior thread where we
discussed where ->flags is set from [1].

>
> > So I think this fixes the syzbot data-race report in:
> > https://lore.kernel.org/r/0000000000003b...@google.com
>
> No.
>
> Almost every usage of task->flags (load or sore) can be reported as "data race".
>
> Say, you do
>
> if (task->flags & PF_KTHREAD)
>
> while this task does
>
> current->flags |= PF_FREEZER_SKIP;
> schedule().
>
> this is data race.

Right, but I thought we agreed on WONTFIX in those scenarios?
The alternative is to READ_ONCE()/WRITE_ONCE() all of these.

[1]: https://lore.kernel.org/r/2019102113...@redhat.com

Anyway, accidental noise on my part.
Christian

Oleg Nesterov

unread,

Oct 25, 2019, 11:52:38 AM10/25/19

to Christian Brauner, Tejun Heo, dvy...@google.com, Li Zefan, Johannes Weiner, ak...@linux-foundation.org, ar...@arndb.de, deepa....@gmail.com, ebie...@xmission.com, el...@google.com, gu...@fb.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com, cgr...@vger.kernel.org, kerne...@fb.com

Well, in my opinion this is WONTFIX, but I won't argue if someone
adds _ONCE to all of these. Same for task->state, exit_state, and
more.

Oleg.

Christian Brauner

unread,

Oct 25, 2019, 1:05:37 PM10/25/19

to Oleg Nesterov, dvy...@google.com, ebie...@xmission.com, el...@google.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com

[Removing a few people from Cc to avoid spamming the whole world]

Well, I honestly think that state and exit_state would make sense.
There already were issues that got fixed for example in 3245d6acab98
("exit: fix race between wait_consider_task() and wait_task_zombie()")
and as far as I understand this would also help kcsan to better detect
races.

Christian

Oleg Nesterov

unread,

Oct 28, 2019, 12:49:04 PM10/28/19

to Christian Brauner, dvy...@google.com, ebie...@xmission.com, el...@google.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com

On 10/25, Christian Brauner wrote:
>
> On Fri, Oct 25, 2019 at 05:52:25PM +0200, Oleg Nesterov wrote:
> > On 10/25, Christian Brauner wrote:
> > >
> > > On Fri, Oct 25, 2019 at 04:13:25PM +0200, Oleg Nesterov wrote:
> > > > Almost every usage of task->flags (load or sore) can be reported as "data race".
> > > >
> > > > Say, you do
> > > >
> > > > if (task->flags & PF_KTHREAD)
> > > >
> > > > while this task does
> > > >
> > > > current->flags |= PF_FREEZER_SKIP;
> > > > schedule().
> > > >
> > > > this is data race.
> > >
> > > Right, but I thought we agreed on WONTFIX in those scenarios?
> > > The alternative is to READ_ONCE()/WRITE_ONCE() all of these.
> >
> > Well, in my opinion this is WONTFIX, but I won't argue if someone
> > adds _ONCE to all of these. Same for task->state, exit_state, and
> > more.
>
> Well, I honestly think that state and exit_state would make sense.

Heh. Again, I am not arguing, but...

OK, lets suppose we blindly add READ_ONCE() to every access of
task->state/exit_state.

Yes, this won't hurt and possibly can fix some bugs we are not aware of.

However,

> There already were issues that got fixed for example in 3245d6acab98
> ("exit: fix race between wait_consider_task() and wait_task_zombie()")

The change above can't fix the problem like this.

It is not that this code lacked READ_ONCE(). I am sure me and others
understood that this code can read ->exit_state more than once, just
nobody noticed that in this case this is really wrong.

IOW, if we simply change the code before 3245d6acab98 to use READ_ONCE()
the code will be equally wrong, and

> and as far as I understand this would also help kcsan to better detect
> races.

this change will simply hide the problem from kcsan.

Oleg.

Christian Brauner

unread,

Oct 28, 2019, 1:30:35 PM10/28/19

to Oleg Nesterov, dvy...@google.com, ebie...@xmission.com, el...@google.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com

On Mon, Oct 28, 2019 at 05:48:52PM +0100, Oleg Nesterov wrote:
> On 10/25, Christian Brauner wrote:
> >
> > On Fri, Oct 25, 2019 at 05:52:25PM +0200, Oleg Nesterov wrote:
> > > On 10/25, Christian Brauner wrote:
> > > >
> > > > On Fri, Oct 25, 2019 at 04:13:25PM +0200, Oleg Nesterov wrote:
> > > > > Almost every usage of task->flags (load or sore) can be reported as "data race".
> > > > >
> > > > > Say, you do
> > > > >
> > > > > if (task->flags & PF_KTHREAD)
> > > > >
> > > > > while this task does
> > > > >
> > > > > current->flags |= PF_FREEZER_SKIP;
> > > > > schedule().
> > > > >
> > > > > this is data race.
> > > >
> > > > Right, but I thought we agreed on WONTFIX in those scenarios?
> > > > The alternative is to READ_ONCE()/WRITE_ONCE() all of these.
> > >
> > > Well, in my opinion this is WONTFIX, but I won't argue if someone
> > > adds _ONCE to all of these. Same for task->state, exit_state, and
> > > more.
> >
> > Well, I honestly think that state and exit_state would make sense.
>
> Heh. Again, I am not arguing, but...
>
> OK, lets suppose we blindly add READ_ONCE() to every access of
> task->state/exit_state.
>
> Yes, this won't hurt and possibly can fix some bugs we are not aware of.

I wasn't planning or working on adding *_ONCE everywhere. ;)
I just think it makes sense as a preemptive strike since they are shared
(though mostly protected by locks anyway).

>
> However,
>
> > There already were issues that got fixed for example in 3245d6acab98
> > ("exit: fix race between wait_consider_task() and wait_task_zombie()")
>
> The change above can't fix the problem like this.

No argument about the code we discussed right here, for sure!

>
> It is not that this code lacked READ_ONCE(). I am sure me and others
> understood that this code can read ->exit_state more than once, just
> nobody noticed that in this case this is really wrong.
>
> IOW, if we simply change the code before 3245d6acab98 to use READ_ONCE()
> the code will be equally wrong, and
>
> > and as far as I understand this would also help kcsan to better detect
> > races.
>
> this change will simply hide the problem from kcsan.

I can't speak to that since the claim that read_once() helps them even
if it's not really doing anything. But maybe I misunderstood the
k{c,t}san manpage.

Christian

Marco Elver

unread,

Oct 28, 2019, 1:46:57 PM10/28/19

to Christian Brauner, Oleg Nesterov, dvy...@google.com, ebie...@xmission.com, linux-...@vger.kernel.org, syzkall...@googlegroups.com

What Oleg said is right: marking things _ONCE will make these accesses
no longer be data races, and KCSAN would then no longer report these as
data race, even if the code is still buggy due to there being a race
condition. In this case, the race condition would stop also being a data
race, and we'd no longer find it.

Thanks,
-- Marco

syzbot

unread,

Sep 30, 2020, 9:52:14 PM9/30/20

to syzkall...@googlegroups.com

Auto-closing this bug as obsolete.
Crashes did not happen for a while, no reproducer and no activity.

Reply all

Reply to author

Forward