Re: [PATCH RFC v2] pidfs: convert rb-tree to rhashtable

1 view
Skip to first unread message

Guillaume Tucker

unread,
Feb 20, 2026, 10:11:48 AMFeb 20
to Christian Brauner, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
Hi Christian et al,

On 20/01/2026 15:52, Christian Brauner wrote:
> Mateusz reported performance penalties [1] during task creation because
> pidfs uses pidmap_lock to add elements into the rbtree. Switch to an
> rhashtable to have separate fine-grained locking and to decouple from
> pidmap_lock moving all heavy manipulations outside of it.
>
> Convert the pidfs inode-to-pid mapping from an rb-tree with seqcount
> protection to an rhashtable. This removes the global pidmap_lock
> contention from pidfs_ino_get_pid() lookups and allows the hashtable
> insert to happen outside the pidmap_lock.
>
> pidfs_add_pid() is split. pidfs_prepare_pid() allocates inode number and
> initializes pid fields and is called inside pidmap_lock. pidfs_add_pid()
> inserts pid into rhashtable and is called outside pidmap_lock. Insertion
> into the rhashtable can fail and memory allocation may happen so we need
> to drop the spinlock.
>
> To guard against accidently opening an already reaped task
> pidfs_ino_get_pid() uses additional checks beyond pid_vnr(). If
> pid->attr is PIDFS_PID_DEAD or NULL the pid either never had a pidfd or
> it already went through pidfs_exit() aka the process as already reaped.
> If pid->attr is valid check PIDFS_ATTR_BIT_EXIT to figure out whether
> the task has exited.
>
> This slightly changes visibility semantics: pidfd creation is denied
> after pidfs_exit() runs, which is just before the pid number is removed
> from the via free_pid(). That should not be an issue though.
>
> Link: https://lore.kernel.org/20251206131955....@gmail.com [1]
> Signed-off-by: Christian Brauner <bra...@kernel.org>
> ---
> Changes in v2:
> - Ensure that pid is removed before call_rcu() from pidfs.
> - Don't drop and reacquire spinlock.
> - Link to v1: https://patch.msgid.link/20260119-work-pidfs-rhas...@kernel.org
> ---
> fs/pidfs.c | 81 +++++++++++++++++++++------------------------------
> include/linux/pid.h | 4 +--
> include/linux/pidfs.h | 3 +-
> kernel/pid.c | 13 ++++++---
> 4 files changed, 46 insertions(+), 55 deletions(-)

[...]

> diff --git a/kernel/pid.c b/kernel/pid.c
> index ad4400a9f15f..6077da774652 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -43,7 +43,6 @@
> #include <linux/sched/task.h>
> #include <linux/idr.h>
> #include <linux/pidfs.h>
> -#include <linux/seqlock.h>
> #include <net/sock.h>
> #include <uapi/linux/pidfd.h>
>
> @@ -85,7 +84,6 @@ struct pid_namespace init_pid_ns = {
> EXPORT_SYMBOL_GPL(init_pid_ns);
>
> static __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
> -seqcount_spinlock_t pidmap_lock_seq = SEQCNT_SPINLOCK_ZERO(pidmap_lock_seq, &pidmap_lock);
>
> void put_pid(struct pid *pid)
> {
> @@ -141,9 +139,9 @@ void free_pid(struct pid *pid)
>
> idr_remove(&ns->idr, upid->nr);
> }
> - pidfs_remove_pid(pid);
> spin_unlock(&pidmap_lock);
>
> + pidfs_remove_pid(pid);
> call_rcu(&pid->rcu, delayed_put_pid);
> }

There appears to be a reproducible panic in rcu since next-20260216
at least while running KUnit. After running a bisection I found that
it was visible at a merge commit adding this patch 44e59e62b2a2
("Merge branch 'kernel-7.0.misc' into vfs.all"). I then narrowed it
down further on a test branch by rebasing the pidfs series on top of
the last known working commit:

https://gitlab.com/gtucker/linux/-/commits/kunit-rcu-debug-rebased

I also did some initial investigation with basic printk debugging and
haven't found anything obviously wrong in this patch itself, although
I'm no expert in pidfs... One symptom is that the kernel panic
always happens because the function pointer to delayed_put_pid()
becomes corrupt. As a quick hack, if I just call put_pid() in
free_pid() rather than go through rcu then there's no panic - see the
last commit on the test branch from the link above. The issue is
still in next-20260219 as far as I can tell.

Here's how to reproduce this, using the new container script and a
plain container image to run KUnit vith QEMU on x86:

scripts/container -s -i docker.io/gtucker/korg-gcc:kunit -- \
tools/testing/kunit/kunit.py \
run \
--arch=x86_64 \
--cross_compile=x86_64-linux-

The panic can be seen in .kunit/test.log:

[gtucker] rcu_do_batch:2609 count=7 func=ffffffff99026d40
Oops: invalid opcode: 0000 [#2] SMP NOPTI
CPU: 0 UID: 0 PID: 197 Comm: kunit_try_catch Tainted: G D N 6.19.0-09950-gc33cbc7ffae4 #77 PREEMPT(lazy)
Tainted: [D]=DIE, [N]=TEST
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.1 11/11/2019
RIP: 0010:0xffffffff99026d42

Looking at the last rcu callbacks that were enqueued with my extra
printk messages:

$ grep call_rcu .kunit/test.log | tail -n16
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0
[gtucker] call_rcu kernel/pid.c:148 free_pid ffffffff988adaf0
[gtucker] call_rcu kernel/exit.c:237 put_task_struct_rcu_user ffffffff9888e440
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu kernel/pid.c:148 free_pid ffffffff988adaf0
[gtucker] call_rcu kernel/exit.c:237 put_task_struct_rcu_user ffffffff9888e440
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0

and then the ones that were called:

$ grep rcu_do_batch .kunit/test.log | tail
[gtucker] rcu_do_batch:2609 count=7 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=8 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=9 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=1 func=ffffffff98ccc1a0
[gtucker] rcu_do_batch:2609 count=2 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=3 func=ffffffff988b7cd0
[gtucker] rcu_do_batch:2609 count=4 func=ffffffff988b7cd0
[gtucker] rcu_do_batch:2609 count=5 func=ffffffff988b7cd0
[gtucker] rcu_do_batch:2609 count=6 func=ffffffff98ccc1a0
[gtucker] rcu_do_batch:2609 count=7 func=ffffffff99026d40

we can see that the last pointer ffffffff99026d40 was never enqueued,
and the one from free_pid() ffffffff988adaf0 was never dequeued.
This is where I stopped investigating as it looked legit and someone
else might have more clues as to what's going on here. I've only
seen the problem with this callback but again, KUnit is a very narrow
kind of workload so the root cause may well be lying elsewhere.

Please let me know if you need any more debugging details or if I can
help test a fix. Hope this helps!

Cheers,
Guillaume

Christian Brauner

unread,
Feb 24, 2026, 8:22:40 AMFeb 24
to Guillaume Tucker, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
Thanks for the report. I have so far no idea how that can happen:

* Is this reproducible with multiple compilers?
* Is this reproducible on v7.0-rc1?

Fwiw, we're missing one check currently:

diff --git a/kernel/pid.c b/kernel/pid.c
index 3b96571d0fe6..dc11acd445ae 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -326,7 +326,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,

retval = pidfs_add_pid(pid);
if (unlikely(retval)) {
- free_pid(pid);
+ if (pid != &init_struct_pid)
+ free_pid(pid);
pid = ERR_PTR(-ENOMEM);
}

But it seems unlikely that pidfs_add_pid() fails here.

Christian Brauner

unread,
Feb 24, 2026, 11:25:29 AMFeb 24
to Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow, Linus Torvalds
Ugh, yuck squared.

IIUC, the bug is a UAF in free_kthread_struct(). It wasn't easy
detectable until the pidfs rhashtable conversion changed struct pid's
size and field layout.

The rhashtable conversion replaced struct pid's struct rb_node of 24
bytes with struct rhash_head of 8 bytes, shrinking struct pid by 16
bytes bringing it to 144 bytes. This means it's now the same size as
struct kthread (without CONFIG_BLK_CGROUP). Both structs use
SLAB_HWCACHE_ALIGN bringing it to 192. KUnit sets
CONFIG_SLAB_MERGE_DEFAULT=y. So now struct pid and struct kthread share
slab caches. First part of the puzzle.

struct pid.rcu.func is at offset 0x78 and struct kthread.affinity node
at offset 0x78. I'm I'm right then we can already see where this is
going.

So free_kthread_struct() calls kfree(kthread) without checking whether
the kthread's affinity_node is still linked in kthread_affinity_list.

If a kthread exits via a path that bypasses kthread_exit() (e.g.,
make_task_dead() after an oops -- which calls do_exit() directly),
the affinity_node remains in the global kthread_affinity_list. When
free_kthread_struct() later frees the kthread struct, the linked list
still references the freed memory. Any subsequent list_del() by another
kthread in kthread_exit() writes to the freed memory:

list_del(&kthread->affinity_node):
entry->prev->next = entry->next // writes to freed predecessor's offset 0x78

With cache merging, this freed kthread memory may have been reused for a
struct pid. The write corrupts pid.rcu.func at the same offset 0x78,
replacing delayed_put_pid with &kthread_affinity_list. The next RCU
callback invocation is fscked.

* Raising SLAB_NO_MERGE makes the bug go away.
* Turn on
CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
CONFIG_KASAN_INLINE=y

BUG: KASAN: slab-use-after-free in kthread_exit+0x255/0x280
Write of size 8 at addr ffff888003238f78 by task kunit_try_catch/183

Allocated by task 169:
set_kthread_struct+0x31/0x150
copy_process+0x203d/0x4c00

<snip>

Freed by task 0:
free_task+0x147/0x3a0
rcu_core+0x58e/0x1c50

<snip>

Object at ffff888003238f00 belongs to kmalloc-192 cache.
Allocated by
set_kthread_struct()
-> kzalloc(sizeof(*kthread)).
Freed by
free_task()
-> free_kthread_struct()
-> kfree().

Written to at offset 0x78 (affinity_node.next) by another kthread's
list_del in kthread_exit().

Fix should be something like

void free_kthread_struct(struct task_struct *k)
{
struct kthread *kthread;

kthread = to_kthread(k);
if (!kthread)
return;

+ if (!list_empty(&kthread->affinity_node)) {
+ mutex_lock(&kthread_affinity_lock);
+ list_del(&kthread->affinity_node);
+ mutex_unlock(&kthread_affinity_lock);
+ }
+ if (kthread->preferred_affinity)
+ kfree(kthread->preferred_affinity);

#ifdef CONFIG_BLK_CGROUP
WARN_ON_ONCE(kthread->blkcg_css);
#endif
k->worker_private = NULL;
kfree(kthread->full_name);
kfree(kthread);
}

The normal kthread_exit() path already unlinks the node. After
list_del(), the node's pointers are set to LIST_POISON1/LIST_POISON2, so
list_empty() returns false. To avoid a double-unlink, kthread_exit()
should use list_del_init() instead of list_del(), so that
free_kthread_struct()'s list_empty() check correctly detects the
already-unlinked state.

I had claude reproduce this under various conditions because I'm a lazy fsck.

Mark Brown

unread,
Feb 24, 2026, 11:37:13 AMFeb 24
to Christian Brauner, Guillaume Tucker, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, kuni...@googlegroups.com, David Gow
On Tue, Feb 24, 2026 at 02:22:31PM +0100, Christian Brauner wrote:
FWIW I've been seeing RCU stalls and crashes in KUnit while building
-next which I believe are the same as Guillaume is reporting, these have
reproduced with multiple arches using the Debian GCC 14 and with LLVM
for me. Example of my command line:

./tools/testing/kunit/kunit.py run --arch arm64 --cross_compile=aarch64-linux-gnu-

> * Is this reproducible on v7.0-rc1?

The issues I'm seeing certainly are. I just tried testing on this
specific commit (802182490445f6bcf5de0e0518fb967c2afb6da1) and managed
to have KUnit complete though...

> Fwiw, we're missing one check currently:

...

> But it seems unlikely that pidfs_add_pid() fails here.

That doesn't seem to have any impact for me. I suspect that there may
be multiple interacting changes triggering the issue (or making it more
likely to occur) which isn't helping anything :/ See also:

https://lore.kernel.org/r/59e8c352-7bc8-4358...@davidgow.net
signature.asc

Mark Brown

unread,
Feb 24, 2026, 12:04:13 PMFeb 24
to Christian Brauner, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, kuni...@googlegroups.com, David Gow, Linus Torvalds
On Tue, Feb 24, 2026 at 05:25:21PM +0100, Christian Brauner wrote:

> Ugh, yuck squared.

> IIUC, the bug is a UAF in free_kthread_struct(). It wasn't easy
> detectable until the pidfs rhashtable conversion changed struct pid's
> size and field layout.

Eeew, indeed. Thanks for figuring this out!
Confirmed that the above patch plus the list_del_init() change seems to
fix the issue for me, the full patch I tested with is below to confirm.
Feel free to add:

Tested-by: Mark Brown <bro...@kernel.org>

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 20451b624b67..3778fcbc56db 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -147,6 +147,13 @@ void free_kthread_struct(struct task_struct *k)
kthread = to_kthread(k);
if (!kthread)
return;
+ if (!list_empty(&kthread->affinity_node)) {
+ mutex_lock(&kthread_affinity_lock);
+ list_del(&kthread->affinity_node);
+ mutex_unlock(&kthread_affinity_lock);
+ }
+ if (kthread->preferred_affinity)
+ kfree(kthread->preferred_affinity);

#ifdef CONFIG_BLK_CGROUP
WARN_ON_ONCE(kthread->blkcg_css);
@@ -325,7 +332,7 @@ void __noreturn kthread_exit(long result)
kthread->result = result;
if (!list_empty(&kthread->affinity_node)) {
mutex_lock(&kthread_affinity_lock);
- list_del(&kthread->affinity_node);
+ list_del_init(&kthread->affinity_node);
mutex_unlock(&kthread_affinity_lock);

if (kthread->preferred_affinity) {
signature.asc

Guillaume Tucker

unread,
Feb 24, 2026, 1:27:18 PMFeb 24
to Mark Brown, Christian Brauner, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, kuni...@googlegroups.com, David Gow, Linus Torvalds
I second Mark's feedback. I only used GCC 14.2 on x86_64 (see my
original email with a container image). I confirm it is reproducible
on v7.0-rc1 and that the patch below fixes the issue. So for what
it's worth, feel free to add:

Tested-by: Guillaume Tucker <gtu...@gtucker.io>

Cheers,
Guillaume


PS: It also happens to be a great first result for my new automated
bisection tool :) I'll add some post-processing with debug configs
turned on to automate typical sanity checks e.g. KASAN and builds
with alternative compilers and architectures.

Linus Torvalds

unread,
Feb 24, 2026, 2:31:19 PMFeb 24
to Christian Brauner, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
On Tue, 24 Feb 2026 at 08:25, Christian Brauner <bra...@kernel.org> wrote:
>
> If a kthread exits via a path that bypasses kthread_exit() (e.g.,
> make_task_dead() after an oops -- which calls do_exit() directly),
> the affinity_node remains in the global kthread_affinity_list. When
> free_kthread_struct() later frees the kthread struct, the linked list
> still references the freed memory. Any subsequent list_del() by another
> kthread in kthread_exit() writes to the freed memory:

Ugh.

So this is nasty, but I really detest the suggested fix. It just
smells wrong to have that affinity_node cleanup done in two different
places depending on how the exit is done.

IOW, I think the proper fix would be to just make sure that
kthread_exit() isn't actually ever bypassed.

Because looking at this, there are other issues with do_exit() killing
a kthread - it currently also means that kthread->result randomly
doesn't get set, for example, so kthread_stop() would appear to
basically return garbage.

No, nobody likely cares about the kthread_stop() return value for that
case, but it's an example of the same kind of "two different exit
paths, inconsistent data structures" issue.

How about something like the attached, in other words?

NOTE NOTE NOTE! This is *entirely* untested. It might do unspeakable
things to your pets, so please check it. I'm sending this patch out as
a "I really would prefer this kind of approach" example, not as
anything more than that.

Because I really think the core fundamental problem was that there
were two different exit paths that did different things, and we
shouldn't try to fix the symptoms of that problem, but instead really
fix the core issue.

Hmm?

Side note: while writing this suggested patch, I do note that this
comment is wrong:

* When "(p->flags & PF_KTHREAD)" is set the task is a kthread and will
* always remain a kthread. For kthreads p->worker_private always
* points to a struct kthread. For tasks that are not kthreads
* p->worker_private is used to point to other things.

because 'init_task' is marked as PF_KTHREAD, but does *not* have a
p->worker_private.

Anyway, that doesn't affect this particular code, but it might be
worth thinking about.

Linus
patch.diff

David Gow

unread,
Feb 25, 2026, 11:09:03 PMFeb 25
to Christian Brauner, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, Linus Torvalds
Le 25/02/2026 à 12:25 AM, Christian Brauner a écrit :
> On Tue, Feb 24, 2026 at 02:22:31PM +0100, Christian Brauner wrote:

(...snip...)
Aha... that makes sense. What a mess!

>
> Fix should be something like
>
> void free_kthread_struct(struct task_struct *k)
> {
> struct kthread *kthread;
>
> kthread = to_kthread(k);
> if (!kthread)
> return;
>
> + if (!list_empty(&kthread->affinity_node)) {
> + mutex_lock(&kthread_affinity_lock);
> + list_del(&kthread->affinity_node);
> + mutex_unlock(&kthread_affinity_lock);
> + }
> + if (kthread->preferred_affinity)
> + kfree(kthread->preferred_affinity);
>
> #ifdef CONFIG_BLK_CGROUP
> WARN_ON_ONCE(kthread->blkcg_css);
> #endif
> k->worker_private = NULL;
> kfree(kthread->full_name);
> kfree(kthread);
> }
>

This fixes the KUnit issues for me, too, so:

Tested-by: David Gow <da...@davidgow.net>

(But so does Linus' patch, which is probably the nicer long-term solution.)

Cheers,
-- David

David Gow

unread,
Feb 25, 2026, 11:09:15 PMFeb 25
to Linus Torvalds, Christian Brauner, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com
FWIW, the attached patch fixes the issue for me, and my machines have
been been stable running it this morning (+reverting the firewire merge
due to [1]), so the idea seems pretty sound.

Tested-by: David Gow <da...@davidgow.net>

Cheers,
-- David

[1]:
https://lore.kernel.org/all/CAHk-=wgUmxkjwsWzX1rTo=DTnaouwY-VT8Bj...@mail.gmail.com/

Christian Brauner

unread,
Feb 26, 2026, 4:48:02 AMFeb 26
to Linus Torvalds, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
Oh nice.
I was kinda hoping Tejun would jump on this one and so just pointed to
one potential way to fix it but didn't really spend time on it.

Anyway, let's just take what you proposed and slap a commit message on
it. Fwiw, init_task does have ->worker_private it just gets set later
during sched_init():

/*
* The idle task doesn't need the kthread struct to function, but it
* is dressed up as a per-CPU kthread and thus needs to play the part
* if we want to avoid special-casing it in code that deals with per-CPU
* kthreads.
*/
WARN_ON(!set_kthread_struct(current));

I think that @current here is misleading. When sched_init() runs it
should be single-threaded still and current == &init_task. So that
set_kthread_struct(current) call sets @init_task's worker_private iiuc.

Patch appended. I'll stuff it into vfs.fixes.
0001-kthread-consolidate-kthread-exit-paths-to-prevent-us.patch

Linus Torvalds

unread,
Feb 27, 2026, 11:55:03 AMFeb 27
to Christian Brauner, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
On Thu, 26 Feb 2026 at 01:48, Christian Brauner <bra...@kernel.org> wrote:
>
> Anyway, let's just take what you proposed and slap a commit message on
> it.

I will look forward to getting this patch through the normal channels
and I have removed it from my "misc tree of random patches".

> Fwiw, init_task does have ->worker_private it just gets set later
> during sched_init():

Ahh. That was not very obvious indeed.

Linus

Alice Ryhl

unread,
Mar 3, 2026, 5:03:34 AMMar 3
to bra...@kernel.org, bro...@kernel.org, davi...@google.com, gtu...@gtucker.io, ja...@suse.cz, kuni...@googlegroups.com, linux-...@vger.kernel.org, mjg...@gmail.com, t...@kernel.org, torv...@linux-foundation.org, vi...@zeniv.linux.org.uk, Alice Ryhl
Christian Brauner <bra...@kernel.org> writes:
> Oh nice.
> I was kinda hoping Tejun would jump on this one and so just pointed to
> one potential way to fix it but didn't really spend time on it.
>
> Anyway, let's just take what you proposed and slap a commit message on
> it. Fwiw, init_task does have ->worker_private it just gets set later
> during sched_init():
>
> /*
> * The idle task doesn't need the kthread struct to function, but it
> * is dressed up as a per-CPU kthread and thus needs to play the part
> * if we want to avoid special-casing it in code that deals with per-CPU
> * kthreads.
> */
> WARN_ON(!set_kthread_struct(current));
>
> I think that @current here is misleading. When sched_init() runs it
> should be single-threaded still and current == &init_task. So that
> set_kthread_struct(current) call sets @init_task's worker_private iiuc.
>
> Patch appended. I'll stuff it into vfs.fixes.

welp, I guess you get another one here:

Tested-by: Alice Ryhl <alic...@google.com>

from:
https://lore.kernel.org/all/aaaxK5gF...@google.com/

Alice

Christian Loehle

unread,
Mar 6, 2026, 6:05:59 AMMar 6
to Christian Brauner, Linus Torvalds, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
FWIW this leaves the stale BTF reference:

------8<------
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 401d6c4960ec..8db79e593156 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -25261,7 +25261,6 @@ BTF_ID(func, __x64_sys_exit_group)
BTF_ID(func, do_exit)
BTF_ID(func, do_group_exit)
BTF_ID(func, kthread_complete_and_exit)
-BTF_ID(func, kthread_exit)
BTF_ID(func, make_task_dead)
BTF_SET_END(noreturn_deny)

0001-bpf-drop-kthread_exit-from-noreturn_deny.patch

Christian Brauner

unread,
Mar 6, 2026, 9:08:37 AMMar 6
to linux-...@vger.kernel.org, linux-...@vger.kernel.org, linu...@vger.kernel.org, b...@vger.kernel.org, linux-k...@vger.kernel.org, kuni...@googlegroups.com, Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Brendan Higgins, David Gow, Rae Moar, Christian Loehle, Linus Torvalds, Christian Brauner
In 28aaa9c39945 ("kthread: consolidate kthread exit paths to prevent use-after-free")
we folded kthread_exit() into do_exit() when we fixed a nasty UAF bug.
We left kthread_exit() around as an alias to do_exit(). Remove it
completely.

Reported-by: Christian Loehle <christia...@arm.com>
Link: https://lore.kernel.org/1ff1bce2-8bb4-463c...@arm.com
Signed-off-by: Christian Brauner <bra...@kernel.org>
---
include/linux/kthread.h | 1 -
include/linux/module.h | 2 +-
include/linux/sunrpc/svc.h | 2 +-
kernel/bpf/verifier.c | 1 -
kernel/kthread.c | 8 ++++----
kernel/module/main.c | 2 +-
lib/kunit/try-catch.c | 2 +-
7 files changed, 8 insertions(+), 10 deletions(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index a01a474719a7..37982eca94f1 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -116,7 +116,6 @@ void *kthread_probe_data(struct task_struct *k);
int kthread_park(struct task_struct *k);
void kthread_unpark(struct task_struct *k);
void kthread_parkme(void);
-#define kthread_exit(result) do_exit(result)
void kthread_complete_and_exit(struct completion *, long) __noreturn;
int kthreads_update_housekeeping(void);
void kthread_do_exit(struct kthread *, long);
diff --git a/include/linux/module.h b/include/linux/module.h
index 14f391b186c6..79ac4a700b39 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -855,7 +855,7 @@ static inline int unregister_module_notifier(struct notifier_block *nb)
return 0;
}

-#define module_put_and_kthread_exit(code) kthread_exit(code)
+#define module_put_and_kthread_exit(code) do_exit(code)

static inline void print_modules(void)
{
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 4dc14c7a711b..c86fc8a87eae 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -338,7 +338,7 @@ static inline void svc_thread_init_status(struct svc_rqst *rqstp, int err)
{
store_release_wake_up(&rqstp->rq_err, err);
if (err)
- kthread_exit(1);
+ do_exit(1);
}

struct svc_deferred_req {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 401d6c4960ec..8db79e593156 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -25261,7 +25261,6 @@ BTF_ID(func, __x64_sys_exit_group)
BTF_ID(func, do_exit)
BTF_ID(func, do_group_exit)
BTF_ID(func, kthread_complete_and_exit)
-BTF_ID(func, kthread_exit)
BTF_ID(func, make_task_dead)
BTF_SET_END(noreturn_deny)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 791210daf8b4..1447c14c8540 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -323,7 +323,7 @@ void __noreturn kthread_complete_and_exit(struct completion *comp, long code)
if (comp)
complete(comp);

- kthread_exit(code);
+ do_exit(code);
}
EXPORT_SYMBOL(kthread_complete_and_exit);

@@ -395,7 +395,7 @@ static int kthread(void *_create)
if (!done) {
kfree(create->full_name);
kfree(create);
- kthread_exit(-EINTR);
+ do_exit(-EINTR);
}

self->full_name = create->full_name;
@@ -435,7 +435,7 @@ static int kthread(void *_create)
__kthread_parkme(self);
ret = threadfn(data);
}
- kthread_exit(ret);
+ do_exit(ret);
}

/* called from kernel_clone() to get node information for about to be created task */
@@ -738,7 +738,7 @@ EXPORT_SYMBOL_GPL(kthread_park);
* instead of calling wake_up_process(): the thread will exit without
* calling threadfn().
*
- * If threadfn() may call kthread_exit() itself, the caller must ensure
+ * If threadfn() may call do_exit() itself, the caller must ensure
* task_struct can't go away.
*
* Returns the result of threadfn(), or %-EINTR if wake_up_process()
diff --git a/kernel/module/main.c b/kernel/module/main.c
index c3ce106c70af..340b4dc5c692 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -228,7 +228,7 @@ static int mod_strncmp(const char *str_a, const char *str_b, size_t n)
void __noreturn __module_put_and_kthread_exit(struct module *mod, long code)
{
module_put(mod);
- kthread_exit(code);
+ do_exit(code);
}
EXPORT_SYMBOL(__module_put_and_kthread_exit);

diff --git a/lib/kunit/try-catch.c b/lib/kunit/try-catch.c
index d84a879f0a78..99d9603a2cfd 100644
--- a/lib/kunit/try-catch.c
+++ b/lib/kunit/try-catch.c
@@ -18,7 +18,7 @@
void __noreturn kunit_try_catch_throw(struct kunit_try_catch *try_catch)
{
try_catch->try_result = -EFAULT;
- kthread_exit(0);
+ do_exit(0);
}
EXPORT_SYMBOL_GPL(kunit_try_catch_throw);


---
base-commit: c107785c7e8dbabd1c18301a1c362544b5786282
change-id: 20260306-work-kernel-exit-2e8fd9e2c2a1

Christoph Hellwig

unread,
Mar 6, 2026, 9:44:26 AMMar 6
to Christian Brauner, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linu...@vger.kernel.org, b...@vger.kernel.org, linux-k...@vger.kernel.org, kuni...@googlegroups.com, Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Brendan Higgins, David Gow, Rae Moar, Christian Loehle, Linus Torvalds
More a comment on the previous patch, but I think exporting do_exit
which can work on any task is a really bad idea. I'd much rather re-add
a kthread_exit wrapper that checks PF_KTHREAD first and only export
that, which would obsolete this patch.

Linus Torvalds

unread,
Mar 6, 2026, 11:28:46 AMMar 6
to Christian Loehle, Christian Brauner, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
On Fri, 6 Mar 2026 at 03:06, Christian Loehle <christia...@arm.com> wrote:
>
> FWIW this leaves the stale BTF reference:

Thanks. Applied.

Linus

Linus Torvalds

unread,
Mar 6, 2026, 1:27:48 PMMar 6
to Christoph Hellwig, Christian Brauner, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linu...@vger.kernel.org, b...@vger.kernel.org, linux-k...@vger.kernel.org, kuni...@googlegroups.com, Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Brendan Higgins, David Gow, Rae Moar, Christian Loehle
On Fri, 6 Mar 2026 at 06:44, Christoph Hellwig <h...@infradead.org> wrote:
>
> More a comment on the previous patch, but I think exporting do_exit
> which can work on any task is a really bad idea.

What is the advantage of a module only being able to do
kthread_exit(), and not able to do a regular do_exit()?

I think the only real advantage of having a "kthread_exit()" is that
it's a nicer name.

Because if that's the main issue, then I agree that "do_exit()" is
really not a great name, and it matches a very traditional "this is an
internal function" naming convention, and not typically a "this is
what random modules should use".

So kthread_exit() is a much better name, but it basically *has* to act
exactly like do_exit(), and adding a limitation to only work on
kthreads doesn't actually seem like an improvement.

Why make a function that is intentionally limited with no real
technical upside? It's not like there's any real reason why a module
couldn't call exit - we may not have exported it before, but we do
have code that looks like it *could* be a module that calls do_exit()
today.

For example, I'm looking at kernel/vhost_task.c, and the only users
are things that *are* modules, and it's not hugely obvious that
there's a big advantage to saying "that task handling has to be
built-in for those modules".

So my reaction is that "no, do_exit() is not a great name, but there's
no real technical upside to havign a separate kthread_exit()"
function.

If it's just about naming, maybe we could just unify it all and call
it "task_exit()" or something?

Linus

David Gow

unread,
Mar 8, 2026, 4:41:43 AM (13 days ago) Mar 8
to Christian Brauner, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linu...@vger.kernel.org, b...@vger.kernel.org, linux-k...@vger.kernel.org, kuni...@googlegroups.com, Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Brendan Higgins, David Gow, Rae Moar, Christian Loehle, Linus Torvalds
Le 06/03/2026 à 10:07 PM, 'Christian Brauner' via KUnit Development a
écrit :
> In 28aaa9c39945 ("kthread: consolidate kthread exit paths to prevent use-after-free")
> we folded kthread_exit() into do_exit() when we fixed a nasty UAF bug.
> We left kthread_exit() around as an alias to do_exit(). Remove it
> completely.
>
> Reported-by: Christian Loehle <christia...@arm.com>
> Link: https://lore.kernel.org/1ff1bce2-8bb4-463c...@arm.com
> Signed-off-by: Christian Brauner <bra...@kernel.org>
> ---

The KUnit bits of this look fine, and it all works for me.

(That being said, I agree that do_exit() isn't as nice a name.)

Acked-by: David Gow <da...@davidgow.net>

Christian Brauner

unread,
Mar 9, 2026, 5:37:04 AM (12 days ago) Mar 9
to Linus Torvalds, Christoph Hellwig, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linu...@vger.kernel.org, b...@vger.kernel.org, linux-k...@vger.kernel.org, kuni...@googlegroups.com, Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Brendan Higgins, David Gow, Rae Moar, Christian Loehle
I have that as a follow-up series... But I didn't want to muddle the two
patches as this one was meant as a clean-up for the kthread_exit()
leftover. Now that you've applied the fixup for the BTF thing directly I
can send both at the same time. Give me a few minutes.

And no, we should definitely not keep kthread_exit() as a separate thing
around. That just seems weird and clearly that has led to bugs before.

The vhost example is good for another reason: the line between a task
acting as a proper kthread and something that is aking to a kthread is
very very blurry by now. Let's ignore usermodehelpers for a minute
(ugh). There's io workers or more generalized - like vhost - user
workers which are hybrid workers that aren't clearly completely
userspace conceptually but also aren't clearly kthreads so I'd not want
to have special-cases for any of them. task_exit() just should to the
right thing no matter what exactly exits. We also have way to many
cleanup hooks that need to be called depending on what precisely exists
to split this over different helpers.

Christoph Hellwig

unread,
Mar 9, 2026, 11:43:50 AM (12 days ago) Mar 9
to Linus Torvalds, Christoph Hellwig, Christian Brauner, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linu...@vger.kernel.org, b...@vger.kernel.org, linux-k...@vger.kernel.org, kuni...@googlegroups.com, Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Brendan Higgins, David Gow, Rae Moar, Christian Loehle
On Fri, Mar 06, 2026 at 10:27:26AM -0800, Linus Torvalds wrote:
> On Fri, 6 Mar 2026 at 06:44, Christoph Hellwig <h...@infradead.org> wrote:
> >
> > More a comment on the previous patch, but I think exporting do_exit
> > which can work on any task is a really bad idea.
>
> What is the advantage of a module only being able to do
> kthread_exit(), and not able to do a regular do_exit()?

Because it can't fuck up the state of random tasks.

> I think the only real advantage of having a "kthread_exit()" is that
> it's a nicer name.

That's another one, but not the main point.

> For example, I'm looking at kernel/vhost_task.c, and the only users
> are things that *are* modules, and it's not hugely obvious that
> there's a big advantage to saying "that task handling has to be
> built-in for those modules".

That's always built in for various good reasons. As are other
random do_exit callers.

Linus Torvalds

unread,
Mar 9, 2026, 12:43:42 PM (12 days ago) Mar 9
to Christoph Hellwig, Christian Brauner, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linu...@vger.kernel.org, b...@vger.kernel.org, linux-k...@vger.kernel.org, kuni...@googlegroups.com, Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Brendan Higgins, David Gow, Rae Moar, Christian Loehle
On Mon, 9 Mar 2026 at 08:43, Christoph Hellwig <h...@infradead.org> wrote:
>
> Because it can't f&*^ up the state of random tasks.

Christoph, you make no sense.

"do_exit()" cannot mess up random tasks. It can only exit the
*current* task. The task that is running right now in that module.

And exiting a task is about the least effed up you can do to a task
when you are in kernel mode.

Compared to everything else you can do by mistake - like just
corrupting task state randomly - it's a very benign operation, *and*
it is obvious both in source code and in behavior. It's not like it is
some subtle operation.

I'd be *much* more worried about actual subtle bugs, not somebody
explicitly calling exit.

So what is the actual problem? No more random rants. Explain yourself
without making wild handwaving gestures.

Now, there are real exports in this area that are actually strange and
should be removed: for some historical reason we export 'free_task()'
which makes no sense to me at all (but probably did at some point).

Now *that* is a strange export that can mess up another task in major ways.

[ Ok, I was intrigued and I went and dug into history: we used to do
it in the oprofile driver many many moons ago. ]

And since I looked at the history of this all due to that odd export,
that also made it clear that historically we used to export
complete_and_exit(), which was this beauty:

NORET_TYPE void complete_and_exit(struct completion *comp, long code)
{
if (comp)
complete(comp);

do_exit(code);
}

so you could always do "do_exit()" by just doing
"complete_and_exit(NULL, code)".

And yes, that function was exported since at least 2003 (it was
exported even before that, under the name 'up_and_exit()', and that's
the point where I couldn't be bothered any more because it predates
even the old BK history).

Yes, it was indeed renamed to kthread_complete_and_exit() back in
2021, but that wasn't due to any fundamental "it has to work only on
kthreads". It was simply because nothing but kthreads used it - and
because that was also the time when kthread_exit() started doing
*extra* things over and beyond just the regular do_exit().

So it was a practical thing brought on by kthread changes, not some
kind of "exit is evil" thing.

And that "ktrhead_exit() does extra things" was the actual bug that
needed fixing and caused nasty memory corruption due to subtle lack of
cleanup when it was bypassed.

End result: we've never historically felt that exit() was somehow bad,
and when we limited it to kthreads and made it special, it caused
actual bugs.

Linus

Linus Torvalds

unread,
Mar 9, 2026, 1:35:35 PM (12 days ago) Mar 9
to Christoph Hellwig, Christian Brauner, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linu...@vger.kernel.org, b...@vger.kernel.org, linux-k...@vger.kernel.org, kuni...@googlegroups.com, Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Brendan Higgins, David Gow, Rae Moar, Christian Loehle
On Mon, 9 Mar 2026 at 09:37, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> Now, there are real exports in this area that are actually strange and
> should be removed: for some historical reason we export 'free_task()'
> which makes no sense to me at all (but probably did at some point).
>
> Now *that* is a strange export that can mess up another task in major ways.
>
> [ Ok, I was intrigued and I went and dug into history: we used to do
> it in the oprofile driver many many moons ago. ]

It looks like it should not only no longer be exported, it should
actually be static to kernel/fork.c. As far as I can tell, that
historic oprofile use was the only reason ever this was exposed in any
way.

IOW, a patch like the attached is probably a good idea.

Somebody should probably remind me next merge window (I'm not going to
make 7.0 bigger, and I find examples of people using
kretprobe/free_task, which *should* still work just fine with a static
function, but for all I know if the compiler inlines things it will
cause issues).

Or just add this patch to one of the trees scheduled for next merge
window. I'm not currently carrying a linux-next branch (I actively
deleted it because it caused problems due to messing up other peoples
linux-next branches, so if I ever introduce one I will have to come up
with a better name anyway)

Linus
patch.diff

Christoph Hellwig

unread,
Mar 10, 2026, 9:11:22 AM (11 days ago) Mar 10
to Linus Torvalds, Christoph Hellwig, Christian Brauner, linux-...@vger.kernel.org, linux-...@vger.kernel.org, linu...@vger.kernel.org, b...@vger.kernel.org, linux-k...@vger.kernel.org, kuni...@googlegroups.com, Luis Chamberlain, Petr Pavlu, Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Chuck Lever, Jeff Layton, NeilBrown, Olga Kornievskaia, Dai Ngo, Tom Talpey, Trond Myklebust, Anna Schumaker, Alexei Starovoitov, Daniel Borkmann, John Fastabend, Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Brendan Higgins, David Gow, Rae Moar, Christian Loehle
On Mon, Mar 09, 2026 at 09:37:26AM -0700, Linus Torvalds wrote:
> On Mon, 9 Mar 2026 at 08:43, Christoph Hellwig <h...@infradead.org> wrote:
> >
> > Because it can't f&*^ up the state of random tasks.
>
> Christoph, you make no sense.

Thanks, I always appreciated rational and detailed answers.

>
> "do_exit()" cannot mess up random tasks. It can only exit the
> *current* task. The task that is running right now in that module.

Yes. And random code should not be able to end arbitrary tasks.

> So what is the actual problem? No more random rants. Explain yourself
> without making wild handwaving gestures.

Wow, you're a bit aggressive, aren't you? Maybe take your meds in the
morning to stay a bit calmer. As said before I don't think allowing
random code (and module is a proxy for that) do exit a task. They
really should not be exiting random kthreads either, but certainly
not more than that.

> Now, there are real exports in this area that are actually strange and
> should be removed: for some historical reason we export 'free_task()'
> which makes no sense to me at all (but probably did at some point).
>
> Now *that* is a strange export that can mess up another task in major ways.

No argument about that, but that's not the point here.

Reply all
Reply to author
Forward
0 new messages