Re: [PATCH RFC v2] pidfs: convert rb-tree to rhashtable

1 view
Skip to first unread message

Guillaume Tucker

unread,
Feb 20, 2026, 10:11:48 AM (7 days ago) Feb 20
to Christian Brauner, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
Hi Christian et al,

On 20/01/2026 15:52, Christian Brauner wrote:
> Mateusz reported performance penalties [1] during task creation because
> pidfs uses pidmap_lock to add elements into the rbtree. Switch to an
> rhashtable to have separate fine-grained locking and to decouple from
> pidmap_lock moving all heavy manipulations outside of it.
>
> Convert the pidfs inode-to-pid mapping from an rb-tree with seqcount
> protection to an rhashtable. This removes the global pidmap_lock
> contention from pidfs_ino_get_pid() lookups and allows the hashtable
> insert to happen outside the pidmap_lock.
>
> pidfs_add_pid() is split. pidfs_prepare_pid() allocates inode number and
> initializes pid fields and is called inside pidmap_lock. pidfs_add_pid()
> inserts pid into rhashtable and is called outside pidmap_lock. Insertion
> into the rhashtable can fail and memory allocation may happen so we need
> to drop the spinlock.
>
> To guard against accidently opening an already reaped task
> pidfs_ino_get_pid() uses additional checks beyond pid_vnr(). If
> pid->attr is PIDFS_PID_DEAD or NULL the pid either never had a pidfd or
> it already went through pidfs_exit() aka the process as already reaped.
> If pid->attr is valid check PIDFS_ATTR_BIT_EXIT to figure out whether
> the task has exited.
>
> This slightly changes visibility semantics: pidfd creation is denied
> after pidfs_exit() runs, which is just before the pid number is removed
> from the via free_pid(). That should not be an issue though.
>
> Link: https://lore.kernel.org/20251206131955....@gmail.com [1]
> Signed-off-by: Christian Brauner <bra...@kernel.org>
> ---
> Changes in v2:
> - Ensure that pid is removed before call_rcu() from pidfs.
> - Don't drop and reacquire spinlock.
> - Link to v1: https://patch.msgid.link/20260119-work-pidfs-rhas...@kernel.org
> ---
> fs/pidfs.c | 81 +++++++++++++++++++++------------------------------
> include/linux/pid.h | 4 +--
> include/linux/pidfs.h | 3 +-
> kernel/pid.c | 13 ++++++---
> 4 files changed, 46 insertions(+), 55 deletions(-)

[...]

> diff --git a/kernel/pid.c b/kernel/pid.c
> index ad4400a9f15f..6077da774652 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -43,7 +43,6 @@
> #include <linux/sched/task.h>
> #include <linux/idr.h>
> #include <linux/pidfs.h>
> -#include <linux/seqlock.h>
> #include <net/sock.h>
> #include <uapi/linux/pidfd.h>
>
> @@ -85,7 +84,6 @@ struct pid_namespace init_pid_ns = {
> EXPORT_SYMBOL_GPL(init_pid_ns);
>
> static __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
> -seqcount_spinlock_t pidmap_lock_seq = SEQCNT_SPINLOCK_ZERO(pidmap_lock_seq, &pidmap_lock);
>
> void put_pid(struct pid *pid)
> {
> @@ -141,9 +139,9 @@ void free_pid(struct pid *pid)
>
> idr_remove(&ns->idr, upid->nr);
> }
> - pidfs_remove_pid(pid);
> spin_unlock(&pidmap_lock);
>
> + pidfs_remove_pid(pid);
> call_rcu(&pid->rcu, delayed_put_pid);
> }

There appears to be a reproducible panic in rcu since next-20260216
at least while running KUnit. After running a bisection I found that
it was visible at a merge commit adding this patch 44e59e62b2a2
("Merge branch 'kernel-7.0.misc' into vfs.all"). I then narrowed it
down further on a test branch by rebasing the pidfs series on top of
the last known working commit:

https://gitlab.com/gtucker/linux/-/commits/kunit-rcu-debug-rebased

I also did some initial investigation with basic printk debugging and
haven't found anything obviously wrong in this patch itself, although
I'm no expert in pidfs... One symptom is that the kernel panic
always happens because the function pointer to delayed_put_pid()
becomes corrupt. As a quick hack, if I just call put_pid() in
free_pid() rather than go through rcu then there's no panic - see the
last commit on the test branch from the link above. The issue is
still in next-20260219 as far as I can tell.

Here's how to reproduce this, using the new container script and a
plain container image to run KUnit vith QEMU on x86:

scripts/container -s -i docker.io/gtucker/korg-gcc:kunit -- \
tools/testing/kunit/kunit.py \
run \
--arch=x86_64 \
--cross_compile=x86_64-linux-

The panic can be seen in .kunit/test.log:

[gtucker] rcu_do_batch:2609 count=7 func=ffffffff99026d40
Oops: invalid opcode: 0000 [#2] SMP NOPTI
CPU: 0 UID: 0 PID: 197 Comm: kunit_try_catch Tainted: G D N 6.19.0-09950-gc33cbc7ffae4 #77 PREEMPT(lazy)
Tainted: [D]=DIE, [N]=TEST
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.1 11/11/2019
RIP: 0010:0xffffffff99026d42

Looking at the last rcu callbacks that were enqueued with my extra
printk messages:

$ grep call_rcu .kunit/test.log | tail -n16
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0
[gtucker] call_rcu kernel/pid.c:148 free_pid ffffffff988adaf0
[gtucker] call_rcu kernel/exit.c:237 put_task_struct_rcu_user ffffffff9888e440
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu kernel/pid.c:148 free_pid ffffffff988adaf0
[gtucker] call_rcu kernel/exit.c:237 put_task_struct_rcu_user ffffffff9888e440
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0

and then the ones that were called:

$ grep rcu_do_batch .kunit/test.log | tail
[gtucker] rcu_do_batch:2609 count=7 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=8 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=9 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=1 func=ffffffff98ccc1a0
[gtucker] rcu_do_batch:2609 count=2 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=3 func=ffffffff988b7cd0
[gtucker] rcu_do_batch:2609 count=4 func=ffffffff988b7cd0
[gtucker] rcu_do_batch:2609 count=5 func=ffffffff988b7cd0
[gtucker] rcu_do_batch:2609 count=6 func=ffffffff98ccc1a0
[gtucker] rcu_do_batch:2609 count=7 func=ffffffff99026d40

we can see that the last pointer ffffffff99026d40 was never enqueued,
and the one from free_pid() ffffffff988adaf0 was never dequeued.
This is where I stopped investigating as it looked legit and someone
else might have more clues as to what's going on here. I've only
seen the problem with this callback but again, KUnit is a very narrow
kind of workload so the root cause may well be lying elsewhere.

Please let me know if you need any more debugging details or if I can
help test a fix. Hope this helps!

Cheers,
Guillaume

Christian Brauner

unread,
Feb 24, 2026, 8:22:40 AM (4 days ago) Feb 24
to Guillaume Tucker, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
Thanks for the report. I have so far no idea how that can happen:

* Is this reproducible with multiple compilers?
* Is this reproducible on v7.0-rc1?

Fwiw, we're missing one check currently:

diff --git a/kernel/pid.c b/kernel/pid.c
index 3b96571d0fe6..dc11acd445ae 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -326,7 +326,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *arg_set_tid,

retval = pidfs_add_pid(pid);
if (unlikely(retval)) {
- free_pid(pid);
+ if (pid != &init_struct_pid)
+ free_pid(pid);
pid = ERR_PTR(-ENOMEM);
}

But it seems unlikely that pidfs_add_pid() fails here.

Christian Brauner

unread,
Feb 24, 2026, 11:25:29 AM (3 days ago) Feb 24
to Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow, Linus Torvalds
Ugh, yuck squared.

IIUC, the bug is a UAF in free_kthread_struct(). It wasn't easy
detectable until the pidfs rhashtable conversion changed struct pid's
size and field layout.

The rhashtable conversion replaced struct pid's struct rb_node of 24
bytes with struct rhash_head of 8 bytes, shrinking struct pid by 16
bytes bringing it to 144 bytes. This means it's now the same size as
struct kthread (without CONFIG_BLK_CGROUP). Both structs use
SLAB_HWCACHE_ALIGN bringing it to 192. KUnit sets
CONFIG_SLAB_MERGE_DEFAULT=y. So now struct pid and struct kthread share
slab caches. First part of the puzzle.

struct pid.rcu.func is at offset 0x78 and struct kthread.affinity node
at offset 0x78. I'm I'm right then we can already see where this is
going.

So free_kthread_struct() calls kfree(kthread) without checking whether
the kthread's affinity_node is still linked in kthread_affinity_list.

If a kthread exits via a path that bypasses kthread_exit() (e.g.,
make_task_dead() after an oops -- which calls do_exit() directly),
the affinity_node remains in the global kthread_affinity_list. When
free_kthread_struct() later frees the kthread struct, the linked list
still references the freed memory. Any subsequent list_del() by another
kthread in kthread_exit() writes to the freed memory:

list_del(&kthread->affinity_node):
entry->prev->next = entry->next // writes to freed predecessor's offset 0x78

With cache merging, this freed kthread memory may have been reused for a
struct pid. The write corrupts pid.rcu.func at the same offset 0x78,
replacing delayed_put_pid with &kthread_affinity_list. The next RCU
callback invocation is fscked.

* Raising SLAB_NO_MERGE makes the bug go away.
* Turn on
CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
CONFIG_KASAN_INLINE=y

BUG: KASAN: slab-use-after-free in kthread_exit+0x255/0x280
Write of size 8 at addr ffff888003238f78 by task kunit_try_catch/183

Allocated by task 169:
set_kthread_struct+0x31/0x150
copy_process+0x203d/0x4c00

<snip>

Freed by task 0:
free_task+0x147/0x3a0
rcu_core+0x58e/0x1c50

<snip>

Object at ffff888003238f00 belongs to kmalloc-192 cache.
Allocated by
set_kthread_struct()
-> kzalloc(sizeof(*kthread)).
Freed by
free_task()
-> free_kthread_struct()
-> kfree().

Written to at offset 0x78 (affinity_node.next) by another kthread's
list_del in kthread_exit().

Fix should be something like

void free_kthread_struct(struct task_struct *k)
{
struct kthread *kthread;

kthread = to_kthread(k);
if (!kthread)
return;

+ if (!list_empty(&kthread->affinity_node)) {
+ mutex_lock(&kthread_affinity_lock);
+ list_del(&kthread->affinity_node);
+ mutex_unlock(&kthread_affinity_lock);
+ }
+ if (kthread->preferred_affinity)
+ kfree(kthread->preferred_affinity);

#ifdef CONFIG_BLK_CGROUP
WARN_ON_ONCE(kthread->blkcg_css);
#endif
k->worker_private = NULL;
kfree(kthread->full_name);
kfree(kthread);
}

The normal kthread_exit() path already unlinks the node. After
list_del(), the node's pointers are set to LIST_POISON1/LIST_POISON2, so
list_empty() returns false. To avoid a double-unlink, kthread_exit()
should use list_del_init() instead of list_del(), so that
free_kthread_struct()'s list_empty() check correctly detects the
already-unlinked state.

I had claude reproduce this under various conditions because I'm a lazy fsck.

Mark Brown

unread,
Feb 24, 2026, 11:37:13 AM (3 days ago) Feb 24
to Christian Brauner, Guillaume Tucker, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, kuni...@googlegroups.com, David Gow
On Tue, Feb 24, 2026 at 02:22:31PM +0100, Christian Brauner wrote:
FWIW I've been seeing RCU stalls and crashes in KUnit while building
-next which I believe are the same as Guillaume is reporting, these have
reproduced with multiple arches using the Debian GCC 14 and with LLVM
for me. Example of my command line:

./tools/testing/kunit/kunit.py run --arch arm64 --cross_compile=aarch64-linux-gnu-

> * Is this reproducible on v7.0-rc1?

The issues I'm seeing certainly are. I just tried testing on this
specific commit (802182490445f6bcf5de0e0518fb967c2afb6da1) and managed
to have KUnit complete though...

> Fwiw, we're missing one check currently:

...

> But it seems unlikely that pidfs_add_pid() fails here.

That doesn't seem to have any impact for me. I suspect that there may
be multiple interacting changes triggering the issue (or making it more
likely to occur) which isn't helping anything :/ See also:

https://lore.kernel.org/r/59e8c352-7bc8-4358...@davidgow.net
signature.asc

Mark Brown

unread,
Feb 24, 2026, 12:04:13 PM (3 days ago) Feb 24
to Christian Brauner, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, kuni...@googlegroups.com, David Gow, Linus Torvalds
On Tue, Feb 24, 2026 at 05:25:21PM +0100, Christian Brauner wrote:

> Ugh, yuck squared.

> IIUC, the bug is a UAF in free_kthread_struct(). It wasn't easy
> detectable until the pidfs rhashtable conversion changed struct pid's
> size and field layout.

Eeew, indeed. Thanks for figuring this out!
Confirmed that the above patch plus the list_del_init() change seems to
fix the issue for me, the full patch I tested with is below to confirm.
Feel free to add:

Tested-by: Mark Brown <bro...@kernel.org>

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 20451b624b67..3778fcbc56db 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -147,6 +147,13 @@ void free_kthread_struct(struct task_struct *k)
kthread = to_kthread(k);
if (!kthread)
return;
+ if (!list_empty(&kthread->affinity_node)) {
+ mutex_lock(&kthread_affinity_lock);
+ list_del(&kthread->affinity_node);
+ mutex_unlock(&kthread_affinity_lock);
+ }
+ if (kthread->preferred_affinity)
+ kfree(kthread->preferred_affinity);

#ifdef CONFIG_BLK_CGROUP
WARN_ON_ONCE(kthread->blkcg_css);
@@ -325,7 +332,7 @@ void __noreturn kthread_exit(long result)
kthread->result = result;
if (!list_empty(&kthread->affinity_node)) {
mutex_lock(&kthread_affinity_lock);
- list_del(&kthread->affinity_node);
+ list_del_init(&kthread->affinity_node);
mutex_unlock(&kthread_affinity_lock);

if (kthread->preferred_affinity) {
signature.asc

Guillaume Tucker

unread,
Feb 24, 2026, 1:27:18 PM (3 days ago) Feb 24
to Mark Brown, Christian Brauner, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, kuni...@googlegroups.com, David Gow, Linus Torvalds
I second Mark's feedback. I only used GCC 14.2 on x86_64 (see my
original email with a container image). I confirm it is reproducible
on v7.0-rc1 and that the patch below fixes the issue. So for what
it's worth, feel free to add:

Tested-by: Guillaume Tucker <gtu...@gtucker.io>

Cheers,
Guillaume


PS: It also happens to be a great first result for my new automated
bisection tool :) I'll add some post-processing with debug configs
turned on to automate typical sanity checks e.g. KASAN and builds
with alternative compilers and architectures.

Linus Torvalds

unread,
Feb 24, 2026, 2:31:19 PM (3 days ago) Feb 24
to Christian Brauner, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
On Tue, 24 Feb 2026 at 08:25, Christian Brauner <bra...@kernel.org> wrote:
>
> If a kthread exits via a path that bypasses kthread_exit() (e.g.,
> make_task_dead() after an oops -- which calls do_exit() directly),
> the affinity_node remains in the global kthread_affinity_list. When
> free_kthread_struct() later frees the kthread struct, the linked list
> still references the freed memory. Any subsequent list_del() by another
> kthread in kthread_exit() writes to the freed memory:

Ugh.

So this is nasty, but I really detest the suggested fix. It just
smells wrong to have that affinity_node cleanup done in two different
places depending on how the exit is done.

IOW, I think the proper fix would be to just make sure that
kthread_exit() isn't actually ever bypassed.

Because looking at this, there are other issues with do_exit() killing
a kthread - it currently also means that kthread->result randomly
doesn't get set, for example, so kthread_stop() would appear to
basically return garbage.

No, nobody likely cares about the kthread_stop() return value for that
case, but it's an example of the same kind of "two different exit
paths, inconsistent data structures" issue.

How about something like the attached, in other words?

NOTE NOTE NOTE! This is *entirely* untested. It might do unspeakable
things to your pets, so please check it. I'm sending this patch out as
a "I really would prefer this kind of approach" example, not as
anything more than that.

Because I really think the core fundamental problem was that there
were two different exit paths that did different things, and we
shouldn't try to fix the symptoms of that problem, but instead really
fix the core issue.

Hmm?

Side note: while writing this suggested patch, I do note that this
comment is wrong:

* When "(p->flags & PF_KTHREAD)" is set the task is a kthread and will
* always remain a kthread. For kthreads p->worker_private always
* points to a struct kthread. For tasks that are not kthreads
* p->worker_private is used to point to other things.

because 'init_task' is marked as PF_KTHREAD, but does *not* have a
p->worker_private.

Anyway, that doesn't affect this particular code, but it might be
worth thinking about.

Linus
patch.diff

David Gow

unread,
Feb 25, 2026, 11:09:03 PM (2 days ago) Feb 25
to Christian Brauner, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, Linus Torvalds
Le 25/02/2026 à 12:25 AM, Christian Brauner a écrit :
> On Tue, Feb 24, 2026 at 02:22:31PM +0100, Christian Brauner wrote:

(...snip...)
Aha... that makes sense. What a mess!

>
> Fix should be something like
>
> void free_kthread_struct(struct task_struct *k)
> {
> struct kthread *kthread;
>
> kthread = to_kthread(k);
> if (!kthread)
> return;
>
> + if (!list_empty(&kthread->affinity_node)) {
> + mutex_lock(&kthread_affinity_lock);
> + list_del(&kthread->affinity_node);
> + mutex_unlock(&kthread_affinity_lock);
> + }
> + if (kthread->preferred_affinity)
> + kfree(kthread->preferred_affinity);
>
> #ifdef CONFIG_BLK_CGROUP
> WARN_ON_ONCE(kthread->blkcg_css);
> #endif
> k->worker_private = NULL;
> kfree(kthread->full_name);
> kfree(kthread);
> }
>

This fixes the KUnit issues for me, too, so:

Tested-by: David Gow <da...@davidgow.net>

(But so does Linus' patch, which is probably the nicer long-term solution.)

Cheers,
-- David

David Gow

unread,
Feb 25, 2026, 11:09:15 PM (2 days ago) Feb 25
to Linus Torvalds, Christian Brauner, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com
FWIW, the attached patch fixes the issue for me, and my machines have
been been stable running it this morning (+reverting the firewire merge
due to [1]), so the idea seems pretty sound.

Tested-by: David Gow <da...@davidgow.net>

Cheers,
-- David

[1]:
https://lore.kernel.org/all/CAHk-=wgUmxkjwsWzX1rTo=DTnaouwY-VT8Bj...@mail.gmail.com/

Christian Brauner

unread,
Feb 26, 2026, 4:48:02 AM (yesterday) Feb 26
to Linus Torvalds, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
Oh nice.
I was kinda hoping Tejun would jump on this one and so just pointed to
one potential way to fix it but didn't really spend time on it.

Anyway, let's just take what you proposed and slap a commit message on
it. Fwiw, init_task does have ->worker_private it just gets set later
during sched_init():

/*
* The idle task doesn't need the kthread struct to function, but it
* is dressed up as a per-CPU kthread and thus needs to play the part
* if we want to avoid special-casing it in code that deals with per-CPU
* kthreads.
*/
WARN_ON(!set_kthread_struct(current));

I think that @current here is misleading. When sched_init() runs it
should be single-threaded still and current == &init_task. So that
set_kthread_struct(current) call sets @init_task's worker_private iiuc.

Patch appended. I'll stuff it into vfs.fixes.
0001-kthread-consolidate-kthread-exit-paths-to-prevent-us.patch

Linus Torvalds

unread,
11:55 AM (10 hours ago) 11:55 AM
to Christian Brauner, Guillaume Tucker, Tejun Heo, Mateusz Guzik, Alexander Viro, Jan Kara, linux-...@vger.kernel.org, Mark Brown, kuni...@googlegroups.com, David Gow
On Thu, 26 Feb 2026 at 01:48, Christian Brauner <bra...@kernel.org> wrote:
>
> Anyway, let's just take what you proposed and slap a commit message on
> it.

I will look forward to getting this patch through the normal channels
and I have removed it from my "misc tree of random patches".

> Fwiw, init_task does have ->worker_private it just gets set later
> during sched_init():

Ahh. That was not very obvious indeed.

Linus
Reply all
Reply to author
Forward
0 new messages