Hi Christian et al,
On 20/01/2026 15:52, Christian Brauner wrote:
> Mateusz reported performance penalties [1] during task creation because
> pidfs uses pidmap_lock to add elements into the rbtree. Switch to an
> rhashtable to have separate fine-grained locking and to decouple from
> pidmap_lock moving all heavy manipulations outside of it.
>
> Convert the pidfs inode-to-pid mapping from an rb-tree with seqcount
> protection to an rhashtable. This removes the global pidmap_lock
> contention from pidfs_ino_get_pid() lookups and allows the hashtable
> insert to happen outside the pidmap_lock.
>
> pidfs_add_pid() is split. pidfs_prepare_pid() allocates inode number and
> initializes pid fields and is called inside pidmap_lock. pidfs_add_pid()
> inserts pid into rhashtable and is called outside pidmap_lock. Insertion
> into the rhashtable can fail and memory allocation may happen so we need
> to drop the spinlock.
>
> To guard against accidently opening an already reaped task
> pidfs_ino_get_pid() uses additional checks beyond pid_vnr(). If
> pid->attr is PIDFS_PID_DEAD or NULL the pid either never had a pidfd or
> it already went through pidfs_exit() aka the process as already reaped.
> If pid->attr is valid check PIDFS_ATTR_BIT_EXIT to figure out whether
> the task has exited.
>
> This slightly changes visibility semantics: pidfd creation is denied
> after pidfs_exit() runs, which is just before the pid number is removed
> from the via free_pid(). That should not be an issue though.
>
> Link:
https://lore.kernel.org/20251206131955....@gmail.com [1]
> Signed-off-by: Christian Brauner <
bra...@kernel.org>
> ---
> Changes in v2:
> - Ensure that pid is removed before call_rcu() from pidfs.
> - Don't drop and reacquire spinlock.
> - Link to v1:
https://patch.msgid.link/20260119-work-pidfs-rhas...@kernel.org
> ---
> fs/pidfs.c | 81 +++++++++++++++++++++------------------------------
> include/linux/pid.h | 4 +--
> include/linux/pidfs.h | 3 +-
> kernel/pid.c | 13 ++++++---
> 4 files changed, 46 insertions(+), 55 deletions(-)
[...]
> diff --git a/kernel/pid.c b/kernel/pid.c
> index ad4400a9f15f..6077da774652 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -43,7 +43,6 @@
> #include <linux/sched/task.h>
> #include <linux/idr.h>
> #include <linux/pidfs.h>
> -#include <linux/seqlock.h>
> #include <net/sock.h>
> #include <uapi/linux/pidfd.h>
>
> @@ -85,7 +84,6 @@ struct pid_namespace init_pid_ns = {
> EXPORT_SYMBOL_GPL(init_pid_ns);
>
> static __cacheline_aligned_in_smp DEFINE_SPINLOCK(pidmap_lock);
> -seqcount_spinlock_t pidmap_lock_seq = SEQCNT_SPINLOCK_ZERO(pidmap_lock_seq, &pidmap_lock);
>
> void put_pid(struct pid *pid)
> {
> @@ -141,9 +139,9 @@ void free_pid(struct pid *pid)
>
> idr_remove(&ns->idr, upid->nr);
> }
> - pidfs_remove_pid(pid);
> spin_unlock(&pidmap_lock);
>
> + pidfs_remove_pid(pid);
> call_rcu(&pid->rcu, delayed_put_pid);
> }
There appears to be a reproducible panic in rcu since next-20260216
at least while running KUnit. After running a bisection I found that
it was visible at a merge commit adding this patch 44e59e62b2a2
("Merge branch 'kernel-7.0.misc' into vfs.all"). I then narrowed it
down further on a test branch by rebasing the pidfs series on top of
the last known working commit:
https://gitlab.com/gtucker/linux/-/commits/kunit-rcu-debug-rebased
I also did some initial investigation with basic printk debugging and
haven't found anything obviously wrong in this patch itself, although
I'm no expert in pidfs... One symptom is that the kernel panic
always happens because the function pointer to delayed_put_pid()
becomes corrupt. As a quick hack, if I just call put_pid() in
free_pid() rather than go through rcu then there's no panic - see the
last commit on the test branch from the link above. The issue is
still in next-20260219 as far as I can tell.
Here's how to reproduce this, using the new container script and a
plain container image to run KUnit vith QEMU on x86:
scripts/container -s -i
docker.io/gtucker/korg-gcc:kunit -- \
tools/testing/kunit/kunit.py \
run \
--arch=x86_64 \
--cross_compile=x86_64-linux-
The panic can be seen in .kunit/test.log:
[gtucker] rcu_do_batch:2609 count=7 func=ffffffff99026d40
Oops: invalid opcode: 0000 [#2] SMP NOPTI
CPU: 0 UID: 0 PID: 197 Comm: kunit_try_catch Tainted: G D N 6.19.0-09950-gc33cbc7ffae4 #77 PREEMPT(lazy)
Tainted: [D]=DIE, [N]=TEST
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.1 11/11/2019
RIP: 0010:0xffffffff99026d42
Looking at the last rcu callbacks that were enqueued with my extra
printk messages:
$ grep call_rcu .kunit/test.log | tail -n16
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu include/linux/sched/task.h:159 put_task_struct ffffffff98887ae0
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0
[gtucker] call_rcu kernel/pid.c:148 free_pid ffffffff988adaf0
[gtucker] call_rcu kernel/exit.c:237 put_task_struct_rcu_user ffffffff9888e440
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu lib/radix-tree.c:310 radix_tree_node_free ffffffff98ccc1a0
[gtucker] call_rcu kernel/pid.c:148 free_pid ffffffff988adaf0
[gtucker] call_rcu kernel/exit.c:237 put_task_struct_rcu_user ffffffff9888e440
[gtucker] call_rcu kernel/cred.c:83 __put_cred ffffffff988b7cd0
and then the ones that were called:
$ grep rcu_do_batch .kunit/test.log | tail
[gtucker] rcu_do_batch:2609 count=7 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=8 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=9 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=1 func=ffffffff98ccc1a0
[gtucker] rcu_do_batch:2609 count=2 func=ffffffff98887ae0
[gtucker] rcu_do_batch:2609 count=3 func=ffffffff988b7cd0
[gtucker] rcu_do_batch:2609 count=4 func=ffffffff988b7cd0
[gtucker] rcu_do_batch:2609 count=5 func=ffffffff988b7cd0
[gtucker] rcu_do_batch:2609 count=6 func=ffffffff98ccc1a0
[gtucker] rcu_do_batch:2609 count=7 func=ffffffff99026d40
we can see that the last pointer ffffffff99026d40 was never enqueued,
and the one from free_pid() ffffffff988adaf0 was never dequeued.
This is where I stopped investigating as it looked legit and someone
else might have more clues as to what's going on here. I've only
seen the problem with this callback but again, KUnit is a very narrow
kind of workload so the root cause may well be lying elsewhere.
Please let me know if you need any more debugging details or if I can
help test a fix. Hope this helps!
Cheers,
Guillaume