Hi Nam,
I've been very interested in RV (Runtime Verification) to proactively detect
"sleep in atomic" scenarios on PREEMPT_RT kernels. Specifically, I'm looking
for ways to find cases where sleeping spinlocks or memory allocations are used
within preemption-disabled or irq-disabled contexts. While searching for
solutions, I discovered the RV subsystem.
I've tested with it as follows, and I have a few questions.
# cat /sys/kernel/tracing/rv/available_monitors
wwnr
rtapp
rtapp:sleep
# cat /sys/kernel/tracing/rv/available_reactors
nop
printk
panic
# echo printk > /sys/kernel/tracing/rv/monitors/rtapp/sleep/reactors
# cat /sys/kernel/tracing/rv/monitors/rtapp/sleep/enable
1
# echo rtapp:sleep > /sys/kernel/tracing/rv/enabled_monitors
> [192735.309072] [ T6957] rv: sleep: multipathd[6957]: violation detected
# echo panic > /sys/kernel/tracing/rv/monitors/rtapp/sleep/reactors
> [ T6957] Kernel panic - not syncing: rv: sleep: multipathd[6957]: violation detected
> [193521.768666][ T6957] CPU: 4 UID: 0 PID: 6957 Comm: multipathd Not tainted 6.17.0-rc3-g39f90c196721 #1 PREEMPT_{RT,(full)}
> [193521.771727][ T6957] Hardware name: QEMU KVM Virtual Machine, BIOS 2025.02-8ubuntu3 10/08/2025
> [193521.774126][ T6957] Call trace:
> [193521.774998][ T6957] show_stack+0x2c/0x3c (C)
> [193521.776281][ T6957] __dump_stack+0x30/0x40
> [193521.777523][ T6957] dump_stack_lvl+0x34/0x2bc
> [193521.778797][ T6957] dump_stack+0x1c/0x48
> [193521.779984][ T6957] vpanic+0x220/0x618
> [193521.781211][ T6957] oom_killer_enable+0x0/0x30
> [193521.782512][ T6957] ltl_validate+0x7ac/0xb1c
> [193521.783870][ T6957] ltl_atom_update+0xd0/0x32c
> [193521.785198][ T6957] handle_sched_set_state+0xb8/0x12c
> [193521.786773][ T6957] __trace_set_current_state+0x128/0x174
> [193521.788450][ T6957] do_nanosleep+0x128/0x2a4
> [193521.789731][ T6957] hrtimer_nanosleep+0xb4/0x160
> [193521.791167][ T6957] common_nsleep+0x6c/0x84
> [193521.792404][ T6957] __arm64_sys_clock_nanosleep+0x1a8/0x1f0
> [193521.794031][ T6957] invoke_syscall+0x64/0x168
> [193521.795353][ T6957] el0_svc_common+0x134/0x164
> [193521.796707][ T6957] do_el0_svc+0x2c/0x3c
> [193521.797897][ T6957] el0_svc+0x58/0x184
> [193521.799048][ T6957] el0t_64_sync_handler+0x84/0x12c
> [193521.800514][ T6957] el0t_64_sync+0x1b8/0x1bc
> [193521.801818][ T6957] SMP: stopping secondary CPUs
> [193521.803320][ T6957] Dumping ftrace buffer:
> [193521.804510][ T6957] (ftrace buffer empty)
> [193521.805848][ T6957] Kernel Offset: disabled
> [193521.807084][ T6957] CPU features: 0xc0000,00007800,149a3161,357ff667
> [193521.808941][ T6957] Memory Limit: none
> [193522.655297][ T6957] Rebooting in 86400 seconds..
Here are my questions:
1. Does the rtapp:sleep monitor proactively detect scenarios that
could lead to sleeping in atomic context, perhaps before
CONFIG_DEBUG_ATOMIC_SLEEP (enabled) would trigger at the actual point of
sleeping?
2. Is there a way to enable this monitor (e.g., rtapp:sleep)
immediately as soon as the RV subsystem is loaded during boot time?
(How to make this "default turn on"?)
3. When a "violation detected" message occurs at runtime, is it
possible to get a call stack of the location that triggered the
violation? The panic reactor provides a full stack, but I'm
wondering if this is also possible with the printk reactor.
Here is some background on why I'm so interested in this topic:
Recently, I was fuzzing the PREEMPT_RT kernel with syzkaller but ran into
issues where fuzzing wouldn't proceed smoothly. It turned out to be a problem
in the kcov USB API. This issue was fixed after I reported it, together
with Sebastian’s patch.
[PATCH] kcov, usb: Don't disable interrupts in kcov_remote_start_usb_softirq()
-
https://lore.kernel.org/all/20250811082...@linutronix.de/
After this fix, syzkaller fuzzing ran well and was able to detect several
runtime "sleep in atomic context" bugs:
[PATCH] USB: gadget: dummy-hcd: Fix locking bug in RT-enabled kernels
-
https://lore.kernel.org/all/bb192ae2-4eee-48ee...@rowland.harvard.edu/
[BUG] usbip: vhci: Sleeping function called from invalid context in
vhci_urb_enqueue on PREEMPT_RT
-
https://lore.kernel.org/all/c6c17f0d-b71d-4a44...@kzalloc.com/
This led me to research ways to find these issues proactively at a
static analysis level, and I created some regex and coccinelle scripts
to detect them.
[BUG] gfs2: sleeping lock in gfs2_quota_init() with preempt disabled
on PREEMPT_RT
-
https://lore.kernel.org/all/20250812103...@linutronix.de/t/#u
[PATCH] md/raid5-ppl: Fix invalid context sleep in
ppl_io_unit_finished() on PREEMPT_RT
-
https://lore.kernel.org/all/f2dbf110-e2a7-4101...@kernel.org/t/#u
Tomas, the author of the rtlockscope project, also gave me some deep
insights into this static analysis approach.
Re: [WIP] coccinelle: rt: Add coccicheck on sleep in atomic context on
PREEMPT_RT
-
https://lore.kernel.org/all/CAP4=nvTOE9W+6UtVZ5-5gAoYeEQ...@mail.gmail.com/
Thank you!
Best regards,
Yunseong Kim