[Bug 209219] New: KSHAKER: scheduling/execution timing perturbations

0 views
Skip to first unread message

bugzill...@bugzilla.kernel.org

unread,
Sep 10, 2020, 4:32:41 AM9/10/20
to kasa...@googlegroups.com
https://bugzilla.kernel.org/show_bug.cgi?id=209219

Bug ID: 209219
Summary: KSHAKER: scheduling/execution timing perturbations
Product: Memory Management
Version: 2.5
Kernel Version: ALL
Hardware: All
OS: Linux
Tree: Mainline
Status: NEW
Severity: enhancement
Priority: P1
Component: Sanitizers
Assignee: mm_san...@kernel-bugs.kernel.org
Reporter: dvy...@google.com
CC: kasa...@googlegroups.com
Regression: No

Two recent bug examples:
https://lore.kernel.org/netdev/20200908104025.4...@google.com/
https://lore.kernel.org/netdev/20200908145911.4...@google.com/
In both cases the race window is extremely narrow. And I suspect in the first
case it's not just the race window, but also the typical scheduling of events
is such that the UAF won't happen. Namely here:

- ieee802154_xmit_complete(&local->hw, skb, false);
-
dev->stats.tx_packets++;
dev->stats.tx_bytes += skb->len;

+ ieee802154_xmit_complete(&local->hw, skb, false);
+

The dev is _usually_ not freed by the call to ieee802154_xmit_complete. But the
bug is very straightforward (we literally free the object and use it after
that) and was introduced and unnoticed since 2014(!).
The other one was present in WireGuard initial implementation and was not
noticed since then as well.
There are sure way more examples like this -- most of the bugs that happen few
times and don't have reproducers.

The proposal is to introduce artificial random delays into execution and/or
some atypical scheduler perturbations. There are some sound approaches for
systematic enumeration of all possible executions (or specific subsets of
executions), but that's probably not feasible for kernel. Just some random
(maybe somewhat intelligently random) perturbations should be good enough for
starters.

For race-free programs it's enough to introduce delays only before
synchronization actions (atomic/lock operations). Any delays between local
actions can't lead to observable behavior differences. Now the kernel is not
race-free, so it does not have this nice properly. But we probably still want
to start with introducing delays only before synchronizations actions, that's
still a good oracle.

We already have some instrumentation hooks in atomic ops. Not sure about locks
(maybe something like might_sleep() will do?).

This should be useful for any automated/manual testing/fuzzing.
The proposed name: KSHAKER.

--
You are receiving this mail because:
You are on the CC list for the bug.

bugzill...@bugzilla.kernel.org

unread,
Sep 10, 2020, 4:50:57 AM9/10/20
to kasa...@googlegroups.com
https://bugzilla.kernel.org/show_bug.cgi?id=209219

--- Comment #1 from Marco Elver (el...@google.com) ---
We had discussed a potential implementation, specifically "NMI injection" as
the means to inject such delays. It's summarized here:
https://github.com/google/syzkaller/issues/1891

Having an interface from userspace to inject NMIs that simply add a delay would
enable e.g. syzkaller to generate programs that include such injected delays.

There may be alternative designs as you propose as well, but using NMIs gives
us arbitrary delay-injection points.

bugzill...@bugzilla.kernel.org

unread,
Sep 10, 2020, 5:05:02 AM9/10/20
to kasa...@googlegroups.com
https://bugzilla.kernel.org/show_bug.cgi?id=209219

--- Comment #2 from Dmitry Vyukov (dvy...@google.com) ---
Right.
I am not sure it's possible to inject NMIs anywhere besides qemu/kvm and
special dev boards (i.e. not syzbot). But even there it's controlled from
outside of the machine, while we want to control this from inside.
Even if we expose a special kernel interface inside of the machine, it won't be
possible to achieve right granularity. E.g. on a machine with 1 CPU, user-space
can't issue the request until the executing kernel code will be preempted for
other reasons at an uncontrolable point. And at this point it's already too
late to preempt it, it's already preempted.

Having this in kernel in cooperative way seems to provide much better
portability, precision and effectiveness.

bugzill...@bugzilla.kernel.org

unread,
Sep 10, 2020, 7:20:52 AM9/10/20
to kasa...@googlegroups.com
https://bugzilla.kernel.org/show_bug.cgi?id=209219

--- Comment #3 from Marco Elver (el...@google.com) ---
Another idea: if the places we would be interested in inserting delays are
limited we could use kprobes.

bugzill...@bugzilla.kernel.org

unread,
Sep 10, 2020, 8:09:21 AM9/10/20
to kasa...@googlegroups.com
https://bugzilla.kernel.org/show_bug.cgi?id=209219

--- Comment #4 from Dmitry Vyukov (dvy...@google.com) ---
I think we one type of systematic testing is feasible as well. Namely,
one-factor enumeration like we do for fault injection: delay first point in a
syscall, then second, then 3rd and so on until we enumerate all of them. This
will require some debugfs interface to arm this per-task and query if the delay
was injected or not.

bugzill...@bugzilla.kernel.org

unread,
Sep 10, 2020, 8:14:54 AM9/10/20
to kasa...@googlegroups.com
https://bugzilla.kernel.org/show_bug.cgi?id=209219

--- Comment #5 from Alexander Potapenko (gli...@google.com) ---
Could these UAFs be detected by KCSAN? Maybe we could bundle the two, as KCSAN
already instruments the code?

bugzill...@bugzilla.kernel.org

unread,
Sep 10, 2020, 8:31:16 AM9/10/20
to kasa...@googlegroups.com
https://bugzilla.kernel.org/show_bug.cgi?id=209219

--- Comment #6 from Marco Elver (el...@google.com) ---
> Could these UAFs be detected by KCSAN?

KCSAN already instruments kfree() and will detect races between usage and
kfree(). But we know that KASAN is still the better tool to detect UAFs, due to
quarantine etc.

> Maybe we could bundle the two, as KCSAN already instruments the code?

KCSAN instruments memory accesses, and I think that's overkill/too
fine-grained.

From what I gather, we want to insert delays into strategic locations, such as
synchronization or special functions, to enumerate interesting schedules. This
will require (as suggested by Dmitry) a cooperative approach, inserting delay
functions either directly or via means of kprobes etc.

The other requirement seems to be, that we want something that could be applied
to all sanitizers, not just KCSAN.


On a whole, one direction I'm being reminded of is stateless model checking,
which can be applied to real code to perturb schedules in a systematic way. One
popular paper I'm aware of is the CHESS paper:
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/pldi08-FairStatelessModelChecking.pdf

bugzill...@kernel.org

unread,
Sep 12, 2024, 3:37:39 AM9/12/24
to kasa...@googlegroups.com
https://bugzilla.kernel.org/show_bug.cgi?id=209219

--- Comment #7 from Dmitry Vyukov (dvy...@google.com) ---
Here is a simple patch based on KCOV that does it:
https://github.com/dvyukov/linux/commit/3ca715d1f7e1fbd592097149966d9034805e338a

It proved to trigger more bugs in local tests. Remaining work:
1. Figure out how to properly check that a task can sleep. Should this check be
moved to kernel/sched* code?
2. Abstract away/remove dependency on rdtsc.
3. Abstract away smap code (x86-specific).
4. Remove all hardcoded policy decisions and allow user-space to control them.
A large question is if randomization should be done for all tasks, or only for
tasks that have KCOV descriptor. I think the most flexible option would be to
add an ioctl that allows to enable delays globally or for the given KCOV
descriptor, and control all parameters (frequency/scale of delays). Per-KCOV
setting should take precedence over the global one. This allows to explore all
possible policies (either enable globally, or enable for each KCOV descriptor
separately). The ioctl should also allow to set the random seed, this will be
useful for snapshot based mode and will remove dependency on rdtsc.

--
You may reply to this email to add a comment.

bugzill...@kernel.org

unread,
Dec 16, 2025, 3:13:32 AM (4 days ago) Dec 16
to kasa...@googlegroups.com
https://bugzilla.kernel.org/show_bug.cgi?id=209219

--- Comment #8 from Dmitry Vyukov (dvy...@google.com) ---
What we discussed with Steven Rostedt:

- market this as might_sleep debugging facility
- split might_sleep so that it has a function that returns a bool, or add a
helper that accepts a bool flag saying to WARN or not to WARN
- use this predicate in KCOV
- improve might_sleep predicate to include SMAP check/etc (anything that's
missing)
- the hypothesis is that might_sleep is actually buggy (and without means to
test it well), this would stress might_sleep predicate well

--
You may reply to this email to add a comment.

Reply all
Reply to author
Forward
0 new messages