Well, it's not Clang itself that is lacking testing: Clang builds with
all those debug configs are quite reliable.
The problem here is that KMSAN is doing pretty elaborate stuff on
every instrumented memory access, and when we enable any of the debug
configs on top of that, we often end up calling debug code from the
instrumentation code, and that debug code is instrumented, so there's
infinite recursion.
Because KMSAN is primarily used by syzbot, which already runs these
debug configs with faster tools (KASAN), they are indeed undertested,
and people keep running into incompatibilities like those Kirill
reported.
There are several possible approaches to addressing this, each having
its benefits and drawbacks.
1. Find all code that could be potentially called from
kmsan_virt_addr_valid(), mark it as noinstr, KMSAN_SANITIZE:=n,
__no_sanitize_memory etc.
+ Debug configs still work
- If this code is called from somewhere else, it won't be
instrumented, so we can miss KMSAN errors in it.
- Future debug configs may introduce more instrumented code.
2. Provide simplified versions of primitives needed by KMSAN without
debug checks (e.g. preempt_disable(), pfn_valid(), phys_addr()) that
won't be instrumented.
+ Covers all existing and future debug configs.
- Code duplication is bad, we'll need to keep both implementations in
sync. (We could refactor the existing primitives though, so that there
is a single version for which checks can be disabled).
3. Disable low-level debugging configs under KMSAN.
+ No more issues with known problematic configs.
- Some people may actually want to have these configs enabled.
- New debug configs may still break in the absence of testing.
4. Use reentrancy counters to stop KMSAN functions from recursively
calling each other.
+ Is config-agnostic.
- This is brittle: there are situations in which we want instrumented
code called from KMSAN runtime to correctly initialize the metadata,
not just bail out
(e.g. when allocating heap storage for stack traces saved by KMSAN in
the stackdepot, we'd better not ignore the stores to freelist pointers
to avoid false positives later on).