[syzbot] [mm?] KCSAN: data-race in mas_wr_store_entry / mtree_range_walk (2)

5 views
Skip to first unread message

syzbot

unread,
Apr 17, 2026, 5:12:23 AMApr 17
to Liam.H...@oracle.com, ak...@linux-foundation.org, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, l...@kernel.org, pfal...@suse.de, syzkall...@googlegroups.com, vba...@kernel.org
Hello,

syzbot found the following issue on:

HEAD commit: 1d51b370a0f8 Merge tag 'jfs-7.1' of github.com:kleikamp/li..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=117dc4ce580000
kernel config: https://syzkaller.appspot.com/x/.config?x=7f207c4b1fbf85a3
dashboard link: https://syzkaller.appspot.com/bug?extid=38a879f4a73497f2dfef
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8

Unfortunately, I don't have any reproducer for this issue yet.

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/e08ff8d2b0e5/disk-1d51b370.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/c11d4b098bbf/vmlinux-1d51b370.xz
kernel image: https://storage.googleapis.com/syzbot-assets/6a4691f32e3d/bzImage-1d51b370.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+38a879...@syzkaller.appspotmail.com

==================================================================
BUG: KCSAN: data-race in mas_wr_store_entry / mtree_range_walk

write to 0xffff888104f71d08 of 8 bytes by task 4757 on cpu 0:
mas_wr_slot_store lib/maple_tree.c:3232 [inline]
mas_wr_store_entry+0x3405/0x5ad0 lib/maple_tree.c:3528
mas_store_prealloc+0x43e/0x690 lib/maple_tree.c:4936
vma_iter_store_overwrite mm/vma.h:616 [inline]
commit_merge+0x6a1/0x720 mm/vma.c:766
vma_expand+0x301/0x460 mm/vma.c:1219
vma_merge_new_range+0x29c/0x320 mm/vma.c:1112
__mmap_region mm/vma.c:2766 [inline]
mmap_region+0x1073/0x2110 mm/vma.c:2856
do_mmap+0x9b2/0xbd0 mm/mmap.c:560
vm_mmap_pgoff+0x183/0x2d0 mm/util.c:581
ksys_mmap_pgoff+0xc1/0x310 mm/mmap.c:606
x64_sys_call+0x14df/0x3020 arch/x86/include/generated/asm/syscalls_64.h:10
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x12c/0x3b0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff888104f71d08 of 8 bytes by task 4759 on cpu 1:
mtree_range_walk+0x1a6/0x490 lib/maple_tree.c:2032
mas_state_walk lib/maple_tree.c:2952 [inline]
mas_walk+0x1cc/0x370 lib/maple_tree.c:4366
lock_vma_under_rcu+0xc9/0x210 mm/mmap_lock.c:304
do_user_addr_fault+0x232/0x1050 arch/x86/mm/fault.c:1325
handle_page_fault arch/x86/mm/fault.c:1474 [inline]
exc_page_fault+0x62/0xa0 arch/x86/mm/fault.c:1527
asm_exc_page_fault+0x26/0x30 arch/x86/include/asm/idtentry.h:618

value changed: 0x00007f68dc2a5fff -> 0x00007f68dc284fff

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 UID: 0 PID: 4759 Comm: syz.5.348 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/18/2026
==================================================================
netlink: 64 bytes leftover after parsing attributes in process `syz.5.348'.


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

Liam R. Howlett

unread,
Apr 17, 2026, 7:51:19 PMApr 17
to syzbot, ak...@linux-foundation.org, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, l...@kernel.org, pfal...@suse.de, syzkall...@googlegroups.com, vba...@kernel.org
* syzbot <syzbot+38a879...@syzkaller.appspotmail.com> [260417 05:12]:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: 1d51b370a0f8 Merge tag 'jfs-7.1' of github.com:kleikamp/li..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=117dc4ce580000
> kernel config: https://syzkaller.appspot.com/x/.config?x=7f207c4b1fbf85a3
> dashboard link: https://syzkaller.appspot.com/bug?extid=38a879f4a73497f2dfef
> compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
>
> Unfortunately, I don't have any reproducer for this issue yet.

... and you won't. This will work unless we tear aligned unsigned long
writes/reads.

I'm debating marking these as data_race(). Marking them all as
READ_ONCE and this one write as WRITE_ONCE. It seems overkill for
something that won't happen.

Alternatively, I can move the slot store fast path to need an
allocation, but that's worse.

Marco Elver

unread,
Apr 17, 2026, 8:26:13 PMApr 17
to Liam R. Howlett, syzbot, ak...@linux-foundation.org, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, l...@kernel.org, pfal...@suse.de, syzkall...@googlegroups.com, vba...@kernel.org
On Sat, 18 Apr 2026 at 01:51, 'Liam R. Howlett' via syzkaller-bugs
<syzkall...@googlegroups.com> wrote:
>
> * syzbot <syzbot+38a879...@syzkaller.appspotmail.com> [260417 05:12]:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit: 1d51b370a0f8 Merge tag 'jfs-7.1' of github.com:kleikamp/li..
> > git tree: upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117dc4ce580000
> > kernel config: https://syzkaller.appspot.com/x/.config?x=7f207c4b1fbf85a3
> > dashboard link: https://syzkaller.appspot.com/bug?extid=38a879f4a73497f2dfef
> > compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
>
> ... and you won't. This will work unless we tear aligned unsigned long
> writes/reads.
>
> I'm debating marking these as data_race(). Marking them all as
> READ_ONCE and this one write as WRITE_ONCE. It seems overkill for
> something that won't happen.
>
> Alternatively, I can move the slot store fast path to need an
> allocation, but that's worse.

The writer:

> rcu_assign_pointer(slots[offset + 1], wr_mas->entry);
> wr_mas->pivots[offset] = mas->index - 1; // <-- stores pivots[offset]

The reader races here:

> if (pivots[offset] >= mas->index) { // <-- load pivots[offset]
> max = pivots[offset]; // <-- load pivots[offset] again
> break;
> }

The compiler is free to reload them as written. What if there's a
concurrent update between the first and second load?

Vlastimil Babka (SUSE)

unread,
Apr 20, 2026, 4:36:26 AMApr 20
to Liam R. Howlett, syzbot, ak...@linux-foundation.org, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, l...@kernel.org, pfal...@suse.de, syzkall...@googlegroups.com
On 4/18/26 01:50, Liam R. Howlett wrote:
> * syzbot <syzbot+38a879...@syzkaller.appspotmail.com> [260417 05:12]:
>> Hello,
>>
>> syzbot found the following issue on:
>>
>> HEAD commit: 1d51b370a0f8 Merge tag 'jfs-7.1' of github.com:kleikamp/li..
>> git tree: upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=117dc4ce580000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=7f207c4b1fbf85a3
>> dashboard link: https://syzkaller.appspot.com/bug?extid=38a879f4a73497f2dfef
>> compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
>>
>> Unfortunately, I don't have any reproducer for this issue yet.
>
> ... and you won't. This will work unless we tear aligned unsigned long
> writes/reads.

Note I think the reproducer here means code that will trigger that KCSAN
alert deterministically (as opposed to fuzzing), not something that would
depend on whether the alert is pointing to a "real" problem.


Liam R. Howlett

unread,
Apr 20, 2026, 3:29:30 PMApr 20
to Marco Elver, syzbot, ak...@linux-foundation.org, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, l...@kernel.org, pfal...@suse.de, syzkall...@googlegroups.com, vba...@kernel.org
* Marco Elver <el...@google.com> [260417 20:26]:
Then the benign race has happened.

Looking at [1], we see that care has been taken to limit the slot store
code to only !rcu mode, except for a subset of cases. Digging through
the information in git will eventually lead you to this note Peng wrote:

commit 64891ba3e51fb841b0af70db029038eb93bd5a43
Author: Peng Zhang <zhangp...@bytedance.com>
Date: Wed Jun 28 15:36:57 2023 +0800

maple_tree: add a fast path case in mas_wr_slot_store()

When expanding a range in two directions, only partially overwriting the
previous and next ranges, the number of entries will not be increased, so
we can just update the pivots as a fast path. However, it may introduce
potential risks in RCU mode, because it updates two pivots. We only
enable it in non-RCU mode.

Link: https://lkml.kernel.org/r/20230628073657.75...@bytedance.com
Signed-off-by: Peng Zhang <zhangp...@bytedance.com>
Reviewed-by: Liam R. Howlett <Liam.H...@oracle.com>
Signed-off-by: Andrew Morton <ak...@linux-foundation.org>

So, you can see that the author of the initial code did look at race
conditions. I wanted to read the link for more information but that
link isn't working right now (403 error).

-----

Or, we can ask an LLM about it:

In mas_wr_store_type(), we only allow wr_slot_store in RCU mode for the narrow
case where wr_mas->offset_end - mas->offset == 1. That condition means the
update touches only one boundary between two adjacent ranges, so the in-place
mutation in mas_wr_slot_store() stays limited to a single slot/pivot boundary
update and is considered safe for lockless readers.

If the span is wider than that, we do not use in-place slot-store under RCU.
The broader in-place path in mas_wr_slot_store() is explicitly guarded with
WARN_ON_ONCE(mt_in_rcu(...)), and the store-type logic instead falls back to
node-replacement paths (wr_node_store/split/rebalance), which preserve RCU
reader safety by publishing a new node rather than mutating too much in place.

In non-RCU mode (!mt_in_rcu()), we allow the wider in-place cases because
readers are expected to be synchronized by locking, so the stricter
lockless-reader constraints do not apply.

-----

I am sorry, but I don't have time to work through the scenarios as this
is not an issue and I no longer have the time budget for mailing lists
as I once did.

If you can come up with a problem (and ideally a reproducer), then
please let me know.

[1]. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/maple_tree.c?id=c1f49dea2b8f335813d3b348fd39117fb8efb428#n3696

Andrew Morton

unread,
Apr 23, 2026, 12:34:25 PMApr 23
to Liam R. Howlett, Marco Elver, syzbot, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, l...@kernel.org, pfal...@suse.de, syzkall...@googlegroups.com, vba...@kernel.org
On Mon, 20 Apr 2026 15:29:00 -0400 "Liam R. Howlett" <Liam.H...@oracle.com> wrote:

> Link: https://lkml.kernel.org/r/20230628073657.75...@bytedance.com
> Signed-off-by: Peng Zhang <zhangp...@bytedance.com>
> Reviewed-by: Liam R. Howlett <Liam.H...@oracle.com>
> Signed-off-by: Andrew Morton <ak...@linux-foundation.org>
>
> So, you can see that the author of the initial code did look at race
> conditions. I wanted to read the link for more information but that
> link isn't working right now (403 error).

lkml links are temporarily broken.

Replacing "lkml.kernekl.org/r" with "lore.kernel.org" fixes it.

https://lore.kernel.org/all/20230628073657.75...@bytedance.com/

Marco Elver

unread,
Jun 10, 2026, 6:00:08 PM (yesterday) Jun 10
to Jay Vadayath, ak...@linux-foundation.org, Liam.H...@oracle.com, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, l...@kernel.org, pfal...@suse.de, syzbot+38a879...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, vba...@kernel.org
On Wed, 10 Jun 2026 at 23:21, Jay Vadayath <jkrsh...@gmail.com> wrote:
>
> Hello,
>
> I am a security researcher and was able to independently reproduce this
> data race on an x86-64 host using a KCSAN-enabled build of the affected
> kernel version (3cd8b194bf3428dfa53120fee47e827a7c495815).
> The crash can be reliably triggered using the provided PoC.

It's not a crash, it's a data-race report. Data races are more subtle,
while technically UB, they are tolerated in some cases in the kernel;
please see: https://lwn.net/Articles/816850/

> However, I am not aware of a method that can take advantage of this bug
> to achieve a practical exploit.

An analysis of its severity would be appreciated (e.g. corrupts data, etc.).

We've been trying to do this with syzbot. This report was also caught
by syzbot, and we have some AI-based workflows that attempt to analyze
benign'ness/harmfulness (take it with a bucket of salt):

> An unprivileged user can easily trigger this data race by performing concurrent
> `mmap()` operations and memory accesses (page faults). However, the race only
> results in a spurious fallback to the slow path of the page fault handler. It
> cannot lead to memory corruption, privilege escalation, or denial of service.

Sounds plausible. Source:
https://syzbot.org/bug?id=3ce3ffd8320398fb3336831c3cb1e7b3ba60d64a

Either way, marking the accesses appropriately is typically encouraged
so that optimizing compilers don't become too clever and break the
code in future:
https://docs.kernel.org/dev-tools/lkmm/docs/access-marking.html

Jay Vadayath

unread,
9:14 AM (11 hours ago) 9:14 AM
to ak...@linux-foundation.org, Liam.H...@oracle.com, el...@google.com, ja...@google.com, linux-...@vger.kernel.org, linu...@kvack.org, l...@kernel.org, pfal...@suse.de, syzbot+38a879...@syzkaller.appspotmail.com, syzkall...@googlegroups.com, vba...@kernel.org
Hello,

I am a security researcher and was able to independently reproduce this
data race on an x86-64 host using a KCSAN-enabled build of the affected
kernel version (3cd8b194bf3428dfa53120fee47e827a7c495815).
The crash can be reliably triggered using the provided PoC.
However, I am not aware of a method that can take advantage of this bug
to achieve a practical exploit.

Reproduction environment:
Kernel: linux kernel commit 3cd8b194bf3428dfa53120fee47e827a7c495815
(KCSAN-enabled)
QEMU: qemu-system-x86_64, TCG/KVM, 2 vCPUs, 2 GiB RAM
Host: x86-64, Linux

Expected KCSAN report:
[ 9.814310][ T2967] BUG: KCSAN: data-race in mas_wr_slot_store / mtree_range_walk
[ 9.814320][ T2967]
[ 9.814323][ T2967] write to 0xffff88800c4d0f30 of 8 bytes by task 2966 on cpu 0:
[ 9.814328][ T2967] mas_wr_slot_store+0x33e/0x360
[ 9.814335][ T2967] mas_wr_store_entry+0x172/0x1d0
[ 9.814343][ T2967] mas_store_prealloc+0xc3/0x170
[ 9.814351][ T2967] commit_merge+0x247/0x450
[ 9.814360][ T2967] vma_expand+0x195/0x440
[ 9.814368][ T2967] vma_merge_new_range+0x153/0x3a0
[ 9.814376][ T2967] vma_merge_extend+0xe5/0x110
[ 9.814384][ T2967] do_mremap+0xbd8/0xc40
[ 9.814394][ T2967] __do_sys_mremap+0x99/0xd0
[ 9.814403][ T2967] do_syscall_64+0xf9/0x540
[ 9.814411][ T2967] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 9.814417][ T2967]
[ 9.814419][ T2967] read to 0xffff88800c4d0f30 of 8 bytes by task 2967 on cpu 1:
[ 9.814424][ T2967] mtree_range_walk+0x234/0x440
[ 9.814431][ T2967] mas_state_walk+0x5c/0x80
[ 9.814437][ T2967] mas_walk+0x75/0x190
[ 9.814444][ T2967] lock_vma_under_rcu+0x73/0x500
[ 9.814450][ T2967] madvise_do_behavior+0x29f/0x7b0
[ 9.814457][ T2967] do_madvise+0x12f/0x1b0
[ 9.814463][ T2967] __x64_sys_madvise+0x2c/0x40
[ 9.814469][ T2967] do_syscall_64+0xf9/0x540
[ 9.814476][ T2967] entry_SYSCALL_64_after_hwframe+0x77/0x7f

The following patch was applied to scope KCSAN to the target
subsystems:
--- kcsan.patch ---
diff --git a/certs/Makefile b/certs/Makefile
index 3ee1960f9f4a..a14f4393963b 100644
--- a/certs/Makefile
+++ b/certs/Makefile
@@ -1,3 +1,4 @@
+KCSAN_SANITIZE := n
# SPDX-License-Identifier: GPL-2.0
#
# Makefile for the linux kernel signature checking certificates.
diff --git a/io_uring/Makefile b/io_uring/Makefile
index c54e328d1410..dea4f90bec8d 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -1,3 +1,4 @@
+KCSAN_SANITIZE := n
# SPDX-License-Identifier: GPL-2.0
#
# Makefile for io_uring
diff --git a/kernel/kcsan/Makefile b/kernel/kcsan/Makefile
index 824f30c93252..03a0e89b52f6 100644
--- a/kernel/kcsan/Makefile
+++ b/kernel/kcsan/Makefile
@@ -1,7 +1,6 @@
# SPDX-License-Identifier: GPL-2.0
CONTEXT_ANALYSIS := y

-KCSAN_SANITIZE := n
KCOV_INSTRUMENT := n
UBSAN_SANITIZE := n

@@ -9,7 +8,7 @@ CFLAGS_REMOVE_core.o = $(CC_FLAGS_FTRACE)
CFLAGS_REMOVE_debugfs.o = $(CC_FLAGS_FTRACE)
CFLAGS_REMOVE_report.o = $(CC_FLAGS_FTRACE)

-CFLAGS_core.o := $(call cc-option,-fno-conserve-stack) \
+CFLAGS_core.o := $(call cc-option,-fno-conserve-stack) -Wno-builtin-declaration-mismatch -Wno-missing-attributes -Wno-builtin-declaration-mismatch -Wno-missing-attributes \
$(call cc-option,-mno-outline-atomics) \
-fno-stack-protector -DDISABLE_BRANCH_PROFILING

diff --git a/lib/Makefile b/lib/Makefile
index 72c90fca6fef..6771e3d7ed58 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -337,3 +337,5 @@ CONTEXT_ANALYSIS_test_context-analysis.o := y
obj-$(CONFIG_CONTEXT_ANALYSIS_TEST) += test_context-analysis.o

subdir-$(CONFIG_FORTIFY_SOURCE) += test_fortify
+KCSAN_SANITIZE_maple_tree.o := y
+KCSAN_SANITIZE_maple_tree.o := y
diff --git a/rust/Makefile b/rust/Makefile
index b361bfedfdf0..e19ba2714d22 100644
--- a/rust/Makefile
+++ b/rust/Makefile
@@ -1,3 +1,4 @@
+KCSAN_SANITIZE := n
# SPDX-License-Identifier: GPL-2.0

# Where to place rustdoc generated documentation
diff --git a/samples/Makefile b/samples/Makefile
index 07641e177bd8..9048409de6d0 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -1,3 +1,4 @@
+KCSAN_SANITIZE := n
# SPDX-License-Identifier: GPL-2.0
# Makefile for Linux samples code

diff --git a/scripts/Makefile b/scripts/Makefile
index 3434a82a119f..95bddac1d1af 100644
--- a/scripts/Makefile
+++ b/scripts/Makefile
@@ -1,3 +1,4 @@
+KCSAN_SANITIZE := n
# SPDX-License-Identifier: GPL-2.0
###
# scripts contains sources for various helper programs used throughout
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index 0718e39cedda..cdcff805ae16 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -96,7 +96,7 @@ endif
#
ifeq ($(CONFIG_KCSAN),y)
_c_flags += $(if $(patsubst n%,, \
- $(KCSAN_SANITIZE_$(target-stem).o)$(KCSAN_SANITIZE)$(is-kernel-object)), \
+ $(KCSAN_SANITIZE_$(target-stem).o)$(KCSAN_SANITIZE)n), \
$(CFLAGS_KCSAN))
# Some uninstrumented files provide implied barriers required to avoid false
# positives: set KCSAN_INSTRUMENT_BARRIERS for barrier instrumentation only.
diff --git a/usr/Makefile b/usr/Makefile
index e8f42478a0b7..4a519330041e 100644
--- a/usr/Makefile
+++ b/usr/Makefile
@@ -1,3 +1,4 @@
+KCSAN_SANITIZE := n
# SPDX-License-Identifier: GPL-2.0
#
# kbuild file for usr/ - including initramfs image
--- end kcsan.patch ---

Once the patch has been applied, the following scripts can be used
to reproduce this issue locally.

--- build.sh ---
#!/usr/bin/env bash
# build.sh — Build KCSAN kernel + initramfs + PoC for maple-tree data-race reproduction
#
# Usage: ./build.sh <linux-source-tree> [kernel-config]

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO="${1:?Usage: $0 <linux-source-tree> [kernel-config]}"
REPO="$(cd "$REPO" && pwd)"
CONFIG_SRC="${2:-$SCRIPT_DIR/38a879f4a73497f2dfef.config}"

NPROC=$(nproc)
JOBS=$((NPROC > 16 ? NPROC / 2 : NPROC))
BUILD_DIR="/tmp/kernel-builds/kcsan"
LINK_DIR="$SCRIPT_DIR/builds/kcsan"
NORMAL_USER="vscode"

# ── Install dependencies ──────────────────────────────────────────────────────
if ! command -v busybox &>/dev/null; then
command -v apt-get &>/dev/null \
&& sudo apt-get install -y --no-install-recommends busybox-static >/dev/null \
|| { echo "busybox not found. Install busybox-static manually." >&2; exit 1; }
fi

# ── Build PoC binary ──────────────────────────────────────────────────────────
gcc -static -O2 -Wall -Wextra -pthread "$SCRIPT_DIR/poc.c" -o "$SCRIPT_DIR/poc"

# ── Patch lib/Makefile for KCSAN symbol visibility ────────────────────────────
if ! grep -q "CFLAGS_maple_tree.o.*fno-inline" "$REPO/lib/Makefile"; then
echo 'CFLAGS_maple_tree.o += -fno-inline' >> "$REPO/lib/Makefile"
fi

# ── Build KCSAN kernel ────────────────────────────────────────────────────────
rm -rf "$BUILD_DIR" && mkdir -p "$BUILD_DIR"
mkdir -p "$(dirname "$LINK_DIR")" && rm -rf "$LINK_DIR" && ln -sfn "$BUILD_DIR" "$LINK_DIR"

cp "$CONFIG_SRC" "$BUILD_DIR/.config"
cat >> "$BUILD_DIR/.config" <<'KCSAN_CONFIG'
CONFIG_KCSAN_EARLY_ENABLE=y
CONFIG_KCSAN_NUM_WATCHPOINTS=128
CONFIG_KCSAN_SKIP_WATCH=1000
CONFIG_KCSAN_REPORT_ONCE_IN_MS=1000
CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=y
# CONFIG_KCSAN_PERMISSIVE is not set
# CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC is not set
CONFIG_KCSAN_IGNORE_ATOMICS=y
CONFIG_DEBUG_INFO=y
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y
KCSAN_CONFIG

make -C "$REPO" O="$BUILD_DIR" olddefconfig 2>&1 | tail -5
make -C "$REPO" O="$BUILD_DIR" -j"$JOBS" bzImage 2>&1 | tail -10

# ── Build initramfs ───────────────────────────────────────────────────────────
BUSYBOX_BIN="$(command -v busybox)"

WORKDIR="$(mktemp -d /tmp/initramfs_build.XXXXXX)"
trap 'rm -rf "$WORKDIR"' EXIT
ROOT="$WORKDIR/rootfs"

mkdir -p "$ROOT"/{bin,sbin,usr/bin,usr/sbin,dev,proc,sys,tmp,run,etc} \
"$ROOT"/home/"$NORMAL_USER"

cp "$BUSYBOX_BIN" "$ROOT/bin/busybox" && chmod +x "$ROOT/bin/busybox"
for applet in $("$BUSYBOX_BIN" --list); do
case "$applet" in
ifconfig|ip|route|udhcpc|modprobe|insmod|rmmod|halt|poweroff|reboot|mdev)
[[ -e "$ROOT/sbin/$applet" ]] || ln -s /bin/busybox "$ROOT/sbin/$applet" ;;
*) [[ -e "$ROOT/bin/$applet" ]] || ln -s /bin/busybox "$ROOT/bin/$applet" ;;
esac
done

cat > "$ROOT/etc/passwd" <<EOF
root:x:0:0:root:/root:/bin/sh
$NORMAL_USER:x:1000:1000:$NORMAL_USER:/home/$NORMAL_USER:/bin/sh
EOF
printf 'root:*:19000:0:99999:7:::\n%s:*:19000:0:99999:7:::\n' "$NORMAL_USER" > "$ROOT/etc/shadow"
chmod 640 "$ROOT/etc/shadow"
printf 'root:x:0:\n%s:x:1000:\n' "$NORMAL_USER" > "$ROOT/etc/group"

cp "$SCRIPT_DIR/poc" "$ROOT/poc" && chmod +x "$ROOT/poc"

cat > "$ROOT/init" <<'INIT_SCRIPT'
#!/bin/sh
export PATH=/bin:/sbin:/usr/bin:/usr/sbin
mount -t proc proc /proc
mount -t sysfs sysfs /sys
mount -t devtmpfs dev /dev 2>/dev/null || mdev -s
mount -t tmpfs tmpfs /tmp
mount -t tmpfs tmpfs /run
mount -t debugfs debugfs /sys/kernel/debug 2>/dev/null || true
echo 0 > /proc/sys/kernel/printk 2>/dev/null || true
[ -f /sys/kernel/debug/kcsan ] && echo on > /sys/kernel/debug/kcsan
if [ -x /poc ]; then
cp /poc /run/poc && chmod 0755 /run/poc
FOUND=0; i=1
while [ $i -le 10 ]; do
echo "[init] === poc iteration $i/10 ==="
su vscode -s /bin/sh -c /run/poc 2>&1 || /run/poc 2>&1
if dmesg | grep -q "mas_wr_slot_store.*mtree_range_walk\|mtree_range_walk.*mas_wr_slot_store"; then
echo "[init] *** RACE REPRODUCED on iteration $i ***"
FOUND=1; break
fi
i=$((i+1))
done
echo "[init] poc loop finished, FOUND=$FOUND"
echo "=== DMESG START ==="; dmesg; echo "=== DMESG END ==="
poweroff -f
fi
exec /bin/sh
INIT_SCRIPT
chmod +x "$ROOT/init"

( cd "$ROOT"; find . | sort | cpio -o -H newc 2>/dev/null | gzip -9 > "$BUILD_DIR/initramfs.cpio.gz" )

echo "[build.sh] Kernel: $BUILD_DIR/arch/x86/boot/bzImage"
echo "[build.sh] Initramfs: $BUILD_DIR/initramfs.cpio.gz"
--- end build.sh ---

--- poc.sh ---
#!/usr/bin/env bash
# poc.sh — Run the KCSAN maple-tree data-race PoC built by build.sh
#
# Usage: ./poc.sh
# Prereq: ./build.sh <linux-source-tree> must have been run first.

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
KCSAN_BUILD="$SCRIPT_DIR/builds/kcsan"
KERNEL="$KCSAN_BUILD/arch/x86/boot/bzImage"
INITRAMFS="$KCSAN_BUILD/initramfs.cpio.gz"
LOGS_DIR="$SCRIPT_DIR/logs"

if [ ! -f "$KERNEL" ]; then
echo "ERROR: kernel not found at $KERNEL — run ./build.sh first" >&2; exit 1
fi
if [ ! -f "$INITRAMFS" ]; then
echo "ERROR: initramfs not found at $INITRAMFS — run ./build.sh first" >&2; exit 1
fi

rm -rf "$LOGS_DIR" && mkdir -p "$LOGS_DIR"

if [ -e /dev/kvm ]; then ACCEL="-accel kvm"; TMO=180; else ACCEL="-accel tcg,thread=multi"; TMO=1000; fi
KCSAN_ARGS="kcsan.skip_watch=15 kcsan.udelay_task=300 kcsan.udelay_interrupt=150"

MAX_ATTEMPTS=3
FOUND=0
for attempt in $(seq 1 $MAX_ATTEMPTS); do
SERIAL_LOG="$LOGS_DIR/serial_attempt_${attempt}.log"
echo "[poc.sh] === Attempt $attempt/$MAX_ATTEMPTS ==="
timeout "$TMO" qemu-system-x86_64 \
$ACCEL -m 2048 -smp 2 \
-kernel "$KERNEL" -initrd "$INITRAMFS" \
-append "console=ttyS0 root=/dev/ram rdinit=/init quiet loglevel=3 printk.devkmsg=off $KCSAN_ARGS" \
-serial file:"$SERIAL_LOG" -display none -no-reboot 2>/dev/null || true
if grep -q "mas_wr_slot_store.*mtree_range_walk\|mtree_range_walk.*mas_wr_slot_store" "$SERIAL_LOG" 2>/dev/null; then
echo "[poc.sh] *** RACE REPRODUCED on attempt $attempt ***"
grep -m1 -A 24 "BUG: KCSAN.*mas_wr_slot_store" "$SERIAL_LOG" | sed 's/\r//'
FOUND=1; break
fi
grep -q "poc loop finished" "$SERIAL_LOG" 2>/dev/null \
&& echo "[poc.sh] PoC loop completed, race not triggered" \
|| echo "[poc.sh] WARNING: boot incomplete"
done

echo "============================================================"
[ $FOUND -eq 1 ] && echo "[poc.sh] SUCCESS" || echo "[poc.sh] NOT reproduced in $MAX_ATTEMPTS attempts"
echo "============================================================"
--- end poc.sh ---

--- poc.c ---
/*
* poc.c - KCSAN data-race in mas_wr_slot_store / mtree_range_walk
* (the *exact* syzbot 38a879f4a73497f2dfef race — the genuine double-fetch)
*
* Writer: mremap()-grow/shrink a VMA against a larger adjacent gap
* -> vma_expand() -> commit_merge() -> mas_store_prealloc()
* -> mas_wr_store_entry() -> mas_wr_slot_store()
* writes wr_mas->pivots[offset] in place (boundary shift, entry count
* unchanged) -- this is the reported writer.
* Reader: page fault -> lock_vma_under_rcu() -> mas_walk() -> mtree_range_walk()
* reads pivots[offset] locklessly (the double-fetched pivot).
*
* Why mremap (not mmap MAP_FIXED + mprotect like poc.c): the slot-store fast path
* (mas_wr_store_type() -> wr_slot_store) is only taken when a store shifts a single
* boundary WITHOUT changing the node's entry count, which in RCU mode additionally
* requires wr_mas->offset_end - mas->offset == 1. Growing/shrinking a VMA into an
* adjacent gap that *survives* (residual gap => entry count unchanged) hits exactly
* that. MAP_FIXED/mprotect split the VMA (entry count changes) and take the
* node-store/replace path instead, which is what poc.c trips.
*
* Build: gcc -static -O2 -Wall -pthread poc.c -o poc
* Run: ./poc (needs a KCSAN-enabled kernel; ~30s)
*/
#define _GNU_SOURCE
#include <errno.h>
#include <sched.h>
#include <signal.h>
#include <setjmp.h>
#include <stdatomic.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>

#define PAGE 4096UL
#define LANE_PAGES 64UL /* address span reserved per lane */
#define LANE_SIZE (LANE_PAGES * PAGE)
#define NLANES 32UL /* independent anchor/gap lanes */
#define BASE_PAGES 2UL /* anchor VMA baseline size (pages) */

static atomic_int g_stop;
static atomic_int g_ready;

/* SIGSEGV/SIGBUS recovery: touching a gap (or a region the writer just shrank
* away) faults; we still want the kernel to have run lock_vma_under_rcu first. */
static __thread sigjmp_buf tl_jmp;
static __thread volatile sig_atomic_t tl_jmp_active;

static void segv_handler(int sig, siginfo_t *si, void *uctx)
{
(void)sig; (void)si; (void)uctx;
if (tl_jmp_active) {
tl_jmp_active = 0;
siglongjmp(tl_jmp, 1);
}
_exit(111);
}

struct ctx {
uint8_t *base; /* base of the reserved arena */
int seconds;
int cpu;
};

static void pin_cpu(int cpu)
{
if (cpu < 0)
return;
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
(void)sched_setaffinity(0, sizeof(set), &set);
}

static uint64_t nsec_now(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec * 1000000000ull + (uint64_t)ts.tv_nsec;
}

static inline uint8_t *lane_base(uint8_t *base, unsigned lane)
{
return base + (size_t)lane * LANE_SIZE;
}

/*
* Writer: for each lane, repeatedly mremap()-grow the small anchor into the
* adjacent gap and then shrink it back. Each grow/shrink shifts the single
* boundary between the anchor VMA and the surviving gap -> mas_wr_slot_store()
* updates pivots[offset] in place.
*/
static void *writer_fn(void *arg)
{
struct ctx *c = (struct ctx *)arg;
pin_cpu(c->cpu);

atomic_fetch_add(&g_ready, 1);
while (atomic_load(&g_ready) < 2)
sched_yield();

uint64_t end = nsec_now() + (uint64_t)c->seconds * 1000000000ull;
unsigned iter = 0;

while (!atomic_load(&g_stop) && nsec_now() < end) {
unsigned lane = iter % NLANES;
uint8_t *a = lane_base(c->base, lane);

/* Oscillate the anchor between BASE_PAGES and a larger size, both > 1 page
* and both leaving a residual gap (grow_pages <= LANE_PAGES-2). The entry
* count (anchor VMA + surviving gap) never changes and the store always
* shifts exactly one boundary => mas_wr_store_type() picks wr_slot_store
* every time, instead of the node-store/rebalance paths that a shrink to
* a single page can trigger. */
unsigned long grow_pages = 3 + (iter % (LANE_PAGES - 4)); /* 3..LANE_PAGES-2 */
size_t big = grow_pages * PAGE;
size_t base = BASE_PAGES * PAGE;

/* Grow base -> big (boundary shifts right), then shrink big -> base
* (boundary shifts left). Both are in-place pivot updates. */
void *p = mremap(a, base, big, 0);
if (p == MAP_FAILED) {
iter++;
continue;
}
p = mremap(a, big, base, 0);
(void)p;

iter++;
}

atomic_store(&g_stop, 1);
return NULL;
}

/*
* Reader: fault addresses across each lane (anchor + gap) so the lockless VMA
* lookup (lock_vma_under_rcu -> mas_walk -> mtree_range_walk) walks the leaf
* holding the boundary the writer is shifting, reading pivots[offset].
*/
static void *reader_fn(void *arg)
{
struct ctx *c = (struct ctx *)arg;
pin_cpu(c->cpu);

atomic_fetch_add(&g_ready, 1);
while (atomic_load(&g_ready) < 2)
sched_yield();

uint64_t end = nsec_now() + (uint64_t)c->seconds * 1000000000ull;
unsigned iter = 0;

while (!atomic_load(&g_stop) && nsec_now() < end) {
unsigned lane = iter % NLANES;
uint8_t *a = lane_base(c->base, lane);

/* Drop the anchor page so the next touch faults through the lockless path. */
madvise(a, PAGE, MADV_DONTNEED);

/* Touch several offsets within the lane: the anchor page plus addresses
* deeper in the (possibly grown / possibly gap) region. Each is a fault
* that runs mtree_range_walk; gap touches segfault and are recovered. */
for (unsigned long off = 0; off < LANE_PAGES; off += 4) {
volatile uint8_t *q = (volatile uint8_t *)(a + off * PAGE);
if (sigsetjmp(tl_jmp, 1) == 0) {
tl_jmp_active = 1;
(void)*q; /* read fault -> lock_vma_under_rcu */
tl_jmp_active = 0;
}
if (atomic_load(&g_stop))
break;
}
iter++;
}

atomic_store(&g_stop, 1);
return NULL;
}

int main(void)
{
int seconds = 30;

struct sigaction sa;
memset(&sa, 0, sizeof(sa));
sa.sa_sigaction = segv_handler;
sa.sa_flags = SA_SIGINFO | SA_NODEFER;
sigemptyset(&sa.sa_mask);
sigaction(SIGSEGV, &sa, NULL);
sigaction(SIGBUS, &sa, NULL);

size_t arena = NLANES * LANE_SIZE;

/* Reserve the whole arena, then punch each lane down to a BASE_PAGES anchor
* followed by a (LANE_PAGES-BASE_PAGES)-page gap. */
uint8_t *base = mmap(NULL, arena, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (base == MAP_FAILED) {
perror("mmap(arena)");
return 1;
}
for (unsigned lane = 0; lane < NLANES; lane++) {
uint8_t *a = lane_base(base, lane);
/* keep [a, a+BASE_PAGES*PAGE) mapped as the anchor; unmap the rest as a gap */
if (munmap(a + BASE_PAGES * PAGE, LANE_SIZE - BASE_PAGES * PAGE) != 0)
perror("munmap(gap)");
}

fprintf(stderr, "[poc] base=%p lanes=%lu lane_size=%luKB seconds=%d\n",
(void *)base, NLANES, LANE_SIZE / 1024, seconds);

struct ctx cw = { .base = base, .seconds = seconds, .cpu = 0 };
struct ctx cr = { .base = base, .seconds = seconds, .cpu = 1 };

atomic_store(&g_stop, 0);
atomic_store(&g_ready, 0);

pthread_t tw, tr;
if (pthread_create(&tw, NULL, writer_fn, &cw) != 0) {
perror("pthread_create(writer)");
return 1;
}
if (pthread_create(&tr, NULL, reader_fn, &cr) != 0) {
perror("pthread_create(reader)");
return 1;
}

pthread_join(tw, NULL);
pthread_join(tr, NULL);

fprintf(stderr, "[poc] done\n");
return 0;
}
--- end poc.c ---


Thanks,
Jay Vadayath
Reply all
Reply to author
Forward
0 new messages