[syzbot] [kvm?] WARNING in clear_dirty_gfn_range

10 views
Skip to first unread message

syzbot

unread,
Mar 12, 2024, 7:34:19 PMMar 12
to b...@alien8.de, dave....@linux.intel.com, h...@zytor.com, k...@vger.kernel.org, linux-...@vger.kernel.org, mi...@redhat.com, pbon...@redhat.com, sea...@google.com, syzkall...@googlegroups.com, tg...@linutronix.de, x...@kernel.org
Hello,

syzbot found the following issue on:

HEAD commit: 855684c7d938 Merge tag 'x86_tdx_for_6.9' of git://git.kern..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=11776f71180000
kernel config: https://syzkaller.appspot.com/x/.config?x=9b384ef2b2d70c33
dashboard link: https://syzkaller.appspot.com/bug?extid=900d58a45dcaab9e4821
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1536da66180000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=14c5078e180000

Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-855684c7.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/a13a9aaebd09/vmlinux-855684c7.xz
kernel image: https://storage.googleapis.com/syzbot-assets/acac43529544/bzImage-855684c7.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+900d58...@syzkaller.appspotmail.com

------------[ cut here ]------------
WARNING: CPU: 1 PID: 5165 at arch/x86/kvm/mmu/tdp_mmu.c:1526 clear_dirty_gfn_range+0x3d6/0x540 arch/x86/kvm/mmu/tdp_mmu.c:1526
Modules linked in:
CPU: 1 PID: 5165 Comm: syz-executor417 Not tainted 6.8.0-syzkaller-01185-g855684c7d938 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:clear_dirty_gfn_range+0x3d6/0x540 arch/x86/kvm/mmu/tdp_mmu.c:1526
Code: 00 31 ff 48 b8 00 00 00 00 00 00 30 00 48 21 d8 49 89 c5 48 89 c6 e8 e9 9c 6c 00 4d 85 ed 0f 84 b8 fe ff ff e8 cb a1 6c 00 90 <0f> 0b 90 e9 aa fe ff ff e8 bd a1 6c 00 e8 a8 39 53 00 31 ff 89 c6
RSP: 0018:ffffc900039a7570 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 06200000310acb77 RCX: ffffffff81204937
RDX: ffff888022a8c880 RSI: ffffffff81204945 RDI: 0000000000000007
RBP: 0000000000000001 R08: 0000000000000007 R09: 0000000000000000
R10: 0020000000000000 R11: 0000000000000002 R12: ffffc900039a75c8
R13: 0020000000000000 R14: 0000000000000200 R15: 0000000000000001
FS: 000055556ed44380(0000) GS:ffff88806b300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000027762000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
kvm_tdp_mmu_clear_dirty_slot+0x24f/0x2e0 arch/x86/kvm/mmu/tdp_mmu.c:1557
kvm_mmu_slot_leaf_clear_dirty+0x38b/0x490 arch/x86/kvm/mmu/mmu.c:6783
kvm_mmu_slot_apply_flags arch/x86/kvm/x86.c:12962 [inline]
kvm_arch_commit_memory_region+0x299/0x490 arch/x86/kvm/x86.c:13031
kvm_commit_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1751 [inline]
kvm_set_memslot+0x4d3/0x13e0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1994
__kvm_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:2129 [inline]
__kvm_set_memory_region+0xdbc/0x1520 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2020
kvm_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:2150 [inline]
kvm_vm_ioctl_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:2162 [inline]
kvm_vm_ioctl+0x151c/0x3e20 arch/x86/kvm/../../../virt/kvm/kvm_main.c:5152
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl fs/ioctl.c:890 [inline]
__x64_sys_ioctl+0x193/0x220 fs/ioctl.c:890
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xd2/0x260 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x6d/0x75
RIP: 0033:0x7f4e1b1860f9
Code: 48 83 c4 28 c3 e8 37 17 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffdd21061f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffdd21063c8 RCX: 00007f4e1b1860f9
RDX: 0000000020000180 RSI: 000000004020ae46 RDI: 0000000000000004
RBP: 00007f4e1b1f9610 R08: 00007ffdd21063c8 R09: 00007ffdd21063c8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
R13: 00007ffdd21063b8 R14: 0000000000000001 R15: 0000000000000001
</TASK>


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzk...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title

If you want syzbot to run the reproducer, reply with:
#syz test: git://repo/address.git branch-or-commit-hash
If you attach or paste a git patch, syzbot will apply it before testing.

If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)

If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report

If you want to undo deduplication, reply with:
#syz undup

David Matlack

unread,
Mar 15, 2024, 2:09:43 PMMar 15
to syzbot, b...@alien8.de, dave....@linux.intel.com, h...@zytor.com, k...@vger.kernel.org, linux-...@vger.kernel.org, mi...@redhat.com, pbon...@redhat.com, sea...@google.com, syzkall...@googlegroups.com, tg...@linutronix.de, x...@kernel.org
On Tue, Mar 12, 2024 at 4:34 PM syzbot
<syzbot+900d58...@syzkaller.appspotmail.com> wrote:
>
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 5165 at arch/x86/kvm/mmu/tdp_mmu.c:1526 clear_dirty_gfn_range+0x3d6/0x540 arch/x86/kvm/mmu/tdp_mmu.c:1526
> Modules linked in:
> CPU: 1 PID: 5165 Comm: syz-executor417 Not tainted 6.8.0-syzkaller-01185-g855684c7d938 #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> RIP: 0010:clear_dirty_gfn_range+0x3d6/0x540 arch/x86/kvm/mmu/tdp_mmu.c:1526
> Call Trace:
> <TASK>
> kvm_tdp_mmu_clear_dirty_slot+0x24f/0x2e0 arch/x86/kvm/mmu/tdp_mmu.c:1557
> kvm_mmu_slot_leaf_clear_dirty+0x38b/0x490 arch/x86/kvm/mmu/mmu.c:6783
> kvm_mmu_slot_apply_flags arch/x86/kvm/x86.c:12962 [inline]
> kvm_arch_commit_memory_region+0x299/0x490 arch/x86/kvm/x86.c:13031
> kvm_commit_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:1751 [inline]
> kvm_set_memslot+0x4d3/0x13e0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1994
> __kvm_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:2129 [inline]
> __kvm_set_memory_region+0xdbc/0x1520 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2020
> kvm_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:2150 [inline]
> kvm_vm_ioctl_set_memory_region arch/x86/kvm/../../../virt/kvm/kvm_main.c:2162 [inline]
> kvm_vm_ioctl+0x151c/0x3e20 arch/x86/kvm/../../../virt/kvm/kvm_main.c:5152

The reproducer uses nested virtualization to launch an L2 with EPT
disabled. This creates a TDP MMU root with role.guest_mode=1, which
triggers the WARN_ON() in kvm_tdp_mmu_clear_dirty_slot() because
kvm_mmu_page_ad_need_write_protect() returns false whenever PML is
enabled and the shadow page role.guest_mode=1.

If I'm reading prepare_vmcs02_constant_state() correctly, we always
disable PML when running in L2. So when enable_pml=1 and L2 runs with
EPT disabled we are blind to dirty tracking L2 accesses.

David Matlack

unread,
Mar 15, 2024, 2:29:37 PMMar 15
to syzbot, b...@alien8.de, dave....@linux.intel.com, h...@zytor.com, k...@vger.kernel.org, linux-...@vger.kernel.org, mi...@redhat.com, pbon...@redhat.com, sea...@google.com, syzkall...@googlegroups.com, tg...@linutronix.de, x...@kernel.org, vip...@google.com
+Vipin

I believe this was introduced by 6.4 commit 5982a5392663 ("KVM: x86/mmu:
Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot").

I see two options to fix:

1. Allow PML to be enabled when L2 is running with EPT is disabled.
2. Fix the TDP MMU logic to write-protect role.guest_mode=1 SPTEs.

(1.) sounds more complicated and will require more testing. (2.) is
quite simple since an entire TDP MMU tree is either guest_mode=0 or
guest_mode=1.

Example fix (fixes syzbot repro but otherwise untested):

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6ae19b4ee5b1..eb6fb8d9c00c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1498,6 +1498,24 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
}
}

+static inline u64 tdp_mmu_dirty_bit(struct kvm_mmu_page *root, bool force_wrprot)
+{
+ if (force_wrprot || kvm_mmu_page_ad_need_write_protect(root) || !kvm_ad_enabled())
+ return PT_WRITABLE_MASK;
+
+ return shadow_dirty_mask;
+}
+
+static inline bool tdp_mmu_dirty_bit_invalid_for_spte(struct tdp_iter *iter, u64 dbit)
+{
+ /*
+ * The decision of whether to clear the D-bit or W-bit is made based on
+ * the root, as all TDP MMU SPTEs within a root should require the same
+ * modifications. This check ensures that is actually the case.
+ */
+ return dbit == shadow_dirty_mask && spte_ad_need_write_protect(iter->old_spte);
+}
+
/*
* Clear the dirty status of all the SPTEs mapping GFNs in the memslot. If
* AD bits are enabled, this will involve clearing the dirty bit on each SPTE.
@@ -1508,7 +1526,7 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t start, gfn_t end)
{
- u64 dbit = kvm_ad_enabled() ? shadow_dirty_mask : PT_WRITABLE_MASK;
+ u64 dbit = tdp_mmu_dirty_bit(root, false);
struct tdp_iter iter;
bool spte_set = false;

@@ -1523,8 +1541,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
continue;

- KVM_MMU_WARN_ON(kvm_ad_enabled() &&
- spte_ad_need_write_protect(iter.old_spte));
+ KVM_MMU_WARN_ON(tdp_mmu_dirty_bit_invalid_for_spte(&iter, dbit));

if (!(iter.old_spte & dbit))
continue;
@@ -1570,8 +1587,7 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm,
static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
gfn_t gfn, unsigned long mask, bool wrprot)
{
- u64 dbit = (wrprot || !kvm_ad_enabled()) ? PT_WRITABLE_MASK :
- shadow_dirty_mask;
+ u64 dbit = tdp_mmu_dirty_bit(root, wrprot);
struct tdp_iter iter;

lockdep_assert_held_write(&kvm->mmu_lock);
@@ -1583,8 +1599,7 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
if (!mask)
break;

- KVM_MMU_WARN_ON(kvm_ad_enabled() &&
- spte_ad_need_write_protect(iter.old_spte));
+ KVM_MMU_WARN_ON(tdp_mmu_dirty_bit_invalid_for_spte(&iter, dbit));

if (iter.level > PG_LEVEL_4K ||
!(mask & (1UL << (iter.gfn - gfn))))

Sean Christopherson

unread,
Mar 15, 2024, 3:06:08 PMMar 15
to David Matlack, syzbot, b...@alien8.de, dave....@linux.intel.com, h...@zytor.com, k...@vger.kernel.org, linux-...@vger.kernel.org, mi...@redhat.com, pbon...@redhat.com, syzkall...@googlegroups.com, tg...@linutronix.de, x...@kernel.org, vip...@google.com
Nit, kvm_mmu_page_ad_need_write_protect() returns %true, not %false. Your analysis
is correct, I think you just mistyped.

> >
> > If I'm reading prepare_vmcs02_constant_state() correctly, we always
> > disable PML when running in L2.

Ya.

> So when enable_pml=1 and L2 runs with EPT disabled we are blind to dirty
> tracking L2 accesses.

Ow.

> +Vipin
>
> I believe this was introduced by 6.4 commit 5982a5392663 ("KVM: x86/mmu:
> Use kvm_ad_enabled() to determine if TDP MMU SPTEs need wrprot").
>
> I see two options to fix:
>
> 1. Allow PML to be enabled when L2 is running with EPT is disabled.
> 2. Fix the TDP MMU logic to write-protect role.guest_mode=1 SPTEs.
>
> (1.) sounds more complicated and will require more testing.

It's not terribly complicated, but I 100% agree it's too complicated for a bugfix
that is destined for stable. And long term, I don't even know that it's something
we'd want to bother actively supporting, as any amount of complexity beyond "PML
is disabled for L2" is probably dead weight since not using TDP is quite uncommon
these days.

> (2.) is quite simple since an entire TDP MMU tree is either guest_mode=0 or
> guest_mode=1.
>
> Example fix (fixes syzbot repro but otherwise untested):
>
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6ae19b4ee5b1..eb6fb8d9c00c 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1498,6 +1498,24 @@ void kvm_tdp_mmu_try_split_huge_pages(struct kvm *kvm,
> }
> }
>
> +static inline u64 tdp_mmu_dirty_bit(struct kvm_mmu_page *root, bool force_wrprot)

No "inline" on local statics, formletter incoming...

Do not use "inline" for functions that are visible only to the local compilation
unit. "inline" is just a hint, and modern compilers are smart enough to inline
functions when appropriate without a hint.

A longer explanation/rant here: https://lore.kernel.org/all/ZAdfX+S3...@google.com
Hrm, I like common helpers, but I dislike "blind" booleans even more. My vote
would be to do this:

u64 dbit = wrprot ? PT_WRITABLE_MASK : tdp_mmu_dirty_bit(root);

if we want to hide the PT_WRITABLE_MASK vs. shadow_dirty_mask. Though I'm tempted
to vote for open coding the bit selection, e.g.

u64 dbit = (wrprot || tdp_mmu_force_wrprot(root)) ? PT_WRITABLE_MASK :
shadow_dirty_mask;

and

u64 dbit = tdp_mmu_force_wrprot(root) ? PT_WRITABLE_MASK :
shadow_dirty_mask;

For tdp_mmu_force_wrprot(), the "what" is pretty clear, i.e. readers only need
to look at the implementation if they care _why_ KVM forces write-protection.

But with tdp_mmu_dirty_bit(), even the "what" isn't clear, i.e. it's not obvious
that the options are hardware's D-bit and the writable bit.

> struct tdp_iter iter;
> bool spte_set = false;
>
> @@ -1523,8 +1541,7 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> if (tdp_mmu_iter_cond_resched(kvm, &iter, false, true))
> continue;
>
> - KVM_MMU_WARN_ON(kvm_ad_enabled() &&
> - spte_ad_need_write_protect(iter.old_spte));
> + KVM_MMU_WARN_ON(tdp_mmu_dirty_bit_invalid_for_spte(&iter, dbit));
>
> if (!(iter.old_spte & dbit))
> continue;
> @@ -1570,8 +1587,7 @@ bool kvm_tdp_mmu_clear_dirty_slot(struct kvm *kvm,
> static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
> gfn_t gfn, unsigned long mask, bool wrprot)
> {
> - u64 dbit = (wrprot || !kvm_ad_enabled()) ? PT_WRITABLE_MASK :
> - shadow_dirty_mask;
> + u64 dbit = tdp_mmu_dirty_bit(root, wrprot);
> struct tdp_iter iter;
>
> lockdep_assert_held_write(&kvm->mmu_lock);
> @@ -1583,8 +1599,7 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
> if (!mask)
> break;
>
> - KVM_MMU_WARN_ON(kvm_ad_enabled() &&
> - spte_ad_need_write_protect(iter.old_spte));
> + KVM_MMU_WARN_ON(tdp_mmu_dirty_bit_invalid_for_spte(&iter, dbit));

I would definitely prefer to keep this open coded as:

KVM_MMU_WARN_ON(dbit != PT_WRITABLE_MASK &&
spte_ad_need_write_protect(iter.old_spte));

IMO, that's self-explanatory, i.e. doesn't need a comment, and doesn't require
hunting down the tdp_mmu_dirty_bit_invalid_for_spte() to see what the fuss is
about.
Reply all
Reply to author
Forward
0 new messages