[PATCH] x86/kfence: Avoid writing L1TF-vulnerable PTEs

2 views
Skip to first unread message

Andrew Cooper

unread,
Jan 6, 2026, 1:04:36 PM (12 days ago) Jan 6
to LKML, Andrew Cooper, Marco Elver, Alexander Potapenko, Dmitry Vyukov, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x...@kernel.org, H. Peter Anvin, Andrew Morton, Jann Horn, kasa...@googlegroups.com
For native, the choice of PTE is fine. There's real memory backing the
non-present PTE. However, for XenPV, Xen complains:

(XEN) d1 L1TF-vulnerable L1e 8010000018200066 - Shadowing

To explain, some background on XenPV pagetables:

Xen PV guests are control their own pagetables; they choose the new PTE
value, and use hypercalls to make changes so Xen can audit for safety.

In addition to a regular reference count, Xen also maintains a type
reference count. e.g. SegDesc (referenced by vGDT/vLDT),
Writable (referenced with _PAGE_RW) or L{1..4} (referenced by vCR3 or a
lower pagetable level). This is in order to prevent e.g. a page being
inserted into the pagetables for which the guest has a writable mapping.

For non-present mappings, all other bits become software accessible, and
typically contain metadata rather a real frame address. There is nothing
that a reference count could sensibly be tied to. As such, even if Xen
could recognise the address as currently safe, nothing would prevent that
frame from changing owner to another VM in the future.

When Xen detects a PV guest writing a L1TF-PTE, it responds by activating
shadow paging. This is normally only used for the live phase of
migration, and comes with a reasonable overhead.

KFENCE only cares about getting #PF to catch wild accesses; it doesn't care
about the value for non-present mappings. Use a fully inverted PTE, to
avoid hitting the slow path when running under Xen.

While adjusting the logic, take the opportunity to skip all actions if the
PTE is already in the right state, half the number PVOps callouts, and skip
TLB maintenance on a !P -> P transition which benefits non-Xen cases too.

Fixes: 1dc0da6e9ec0 ("x86, kfence: enable KFENCE for x86")
Tested-by: Marco Elver <el...@google.com>
Signed-off-by: Andrew Cooper <andrew....@citrix.com>
---
CC: Alexander Potapenko <gli...@google.com>
CC: Marco Elver <el...@google.com>
CC: Dmitry Vyukov <dvy...@google.com>
CC: Thomas Gleixner <tg...@linutronix.de>
CC: Ingo Molnar <mi...@redhat.com>
CC: Borislav Petkov <b...@alien8.de>
CC: Dave Hansen <dave....@linux.intel.com>
CC: x...@kernel.org
CC: "H. Peter Anvin" <h...@zytor.com>
CC: Andrew Morton <ak...@linux-foundation.org>
CC: Jann Horn <ja...@google.com>
CC: kasa...@googlegroups.com
CC: linux-...@vger.kernel.org

v1:
* First public posting. This went to security@ first just in case, and
then I got districted with other things ahead of public posting.
---
arch/x86/include/asm/kfence.h | 29 ++++++++++++++++++++++++-----
1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kfence.h b/arch/x86/include/asm/kfence.h
index ff5c7134a37a..acf9ffa1a171 100644
--- a/arch/x86/include/asm/kfence.h
+++ b/arch/x86/include/asm/kfence.h
@@ -42,10 +42,34 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
{
unsigned int level;
pte_t *pte = lookup_address(addr, &level);
+ pteval_t val;

if (WARN_ON(!pte || level != PG_LEVEL_4K))
return false;

+ val = pte_val(*pte);
+
+ /*
+ * protect requires making the page not-present. If the PTE is
+ * already in the right state, there's nothing to do.
+ */
+ if (protect != !!(val & _PAGE_PRESENT))
+ return true;
+
+ /*
+ * Otherwise, invert the entire PTE. This avoids writing out an
+ * L1TF-vulnerable PTE (not present, without the high address bits
+ * set).
+ */
+ set_pte(pte, __pte(~val));
+
+ /*
+ * If the page was protected (non-present) and we're making it
+ * present, there is no need to flush the TLB at all.
+ */
+ if (!protect)
+ return true;
+
/*
* We need to avoid IPIs, as we may get KFENCE allocations or faults
* with interrupts disabled. Therefore, the below is best-effort, and
@@ -53,11 +77,6 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
* lazy fault handling takes care of faults after the page is PRESENT.
*/

- if (protect)
- set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
- else
- set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
-
/*
* Flush this CPU's TLB, assuming whoever did the allocation/free is
* likely to continue running on this CPU.

base-commit: 7f98ab9da046865d57c102fd3ca9669a29845f67
--
2.39.5

Alexander Potapenko

unread,
Jan 7, 2026, 6:32:01 AM (11 days ago) Jan 7
to Andrew Cooper, LKML, Marco Elver, Dmitry Vyukov, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x...@kernel.org, H. Peter Anvin, Andrew Morton, Jann Horn, kasa...@googlegroups.com
Reviewed-by: Alexander Potapenko <gli...@google.com>

> /*
> * We need to avoid IPIs, as we may get KFENCE allocations or faults
> * with interrupts disabled. Therefore, the below is best-effort, and
> @@ -53,11 +77,6 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
> * lazy fault handling takes care of faults after the page is PRESENT.
> */
Nit: should this comment be moved above before set_pte() or merged wit
the following comment block?

Andrew Cooper

unread,
Jan 7, 2026, 7:02:08 AM (11 days ago) Jan 7
to Alexander Potapenko, Andrew Cooper, LKML, Marco Elver, Dmitry Vyukov, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x...@kernel.org, H. Peter Anvin, Andrew Morton, Jann Horn, kasa...@googlegroups.com
Thanks.

>
>> /*
>> * We need to avoid IPIs, as we may get KFENCE allocations or faults
>> * with interrupts disabled. Therefore, the below is best-effort, and
>> @@ -53,11 +77,6 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
>> * lazy fault handling takes care of faults after the page is PRESENT.
>> */
> Nit: should this comment be moved above before set_pte() or merged wit
> the following comment block?

Hmm, probably merged as they're both about the TLB maintenance.  But the
end result is a far more messy diff:

@@ -42,23 +42,40 @@ static inline bool kfence_protect_page(unsigned long addr, bool protect)
 {
        unsigned int level;
        pte_t *pte = lookup_address(addr, &level);
+       pteval_t val;
 
        if (WARN_ON(!pte || level != PG_LEVEL_4K))
                return false;
 
+       val = pte_val(*pte);
+
        /*
-        * We need to avoid IPIs, as we may get KFENCE allocations or faults
-        * with interrupts disabled. Therefore, the below is best-effort, and
-        * does not flush TLBs on all CPUs. We can tolerate some inaccuracy;
-        * lazy fault handling takes care of faults after the page is PRESENT.
+        * protect requires making the page not-present.  If the PTE is
+        * already in the right state, there's nothing to do.
+        */
+       if (protect != !!(val & _PAGE_PRESENT))
+               return true;
+
+       /*
+        * Otherwise, invert the entire PTE.  This avoids writing out an
+        * L1TF-vulnerable PTE (not present, without the high address bits
+        * set).
         */
+       set_pte(pte, __pte(~val));
 
-       if (protect)
-               set_pte(pte, __pte(pte_val(*pte) & ~_PAGE_PRESENT));
-       else
-               set_pte(pte, __pte(pte_val(*pte) | _PAGE_PRESENT));
+       /*
+        * If the page was protected (non-present) and we're making it
+        * present, there is no need to flush the TLB at all.
+        */
+       if (!protect)
+               return true;
 
        /*
+        * We need to avoid IPIs, as we may get KFENCE allocations or faults
+        * with interrupts disabled. Therefore, the below is best-effort, and
+        * does not flush TLBs on all CPUs. We can tolerate some inaccuracy;
+        * lazy fault handling takes care of faults after the page is PRESENT.
+        *
         * Flush this CPU's TLB, assuming whoever did the allocation/free is
         * likely to continue running on this CPU.
         */



I need to resubmit anyway, because I've spotted one silly error in the
commit message.

I could submit two patches, with the second one stated as "to make the
previous patch legible".

Thoughts?

~Andrew

Andrew Morton

unread,
Jan 7, 2026, 6:17:03 PM (11 days ago) Jan 7
to Andrew Cooper, LKML, Marco Elver, Alexander Potapenko, Dmitry Vyukov, Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x...@kernel.org, H. Peter Anvin, Jann Horn, kasa...@googlegroups.com
Seems that I sent 1dc0da6e9ec0 upstream so thanks, I'll grab this. If
an x86 person chooses to handle it then I'll drop the mm.git version.

I'll add a cc:stable to the mm.git copy, just to be sure.

> Tested-by: Marco Elver <el...@google.com>
> Signed-off-by: Andrew Cooper <andrew....@citrix.com>
> ---

That "^---$" tells tooling "changelog stops here".

> CC: Alexander Potapenko <gli...@google.com>
> CC: Marco Elver <el...@google.com>
> CC: Dmitry Vyukov <dvy...@google.com>
> CC: Thomas Gleixner <tg...@linutronix.de>
> CC: Ingo Molnar <mi...@redhat.com>
> CC: Borislav Petkov <b...@alien8.de>
> CC: Dave Hansen <dave....@linux.intel.com>
> CC: x...@kernel.org
> CC: "H. Peter Anvin" <h...@zytor.com>
> CC: Andrew Morton <ak...@linux-foundation.org>
> CC: Jann Horn <ja...@google.com>
> CC: kasa...@googlegroups.com
> CC: linux-...@vger.kernel.org
>
> v1:
> * First public posting. This went to security@ first just in case, and
> then I got districted with other things ahead of public posting.
> ---

That "^---$" would be better placed above the versioning info.

>
> ...
>
Reply all
Reply to author
Forward
0 new messages