[RFC 00/13] x86/mm: PCID and INVPCID

Andy Lutomirski

unread,

Jan 8, 2016, 6:19:02 PM1/8/16

to x...@kernel.org, linux-...@vger.kernel.org, Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov, linu...@kvack.org, Andy Lutomirski

Here's my PCID and INVPCID work-in-progress. It seems to work well
enough to play with it. (That is, I'm not aware of anything wrong
with it, although it may eat your data.)

PCID and INVPCID use are orthogonal here. INVPCID is a
straightforward speedup for global TLB flushes. Other than that, I
don't use INVPCID at all, since it seems slower than just
manipulating CR3 carefully, at least on my Skylake laptop.

Please play around and suggest (and run?) good benchmarks. It seems
to save around 100ns on cross-process context switches for me.
Unfortunately, we suck at context switches in general, so this is,
at best, a little over a 10% speedup. Most of the time is spent in
the scheduler, not in arch code.

Andy Lutomirski (13):
x86/paravirt: Turn KASAN off for parvirt.o
x86/mm: Add INVPCID helpers
x86/mm: Add a noinvpcid option to turn off INVPCID
x86/mm: If INVPCID is available, use it to flush global mappings
x86/mm: Add barriers and document switch_mm-vs-flush synchronization
x86/mm: Disable PCID on 32-bit kernels
x86/mm: Add nopcid to turn off PCID
x86/mm: Teach CR3 readers about PCID
x86/mm: Disable interrupts when flushing the TLB using CR3
x86/mm: Factor out remote TLB flushing
x86/mm: Build arch/x86/mm/tlb.c even on !SMP
x86/mm: Uninline switch_mm
x86/mm: Try to preserve old TLB entries using PCID

Documentation/kernel-parameters.txt | 4 +
arch/x86/include/asm/disabled-features.h | 4 +-
arch/x86/include/asm/mmu.h | 7 +-
arch/x86/include/asm/mmu_context.h | 62 +-----
arch/x86/include/asm/tlbflush.h | 86 ++++++++
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/cpu/bugs.c | 6 +
arch/x86/kernel/cpu/common.c | 38 ++++
arch/x86/kernel/head64.c | 3 +-
arch/x86/kernel/ldt.c | 2 +
arch/x86/kernel/process_64.c | 2 +
arch/x86/mm/Makefile | 3 +-
arch/x86/mm/fault.c | 8 +-
arch/x86/mm/tlb.c | 324 +++++++++++++++++++++++++++++--
14 files changed, 467 insertions(+), 83 deletions(-)

--
2.5.0

Linus Torvalds

unread,

Jan 8, 2016, 6:31:13 PM1/8/16

to Andy Lutomirski, the arch/x86 maintainers, Linux Kernel Mailing List, Borislav Petkov, Brian Gerst, Dave Hansen, Oleg Nesterov, linu...@kvack.org

On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski <lu...@kernel.org> wrote:
>
> Please play around and suggest (and run?) good benchmarks. It seems
> to save around 100ns on cross-process context switches for me.

Interesting. There was reportedly (I never saw it) a test-patch to use
pcids inside of Intel a couple of years ago, and it never got outside
because it didn't make a difference.

Either things have changed (newer hardware with more pcids perhaps?)
or you did a better job at it.

Linus

Andy Lutomirski

unread,

Jan 8, 2016, 6:36:41 PM1/8/16

to Linus Torvalds, Andy Lutomirski, the arch/x86 maintainers, Linux Kernel Mailing List, Borislav Petkov, Brian Gerst, Dave Hansen, Oleg Nesterov, linu...@kvack.org

On Fri, Jan 8, 2016 at 3:31 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski <lu...@kernel.org> wrote:
>>
>> Please play around and suggest (and run?) good benchmarks. It seems
>> to save around 100ns on cross-process context switches for me.
>
> Interesting. There was reportedly (I never saw it) a test-patch to use
> pcids inside of Intel a couple of years ago, and it never got outside
> because it didn't make a difference.

I have a copy of that patch, and my code works very differently. I
only use 3 bits of PCID in this series. I could probably reduce that
to 2 with little loss. 4 or more would be a waste.

>
> Either things have changed (newer hardware with more pcids perhaps?)
> or you did a better job at it.

On my Skylake laptop, all of the PCID bits appear to have at least
some effect. Whether this means it gets hashed or whether this means
that all of the bits are real, I don't know. I'll fiddle with it on
an older machine.

--Andy

Linus Torvalds

unread,

Jan 8, 2016, 6:41:19 PM1/8/16

to Andy Lutomirski, the arch/x86 maintainers, Linux Kernel Mailing List, Borislav Petkov, Brian Gerst, Dave Hansen, Oleg Nesterov, linu...@kvack.org

On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski <lu...@kernel.org> wrote:

> + /*
> + * We mustn't be preempted or handle an IPI while reading and
> + * writing CR3. Preemption could switch mms and switch back, and
> + * an IPI could call leave_mm. Either of those could cause our
> + * PCID to change asynchronously.
> + */
> + raw_local_irq_save(flags);
> native_write_cr3(native_read_cr3());
> + raw_local_irq_restore(flags);

This seems sad for two reasons:

- it adds unnecessary overhead on non-pcid setups (32-bit being an
example of that)

- on pcid setups, wouldn't invpcid_flush_single_context() be better?

So on the whole I hate it.

Why isn't this something like

if (static_cpu_has_safe(X86_FEATURE_INVPCID)) {
invpcid_flush_single_context();
return;
}
native_write_cr3(native_read_cr3());

*without* any flag saving crud?

And yes, that means that we'd require X86_FEATURE_INVPCID in order to
use X86_FEATURE_PCID, but that seems fine.

Or is there some reason you wanted the odd flags version? If so, that
should be documented.

Linus

Linus Torvalds

unread,

Jan 8, 2016, 6:42:41 PM1/8/16

to Andy Lutomirski, Andy Lutomirski, the arch/x86 maintainers, Linux Kernel Mailing List, Borislav Petkov, Brian Gerst, Dave Hansen, Oleg Nesterov, linu...@kvack.org

On Fri, Jan 8, 2016 at 3:36 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>>
>> Either things have changed (newer hardware with more pcids perhaps?)
>> or you did a better job at it.
>
> On my Skylake laptop, all of the PCID bits appear to have at least
> some effect. Whether this means it gets hashed or whether this means
> that all of the bits are real, I don't know.

They have always gotten hashed, and no the bits aren't real - hardware
doesn't actually have as many bits in the pcid as there are in cr3.

Linus

Dave Hansen

unread,

Jan 8, 2016, 6:56:01 PM1/8/16

to Andy Lutomirski, x...@kernel.org, linux-...@vger.kernel.org, Borislav Petkov, Brian Gerst, Linus Torvalds, Oleg Nesterov, linu...@kvack.org

On 01/08/2016 03:15 PM, Andy Lutomirski wrote:
> @@ -352,3 +354,5 @@ static int __init create_tlb_single_page_flush_ceiling(void)
> return 0;
> }
> late_initcall(create_tlb_single_page_flush_ceiling);
> +
> +#endif /* CONFIG_SMP */

Heh, I was about to complain that you #ifdef'd out my lovely INVLPG
tunable. But I guess on UP you just get flush_tlb_mm_range() from:

> static inline void flush_tlb_mm_range(struct mm_struct *mm,
> unsigned long start, unsigned long end, unsigned long vmflag)
> {
> if (mm == current->active_mm)
> __flush_tlb_up();
> }

which doesn't even do INVLPG. How sad. Poor UP.

Andy Lutomirski

unread,

Jan 8, 2016, 7:18:36 PM1/8/16

to Linus Torvalds, Oleg Nesterov, X86 ML, Dave Hansen, Borislav Petkov, Linux Kernel Mailing List, linu...@kvack.org, Brian Gerst

On Jan 8, 2016 3:41 PM, "Linus Torvalds" <torv...@linux-foundation.org> wrote:
>
> On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski <lu...@kernel.org> wrote:
> > + /*
> > + * We mustn't be preempted or handle an IPI while reading and
> > + * writing CR3. Preemption could switch mms and switch back, and
> > + * an IPI could call leave_mm. Either of those could cause our
> > + * PCID to change asynchronously.
> > + */
> > + raw_local_irq_save(flags);
> > native_write_cr3(native_read_cr3());
> > + raw_local_irq_restore(flags);
>
> This seems sad for two reasons:
>
> - it adds unnecessary overhead on non-pcid setups (32-bit being an
> example of that)

I can certainly skip the flag saving on !PCID.

>
> - on pcid setups, wouldn't invpcid_flush_single_context() be better?
>

I played with that and it was slower. I don't pretend that makes any sense.

> So on the whole I hate it.
>
> Why isn't this something like
>
> if (static_cpu_has_safe(X86_FEATURE_INVPCID)) {
> invpcid_flush_single_context();
> return;
> }
> native_write_cr3(native_read_cr3());
>
> *without* any flag saving crud?
>
> And yes, that means that we'd require X86_FEATURE_INVPCID in order to
> use X86_FEATURE_PCID, but that seems fine.

I have an SNB "Extreme" with PCID but not INVPCID, and there could be
a whole generation of servers like that. I think we should fully
support them.

We might be able to get away with just disabling preemption instead of
IRQs, at least if mm == active_mm.

>
> Or is there some reason you wanted the odd flags version? If so, that
> should be documented.

What do you mean "odd"?

--Andy

Dave Hansen

unread,

Jan 8, 2016, 7:28:03 PM1/8/16

to Andy Lutomirski, x...@kernel.org, linux-...@vger.kernel.org, Borislav Petkov, Brian Gerst, Linus Torvalds, Oleg Nesterov, linu...@kvack.org

On 01/08/2016 03:15 PM, Andy Lutomirski wrote:

> + * The guiding principle of this code is that TLB entries that have
> + * survived more than a small number of context switches are mostly
> + * useless, so we don't try very hard not to evict them.

Big ack on that. The original approach tried to keep track of the full
4k worth of possible PCIDs, it also needed an additional cpumask (which
it dynamically allocated) for where the PCID was active in addition to
the normal "where has this mm been" mask.

That's a lot of extra data to mangle, and I can definitely see your
approach as being nicer, *IF* the hardware isn't doing something useful
with the other 9 bits of PCID that you're throwing away. ;)

Andy Lutomirski

unread,

Jan 8, 2016, 9:20:27 PM1/8/16

to Dave Hansen, Andy Lutomirski, X86 ML, linux-...@vger.kernel.org, Borislav Petkov, Brian Gerst, Linus Torvalds, Oleg Nesterov, linu...@kvack.org

On Fri, Jan 8, 2016 at 4:27 PM, Dave Hansen <dave....@linux.intel.com> wrote:
> On 01/08/2016 03:15 PM, Andy Lutomirski wrote:
>> + * The guiding principle of this code is that TLB entries that have
>> + * survived more than a small number of context switches are mostly
>> + * useless, so we don't try very hard not to evict them.
>
> Big ack on that. The original approach tried to keep track of the full
> 4k worth of possible PCIDs, it also needed an additional cpumask (which
> it dynamically allocated) for where the PCID was active in addition to
> the normal "where has this mm been" mask.

My patch has a similar extra cpumask, but at least I didn't
dynamically allocate it. I did it because I need a 100% reliable way
to tell whether a given mm has a valid PCID in a cpu's PCID LRU list,
as opposed to just matching due to struct mm reuse or similar. I also
need the ability to blow away old mappings, which I can do by clearing
the cpumask. This happens in init_new_context and in
propagate_tlb_flush.

The other way to do it would be to store some kind of generation
counter in the per-cpu list. I could use a global 64-bit atomic
counter to allocate never-reused mm ids (it's highly unlikely that a
system will run long enough for such a counter to overflow -- it could
only ever be incremented every few ns, giving hundreds of years of
safety), but that's kind of expensive. I could use a per-cpu
allocator, but 54 bits per cpu is uncomfortably small unless we have
wraparound handling. We could do 64 bits per cpu for very cheap
counter allocation, but then the "zap the pcid" logic gets much
nastier in that neither the percpu entries nor the per-mm generation
counter entries don't fit in a word any more. Maybe that's fine.

What we can't do easily is have a per-mm generation counter, because
freeing an mm and reallocating it in the same place needs to reliably
zap the pcid on all CPUs.

Anyway, this problem is clearly solvable, but I haven't thought of a
straightforward solution that doesn't involve rarely-executed code
paths, and that makes me a bit nervous.

--Andy

Linus Torvalds

unread,

Jan 8, 2016, 9:20:32 PM1/8/16

to Andy Lutomirski, Oleg Nesterov, X86 ML, Dave Hansen, Borislav Petkov, Linux Kernel Mailing List, linu...@kvack.org, Brian Gerst

On Fri, Jan 8, 2016 at 4:18 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>>
>> - on pcid setups, wouldn't invpcid_flush_single_context() be better?
>
> I played with that and it was slower. I don't pretend that makes any sense.

Ugh. I guess reading and writing cr3 has been optimized.

>> And yes, that means that we'd require X86_FEATURE_INVPCID in order to
>> use X86_FEATURE_PCID, but that seems fine.
>
> I have an SNB "Extreme" with PCID but not INVPCID, and there could be
> a whole generation of servers like that. I think we should fully
> support them.

Can you check the timings? IOW, is it a win on SNB?

I think originally Intel only had two actual bits of process context
ID in the TLB, and it was meant to be used for virtualization or
something. Together with the hashing (to make it always appear as 12
bits to software - a nice idea but also means that the hardware ends
up invalidating more than software really expects), it may not work
all that well.

That _could_ explain why the original patch from intel didn't work.

> We might be able to get away with just disabling preemption instead of
> IRQs, at least if mm == active_mm.

I'm not convinced it is all that much faster. Of course, it's nicer on
non-preempt, but nobody seems to run things that way.

>> Or is there some reason you wanted the odd flags version? If so, that
>> should be documented.
>
> What do you mean "odd"?

It's odd because it makes no sense for non-pcid (christ, I wish Intel
had just called it "asid" instead, "pcid" always makes me react to
"pci"), and I think it would make more sense to pair up the pcid case
with the invpcid rather than have those preemption rules here.

Linus

Borislav Petkov

unread,

Jan 10, 2016, 2:00:06 PM1/10/16

to Andy Lutomirski, x...@kernel.org, linux-...@vger.kernel.org, Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov, linu...@kvack.org, Andrey Ryabinin

+ Andrey.

On Fri, Jan 08, 2016 at 03:15:19PM -0800, Andy Lutomirski wrote:
> Otherwise terrible things happen if some of the callbacks end up
> calling into KASAN in unexpected places.
>
> This has no obvious symptoms yet, but adding a memory reference to
> native_flush_tlb_global without this blows up on KASAN kernels.
>
> Signed-off-by: Andy Lutomirski <lu...@kernel.org>
> ---
> arch/x86/kernel/Makefile | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
> index b1b78ffe01d0..b7cd5bdf314b 100644
> --- a/arch/x86/kernel/Makefile
> +++ b/arch/x86/kernel/Makefile
> @@ -19,6 +19,7 @@ endif
> KASAN_SANITIZE_head$(BITS).o := n
> KASAN_SANITIZE_dumpstack.o := n
> KASAN_SANITIZE_dumpstack_$(BITS).o := n
> +KASAN_SANITIZE_paravirt.o := n
>
> CFLAGS_irq.o := -I$(src)/../include/asm/trace

Shouldn't we take this one irrespectively of what happens to the rest in
the patchset?

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Ingo Molnar

unread,

Jan 11, 2016, 5:51:17 AM1/11/16

to Linus Torvalds, Andy Lutomirski, Oleg Nesterov, X86 ML, Dave Hansen, Borislav Petkov, Linux Kernel Mailing List, linu...@kvack.org, Brian Gerst

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> >> Or is there some reason you wanted the odd flags version? If so, that
> >> should be documented.
> >
> > What do you mean "odd"?
>
> It's odd because it makes no sense for non-pcid (christ, I wish Intel had just
> called it "asid" instead, "pcid" always makes me react to "pci"), and I think it
> would make more sense to pair up the pcid case with the invpcid rather than have
> those preemption rules here.

The naming is really painful, so a trivial suggestion: could we just name all the
Linux side bits 'asid' or 'ctx_id' (even in x86 arch code) and only use 'PCID'
nomenclature in the very lowest level code?

I.e. rename pcid_live_cpus et al and most functions to the asid or ctx_id or asid
naming scheme or so. That would hide most of the naming ugliness.

Thanks,

Ingo

Andrey Ryabinin

unread,

Jan 11, 2016, 7:50:52 AM1/11/16

to Borislav Petkov, Andy Lutomirski, x...@kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov, linu...@kvack.org, linux-...@vger.kernel.org, Andrey Ryabinin

I don't think that this patch is the right way to solve the problem.
The follow-up patch "x86/kasan: clear kasan_zero_page after TLB flush" should fix Andy's problem.

Andrey Ryabinin

unread,

Jan 11, 2016, 7:50:59 AM1/11/16

to Borislav Petkov, Andy Lutomirski, x...@kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov, linu...@kvack.org, linux-...@vger.kernel.org, Andrey Ryabinin

Currently we clear kasan_zero_page before __flush_tlb_all(). This
works with current implementation of native_flush_tlb[_global]()
because it doesn't cause do any writes to kasan shadow memory.
But any subtle change made in native_flush_tlb*() could break this.
Also current code seems doesn't work for paravirt guests (lguest).

Only after the TLB flush we can be sure that kasan_zero_page is not
used as early shadow anymore (instrumented code will not write to it).
So it should cleared it only after the TLB flush.

Signed-off-by: Andrey Ryabinin <arya...@virtuozzo.com>
---
arch/x86/mm/kasan_init_64.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index d470cf2..303e470 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -120,11 +120,16 @@ void __init kasan_init(void)
kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
(void *)KASAN_SHADOW_END);

- memset(kasan_zero_page, 0, PAGE_SIZE);
-
load_cr3(init_level4_pgt);
__flush_tlb_all();
- init_task.kasan_depth = 0;

+ /*
+ * kasan_zero_page has been used as early shadow memory, thus it may
+ * contain some garbage. Now we can clear it, since after the TLB flush
+ * no one should write to it.
+ */
+ memset(kasan_zero_page, 0, PAGE_SIZE);
+
+ init_task.kasan_depth = 0;
pr_info("KernelAddressSanitizer initialized\n");
}
--
2.4.10

Andrey Ryabinin

unread,

Jan 11, 2016, 7:51:08 AM1/11/16

to Borislav Petkov, Andy Lutomirski, x...@kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov, linu...@kvack.org, linux-...@vger.kernel.org, Andrey Ryabinin

After kasan_init() executed, no one is allowed to write to kasan_zero_page,
so write protect it.

Signed-off-by: Andrey Ryabinin <arya...@virtuozzo.com>
---

arch/x86/mm/kasan_init_64.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 303e470..1b1110f 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -125,10 +125,16 @@ void __init kasan_init(void)

/*

* kasan_zero_page has been used as early shadow memory, thus it may

- * contain some garbage. Now we can clear it, since after the TLB flush
- * no one should write to it.
+ * contain some garbage. Now we can clear and write protect it, since
+ * after the TLB flush no one should write to it.
*/
memset(kasan_zero_page, 0, PAGE_SIZE);
+ for (i = 0; i < PTRS_PER_PTE; i++) {
+ pte_t pte = __pte(__pa(kasan_zero_page) | __PAGE_KERNEL_RO);
+ set_pte(&kasan_zero_pte[i], pte);
+ }
+ /* Flush TLBs again to be sure that write protection applied. */
+ __flush_tlb_all();

Andy Lutomirski

unread,

Jan 13, 2016, 6:34:08 PM1/13/16

to Ingo Molnar, Linus Torvalds, Oleg Nesterov, X86 ML, Dave Hansen, Borislav Petkov, Linux Kernel Mailing List, linu...@kvack.org, Brian Gerst

On Mon, Jan 11, 2016 at 2:51 AM, Ingo Molnar <mi...@kernel.org> wrote:
>
> * Linus Torvalds <torv...@linux-foundation.org> wrote:
>
>> >> Or is there some reason you wanted the odd flags version? If so, that
>> >> should be documented.
>> >
>> > What do you mean "odd"?
>>
>> It's odd because it makes no sense for non-pcid (christ, I wish Intel had just
>> called it "asid" instead, "pcid" always makes me react to "pci"), and I think it
>> would make more sense to pair up the pcid case with the invpcid rather than have
>> those preemption rules here.
>
> The naming is really painful, so a trivial suggestion: could we just name all the
> Linux side bits 'asid' or 'ctx_id' (even in x86 arch code) and only use 'PCID'
> nomenclature in the very lowest level code?

I'd be okay with "pctx_id" or "pctxid" for this, I think. I'd like to
at least make it somewhat obvious how it maps back to hardware.

FWIW, I'd guess that Intel deviated from convention because their
actual address space id is (vpid, pcid), and calling it (vpid, asid)
might have been slightly confusing. Or not.

--Andy

Andy Lutomirski

unread,

Jan 13, 2016, 6:36:15 PM1/13/16

to Linus Torvalds, Oleg Nesterov, X86 ML, Dave Hansen, Borislav Petkov, Linux Kernel Mailing List, linu...@kvack.org, Brian Gerst

On Fri, Jan 8, 2016 at 6:20 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Fri, Jan 8, 2016 at 4:18 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>>>
>>> - on pcid setups, wouldn't invpcid_flush_single_context() be better?
>>
>> I played with that and it was slower. I don't pretend that makes any sense.
>
> Ugh. I guess reading and writing cr3 has been optimized.
>
>>> And yes, that means that we'd require X86_FEATURE_INVPCID in order to
>>> use X86_FEATURE_PCID, but that seems fine.
>>
>> I have an SNB "Extreme" with PCID but not INVPCID, and there could be
>> a whole generation of servers like that. I think we should fully
>> support them.
>
> Can you check the timings? IOW, is it a win on SNB?

~80ns gain on SNB. It's actually quite impressive on SNB: it knocks
the penalty for mm switches down to 20ns or so, which I find to be
fairly amazing. (This is at 3.8GHz or thereabouts.)

>
> I think originally Intel only had two actual bits of process context
> ID in the TLB, and it was meant to be used for virtualization or
> something. Together with the hashing (to make it always appear as 12
> bits to software - a nice idea but also means that the hardware ends
> up invalidating more than software really expects), it may not work
> all that well.
>
> That _could_ explain why the original patch from intel didn't work.
>
>> We might be able to get away with just disabling preemption instead of
>> IRQs, at least if mm == active_mm.
>
> I'm not convinced it is all that much faster. Of course, it's nicer on
> non-preempt, but nobody seems to run things that way.

My current testing version has three different code paths now. If
INVPCID and PCID are both available, then it uses INVPCID. If PCID is
available but INVPCID is not, it does raw_local_irqsave. If PCID is
not available, it just does the CR3 read/write.

Yeah, it's ugly, and it's a big blob of code to do something trivial,
but it seems to work and it should be the right thing to do in most
cases.

Can anyone here ask a hardware or microcode person what's going on
with CR3 writes possibly being faster than INVPCID? Is there some
trick to it?

--Andy

Dave Hansen

unread,

Jan 13, 2016, 6:44:28 PM1/13/16

to Andy Lutomirski, Linus Torvalds, Oleg Nesterov, X86 ML, Borislav Petkov, Linux Kernel Mailing List, linu...@kvack.org, Brian Gerst

On 01/13/2016 03:35 PM, Andy Lutomirski wrote:
> Can anyone here ask a hardware or microcode person what's going on
> with CR3 writes possibly being faster than INVPCID? Is there some
> trick to it?

I just went and measured it myself this morning. "INVPCID Type 3" (all
contexts no global) on a Skylake system was 15% slower than a CR3 write.

Is that in the same ballpark from what you've observed?

Andy Lutomirski

unread,

Jan 13, 2016, 6:52:02 PM1/13/16

to Dave Hansen, Linus Torvalds, Oleg Nesterov, X86 ML, Borislav Petkov, Linux Kernel Mailing List, linu...@kvack.org, Brian Gerst

It's similar, except that I was comparing "INVPCID Type 1" (single
context no globals) to a CR3 write.

Type 2, at least, is dramatically faster than the pair of CR4 writes
it replaces.

--Andy

Dave Hansen

unread,

Jan 13, 2016, 6:56:29 PM1/13/16

to Andy Lutomirski, Linus Torvalds, Oleg Nesterov, X86 ML, Borislav Petkov, Linux Kernel Mailing List, linu...@kvack.org, Brian Gerst

On 01/13/2016 03:51 PM, Andy Lutomirski wrote:
> On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen
> <dave....@linux.intel.com> wrote:
>> On 01/13/2016 03:35 PM, Andy Lutomirski wrote:
>>> Can anyone here ask a hardware or microcode person what's going on
>>> with CR3 writes possibly being faster than INVPCID? Is there some
>>> trick to it?
>>
>> I just went and measured it myself this morning. "INVPCID Type 3" (all
>> contexts no global) on a Skylake system was 15% slower than a CR3 write.
>>
>> Is that in the same ballpark from what you've observed?
>
> It's similar, except that I was comparing "INVPCID Type 1" (single
> context no globals) to a CR3 write.

Ahh, because you're using PCID... That one I saw as being ~1.85x the
number of cycles that a CR3 write was.

> Type 2, at least, is dramatically faster than the pair of CR4 writes
> it replaces.

Yeah, I saw the same thing. Type 2 was ~2.4x faster than the CR4 writes.

Andy Lutomirski

unread,

Jan 13, 2016, 7:35:05 PM1/13/16

to Dave Hansen, Linus Torvalds, Oleg Nesterov, X86 ML, Borislav Petkov, Linux Kernel Mailing List, linu...@kvack.org, Brian Gerst

On Wed, Jan 13, 2016 at 3:56 PM, Dave Hansen

<dave....@linux.intel.com> wrote:
> On 01/13/2016 03:51 PM, Andy Lutomirski wrote:
>> On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen
>> <dave....@linux.intel.com> wrote:
>>> On 01/13/2016 03:35 PM, Andy Lutomirski wrote:
>>>> Can anyone here ask a hardware or microcode person what's going on
>>>> with CR3 writes possibly being faster than INVPCID? Is there some
>>>> trick to it?
>>>
>>> I just went and measured it myself this morning. "INVPCID Type 3" (all
>>> contexts no global) on a Skylake system was 15% slower than a CR3 write.
>>>
>>> Is that in the same ballpark from what you've observed?
>>
>> It's similar, except that I was comparing "INVPCID Type 1" (single
>> context no globals) to a CR3 write.
>
> Ahh, because you're using PCID... That one I saw as being ~1.85x the
> number of cycles that a CR3 write was.
>

I think that settles it, then:

if (static_cpu_has_safe(X86_FEATURE_PCID)) {
raw_local_irqsave();
native_write_cr3(native_read_cr3());
raw_local_irqrestore();
} else {
native_write_cr3(native_read_cr3());
}

I don't think it's worth hacking more complexity into switch_mm to
make that annoyance go away.

--Andy

Andy Lutomirski

unread,

Jan 18, 2016, 5:24:36 PM1/18/16

to Andrey Ryabinin, Borislav Petkov, Andy Lutomirski, X86 ML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov, linu...@kvack.org, linux-...@vger.kernel.org

On Mon, Jan 11, 2016 at 4:51 AM, Andrey Ryabinin
<arya...@virtuozzo.com> wrote:
> Currently we clear kasan_zero_page before __flush_tlb_all(). This
> works with current implementation of native_flush_tlb[_global]()
> because it doesn't cause do any writes to kasan shadow memory.
> But any subtle change made in native_flush_tlb*() could break this.
> Also current code seems doesn't work for paravirt guests (lguest).
>
> Only after the TLB flush we can be sure that kasan_zero_page is not
> used as early shadow anymore (instrumented code will not write to it).
> So it should cleared it only after the TLB flush.

This seems to fix the issue with my patch set. Thanks.

Tested-by: Andy Lutomirski <lu...@kernel.org>

--Andy

Andy Lutomirski

unread,

Jan 18, 2016, 5:25:01 PM1/18/16

to Andrey Ryabinin, Borislav Petkov, Andy Lutomirski, X86 ML, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov, linu...@kvack.org, linux-...@vger.kernel.org

On Mon, Jan 11, 2016 at 4:51 AM, Andrey Ryabinin
<arya...@virtuozzo.com> wrote:

> After kasan_init() executed, no one is allowed to write to kasan_zero_page,
> so write protect it.

This seems to work for me.

--Andy

Borislav Petkov

unread,

Jan 29, 2016, 5:35:56 AM1/29/16

to Andrey Ryabinin, Andy Lutomirski, x...@kernel.org, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov, linu...@kvack.org, linux-...@vger.kernel.org

On Mon, Jan 11, 2016 at 03:51:17PM +0300, Andrey Ryabinin wrote:
> I don't think that this patch is the right way to solve the problem.
> The follow-up patch "x86/kasan: clear kasan_zero_page after TLB flush"
> should fix Andy's problem.

Both applied, thanks.

tip-bot for Andrey Ryabinin

unread,

Feb 9, 2016, 11:08:21 AM2/9/16

to linux-ti...@vger.kernel.org, ol...@redhat.com, h...@zytor.com, pet...@infradead.org, lu...@kernel.org, tg...@linutronix.de, dave....@linux.intel.com, dvla...@redhat.com, b...@alien8.de, mcg...@suse.com, brg...@gmail.com, b...@suse.de, mi...@kernel.org, ak...@linux-foundation.org, linux-...@vger.kernel.org, lu...@amacapital.net, toshi...@hp.com, arya...@virtuozzo.com, torv...@linux-foundation.org

Commit-ID: 69e0210fd01ff157d332102219aaf5c26ca8069b
Gitweb: http://git.kernel.org/tip/69e0210fd01ff157d332102219aaf5c26ca8069b
Author: Andrey Ryabinin <arya...@virtuozzo.com>
AuthorDate: Mon, 11 Jan 2016 15:51:18 +0300
Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Tue, 9 Feb 2016 13:33:14 +0100

x86/kasan: Clear kasan_zero_page after TLB flush

Currently we clear kasan_zero_page before __flush_tlb_all(). This
works with current implementation of native_flush_tlb[_global]()
because it doesn't cause do any writes to kasan shadow memory.
But any subtle change made in native_flush_tlb*() could break this.
Also current code seems doesn't work for paravirt guests (lguest).

Only after the TLB flush we can be sure that kasan_zero_page is not
used as early shadow anymore (instrumented code will not write to it).
So it should cleared it only after the TLB flush.

Signed-off-by: Andrey Ryabinin <arya...@virtuozzo.com>

Reviewed-by: Borislav Petkov <b...@suse.de>
Cc: Andrew Morton <ak...@linux-foundation.org>
Cc: Andy Lutomirski <lu...@amacapital.net>
Cc: Andy Lutomirski <lu...@kernel.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Brian Gerst <brg...@gmail.com>
Cc: Dave Hansen <dave....@linux.intel.com>
Cc: Denys Vlasenko <dvla...@redhat.com>
Cc: H. Peter Anvin <h...@zytor.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Luis R. Rodriguez <mcg...@suse.com>
Cc: Oleg Nesterov <ol...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Cc: Toshi Kani <toshi...@hp.com>
Cc: linu...@kvack.org
Link: http://lkml.kernel.org/r/1452516679-32040-2-gi...@virtuozzo.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>

tip-bot for Andrey Ryabinin

unread,

Feb 9, 2016, 11:15:52 AM2/9/16

to linux-ti...@vger.kernel.org, b...@suse.de, torv...@linux-foundation.org, toshi...@hp.com, dave....@linux.intel.com, b...@alien8.de, mcg...@suse.com, dvla...@redhat.com, ak...@linux-foundation.org, linux-...@vger.kernel.org, lu...@kernel.org, arya...@virtuozzo.com, ol...@redhat.com, pet...@infradead.org, h...@zytor.com, brg...@gmail.com, lu...@amacapital.net, mi...@kernel.org, tg...@linutronix.de

Commit-ID: 063fb3e56f6dd29b2633b678b837e1d904200e6f
Gitweb: http://git.kernel.org/tip/063fb3e56f6dd29b2633b678b837e1d904200e6f
Author: Andrey Ryabinin <arya...@virtuozzo.com>
AuthorDate: Mon, 11 Jan 2016 15:51:19 +0300

Committer: Ingo Molnar <mi...@kernel.org>
CommitDate: Tue, 9 Feb 2016 13:33:14 +0100

x86/kasan: Write protect kasan zero shadow

After kasan_init() executed, no one is allowed to write to kasan_zero_page,
so write protect it.

Signed-off-by: Andrey Ryabinin <arya...@virtuozzo.com>

Reviewed-by: Borislav Petkov <b...@suse.de>
Cc: Andrew Morton <ak...@linux-foundation.org>
Cc: Andy Lutomirski <lu...@amacapital.net>
Cc: Andy Lutomirski <lu...@kernel.org>
Cc: Borislav Petkov <b...@alien8.de>
Cc: Brian Gerst <brg...@gmail.com>
Cc: Dave Hansen <dave....@linux.intel.com>
Cc: Denys Vlasenko <dvla...@redhat.com>
Cc: H. Peter Anvin <h...@zytor.com>
Cc: Linus Torvalds <torv...@linux-foundation.org>
Cc: Luis R. Rodriguez <mcg...@suse.com>
Cc: Oleg Nesterov <ol...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Thomas Gleixner <tg...@linutronix.de>
Cc: Toshi Kani <toshi...@hp.com>
Cc: linu...@kvack.org

Link: http://lkml.kernel.org/r/1452516679-32040-3-gi...@virtuozzo.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>

Nadav Amit

unread,

Jun 3, 2016, 1:43:04 PM6/3/16

to Andy Lutomirski, x...@kernel.org, LKML, Borislav Petkov, Brian Gerst, Dave Hansen, Oleg Nesterov, linu...@kvack.org

Following this patch, if (current->active_mm != mm), flush_tlb_page() still
doesn’t call smp_mb() before checking mm_cpumask(mm).

In contrast, flush_tlb_mm_range() does call smp_mb().

Is there a reason for this discrepancy?

Thanks,
Nadav

Andy Lutomirski <lu...@kernel.org> wrote:

> When switch_mm activates a new pgd, it also sets a bit that tells
> other CPUs that the pgd is in use so that tlb flush IPIs will be
> sent. In order for that to work correctly, the bit needs to be
> visible prior to loading the pgd and therefore starting to fill the
> local TLB.
>
> Document all the barriers that make this work correctly and add a
> couple that were missing.
>
> Cc: sta...@vger.kernel.org

> Signed-off-by: Andy Lutomirski <lu...@kernel.org>
> ---

> arch/x86/include/asm/mmu_context.h | 33 ++++++++++++++++++++++++++++++++-
> arch/x86/mm/tlb.c | 29 ++++++++++++++++++++++++++---
> 2 files changed, 58 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index 379cd3658799..1edc9cd198b8 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -116,8 +116,34 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> #endif
> cpumask_set_cpu(cpu, mm_cpumask(next));
>
> - /* Re-load page tables */
> + /*
> + * Re-load page tables.
> + *
> + * This logic has an ordering constraint:
> + *
> + * CPU 0: Write to a PTE for 'next'
> + * CPU 0: load bit 1 in mm_cpumask. if nonzero, send IPI.
> + * CPU 1: set bit 1 in next's mm_cpumask
> + * CPU 1: load from the PTE that CPU 0 writes (implicit)
> + *
> + * We need to prevent an outcome in which CPU 1 observes
> + * the new PTE value and CPU 0 observes bit 1 clear in
> + * mm_cpumask. (If that occurs, then the IPI will never
> + * be sent, and CPU 0's TLB will contain a stale entry.)
> + *
> + * The bad outcome can occur if either CPU's load is
> + * reordered before that CPU's store, so both CPUs much
> + * execute full barriers to prevent this from happening.
> + *
> + * Thus, switch_mm needs a full barrier between the
> + * store to mm_cpumask and any operation that could load
> + * from next->pgd. This barrier synchronizes with
> + * remote TLB flushers. Fortunately, load_cr3 is
> + * serializing and thus acts as a full barrier.
> + *
> + */
> load_cr3(next->pgd);
> +
> trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
>
> /* Stop flush ipis for the previous mm */
> @@ -156,10 +182,15 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> * schedule, protecting us from simultaneous changes.
> */
> cpumask_set_cpu(cpu, mm_cpumask(next));
> +
> /*
> * We were in lazy tlb mode and leave_mm disabled
> * tlb flush IPI delivery. We must reload CR3
> * to make sure to use no freed page tables.
> + *
> + * As above, this is a barrier that forces
> + * TLB repopulation to be ordered after the
> + * store to mm_cpumask.
> */
> load_cr3(next->pgd);
> trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 8ddb5d0d66fb..8f4cc3dfac32 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -161,7 +161,10 @@ void flush_tlb_current_task(void)
> preempt_disable();
>
> count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> +
> + /* This is an implicit full barrier that synchronizes with switch_mm. */
> local_flush_tlb();
> +
> trace_tlb_flush(TLB_LOCAL_SHOOTDOWN, TLB_FLUSH_ALL);
> if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
> flush_tlb_others(mm_cpumask(mm), mm, 0UL, TLB_FLUSH_ALL);
> @@ -188,17 +191,29 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> unsigned long base_pages_to_flush = TLB_FLUSH_ALL;
>
> preempt_disable();
> - if (current->active_mm != mm)
> + if (current->active_mm != mm) {
> + /* Synchronize with switch_mm. */
> + smp_mb();
> +
> goto out;
> + }
>
> if (!current->mm) {
> leave_mm(smp_processor_id());
> +
> + /* Synchronize with switch_mm. */
> + smp_mb();
> +
> goto out;
> }
>
> if ((end != TLB_FLUSH_ALL) && !(vmflag & VM_HUGETLB))
> base_pages_to_flush = (end - start) >> PAGE_SHIFT;
>
> + /*
> + * Both branches below are implicit full barriers (MOV to CR or
> + * INVLPG) that synchronize with switch_mm.
> + */
> if (base_pages_to_flush > tlb_single_page_flush_ceiling) {
> base_pages_to_flush = TLB_FLUSH_ALL;
> count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ALL);
> @@ -228,10 +243,18 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long start)
> preempt_disable();
>
> if (current->active_mm == mm) {
> - if (current->mm)
> + if (current->mm) {
> + /*
> + * Implicit full barrier (INVLPG) that synchronizes
> + * with switch_mm.
> + */
> __flush_tlb_one(start);
> - else
> + } else {
> leave_mm(smp_processor_id());
> +
> + /* Synchronize with switch_mm. */
> + smp_mb();
> + }
> }
>
> if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)
> --
> 2.5.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majo...@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"do...@kvack.org"> em...@kvack.org </a>

Andy Lutomirski

unread,

Jun 9, 2016, 1:24:50 PM6/9/16

to Nadav Amit, Andy Lutomirski, X86 ML, LKML, Borislav Petkov, Brian Gerst, Dave Hansen, Oleg Nesterov, linu...@kvack.org

On Fri, Jun 3, 2016 at 10:42 AM, Nadav Amit <nadav...@gmail.com> wrote:
> Following this patch, if (current->active_mm != mm), flush_tlb_page() still
> doesn’t call smp_mb() before checking mm_cpumask(mm).
>
> In contrast, flush_tlb_mm_range() does call smp_mb().
>
> Is there a reason for this discrepancy?

Not that I can remember. Is the remote flush case likely to be racy?

--Andy

Nadav Amit

unread,

Jun 9, 2016, 3:46:06 PM6/9/16

to Andy Lutomirski, Andy Lutomirski, X86 ML, LKML, Borislav Petkov, Brian Gerst, Dave Hansen, Oleg Nesterov, linu...@kvack.org

You replied separately on another email that included a patch to fix
this case. It turns out smp_mb is not needed on flush_tlb_page, since
the PTE is always updated using an atomic operation. Yet, a compiler
barrier is still needed, so I added smp_mb__after_atomic instead.

Nadav

Wanpeng Li

unread,

Sep 5, 2016, 9:22:41 PM9/5/16

to Andy Lutomirski, the arch/x86 maintainers, linux-...@vger.kernel.org, Borislav Petkov, Brian Gerst, Dave Hansen, Linus Torvalds, Oleg Nesterov, linu...@kvack.org, sta...@vger.kernel.org

Hi Andy,

I misunderstand this comments, CPU0 write to a PTE for 'next', and
CPU0 observes bit 1 clear in mm_cpumask, so CPU0 won't kick IPI to
CPU1, why CPU0's TLB will contain a stale entry instead of CPU1?

Regards,
Wanpeng Li