[RFC 09/13] x86/mm: Disable interrupts when flushing the TLB using CR3

Andy Lutomirski

unread,

Jan 8, 2016, 6:20:07 PM1/8/16

to

Signed-off-by: Andy Lutomirski <lu...@kernel.org>
---
arch/x86/include/asm/tlbflush.h | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3d905f12cda9..32e3d8769a22 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -135,7 +135,17 @@ static inline void cr4_set_bits_and_update_boot(unsigned long mask)

static inline void __native_flush_tlb(void)
{
+ unsigned long flags;
+
+ /*
+ * We mustn't be preempted or handle an IPI while reading and
+ * writing CR3. Preemption could switch mms and switch back, and
+ * an IPI could call leave_mm. Either of those could cause our
+ * PCID to change asynchronously.
+ */
+ raw_local_irq_save(flags);
native_write_cr3(native_read_cr3());
+ raw_local_irq_restore(flags);
}

static inline void __native_flush_tlb_global_irq_disabled(void)
--
2.5.0

Linus Torvalds

unread,

Jan 8, 2016, 6:50:06 PM1/8/16

to

On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski <lu...@kernel.org> wrote:
> + /*
> + * We mustn't be preempted or handle an IPI while reading and
> + * writing CR3. Preemption could switch mms and switch back, and
> + * an IPI could call leave_mm. Either of those could cause our
> + * PCID to change asynchronously.
> + */
> + raw_local_irq_save(flags);
> native_write_cr3(native_read_cr3());
> + raw_local_irq_restore(flags);

This seems sad for two reasons:

- it adds unnecessary overhead on non-pcid setups (32-bit being an
example of that)

- on pcid setups, wouldn't invpcid_flush_single_context() be better?

So on the whole I hate it.

Why isn't this something like

if (static_cpu_has_safe(X86_FEATURE_INVPCID)) {
invpcid_flush_single_context();
return;
}
native_write_cr3(native_read_cr3());

*without* any flag saving crud?

And yes, that means that we'd require X86_FEATURE_INVPCID in order to
use X86_FEATURE_PCID, but that seems fine.

Or is there some reason you wanted the odd flags version? If so, that
should be documented.

Linus

Andy Lutomirski

unread,

Jan 8, 2016, 7:20:06 PM1/8/16

to

On Jan 8, 2016 3:41 PM, "Linus Torvalds" <torv...@linux-foundation.org> wrote:
>
> On Fri, Jan 8, 2016 at 3:15 PM, Andy Lutomirski <lu...@kernel.org> wrote:
> > + /*
> > + * We mustn't be preempted or handle an IPI while reading and
> > + * writing CR3. Preemption could switch mms and switch back, and
> > + * an IPI could call leave_mm. Either of those could cause our
> > + * PCID to change asynchronously.
> > + */
> > + raw_local_irq_save(flags);
> > native_write_cr3(native_read_cr3());
> > + raw_local_irq_restore(flags);
>
> This seems sad for two reasons:
>
> - it adds unnecessary overhead on non-pcid setups (32-bit being an
> example of that)

I can certainly skip the flag saving on !PCID.

>
> - on pcid setups, wouldn't invpcid_flush_single_context() be better?
>

I played with that and it was slower. I don't pretend that makes any sense.

> So on the whole I hate it.
>
> Why isn't this something like
>
> if (static_cpu_has_safe(X86_FEATURE_INVPCID)) {
> invpcid_flush_single_context();
> return;
> }
> native_write_cr3(native_read_cr3());
>
> *without* any flag saving crud?
>
> And yes, that means that we'd require X86_FEATURE_INVPCID in order to
> use X86_FEATURE_PCID, but that seems fine.

I have an SNB "Extreme" with PCID but not INVPCID, and there could be
a whole generation of servers like that. I think we should fully
support them.

We might be able to get away with just disabling preemption instead of
IRQs, at least if mm == active_mm.

>
> Or is there some reason you wanted the odd flags version? If so, that
> should be documented.

What do you mean "odd"?

--Andy

Linus Torvalds

unread,

Jan 8, 2016, 9:30:06 PM1/8/16

to

On Fri, Jan 8, 2016 at 4:18 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>>
>> - on pcid setups, wouldn't invpcid_flush_single_context() be better?
>
> I played with that and it was slower. I don't pretend that makes any sense.

Ugh. I guess reading and writing cr3 has been optimized.

>> And yes, that means that we'd require X86_FEATURE_INVPCID in order to
>> use X86_FEATURE_PCID, but that seems fine.
>
> I have an SNB "Extreme" with PCID but not INVPCID, and there could be
> a whole generation of servers like that. I think we should fully
> support them.

Can you check the timings? IOW, is it a win on SNB?

I think originally Intel only had two actual bits of process context
ID in the TLB, and it was meant to be used for virtualization or
something. Together with the hashing (to make it always appear as 12
bits to software - a nice idea but also means that the hardware ends
up invalidating more than software really expects), it may not work
all that well.

That _could_ explain why the original patch from intel didn't work.

> We might be able to get away with just disabling preemption instead of
> IRQs, at least if mm == active_mm.

I'm not convinced it is all that much faster. Of course, it's nicer on
non-preempt, but nobody seems to run things that way.

>> Or is there some reason you wanted the odd flags version? If so, that
>> should be documented.
>
> What do you mean "odd"?

It's odd because it makes no sense for non-pcid (christ, I wish Intel
had just called it "asid" instead, "pcid" always makes me react to
"pci"), and I think it would make more sense to pair up the pcid case
with the invpcid rather than have those preemption rules here.

Linus

Ingo Molnar

unread,

Jan 11, 2016, 6:00:08 AM1/11/16

to

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> >> Or is there some reason you wanted the odd flags version? If so, that
> >> should be documented.
> >
> > What do you mean "odd"?
>
> It's odd because it makes no sense for non-pcid (christ, I wish Intel had just
> called it "asid" instead, "pcid" always makes me react to "pci"), and I think it
> would make more sense to pair up the pcid case with the invpcid rather than have
> those preemption rules here.

The naming is really painful, so a trivial suggestion: could we just name all the
Linux side bits 'asid' or 'ctx_id' (even in x86 arch code) and only use 'PCID'
nomenclature in the very lowest level code?

I.e. rename pcid_live_cpus et al and most functions to the asid or ctx_id or asid
naming scheme or so. That would hide most of the naming ugliness.

Thanks,

Ingo

Andy Lutomirski

unread,

Jan 13, 2016, 6:40:06 PM1/13/16

to

On Mon, Jan 11, 2016 at 2:51 AM, Ingo Molnar <mi...@kernel.org> wrote:
>
> * Linus Torvalds <torv...@linux-foundation.org> wrote:
>
>> >> Or is there some reason you wanted the odd flags version? If so, that
>> >> should be documented.
>> >
>> > What do you mean "odd"?
>>
>> It's odd because it makes no sense for non-pcid (christ, I wish Intel had just
>> called it "asid" instead, "pcid" always makes me react to "pci"), and I think it
>> would make more sense to pair up the pcid case with the invpcid rather than have
>> those preemption rules here.
>
> The naming is really painful, so a trivial suggestion: could we just name all the
> Linux side bits 'asid' or 'ctx_id' (even in x86 arch code) and only use 'PCID'
> nomenclature in the very lowest level code?

I'd be okay with "pctx_id" or "pctxid" for this, I think. I'd like to
at least make it somewhat obvious how it maps back to hardware.

FWIW, I'd guess that Intel deviated from convention because their
actual address space id is (vpid, pcid), and calling it (vpid, asid)
might have been slightly confusing. Or not.

--Andy

Andy Lutomirski

unread,

Jan 13, 2016, 6:40:07 PM1/13/16

to

On Fri, Jan 8, 2016 at 6:20 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Fri, Jan 8, 2016 at 4:18 PM, Andy Lutomirski <lu...@amacapital.net> wrote:
>>>
>>> - on pcid setups, wouldn't invpcid_flush_single_context() be better?
>>
>> I played with that and it was slower. I don't pretend that makes any sense.
>
> Ugh. I guess reading and writing cr3 has been optimized.
>
>>> And yes, that means that we'd require X86_FEATURE_INVPCID in order to
>>> use X86_FEATURE_PCID, but that seems fine.
>>
>> I have an SNB "Extreme" with PCID but not INVPCID, and there could be
>> a whole generation of servers like that. I think we should fully
>> support them.
>
> Can you check the timings? IOW, is it a win on SNB?

~80ns gain on SNB. It's actually quite impressive on SNB: it knocks
the penalty for mm switches down to 20ns or so, which I find to be
fairly amazing. (This is at 3.8GHz or thereabouts.)

>
> I think originally Intel only had two actual bits of process context
> ID in the TLB, and it was meant to be used for virtualization or
> something. Together with the hashing (to make it always appear as 12
> bits to software - a nice idea but also means that the hardware ends
> up invalidating more than software really expects), it may not work
> all that well.
>
> That _could_ explain why the original patch from intel didn't work.
>
>> We might be able to get away with just disabling preemption instead of
>> IRQs, at least if mm == active_mm.
>
> I'm not convinced it is all that much faster. Of course, it's nicer on
> non-preempt, but nobody seems to run things that way.

My current testing version has three different code paths now. If
INVPCID and PCID are both available, then it uses INVPCID. If PCID is
available but INVPCID is not, it does raw_local_irqsave. If PCID is
not available, it just does the CR3 read/write.

Yeah, it's ugly, and it's a big blob of code to do something trivial,
but it seems to work and it should be the right thing to do in most
cases.

Can anyone here ask a hardware or microcode person what's going on
with CR3 writes possibly being faster than INVPCID? Is there some
trick to it?

--Andy

Dave Hansen

unread,

Jan 13, 2016, 6:50:08 PM1/13/16

to

On 01/13/2016 03:35 PM, Andy Lutomirski wrote:
> Can anyone here ask a hardware or microcode person what's going on
> with CR3 writes possibly being faster than INVPCID? Is there some
> trick to it?

I just went and measured it myself this morning. "INVPCID Type 3" (all
contexts no global) on a Skylake system was 15% slower than a CR3 write.

Is that in the same ballpark from what you've observed?

Andy Lutomirski

unread,

Jan 13, 2016, 7:00:06 PM1/13/16

to

It's similar, except that I was comparing "INVPCID Type 1" (single
context no globals) to a CR3 write.

Type 2, at least, is dramatically faster than the pair of CR4 writes
it replaces.

--Andy

Dave Hansen

unread,

Jan 13, 2016, 7:00:07 PM1/13/16

to

On 01/13/2016 03:51 PM, Andy Lutomirski wrote:
> On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen
> <dave....@linux.intel.com> wrote:
>> On 01/13/2016 03:35 PM, Andy Lutomirski wrote:
>>> Can anyone here ask a hardware or microcode person what's going on
>>> with CR3 writes possibly being faster than INVPCID? Is there some
>>> trick to it?
>>
>> I just went and measured it myself this morning. "INVPCID Type 3" (all
>> contexts no global) on a Skylake system was 15% slower than a CR3 write.
>>
>> Is that in the same ballpark from what you've observed?
>
> It's similar, except that I was comparing "INVPCID Type 1" (single
> context no globals) to a CR3 write.

Ahh, because you're using PCID... That one I saw as being ~1.85x the
number of cycles that a CR3 write was.

> Type 2, at least, is dramatically faster than the pair of CR4 writes
> it replaces.

Yeah, I saw the same thing. Type 2 was ~2.4x faster than the CR4 writes.

Andy Lutomirski

unread,

Jan 13, 2016, 7:40:07 PM1/13/16

to

On Wed, Jan 13, 2016 at 3:56 PM, Dave Hansen

<dave....@linux.intel.com> wrote:
> On 01/13/2016 03:51 PM, Andy Lutomirski wrote:
>> On Wed, Jan 13, 2016 at 3:43 PM, Dave Hansen
>> <dave....@linux.intel.com> wrote:
>>> On 01/13/2016 03:35 PM, Andy Lutomirski wrote:
>>>> Can anyone here ask a hardware or microcode person what's going on
>>>> with CR3 writes possibly being faster than INVPCID? Is there some
>>>> trick to it?
>>>
>>> I just went and measured it myself this morning. "INVPCID Type 3" (all
>>> contexts no global) on a Skylake system was 15% slower than a CR3 write.
>>>
>>> Is that in the same ballpark from what you've observed?
>>
>> It's similar, except that I was comparing "INVPCID Type 1" (single
>> context no globals) to a CR3 write.
>
> Ahh, because you're using PCID... That one I saw as being ~1.85x the
> number of cycles that a CR3 write was.
>

I think that settles it, then:

if (static_cpu_has_safe(X86_FEATURE_PCID)) {
raw_local_irqsave();
native_write_cr3(native_read_cr3());
raw_local_irqrestore();
} else {
native_write_cr3(native_read_cr3());
}

I don't think it's worth hacking more complexity into switch_mm to
make that annoyance go away.

--Andy