Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[GIT PULL] x86/mm changes for v4.14: PCID support, 5-level paging support, Secure Memory Encryption support

21 views
Skip to first unread message

Ingo Molnar

unread,
Sep 4, 2017, 5:40:07 AM9/4/17
to
Linus,

Please pull the latest x86-mm-for-linus git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-mm-for-linus

# HEAD: 9e52fc2b50de3a1c08b44f94c610fbe998c0031a x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)

[ NOTE: this tree depends on you having merged x86-boot-for-linus successfully.
If that tree could not be merged for whatever reason then please disregard this
pull request. ]

The main changes in this cycle are support for three new, complex hardware
features of x86 CPUs:

- Add 5-level paging support, which is a new hardware feature on upcoming Intel
CPUs allowing up to 128 PB of virtual address space and 4 PB of physical RAM
space - a 256-fold increase over the old limits. (Supercomputers of the future
forecasting hurricanes on an ever warming planet can certainly make good
use of more RAM.)

Many of the necessary changes went upstream in previous cycles, v4.14 is the
first kernel that can enable 5-level paging.

This feature is activated via CONFIG_X86_5LEVEL=y - disabled by default.

(By Kirill A. Shutemov)

- Add 'encrypted memory' support, which is a new hardware feature on upcoming AMD
CPUs ('Secure Memory Encryption', SME) allowing system RAM to be encrypted and
decrypted (mostly) transparently by the CPU, with a little help from the kernel
to transition to/from encrypted RAM. Such RAM should be more secure against
various attacks like RAM access via the memory bus and should make the radio
signature of memory bus traffic harder to intercept (and decrypt) as well.

This feature is activated via CONFIG_AMD_MEM_ENCRYPT=y - disabled by default.

(By Tom Lendacky)

- Enable PCID optimized TLB flushing on newer Intel CPUs: PCID is a hardware
feature that attaches an address space tag to TLB entries and thus allows to
skip TLB flushing in many cases, even if we switch mm's.

(By Andy Lutomirski)

All three of these features were in the works for a long time, and it's
coincidence of the three independent development paths that they are all
enabled in v4.14 at once.

out-of-topic modifications in x86-mm-for-linus:
-------------------------------------------------
arch/ia64/include/asm/acpi.h # 43858b4f25cf: x86/mm: Stop calling leave_m
arch/ia64/kernel/efi.c # f99afd08a45f: efi: Update efi_mem_type() t
drivers/acpi/processor_idle.c # 43858b4f25cf: x86/mm: Stop calling leave_m
drivers/firmware/dmi-sysfs.c # f7750a795687: x86, mpparse, x86/acpi, x86/
drivers/firmware/efi/efi.c # a19d66c56af1: efi: Add an EFI table addres
drivers/firmware/pcdp.c # f7750a795687: x86, mpparse, x86/acpi, x86/
drivers/gpu/drm/drm_gem.c # 95cf9264d5f3: x86, drm, fbdev: Do not spec
drivers/gpu/drm/drm_vm.c # 95cf9264d5f3: x86, drm, fbdev: Do not spec
drivers/gpu/drm/ttm/ttm_bo_vm.c # 95cf9264d5f3: x86, drm, fbdev: Do not spec
drivers/gpu/drm/udl/udl_fb.c # 95cf9264d5f3: x86, drm, fbdev: Do not spec
drivers/idle/intel_idle.c # 43858b4f25cf: x86/mm: Stop calling leave_m
drivers/iommu/amd_iommu.c # 2543a786aa25: iommu/amd: Allow the AMD IOM
drivers/iommu/amd_iommu_init.c # 2543a786aa25: iommu/amd: Allow the AMD IOM
drivers/iommu/amd_iommu_proto.h # 2543a786aa25: iommu/amd: Allow the AMD IOM
drivers/iommu/amd_iommu_types.h # 2543a786aa25: iommu/amd: Allow the AMD IOM
drivers/sfi/sfi_core.c # 693bf0aa01b7: x86/boot: Fix memremap() rel
# f7750a795687: x86, mpparse, x86/acpi, x86/
drivers/video/fbdev/core/fbmem.c # 95cf9264d5f3: x86, drm, fbdev: Do not spec
include/asm-generic/early_ioremap.h# f88a68facd9a: x86/mm: Extend early_memrema
include/asm-generic/pgtable.h # 21729f81ce8a: x86/mm: Provide general kern
include/linux/compiler.h # 7375ae3a0b79: compiler-gcc.h: Introduce __
include/linux/dma-mapping.h # 648babb7078c: swiotlb: Add warnings for us
include/linux/io.h # 8f716c9b5feb: x86/mm: Add support to acces
include/linux/kexec.h # bba4ed011a52: x86/mm, kexec: Allow kexec t
include/linux/mem_encrypt.h # 21729f81ce8a: x86/mm: Provide general kern
# 5868f3651fa0: x86/mm: Add support to enabl
# 7744ccdbc16f: x86/mm: Add Secure Memory En
include/linux/mm_inline.h # ce0fa3e56ad2: x86/mm, mm/hwpoison: Clear P
include/linux/swiotlb.h # c7753208a94c: x86, swiotlb: Add memory enc
kernel/kexec_core.c # bba4ed011a52: x86/mm, kexec: Allow kexec t
kernel/memremap.c # 8f716c9b5feb: x86/mm: Add support to acces
lib/swiotlb.c # 648babb7078c: swiotlb: Add warnings for us
# c7753208a94c: x86, swiotlb: Add memory enc
mm/early_ioremap.c # 8f716c9b5feb: x86/mm: Add support to acces
# f88a68facd9a: x86/mm: Extend early_memrema
mm/memory-failure.c # ce0fa3e56ad2: x86/mm, mm/hwpoison: Clear P

Thanks,

Ingo

------------------>
Andrey Ryabinin (1):
x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y

Andy Lutomirski (8):
x86/mm: Give each mm TLB flush generation a unique ID
x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
x86/mm: Rework lazy TLB mode and TLB freshness tracking
x86/mm: Stop calling leave_mm() in idle code
x86/mm: Disable PCID on 32-bit kernels
x86/mm: Add the 'nopcid' boot option to turn off PCID
x86/mm: Enable CR4.PCIDE on supported systems
x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID

Baoquan He (3):
x86/boot/KASLR: Wrap e820 entries walking code into new function process_e820_entries()
x86/boot/KASLR: Switch to pass struct mem_vector to process_e820_entry()
x86/boot/KASLR: Rename process_e820_entry() into process_mem_region()

Borislav Petkov (2):
x86/CPU: Align CR3 defines
x86/mm: Fix SME encryption stack ptr handling

Brijesh Singh (1):
kvm/x86: Avoid clearing the C-bit in rsvd_bits()

Ingo Molnar (1):
x86/boot: Fix memremap() related build failure

Jan Beulich (1):
x86/mm: Use pr_cont() in dump_pagetable()

Kirill A. Shutemov (8):
x86/mm/dump_pagetables: Generalize address normalization
x86/mm/dump_pagetables: Fix printout of p4d level
x86/xen: Redefine XEN_ELFNOTE_INIT_P2M using PUD_SIZE * PTRS_PER_PUD
x86/mm: Rename tasksize_32bit/64bit to task_size_32bit/64bit()
x86/mpx: Do not allow MPX if we have mappings above 47-bit
x86/mm: Prepare to expose larger address space to userspace
x86/mm: Allow userspace have mappings above 47-bit
x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y

Tom Lendacky (40):
x86/cpu/AMD: Document AMD Secure Memory Encryption (SME)
x86/mm/pat: Set write-protect cache mode for full PAT support
x86, mpparse, x86/acpi, x86/PCI, x86/dmi, SFI: Use memremap() for RAM mappings
x86/cpu/AMD: Add the Secure Memory Encryption CPU feature
x86/cpu/AMD: Handle SME reduction in physical address size
x86/mm: Add Secure Memory Encryption (SME) support
x86/mm: Remove phys_to_virt() usage in ioremap()
x86/mm: Add support to enable SME in early boot processing
x86/mm: Simplify p[g4um]d_page() macros
x86/mm: Provide general kernel support for memory encryption
x86/mm: Add SME support for read_cr3_pa()
x86/mm: Extend early_memremap() support with additional attrs
x86/mm: Add support for early encryption/decryption of memory
x86/mm: Insure that boot memory areas are mapped properly
x86/boot/e820: Add support to determine the E820 type of an address
efi: Add an EFI table address match function
efi: Update efi_mem_type() to return an error rather than 0
x86/efi: Update EFI pagetable creation to work with SME
x86/mm: Add support to access boot related data in the clear
x86/boot: Use memremap() to map the MPF and MPC data
x86/mm: Add support to access persistent memory in the clear
x86/mm: Add support for changing the memory encryption attribute
x86/realmode: Decrypt trampoline area if memory encryption is active
x86, swiotlb: Add memory encryption support
swiotlb: Add warnings for use of bounce buffers with SME
x86/cpu/AMD: Make the microcode level available earlier in the boot
iommu/amd: Allow the AMD IOMMU to work with memory encryption
x86/boot/realmode: Check for memory encryption on the APs
x86, drm, fbdev: Do not specify encrypted memory for video mappings
kvm/x86/svm: Support Secure Memory Encryption within KVM
x86/mm, kexec: Allow kexec to be used with SME
xen/x86: Remove SME feature in PV guests
x86/mm: Use proper encryption attributes with /dev/mem
x86/mm: Create native_make_p4d() for PGTABLE_LEVELS <= 4
x86/mm: Add support to encrypt the kernel in-place
x86/boot: Add early cmdline parsing for options with arguments
compiler-gcc.h: Introduce __nostackprotector function attribute
x86/mm: Add support to make use of Secure Memory Encryption
x86/mm, kexec: Fix memory corruption with SME on successive kexecs
acpi, x86/mm: Remove encryption mask from ACPI page protection type

Tony Luck (1):
x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages

Vitaly Kuznetsov (1):
x86/mm: Enable RCU based page table freeing (CONFIG_HAVE_RCU_TABLE_FREE=y)

Wang Kai (1):
x86/mm/pkeys: Fix typo in Documentation/x86/protection-keys.txt


Documentation/admin-guide/kernel-parameters.txt | 13 +
Documentation/x86/amd-memory-encryption.txt | 68 +++
Documentation/x86/protection-keys.txt | 6 +-
Documentation/x86/x86_64/5level-paging.txt | 64 +++
arch/ia64/include/asm/acpi.h | 2 -
arch/ia64/kernel/efi.c | 4 +-
arch/x86/Kconfig | 49 ++
arch/x86/boot/compressed/kaslr.c | 63 +--
arch/x86/boot/compressed/pagetable.c | 7 +
arch/x86/include/asm/acpi.h | 13 +-
arch/x86/include/asm/cmdline.h | 2 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 4 +-
arch/x86/include/asm/dma-mapping.h | 5 +-
arch/x86/include/asm/dmi.h | 8 +-
arch/x86/include/asm/e820/api.h | 2 +
arch/x86/include/asm/elf.h | 4 +-
arch/x86/include/asm/fixmap.h | 20 +
arch/x86/include/asm/init.h | 1 +
arch/x86/include/asm/io.h | 8 +
arch/x86/include/asm/kexec.h | 11 +-
arch/x86/include/asm/kvm_host.h | 2 +-
arch/x86/include/asm/mem_encrypt.h | 80 ++++
arch/x86/include/asm/mmu.h | 25 +-
arch/x86/include/asm/mmu_context.h | 15 +-
arch/x86/include/asm/mpx.h | 9 +
arch/x86/include/asm/msr-index.h | 2 +
arch/x86/include/asm/page_64.h | 4 +
arch/x86/include/asm/page_types.h | 3 +-
arch/x86/include/asm/pgtable.h | 28 +-
arch/x86/include/asm/pgtable_types.h | 58 ++-
arch/x86/include/asm/processor-flags.h | 13 +-
arch/x86/include/asm/processor.h | 20 +-
arch/x86/include/asm/realmode.h | 12 +
arch/x86/include/asm/set_memory.h | 3 +
arch/x86/include/asm/tlb.h | 14 +
arch/x86/include/asm/tlbflush.h | 87 +++-
arch/x86/include/asm/vga.h | 14 +-
arch/x86/kernel/acpi/boot.c | 6 +-
arch/x86/kernel/cpu/amd.c | 29 +-
arch/x86/kernel/cpu/bugs.c | 8 +
arch/x86/kernel/cpu/common.c | 40 ++
arch/x86/kernel/cpu/mcheck/mce.c | 43 ++
arch/x86/kernel/cpu/scattered.c | 1 +
arch/x86/kernel/e820.c | 26 +-
arch/x86/kernel/espfix_64.c | 2 +-
arch/x86/kernel/head64.c | 95 +++-
arch/x86/kernel/head_64.S | 40 +-
arch/x86/kernel/kdebugfs.c | 34 +-
arch/x86/kernel/ksysfs.c | 28 +-
arch/x86/kernel/machine_kexec_64.c | 25 +-
arch/x86/kernel/mpparse.c | 108 +++--
arch/x86/kernel/pci-dma.c | 11 +-
arch/x86/kernel/pci-nommu.c | 2 +-
arch/x86/kernel/pci-swiotlb.c | 15 +-
arch/x86/kernel/process.c | 17 +-
arch/x86/kernel/relocate_kernel_64.S | 14 +
arch/x86/kernel/setup.c | 9 +
arch/x86/kernel/sys_x86_64.c | 30 +-
arch/x86/kvm/mmu.c | 41 +-
arch/x86/kvm/svm.c | 35 +-
arch/x86/kvm/vmx.c | 2 +-
arch/x86/kvm/x86.c | 3 +-
arch/x86/lib/cmdline.c | 105 +++++
arch/x86/mm/Makefile | 2 +
arch/x86/mm/dump_pagetables.c | 93 ++--
arch/x86/mm/fault.c | 26 +-
arch/x86/mm/hugetlbpage.c | 27 +-
arch/x86/mm/ident_map.c | 12 +-
arch/x86/mm/init.c | 2 +-
arch/x86/mm/ioremap.c | 287 +++++++++++-
arch/x86/mm/kasan_init_64.c | 6 +-
arch/x86/mm/mem_encrypt.c | 593 ++++++++++++++++++++++++
arch/x86/mm/mem_encrypt_boot.S | 149 ++++++
arch/x86/mm/mmap.c | 12 +-
arch/x86/mm/mpx.c | 33 +-
arch/x86/mm/pageattr.c | 67 +++
arch/x86/mm/pat.c | 9 +-
arch/x86/mm/pgtable.c | 8 +-
arch/x86/mm/tlb.c | 331 +++++++++----
arch/x86/pci/common.c | 4 +-
arch/x86/platform/efi/efi.c | 6 +-
arch/x86/platform/efi/efi_64.c | 15 +-
arch/x86/realmode/init.c | 12 +
arch/x86/realmode/rm/trampoline_64.S | 24 +
arch/x86/xen/Kconfig | 5 +
arch/x86/xen/enlighten_pv.c | 7 +
arch/x86/xen/mmu_pv.c | 5 +-
arch/x86/xen/xen-head.S | 2 +-
drivers/acpi/processor_idle.c | 2 -
drivers/firmware/dmi-sysfs.c | 5 +-
drivers/firmware/efi/efi.c | 33 ++
drivers/firmware/pcdp.c | 4 +-
drivers/gpu/drm/drm_gem.c | 2 +
drivers/gpu/drm/drm_vm.c | 4 +
drivers/gpu/drm/ttm/ttm_bo_vm.c | 7 +-
drivers/gpu/drm/udl/udl_fb.c | 4 +
drivers/idle/intel_idle.c | 9 +-
drivers/iommu/amd_iommu.c | 30 +-
drivers/iommu/amd_iommu_init.c | 34 +-
drivers/iommu/amd_iommu_proto.h | 10 +
drivers/iommu/amd_iommu_types.h | 2 +-
drivers/sfi/sfi_core.c | 23 +-
drivers/video/fbdev/core/fbmem.c | 12 +
include/asm-generic/early_ioremap.h | 2 +
include/asm-generic/pgtable.h | 12 +
include/linux/compiler-gcc.h | 2 +
include/linux/compiler.h | 4 +
include/linux/dma-mapping.h | 13 +
include/linux/efi.h | 9 +-
include/linux/io.h | 2 +
include/linux/kexec.h | 8 +
include/linux/mem_encrypt.h | 48 ++
include/linux/mm_inline.h | 6 +
include/linux/swiotlb.h | 1 +
init/main.c | 10 +
kernel/kexec_core.c | 12 +-
kernel/memremap.c | 20 +-
lib/swiotlb.c | 57 ++-
mm/early_ioremap.c | 28 +-
mm/memory-failure.c | 2 +
121 files changed, 3169 insertions(+), 498 deletions(-)
create mode 100644 Documentation/x86/amd-memory-encryption.txt
create mode 100644 Documentation/x86/x86_64/5level-paging.txt
create mode 100644 arch/x86/include/asm/mem_encrypt.h
create mode 100644 arch/x86/mm/mem_encrypt.c
create mode 100644 arch/x86/mm/mem_encrypt_boot.S
create mode 100644 include/linux/mem_encrypt.h

Kirill A. Shutemov

unread,
Sep 4, 2017, 8:30:07 AM9/4/17
to
On Mon, Sep 04, 2017 at 11:31:58AM +0200, Ingo Molnar wrote:
> The main changes in this cycle are support for three new, complex hardware
> features of x86 CPUs:
>
> - Add 5-level paging support, which is a new hardware feature on upcoming Intel
> CPUs allowing up to 128 PB of virtual address space and 4 PB of physical RAM
> space - a 256-fold increase over the old limits.

Minor nitpick: I don't see where "256-fold" comes from.

Virtual address space increased from 256 TB to 128 PB -- 512 times.
Physical: 64 TB -> 4 PB -- 64 times.

--
Kirill A. Shutemov

Thomas Gleixner

unread,
Sep 4, 2017, 9:10:09 AM9/4/17
to
See arch/x86/kernel/tsc.c:

-john...@us.ibm.com "math is hard, lets go shopping!"

Ingo Molnar

unread,
Sep 4, 2017, 10:00:06 AM9/4/17
to
Yeah, I mis-remembered the increase in number of bits, should have double checked ...

/me goes shopping

Thanks,

Ingo

Linus Torvalds

unread,
Sep 5, 2017, 5:20:08 PM9/5/17
to
On Mon, Sep 4, 2017 at 2:31 AM, Ingo Molnar <mi...@kernel.org> wrote:
>
> Please pull the latest ... git tree from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git ...

Hmm. My laptop (XPS 13) doesn't resume any more. It suspends, but
doesn't come back from resume.

I immediately assumed it was the power management pulls I just did,
but then I started bisecting, and now it's actually pointing into the
various x86 pulls I did yesterday instead.

Now, I'm reasonably early in my bisection (so literally "somewhere
between the 'docs-next' and the 'x86-mm-for-linus' pull), and maybe
the problem isn't even entirely repeatable and my bisection has
already gone off the rails, but I thought I'd give at least an early
heads-up about this thing.

I'll have more as it bisects deeper into the merge window, but it
might be a while.

Linus

Andy Lutomirski

unread,
Sep 5, 2017, 5:40:07 PM9/5/17
to


> On Sep 5, 2017, at 2:17 PM, Linus Torvalds <torv...@linux-foundation.org> wrote:
>
>> On Mon, Sep 4, 2017 at 2:31 AM, Ingo Molnar <mi...@kernel.org> wrote:
>>
>> Please pull the latest ... git tree from:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git ...
>
> Hmm. My laptop (XPS 13) doesn't resume any more. It suspends, but
> doesn't come back from resume.
>

You could try booting with nopcid to rule out CR4 issues. I can also imagine SME's very clever CR3 masking causing a problem.

Unfortunately, if it's a PCID problem involving wrong ordering of CR4 initialization, you might get lucky if you suspend with ASID 0 active, causing unfortunate bisection results.

I will test on my XPS 13 after lunch.

Ingo Molnar

unread,
Sep 5, 2017, 5:50:07 PM9/5/17
to
Hm, just as background, there are no regression reports I'm aware of
against any of these trees, plus most of the dangerous commits have
been in linux-next for at least two weeks - the majority of them even
longer. The last 2-4 commits of x86/mm are fresher.

But it could be a regression in an older commit in x86/mm just as well,
triggered by you for the first time :-/

Does your laptop CPU have PCID support by any chance? The CPU coming
out of resume with PCID disabled and us not properly re-enabling it
might be a possible failure mode.

Thanks,

Ingo

Borislav Petkov

unread,
Sep 5, 2017, 5:50:07 PM9/5/17
to
On Tue, Sep 05, 2017 at 02:37:41PM -0700, Andy Lutomirski wrote:
> I can also imagine SME's very clever CR3 masking causing a problem.

So that mask should be 0 on an !SME system. One of the main goals of the
SME code was to have no effect when SME is disabled/not available.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Linus Torvalds

unread,
Sep 5, 2017, 5:50:07 PM9/5/17
to
On Tue, Sep 5, 2017 at 2:40 PM, Ingo Molnar <mi...@kernel.org> wrote:
>
> Does your laptop CPU have PCID support by any chance? The CPU coming
> out of resume with PCID disabled and us not properly re-enabling it
> might be a possible failure mode.

It does have pcid. It's a -7-6560U in the prev-gen XPS 13 (aka "9350").

I have bisected a bit deeper, and the pcid code is definitely in that
bisection window still. But are a fair number of other commits (about
160 right now)

Linus

Linus Torvalds

unread,
Sep 5, 2017, 6:00:07 PM9/5/17
to
On Tue, Sep 5, 2017 at 2:40 PM, Ingo Molnar <mi...@kernel.org> wrote:
>
> Hm, just as background, there are no regression reports I'm aware of
> against any of these trees, plus most of the dangerous commits have
> been in linux-next for at least two weeks - the majority of them even
> longer. The last 2-4 commits of x86/mm are fresher.

Side note: I do not believe a lot of people actually run linux-next on
laptops, so suspend/resume likely doesn't get a lot of testing in
next.

I think most people who run linux-next tend to be automation things on farms.

Don't get me wrong - I love linux-next and your tip testing, but I
think linux-next is best for finding build errors etc big integration
issues, with some very rudimentary actual boot checking.

Maybe I'm wrong.

I _wish_ I am wrong.

But honestly, I see problems on my machines most merge windows. Last
release was actually unusually calm, in that I don't think I had to
bisect anything at all.

Which really says to me: "very few people actually _run_ those next trees".

Linus

Andy Lutomirski

unread,
Sep 5, 2017, 6:40:09 PM9/5/17
to
On Tue, Sep 5, 2017 at 3:33 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
> On Tue, Sep 5, 2017 at 2:45 PM, Linus Torvalds
> <torv...@linux-foundation.org> wrote:
>>
>> I have bisected a bit deeper, and the pcid code is definitely in that
>> bisection window still. But are a fair number of other commits (about
>> 160 right now)
>
> Now down to 18.
>
> And one of those 18 is commit 10af6235e0d3 ("x86/mm: Implement PCID
> based optimization: try to preserve old TLB entries using PCID"),
> which I guess is where the problem might actually start showing up if
> it is pcid.
>
> I'll continue to bisect rather than just test "nopcid", because I want
> to get that bisection result regardless.

The thing that would surprise me about this is that I dogfooded my
PCID tree on my XPS 13 9250 for quite a while and it worked just fine.
Maybe something got merged oddly or had an unpleasant interaction
somewhere?

Anyway, all two cores of that fancy Skylake are compiling your tree as
we speak :). It'll finish eventually.

--Andy

Linus Torvalds

unread,
Sep 5, 2017, 6:40:09 PM9/5/17
to
On Tue, Sep 5, 2017 at 2:45 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> I have bisected a bit deeper, and the pcid code is definitely in that
> bisection window still. But are a fair number of other commits (about
> 160 right now)

Now down to 18.

And one of those 18 is commit 10af6235e0d3 ("x86/mm: Implement PCID
based optimization: try to preserve old TLB entries using PCID"),
which I guess is where the problem might actually start showing up if
it is pcid.

I'll continue to bisect rather than just test "nopcid", because I want
to get that bisection result regardless.

Linus

Linus Torvalds

unread,
Sep 5, 2017, 7:10:09 PM9/5/17
to
On Tue, Sep 5, 2017 at 3:33 PM, Linus Torvalds
<torv...@linux-foundation.org> wrote:
>
> And one of those 18 is commit 10af6235e0d3 ("x86/mm: Implement PCID
> based optimization: try to preserve old TLB entries using PCID"),
> which I guess is where the problem might actually start showing up if
> it is pcid.

Yup, that's what it bisected down to in the end.

And then rebooting once more into that kernel, but with "nopcid" on
the command line, and it all works.

I'll go back to top-of-tree just to verify that 'nopcid' thing there
too, but it does seem pretty clear-cut.

Linus

Jiri Kosina

unread,
Sep 6, 2017, 5:00:12 PM9/6/17
to
On Tue, 5 Sep 2017, Linus Torvalds wrote:

> > And one of those 18 is commit 10af6235e0d3 ("x86/mm: Implement PCID
> > based optimization: try to preserve old TLB entries using PCID"),
> > which I guess is where the problem might actually start showing up if
> > it is pcid.
>
> Yup, that's what it bisected down to in the end.
>
> And then rebooting once more into that kernel, but with "nopcid" on
> the command line, and it all works.
>
> I'll go back to top-of-tree just to verify that 'nopcid' thing there
> too, but it does seem pretty clear-cut.

This is a "me too", observed on my Lenovo thinkpad x270 (so it's not
specific to that XPS 13 system at all).

The symptom I observe is that an attempt to resume from hibernation
proceeds up to reading 100% of the hibernation image, and then reboot
happens (IOW looks like triple fault).

nopcid cures it, I haven't tried to revert 10af6235e0d3 yet, but looks
like it's the same thing.

--
Jiri Kosina
SUSE Labs

Jiri Kosina

unread,
Sep 6, 2017, 5:20:07 PM9/6/17
to
On Wed, 6 Sep 2017, Jiri Kosina wrote:

> This is a "me too", observed on my Lenovo thinkpad x270 (so it's not
> specific to that XPS 13 system at all).
>
> The symptom I observe is that an attempt to resume from hibernation
> proceeds up to reading 100% of the hibernation image, and then reboot
> happens (IOW looks like triple fault).
>
> nopcid cures it, I haven't tried to revert 10af6235e0d3 yet, but looks
> like it's the same thing.

[ reposting the information again with LKML re-introduced to CC ]

As suggested by Andy off-list, I tested with this change to always force
ASID 0

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5ca71d1..c3b0811 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -35,7 +35,7 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
{
u16 asid;

- if (!static_cpu_has(X86_FEATURE_PCID)) {
+ if (true || !static_cpu_has(X86_FEATURE_PCID)) {
*new_asid = 0;
*need_flush = true;
return;

and that fixes the issue on my system.

Andy Lutomirski

unread,
Sep 6, 2017, 6:30:06 PM9/6/17
to
I got Linus' config to boot. The problem was that I ended up with a
root-owned file (not sure which) in my tree that cause an incorrect
build but didn't generate errors. I don't know how this happened, but
an ill-timed sudo make -j4 modules_install install was probably
involved. git clean -ffxxxd , did *not* fix it or even notice it in
any obvious way.

Anyway, the problem appears to depend on kernel config because it's
dying here on resume on secondary cpus:

VM_BUG_ON(__read_cr3() != (__sme_pa(real_prev->pgd) | prev_asid));

in switch_mm_irqs_off().

What seems to be going on is that the wakeup CPU is exactly restoring
original state. All other CPUs are restoring swapper_pg_dir but are
failing to restore the PCID tag bits, which trips the assertion w.p.
5/6 per non-boot CPU. So, if you have that debug option set, you die
w.p. 1 - (1/6)^(cpus - 1), which is pretty large.

I'll come up with a clean fix this evening, I hope.

Ingo Molnar

unread,
Sep 7, 2017, 4:10:07 AM9/7/17
to

* Linus Torvalds <torv...@linux-foundation.org> wrote:

> On Tue, Sep 5, 2017 at 2:40 PM, Ingo Molnar <mi...@kernel.org> wrote:
> >
> > Hm, just as background, there are no regression reports I'm aware of
> > against any of these trees, plus most of the dangerous commits have
> > been in linux-next for at least two weeks - the majority of them even
> > longer. The last 2-4 commits of x86/mm are fresher.
>
> Side note: I do not believe a lot of people actually run linux-next on
> laptops, so suspend/resume likely doesn't get a lot of testing in
> next.
>
> I think most people who run linux-next tend to be automation things on farms.

Yeah, so 10af6235e0d3 was in linux-next for over a month, yet no-one reported the
bug.

> Don't get me wrong - I love linux-next and your tip testing, but I
> think linux-next is best for finding build errors etc big integration
> issues, with some very rudimentary actual boot checking.
>
> Maybe I'm wrong.

I don't think you are wrong - most boot tests don't involve laptops. linux-next is
mostly server oriented - and servers are often more debuggable than laptops. (Have
actual serial ports or physical network connections with serial emulation, etc.)

I tried to maintain a laptop testbox in -tip testing with netconsole for a time -
but it was quite a bit of pain so I eventually dropped it. (Not that the simple
boot + kernel build test that -tip does would have uncovered this particular bug.)

Maybe a tester or two saw the 'dead on resume' bug and didn't bother reporting it,
because it's a very difficult category of bug to debug short of a full bisection?

Thanks,

Ingo

Ingo Molnar

unread,
Sep 7, 2017, 4:20:09 AM9/7/17
to

* Ingo Molnar <mi...@kernel.org> wrote:

>
> * Linus Torvalds <torv...@linux-foundation.org> wrote:
>
> > On Tue, Sep 5, 2017 at 2:40 PM, Ingo Molnar <mi...@kernel.org> wrote:
> > >
> > > Hm, just as background, there are no regression reports I'm aware of
> > > against any of these trees, plus most of the dangerous commits have
> > > been in linux-next for at least two weeks - the majority of them even
> > > longer. The last 2-4 commits of x86/mm are fresher.
> >
> > Side note: I do not believe a lot of people actually run linux-next on
> > laptops, so suspend/resume likely doesn't get a lot of testing in
> > next.
> >
> > I think most people who run linux-next tend to be automation things on farms.
>
> Yeah, so 10af6235e0d3 was in linux-next for over a month, yet no-one reported the
> bug.

That was also smack in the middle of the vacation season on the northern
hemisphere, which didn't help testing coverage either I suspect ...

In hindsight it was perhaps not the smartest thing from me to send three major
hw-enablement features to you - although only PCID was the one that should have
real widespread effects, and I did stage those changes pretty conservatively over
several months. Hindsight is 20/20 ...

Thanks,

Ingo

Pavel Machek

unread,
Sep 15, 2017, 7:30:09 AM9/15/17
to
On Thu 2017-09-07 10:07:09, Ingo Molnar wrote:
>
> * Linus Torvalds <torv...@linux-foundation.org> wrote:
>
> > On Tue, Sep 5, 2017 at 2:40 PM, Ingo Molnar <mi...@kernel.org> wrote:
> > >
> > > Hm, just as background, there are no regression reports I'm aware of
> > > against any of these trees, plus most of the dangerous commits have
> > > been in linux-next for at least two weeks - the majority of them even
> > > longer. The last 2-4 commits of x86/mm are fresher.
> >
> > Side note: I do not believe a lot of people actually run linux-next on
> > laptops, so suspend/resume likely doesn't get a lot of testing in
> > next.
> >
> > I think most people who run linux-next tend to be automation things on farms.
>
> Yeah, so 10af6235e0d3 was in linux-next for over a month, yet no-one reported the
> bug.
>
> > Don't get me wrong - I love linux-next and your tip testing, but I
> > think linux-next is best for finding build errors etc big integration
> > issues, with some very rudimentary actual boot checking.
> >
> > Maybe I'm wrong.
>
> I don't think you are wrong - most boot tests don't involve laptops. linux-next is
> mostly server oriented - and servers are often more debuggable than laptops. (Have
> actual serial ports or physical network connections with serial emulation, etc.)
>
> I tried to maintain a laptop testbox in -tip testing with netconsole for a time -
> but it was quite a bit of pain so I eventually dropped it. (Not that the simple
> boot + kernel build test that -tip does would have uncovered this particular bug.)

Some time ago, Tony Lindgren asked me to test Nokia N900 on -next from
time to time. I do, and it uncovers problems from time to time.

Perhaps if Linus or you asks for a volunteer, they can get one? We
have people submitting patches due to various challenges, perhaps "run
-next for a month" would be suitable challenge?

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
signature.asc
0 new messages