[PATCH 0/18][RFC] Nested Paging support for Nested SVM (aka NPT-Virtualization)

Joerg Roedel

unread,

Mar 3, 2010, 2:20:03 PM3/3/10

to

Hi,

here are the patches that implement nested paging support for nested
svm. They are somewhat intrusive to the soft-mmu so I post them as RFC
in the first round to get feedback about the general direction of the
changes. Nevertheless I am proud to report that with these patches the
famous kernel-compile benchmark runs only 4% slower in the l2 guest as
in the l1 guest when l2 is single-processor. With SMP guests the
situation is very different. The more vcpus the guest has the more is
the performance drop from l1 to l2.
Anyway, this post is to get feedback about the overall concept of these
patches. Please review and give feedback :-)

Thanks,

Joerg

Diffstat:

Shortlog:

Joerg Roedel (18):
KVM: MMU: Check for root_level instead of long mode
KVM: MMU: Make tdp_enabled a mmu-context parameter
KVM: MMU: Make set_cr3 a function pointer in kvm_mmu
KVM: X86: Introduce a tdp_set_cr3 function
KVM: MMU: Introduce get_cr3 function pointer
KVM: MMU: Introduce inject_page_fault function pointer
KVM: SVM: Implement MMU helper functions for Nested Nested Paging
KVM: MMU: Change init_kvm_softmmu to take a context as parameter
KVM: MMU: Let is_rsvd_bits_set take mmu context instead of vcpu
KVM: MMU: Introduce generic walk_addr function
KVM: MMU: Add infrastructure for two-level page walker
KVM: MMU: Implement nested gva_to_gpa functions
KVM: MMU: Introduce Nested MMU context
KVM: SVM: Initialize Nested Nested MMU context on VMRUN
KVM: MMU: Propagate the right fault back to the guest after gva_to_gpa
KVM: X86: Add callback to let modules decide over some supported cpuid bits
KVM: SVM: Report Nested Paging support to userspace
KVM: X86: Add KVM_CAP_SVM_CPUID_FIXED

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Joerg Roedel

unread,

Mar 3, 2010, 2:20:02 PM3/3/10

to

This patch introduces a second MMU context which will hold
the paging information for the l2 guest.

Signed-off-by: Joerg Roedel <joerg....@amd.com>
---
arch/x86/include/asm/kvm_host.h | 8 ++++++
arch/x86/kvm/mmu.c | 48 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 20dd1ce..66a698e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -264,6 +264,13 @@ struct kvm_mmu {

u64 *pae_root;
u64 rsvd_bits_mask[2][4];
+
+ /*
+ * If true the mmu runs in two-level mode.
+ * vcpu->arch.nested_mmu needs to contain meaningful
+ * values then.
+ */
+ bool nested;
};

struct kvm_vcpu_arch {
@@ -296,6 +303,7 @@ struct kvm_vcpu_arch {

struct kvm_mmu mmu;

+ /* This will hold the mmu context of the second level guest */
struct kvm_mmu nested_mmu;

/* only needed in kvm_pv_mmu_op() path, but it's hot so
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index c831955..ccaf6b1 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2154,6 +2154,18 @@ static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 *error)
return gpa;
}

+static gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 *error)
+{
+ u32 access;
+
+ BUG_ON(!vcpu->arch.mmu.nested);
+
+ /* NPT walks are treated as user writes */
+ access = PFERR_WRITE_MASK | PFERR_USER_MASK;
+
+ return vcpu->arch.nested_mmu.gva_to_gpa(vcpu, gpa, access, error);
+}
+
static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr,
u32 access, u32 *error)
{
@@ -2476,11 +2488,45 @@ static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
return r;
}

+static int init_kvm_nested_mmu(struct kvm_vcpu *vcpu)
+{
+ struct kvm_mmu *g_context = &vcpu->arch.nested_mmu;
+ struct kvm_mmu *h_context = &vcpu->arch.mmu;
+
+ g_context->get_cr3 = get_cr3;
+ g_context->translate_gpa = translate_nested_gpa;
+ g_context->inject_page_fault = kvm_inject_page_fault;
+
+ /*
+ * Note that arch.mmu.gva_to_gpa translates l2_gva to l1_gpa. The
+ * translation of l2_gpa to l1_gpa addresses is done using the
+ * arch.nested_mmu.gva_to_gpa function. Basically the gva_to_gpa
+ * functions between mmu and nested_mmu are swapped.
+ */
+ if (!is_paging(vcpu)) {
+ g_context->root_level = 0;
+ h_context->gva_to_gpa = nonpaging_gva_to_gpa_nested;
+ } else if (is_long_mode(vcpu)) {
+ g_context->root_level = PT64_ROOT_LEVEL;
+ h_context->gva_to_gpa = paging64_gva_to_gpa_nested;
+ } else if (is_pae(vcpu)) {
+ g_context->root_level = PT32E_ROOT_LEVEL;
+ h_context->gva_to_gpa = paging64_gva_to_gpa_nested;
+ } else {
+ g_context->root_level = PT32_ROOT_LEVEL;
+ h_context->gva_to_gpa = paging32_gva_to_gpa_nested;
+ }
+
+ return 0;
+}
+
static int init_kvm_mmu(struct kvm_vcpu *vcpu)
{
vcpu->arch.update_pte.pfn = bad_pfn;

- if (tdp_enabled)
+ if (vcpu->arch.mmu.nested)
+ return init_kvm_nested_mmu(vcpu);
+ else if (tdp_enabled)
return init_kvm_tdp_mmu(vcpu);
else
return init_kvm_softmmu(vcpu);
--
1.7.0

Joerg Roedel

unread,

Mar 3, 2010, 2:20:03 PM3/3/10

to

This patch changes the tdp_enabled flag from its global
meaning to the mmu-context. This is necessary for Nested SVM
with emulation of Nested Paging where we need an extra MMU
context to shadow the Nested Nested Page Table.

Signed-off-by: Joerg Roedel <joerg....@amd.com>
---

arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu.c | 8 +++++---
2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index ec891a2..e7bef19 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -254,6 +254,7 @@ struct kvm_mmu {
int root_level;
int shadow_root_level;
union kvm_mmu_page_role base_role;
+ bool tdp_enabled;

u64 *pae_root;
u64 rsvd_bits_mask[2][4];

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 741373e..5c66c99 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1812,7 +1812,7 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
spte |= shadow_user_mask;
if (level > PT_PAGE_TABLE_LEVEL)
spte |= PT_PAGE_SIZE_MASK;
- if (tdp_enabled)
+ if (vcpu->arch.mmu.tdp_enabled)
spte |= kvm_x86_ops->get_mt_mask(vcpu, gfn,
kvm_is_mmio_pfn(pfn));

@@ -2077,7 +2077,7 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
hpa_t root = vcpu->arch.mmu.root_hpa;

ASSERT(!VALID_PAGE(root));
- if (tdp_enabled)
+ if (vcpu->arch.mmu.tdp_enabled)
direct = 1;
if (mmu_check_root(vcpu, root_gfn))
return 1;
@@ -2090,7 +2090,7 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
return 0;
}
direct = !is_paging(vcpu);
- if (tdp_enabled)
+ if (vcpu->arch.mmu.tdp_enabled)
direct = 1;
for (i = 0; i < 4; ++i) {
hpa_t root = vcpu->arch.mmu.pae_root[i];
@@ -2397,6 +2397,7 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
context->invlpg = nonpaging_invlpg;
context->shadow_root_level = kvm_x86_ops->get_tdp_level();
context->root_hpa = INVALID_PAGE;
+ vcpu->arch.mmu.tdp_enabled = true;

if (!is_paging(vcpu)) {
context->gva_to_gpa = nonpaging_gva_to_gpa;
@@ -2435,6 +2436,7 @@ static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
r = paging32_init_context(vcpu);

vcpu->arch.mmu.base_role.glevels = vcpu->arch.mmu.root_level;
+ vcpu->arch.mmu.tdp_enabled = false;

return r;
}
--
1.7.0

Joerg Roedel

unread,

Mar 3, 2010, 2:20:03 PM3/3/10

to

This patch changes is_rsvd_bits_set() function prototype to
take only a kvm_mmu context instead of a full vcpu.

Signed-off-by: Joerg Roedel <joerg....@amd.com>
---

arch/x86/kvm/mmu.c | 4 ++--
arch/x86/kvm/paging_tmpl.h | 3 ++-
2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 560ecb6..647353d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2270,12 +2270,12 @@ static void paging_free(struct kvm_vcpu *vcpu)
nonpaging_free(vcpu);
}

-static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int level)
+static bool is_rsvd_bits_set(struct kvm_mmu *mmu, u64 gpte, int level)
{
int bit7;

bit7 = (gpte >> 7) & 1;
- return (gpte & vcpu->arch.mmu.rsvd_bits_mask[bit7][level-1]) != 0;
+ return (gpte & mmu->rsvd_bits_mask[bit7][level-1]) != 0;
}

#define PTTYPE 64
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 1149daa..8608439 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -158,7 +158,8 @@ walk:
if (!is_present_gpte(pte))
goto not_present;

- rsvd_fault = is_rsvd_bits_set(vcpu, pte, walker->level);
+ rsvd_fault = is_rsvd_bits_set(&vcpu->arch.mmu, pte,
+ walker->level);
if (rsvd_fault)
goto access_error;

--
1.7.0

Joerg Roedel

unread,

Mar 3, 2010, 2:20:03 PM3/3/10

to

This patch introduces a mmu-callback to translate gpa
addresses in the walk_addr code. This is later used to
translate l2_gpa addresses into l1_gpa addresses.

Signed-off-by: Joerg Roedel <joerg....@amd.com>
---

arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu.c | 7 +++++++
arch/x86/kvm/paging_tmpl.h | 19 +++++++++++++++++++
include/linux/kvm_host.h | 5 +++++
4 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c0b5576..76c8b5f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -250,6 +250,7 @@ struct kvm_mmu {
void (*free)(struct kvm_vcpu *vcpu);
gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
u32 *error);
+ gpa_t (*translate_gpa)(struct kvm_vcpu *vcpu, gpa_t gpa, u32 *error);
void (*prefetch_page)(struct kvm_vcpu *vcpu,
struct kvm_mmu_page *page);
int (*sync_page)(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 647353d..ec3830c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2149,6 +2149,11 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
spin_unlock(&vcpu->kvm->mmu_lock);
}

+static gpa_t translate_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 *error)
+{
+ return gpa;

+}
+
static gpa_t nonpaging_gva_to_gpa(struct kvm_vcpu *vcpu, gva_t vaddr,
u32 access, u32 *error)
{

@@ -2399,6 +2404,7 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
context->new_cr3 = nonpaging_new_cr3;
context->page_fault = tdp_page_fault;
context->free = nonpaging_free;
+ context->translate_gpa = translate_gpa;
context->prefetch_page = nonpaging_prefetch_page;
context->sync_page = nonpaging_sync_page;
context->invlpg = nonpaging_invlpg;
@@ -2443,6 +2449,7 @@ int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
else
r = paging32_init_context(vcpu, context);

+ vcpu->arch.mmu.translate_gpa = translate_gpa;

vcpu->arch.mmu.base_role.glevels = vcpu->arch.mmu.root_level;

vcpu->arch.mmu.tdp_enabled = false;

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 6c55a31..a72d5ea 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -122,6 +122,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
unsigned index, pt_access, pte_access;
gpa_t pte_gpa;
int rsvd_fault = 0;
+ u32 error;

trace_kvm_mmu_pagetable_walk(addr, write_fault, user_fault,
fetch_fault);
@@ -150,6 +151,15 @@ walk:
table_gfn = gpte_to_gfn(pte);
pte_gpa = gfn_to_gpa(table_gfn);
pte_gpa += index * sizeof(pt_element_t);
+
+ pte_gpa = mmu->translate_gpa(vcpu, pte_gpa, &error);
+ if (pte_gpa == UNMAPPED_GVA) {
+ walker->error_code = error;
+ return 0;
+ }
+ /* pte_gpa might have changed - recalculate table_gfn */
+ table_gfn = gpa_to_gfn(pte_gpa);
+
walker->table_gfn[walker->level - 1] = table_gfn;
walker->pte_gpa[walker->level - 1] = pte_gpa;

@@ -209,6 +219,15 @@ walk:
is_cpuid_PSE36())
walker->gfn += pse36_gfn_delta(pte);

+ /* Do the final translation */
+ pte_gpa = gfn_to_gpa(walker->gfn);
+ pte_gpa = mmu->translate_gpa(vcpu, pte_gpa, &error);
+ if (pte_gpa == UNMAPPED_GVA) {
+ walker->error_code = error;
+ return 0;
+ }
+ walker->gfn = gpa_to_gfn(pte_gpa);
+
break;
}

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a3fd0f9..ef2e81a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -503,6 +503,11 @@ static inline gpa_t gfn_to_gpa(gfn_t gfn)
return (gpa_t)gfn << PAGE_SHIFT;
}

+static inline gfn_t gpa_to_gfn(gpa_t gpa)
+{
+ return (gfn_t)gpa >> PAGE_SHIFT;
+}
+
static inline hpa_t pfn_to_hpa(pfn_t pfn)
{
return (hpa_t)pfn << PAGE_SHIFT;
--
1.7.0

Joerg Roedel

unread,

Mar 3, 2010, 2:20:04 PM3/3/10

to

The walk_addr function checks for !is_long_mode in its 64
bit version. But what is meant here is a check for pae
paging. Change the condition to really check for pae paging
so that it also works with nested nested paging.

Signed-off-by: Joerg Roedel <joerg....@amd.com>
---

arch/x86/kvm/paging_tmpl.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 81eab9a..92b6bb5 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -128,7 +128,7 @@ walk:
walker->level = vcpu->arch.mmu.root_level;
pte = vcpu->arch.cr3;
#if PTTYPE == 64
- if (!is_long_mode(vcpu)) {
+ if (vcpu->arch.mmu.root_level == PT32E_ROOT_LEVEL) {
pte = kvm_pdptr_read(vcpu, (addr >> 30) & 3);
trace_kvm_mmu_paging_element(pte, walker->level);
if (!is_present_gpte(pte))
@@ -194,7 +194,7 @@ walk:
(PTTYPE == 64 || is_pse(vcpu))) ||
((walker->level == PT_PDPE_LEVEL) &&
(pte & PT_PAGE_SIZE_MASK) &&
- is_long_mode(vcpu))) {
+ vcpu->arch.mmu.root_level == PT64_ROOT_LEVEL)) {
int lvl = walker->level;

walker->gfn = gpte_to_gfn_lvl(pte, lvl);
--
1.7.0

Joerg Roedel

unread,

Mar 3, 2010, 2:20:04 PM3/3/10

to

This patch introduces an inject_page_fault function pointer
into struct kvm_mmu which will be used to inject a page
fault. This will be used later when Nested Nested Paging is
implemented.

Signed-off-by: Joerg Roedel <joerg....@amd.com>
---

arch/x86/include/asm/kvm_host.h | 3 +++
arch/x86/kvm/mmu.c | 3 ++-
2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 37d0145..c0b5576 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -244,6 +244,9 @@ struct kvm_mmu {
void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);
+ void (*inject_page_fault)(struct kvm_vcpu *vcpu,
+ unsigned long addr,
+ u32 error_code);

void (*free)(struct kvm_vcpu *vcpu);
gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
u32 *error);

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 189c68d..8f835f1 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2263,7 +2263,7 @@ static void inject_page_fault(struct kvm_vcpu *vcpu,
u64 addr,
u32 err_code)
{
- kvm_inject_page_fault(vcpu, addr, err_code);
+ vcpu->arch.mmu.inject_page_fault(vcpu, addr, err_code);

}

static void paging_free(struct kvm_vcpu *vcpu)

@@ -2446,6 +2446,7 @@ static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
vcpu->arch.mmu.tdp_enabled = false;
vcpu->arch.mmu.set_cr3 = kvm_x86_ops->set_cr3;
vcpu->arch.mmu.get_cr3 = get_cr3;
+ vcpu->arch.mmu.inject_page_fault = kvm_inject_page_fault;

return r;
}
--
1.7.0

Joerg Roedel

unread,

Mar 3, 2010, 2:20:02 PM3/3/10

to

This patch adds the get_supported_cpuid callback to
kvm_x86_ops. It will be used in do_cpuid_ent to delegate the
decission about some supported cpuid bits to the
architecture modules.

Signed-off-by: Joerg Roedel <joerg....@amd.com>
---

arch/x86/include/asm/kvm_host.h | 2 ++
arch/x86/kvm/svm.c | 6 ++++++
arch/x86/kvm/vmx.c | 6 ++++++
arch/x86/kvm/x86.c | 3 +++
4 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 66a698e..7d649f9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -555,6 +555,8 @@ struct kvm_x86_ops {
bool (*rdtscp_supported)(void);
void (*set_tdp_cr3)(struct kvm_vcpu *vcpu, unsigned long cr3);

+ void (*set_supported_cpuid)(u32 func, struct kvm_cpuid_entry2 *entry);
+
const struct trace_print_flags *exit_reasons_str;
};

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index bce10fe..fe1398e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3287,6 +3287,10 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)
{
}

+static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
+{
+}
+
static const struct trace_print_flags svm_exit_reasons_str[] = {
{ SVM_EXIT_READ_CR0, "read_cr0" },
{ SVM_EXIT_READ_CR3, "read_cr3" },
@@ -3435,6 +3439,8 @@ static struct kvm_x86_ops svm_x86_ops = {
.rdtscp_supported = svm_rdtscp_supported,

.set_tdp_cr3 = set_tdp_cr3,
+
+ .set_supported_cpuid = svm_set_supported_cpuid,
};

static int __init svm_init(void)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 530d14d..9216867 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4146,6 +4146,10 @@ static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
}
}

+static void vmx_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
+{
+}
+
static struct kvm_x86_ops vmx_x86_ops = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -4220,6 +4224,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.rdtscp_supported = vmx_rdtscp_supported,

.set_tdp_cr3 = vmx_set_cr3,
+
+ .set_supported_cpuid = vmx_set_supported_cpuid,
};

static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9f8b02d..53360de 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1921,6 +1921,9 @@ static void do_cpuid_ent(struct kvm_cpuid_entry2 *entry, u32 function,
entry->ecx &= kvm_supported_word6_x86_features;
break;
}
+
+ kvm_x86_ops->set_supported_cpuid(function, entry);
+
put_cpu();
}

--
1.7.0

Joerg Roedel

unread,

Mar 3, 2010, 2:20:04 PM3/3/10

to

This function pointer in the MMU context is required to
implement Nested Nested Paging.

Signed-off-by: Joerg Roedel <joerg....@amd.com>
---

arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/mmu.c | 9 ++++++++-
arch/x86/kvm/paging_tmpl.h | 4 ++--
3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1bf8501..37d0145 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -242,6 +242,7 @@ struct kvm_pio_request {
struct kvm_mmu {
void (*new_cr3)(struct kvm_vcpu *vcpu);

void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);

+ unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);

int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err);

void (*free)(struct kvm_vcpu *vcpu);
gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 84e3209..189c68d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2071,7 +2071,7 @@ static int mmu_alloc_roots(struct kvm_vcpu *vcpu)
int direct = 0;
u64 pdptr;

- root_gfn = vcpu->arch.cr3 >> PAGE_SHIFT;
+ root_gfn = vcpu->arch.mmu.get_cr3(vcpu) >> PAGE_SHIFT;

if (vcpu->arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {

hpa_t root = vcpu->arch.mmu.root_hpa;

@@ -2254,6 +2254,11 @@ static void paging_new_cr3(struct kvm_vcpu *vcpu)
mmu_free_roots(vcpu);
}

+static unsigned long get_cr3(struct kvm_vcpu *vcpu)
+{
+ return vcpu->arch.cr3;
+}
+

static void inject_page_fault(struct kvm_vcpu *vcpu,
u64 addr,
u32 err_code)

@@ -2399,6 +2404,7 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)

context->root_hpa = INVALID_PAGE;
vcpu->arch.mmu.tdp_enabled = true;
vcpu->arch.mmu.set_cr3 = kvm_x86_ops->set_tdp_cr3;
+ vcpu->arch.mmu.get_cr3 = get_cr3;

if (!is_paging(vcpu)) {
context->gva_to_gpa = nonpaging_gva_to_gpa;

@@ -2439,6 +2445,7 @@ static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
vcpu->arch.mmu.base_role.glevels = vcpu->arch.mmu.root_level;

vcpu->arch.mmu.tdp_enabled = false;
vcpu->arch.mmu.set_cr3 = kvm_x86_ops->set_cr3;

+ vcpu->arch.mmu.get_cr3 = get_cr3;

return r;
}
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 92b6bb5..1149daa 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -126,7 +126,7 @@ static int FNAME(walk_addr)(struct guest_walker *walker,
fetch_fault);

walk:
walker->level = vcpu->arch.mmu.root_level;

- pte = vcpu->arch.cr3;
+ pte = vcpu->arch.mmu.get_cr3(vcpu);
#if PTTYPE == 64

if (vcpu->arch.mmu.root_level == PT32E_ROOT_LEVEL) {
pte = kvm_pdptr_read(vcpu, (addr >> 30) & 3);

@@ -137,7 +137,7 @@ walk:
}
#endif
ASSERT((!is_long_mode(vcpu) && is_pae(vcpu)) ||
- (vcpu->arch.cr3 & CR3_NONPAE_RESERVED_BITS) == 0);
+ (vcpu->arch.mmu.get_cr3(vcpu) & CR3_NONPAE_RESERVED_BITS) == 0);

pt_access = ACC_ALL;

--
1.7.0

Joerg Roedel

unread,

Mar 3, 2010, 2:20:03 PM3/3/10

to

This patch implements the reporting of the nested paging
feature support to userspace.

Signed-off-by: Joerg Roedel <joerg....@amd.com>
---

arch/x86/kvm/svm.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index fe1398e..ce71023 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3289,6 +3289,16 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)

static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
{

+ switch (func) {
+ case 0x8000000A:
+ if (!npt_enabled)
+ break;
+
+ /* NPT feature is supported by Nested SVM */
+ entry->edx = SVM_FEATURE_NPT;
+
+ break;

+ }
}

static const struct trace_print_flags svm_exit_reasons_str[] = {

--
1.7.0

Joerg Roedel

unread,

Mar 3, 2010, 2:20:03 PM3/3/10

to

Some logic of this function is required to build the Nested
Nested Paging context. So factor the required logic into a
seperate function and export it.
Also make the whole init path suitable for more than one mmu
context.

Signed-off-by: Joerg Roedel <joerg....@amd.com>
---

arch/x86/kvm/mmu.c | 60 ++++++++++++++++++++++++++++++---------------------
arch/x86/kvm/mmu.h | 1 +
2 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 8f835f1..560ecb6 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2225,10 +2225,9 @@ static void nonpaging_free(struct kvm_vcpu *vcpu)
mmu_free_roots(vcpu);
}

-static int nonpaging_init_context(struct kvm_vcpu *vcpu)
+static int nonpaging_init_context(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context)
{
- struct kvm_mmu *context = &vcpu->arch.mmu;
-
context->new_cr3 = nonpaging_new_cr3;
context->page_fault = nonpaging_page_fault;
context->gva_to_gpa = nonpaging_gva_to_gpa;
@@ -2287,9 +2286,10 @@ static bool is_rsvd_bits_set(struct kvm_vcpu *vcpu, u64 gpte, int level)
#include "paging_tmpl.h"
#undef PTTYPE

-static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level)
+static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context,
+ int level)
{
- struct kvm_mmu *context = &vcpu->arch.mmu;
int maxphyaddr = cpuid_maxphyaddr(vcpu);
u64 exb_bit_rsvd = 0;

@@ -2342,9 +2342,11 @@ static void reset_rsvds_bits_mask(struct kvm_vcpu *vcpu, int level)
}
}

-static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level)
+static int paging64_init_context_common(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context,
+ int level)
{
- struct kvm_mmu *context = &vcpu->arch.mmu;
+ reset_rsvds_bits_mask(vcpu, context, level);

ASSERT(is_pae(vcpu));
context->new_cr3 = paging_new_cr3;
@@ -2360,17 +2362,17 @@ static int paging64_init_context_common(struct kvm_vcpu *vcpu, int level)
return 0;
}

-static int paging64_init_context(struct kvm_vcpu *vcpu)
+static int paging64_init_context(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context)
{
- reset_rsvds_bits_mask(vcpu, PT64_ROOT_LEVEL);
- return paging64_init_context_common(vcpu, PT64_ROOT_LEVEL);
+ return paging64_init_context_common(vcpu, context, PT64_ROOT_LEVEL);
}

-static int paging32_init_context(struct kvm_vcpu *vcpu)
+static int paging32_init_context(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context)
{
- struct kvm_mmu *context = &vcpu->arch.mmu;
+ reset_rsvds_bits_mask(vcpu, context, PT32_ROOT_LEVEL);

- reset_rsvds_bits_mask(vcpu, PT32_ROOT_LEVEL);
context->new_cr3 = paging_new_cr3;
context->page_fault = paging32_page_fault;
context->gva_to_gpa = paging32_gva_to_gpa;
@@ -2384,10 +2386,10 @@ static int paging32_init_context(struct kvm_vcpu *vcpu)
return 0;
}

-static int paging32E_init_context(struct kvm_vcpu *vcpu)
+static int paging32E_init_context(struct kvm_vcpu *vcpu,
+ struct kvm_mmu *context)
{
- reset_rsvds_bits_mask(vcpu, PT32E_ROOT_LEVEL);
- return paging64_init_context_common(vcpu, PT32E_ROOT_LEVEL);
+ return paging64_init_context_common(vcpu, context, PT32E_ROOT_LEVEL);

}

static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)

@@ -2410,15 +2412,15 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
context->gva_to_gpa = nonpaging_gva_to_gpa;
context->root_level = 0;
} else if (is_long_mode(vcpu)) {
- reset_rsvds_bits_mask(vcpu, PT64_ROOT_LEVEL);
+ reset_rsvds_bits_mask(vcpu, context, PT64_ROOT_LEVEL);
context->gva_to_gpa = paging64_gva_to_gpa;
context->root_level = PT64_ROOT_LEVEL;
} else if (is_pae(vcpu)) {
- reset_rsvds_bits_mask(vcpu, PT32E_ROOT_LEVEL);
+ reset_rsvds_bits_mask(vcpu, context, PT32E_ROOT_LEVEL);
context->gva_to_gpa = paging64_gva_to_gpa;
context->root_level = PT32E_ROOT_LEVEL;
} else {
- reset_rsvds_bits_mask(vcpu, PT32_ROOT_LEVEL);
+ reset_rsvds_bits_mask(vcpu, context, PT32_ROOT_LEVEL);
context->gva_to_gpa = paging32_gva_to_gpa;
context->root_level = PT32_ROOT_LEVEL;
}
@@ -2426,24 +2428,32 @@ static int init_kvm_tdp_mmu(struct kvm_vcpu *vcpu)
return 0;
}

-static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
+int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context)
{
int r;
-
ASSERT(vcpu);
ASSERT(!VALID_PAGE(vcpu->arch.mmu.root_hpa));

if (!is_paging(vcpu))
- r = nonpaging_init_context(vcpu);
+ r = nonpaging_init_context(vcpu, context);
else if (is_long_mode(vcpu))
- r = paging64_init_context(vcpu);
+ r = paging64_init_context(vcpu, context);
else if (is_pae(vcpu))
- r = paging32E_init_context(vcpu);
+ r = paging32E_init_context(vcpu, context);
else
- r = paging32_init_context(vcpu);
+ r = paging32_init_context(vcpu, context);

vcpu->arch.mmu.base_role.glevels = vcpu->arch.mmu.root_level;
vcpu->arch.mmu.tdp_enabled = false;

+
+ return r;
+}
+EXPORT_SYMBOL_GPL(kvm_init_shadow_mmu);
+
+static int init_kvm_softmmu(struct kvm_vcpu *vcpu)
+{
+ int r = kvm_init_shadow_mmu(vcpu, &vcpu->arch.mmu);
+

vcpu->arch.mmu.set_cr3 = kvm_x86_ops->set_cr3;

vcpu->arch.mmu.get_cr3 = get_cr3;
vcpu->arch.mmu.inject_page_fault = kvm_inject_page_fault;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index be66759..64f619b 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -49,6 +49,7 @@
#define PFERR_FETCH_MASK (1U << 4)

int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
+int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);

static inline void kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
{
--
1.7.0

Jan Kiszka

unread,

Mar 3, 2010, 6:20:02 PM3/3/10

to

Joerg Roedel wrote:
> Hi,
>
> here are the patches that implement nested paging support for nested
> svm. They are somewhat intrusive to the soft-mmu so I post them as RFC
> in the first round to get feedback about the general direction of the
> changes. Nevertheless I am proud to report that with these patches the
> famous kernel-compile benchmark runs only 4% slower in the l2 guest as
> in the l1 guest when l2 is single-processor. With SMP guests the

Wow!

Jan

signature.asc

Alexander Graf

unread,

Mar 3, 2010, 6:40:02 PM3/3/10

to

On 03.03.2010, at 20:12, Joerg Roedel wrote:

> This patch implements the reporting of the nested paging
> feature support to userspace.
>
> Signed-off-by: Joerg Roedel <joerg....@amd.com>
> ---
> arch/x86/kvm/svm.c | 10 ++++++++++
> 1 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index fe1398e..ce71023 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -3289,6 +3289,16 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)
>
> static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
> {
> + switch (func) {
> + case 0x8000000A:
> + if (!npt_enabled)
> + break;

if (!nested)
break;

Alex--

Alexander Graf

unread,

Mar 3, 2010, 6:50:02 PM3/3/10

to

On 03.03.2010, at 20:12, Joerg Roedel wrote:

> Hi,
>
> here are the patches that implement nested paging support for nested
> svm. They are somewhat intrusive to the soft-mmu so I post them as RFC
> in the first round to get feedback about the general direction of the
> changes. Nevertheless I am proud to report that with these patches the
> famous kernel-compile benchmark runs only 4% slower in the l2 guest as
> in the l1 guest when l2 is single-processor. With SMP guests the
> situation is very different. The more vcpus the guest has the more is
> the performance drop from l1 to l2.
> Anyway, this post is to get feedback about the overall concept of these
> patches. Please review and give feedback :-)

Nice job! It's great to see you finally got around to it :-).

Have you tracked what slows down SMP l2 guests yet? So far I've been assuming that IPIs just completely kill the performance, but I guess it shouldn't be that bad, especially now where you have sped up the #VMEXIT path that much.

Alex--

Anthony Liguori

unread,

Mar 3, 2010, 7:40:02 PM3/3/10

to

On 03/03/2010 01:12 PM, Joerg Roedel wrote:
> Hi,
>
> here are the patches that implement nested paging support for nested
> svm. They are somewhat intrusive to the soft-mmu so I post them as RFC
> in the first round to get feedback about the general direction of the
> changes. Nevertheless I am proud to report that with these patches the
> famous kernel-compile benchmark runs only 4% slower in the l2 guest as
> in the l1 guest when l2 is single-processor.

That's an awesome result.

Regards,

Anthony Liguori

Joerg Roedel

unread,

Mar 4, 2010, 6:30:02 AM3/4/10

to

On Thu, Mar 04, 2010 at 12:44:48AM +0100, Alexander Graf wrote:
>
> On 03.03.2010, at 20:12, Joerg Roedel wrote:
>
> > Hi,
> >
> > here are the patches that implement nested paging support for nested
> > svm. They are somewhat intrusive to the soft-mmu so I post them as RFC
> > in the first round to get feedback about the general direction of the
> > changes. Nevertheless I am proud to report that with these patches the
> > famous kernel-compile benchmark runs only 4% slower in the l2 guest as
> > in the l1 guest when l2 is single-processor. With SMP guests the
> > situation is very different. The more vcpus the guest has the more is
> > the performance drop from l1 to l2.
> > Anyway, this post is to get feedback about the overall concept of these
> > patches. Please review and give feedback :-)
>
> Nice job! It's great to see you finally got around to it :-).
>
> Have you tracked what slows down SMP l2 guests yet? So far I've been
> assuming that IPIs just completely kill the performance, but I guess
> it shouldn't be that bad, especially now where you have sped up the
> #VMEXIT path that much.

I have not yet looked deeper into this issue. I also suspect lockholder
preemption to be the cause for this. I did the test with a
populated nested page table too and the slowdown is still there. But
thats all guessing, I need to do some research for the exact reasons.

Joerg

Joerg Roedel

unread,

Mar 4, 2010, 6:30:02 AM3/4/10

to

On Thu, Mar 04, 2010 at 12:37:42AM +0100, Alexander Graf wrote:
>
> On 03.03.2010, at 20:12, Joerg Roedel wrote:
>
> > This patch implements the reporting of the nested paging
> > feature support to userspace.
> >
> > Signed-off-by: Joerg Roedel <joerg....@amd.com>
> > ---
> > arch/x86/kvm/svm.c | 10 ++++++++++
> > 1 files changed, 10 insertions(+), 0 deletions(-)
> >
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index fe1398e..ce71023 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -3289,6 +3289,16 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)
> >
> > static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
> > {
> > + switch (func) {
> > + case 0x8000000A:
> > + if (!npt_enabled)
> > + break;
>
> if (!nested)
> break;

True, but shouldn't matter much because if nested is off the guest will
not see the svm bit. It would only see that the processor could do
nested paging if it had svm support ;-)

Joerg

Marcelo Tosatti

unread,

Mar 4, 2010, 9:50:02 AM3/4/10

to

On Wed, Mar 03, 2010 at 08:12:03PM +0100, Joerg Roedel wrote:
> Hi,
>
> here are the patches that implement nested paging support for nested
> svm. They are somewhat intrusive to the soft-mmu so I post them as RFC
> in the first round to get feedback about the general direction of the
> changes. Nevertheless I am proud to report that with these patches the
> famous kernel-compile benchmark runs only 4% slower in the l2 guest as
> in the l1 guest when l2 is single-processor. With SMP guests the
> situation is very different. The more vcpus the guest has the more is
> the performance drop from l1 to l2.
> Anyway, this post is to get feedback about the overall concept of these
> patches. Please review and give feedback :-)

Joerg,

What perf gain does this bring ? (i'm not aware of the current
overhead).

Overall comments:

Can't you translate l2_gpa -> l1_gpa walking the current l1 nested
pagetable, and pass that to the kvm tdp fault path (with the correct
context setup)?

You probably need to include a flag in base_role to differentiate
between l1 / l2 shadow tables (say if they use the same cr3 value).

Joerg Roedel

unread,

Mar 4, 2010, 11:00:01 AM3/4/10

to

On Thu, Mar 04, 2010 at 11:42:55AM -0300, Marcelo Tosatti wrote:
> On Wed, Mar 03, 2010 at 08:12:03PM +0100, Joerg Roedel wrote:
> > Hi,
> >
> > here are the patches that implement nested paging support for nested
> > svm. They are somewhat intrusive to the soft-mmu so I post them as RFC
> > in the first round to get feedback about the general direction of the
> > changes. Nevertheless I am proud to report that with these patches the
> > famous kernel-compile benchmark runs only 4% slower in the l2 guest as
> > in the l1 guest when l2 is single-processor. With SMP guests the
> > situation is very different. The more vcpus the guest has the more is
> > the performance drop from l1 to l2.
> > Anyway, this post is to get feedback about the overall concept of these
> > patches. Please review and give feedback :-)
>
> Joerg,
>
> What perf gain does this bring ? (i'm not aware of the current
> overhead).

The benchmark was an allnoconfig kernel compile in tmpfs which took with
the same guest image:

as l1-guest with npt:

2m23s

as l2-guest with l1(nested)-l2(shadow):

around 8-9 minutes

as l2-guest with l1(nested)-l2(shadow) without the recent msrpm
optimization:

around 19 minutes

as l2-guest with l1(nested)-l2(nested) [this patchset]:

2m25s-2m30s

> Overall comments:
>
> Can't you translate l2_gpa -> l1_gpa walking the current l1 nested
> pagetable, and pass that to the kvm tdp fault path (with the correct
> context setup)?

If I understand your suggestion correctly, I think thats exactly whats
done in the patches. Some words about the design:

For nested-nested we need to shadow the l1-nested-ptable on the host.
This is done using the vcpu->arch.mmu context which holds the l1 paging
modes while the l2 is running. On a npt-fault from the l2 we just
instrument the shadow-ptable code. This is the common case. because it
happens all the time while the l2 is running.

The other thing is that vcpu->arch.mmu.gva_to_gpa is expected to still
work and translate virtual addresses of the l2 into physical addresses
of the l1 (so it can be accessed with kvm functions).

To do this we need to be aware of the L2 paging mode. It is stored in
vcpu->arch.nested_mmu context. This context is only used for gva_to_gpa
translations. It is not used to build shadow page tables or anything
else. Thats the reason only the parts necessary for gva_to_gpa
translations of the nested_mmu context are initialized.

Since we can not use mmu.gva_to_gpa to translate only between l2_gpa and
l1_gpa because this function is required to translate l2_gva to l1_gpa
by other parts of kvm, the function which does this translation is moved
to nested_mmu.gva_to_gpa. So basically the gva_to_gpa function pointers
are swapped between mmu and nested_mmu.

The nested_mmu.gva_to_gpa function is used in translate_gpa_nested which
is assigned to the newly introduced translate_gpa callback of nested_mmu
context.

This callback is used in the walk_addr function to translate every
l2_gpa address we read from cr3 or the guest ptes into l1_gpa to read
the next step from the guest memory.

In the old unnested case the translate_gpa callback would point to a
function which just returns the gpa it is passed to it unmodified. The
walk_addr function is generalized and now there are basically two
versions of it:

* walk_addr which translates using vcpu->arch.mmu context
* walk_addr_nested which translates using vcpu->arch.nested_mmu
context

Thats pretty much how these patches work.

> You probably need to include a flag in base_role to differentiate
> between l1 / l2 shadow tables (say if they use the same cr3 value).

Not sure if this is necessary. It may be necessary when large pages come
into play. Otherwise the host npt pages are distinguished by the shadow
npt pages by the direct-flag.

Joerg

Avi Kivity

unread,

Mar 8, 2010, 4:20:02 AM3/8/10

to

On 03/03/2010 09:12 PM, Joerg Roedel wrote:
> This patch changes the tdp_enabled flag from its global
> meaning to the mmu-context. This is necessary for Nested SVM
> with emulation of Nested Paging where we need an extra MMU
> context to shadow the Nested Nested Page Table.
>
>

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index ec891a2..e7bef19 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -254,6 +254,7 @@ struct kvm_mmu {
> int root_level;
> int shadow_root_level;
> union kvm_mmu_page_role base_role;
> + bool tdp_enabled;
>
>

This needs a different name, since the old one is still around. Perhaps
we could call it parent_mmu and make it a kvm_mmu pointer.

--
error compiling committee.c: too many arguments to function

Avi Kivity

unread,

Mar 8, 2010, 4:40:03 AM3/8/10

to

On 03/03/2010 09:12 PM, Joerg Roedel wrote:

> This patch introduces a mmu-callback to translate gpa
> addresses in the walk_addr code. This is later used to
> translate l2_gpa addresses into l1_gpa addresses.
>
> Signed-off-by: Joerg Roedel<joerg....@amd.com>
> ---
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/mmu.c | 7 +++++++
> arch/x86/kvm/paging_tmpl.h | 19 +++++++++++++++++++
> include/linux/kvm_host.h | 5 +++++
> 4 files changed, 32 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index c0b5576..76c8b5f 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -250,6 +250,7 @@ struct kvm_mmu {
> void (*free)(struct kvm_vcpu *vcpu);
> gpa_t (*gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t gva, u32 access,
> u32 *error);
> + gpa_t (*translate_gpa)(struct kvm_vcpu *vcpu, gpa_t gpa, u32 *error);
> void (*prefetch_page)(struct kvm_vcpu *vcpu,
> struct kvm_mmu_page *page);
> int (*sync_page)(struct kvm_vcpu *vcpu,
>

I think placing this here means we will miss a few translations, namely
when we do a physical access (say, reading PDPTEs or similar).

We need to do this on the level of kvm_read_guest() so we capture
physical accesses:

kvm_read_guest_virt
-> walk_addr
-> kvm_read_guest_tdp
-> kvm_read_guest_virt
-> walk_addr
-> kvm_read_guest_tdp
-> kvm_read_guest

Of course, not all accesses will use kvm_read_guest_tdp; for example
kvmclock accesses should still go untranslated.

--
error compiling committee.c: too many arguments to function

--

Avi Kivity

unread,

Mar 8, 2010, 4:50:03 AM3/8/10

to

On 03/03/2010 09:12 PM, Joerg Roedel wrote:

Okay, this looks excellent overall, it's nice to see how well this fits
with the existing mmu infrastructure (only ~300 lines added). The
performance results are impressive.

--
error compiling committee.c: too many arguments to function

--

Joerg Roedel

unread,

Mar 10, 2010, 9:50:02 AM3/10/10

to

On Mon, Mar 08, 2010 at 11:17:41AM +0200, Avi Kivity wrote:
> On 03/03/2010 09:12 PM, Joerg Roedel wrote:
> >This patch changes the tdp_enabled flag from its global
> >meaning to the mmu-context. This is necessary for Nested SVM
> >with emulation of Nested Paging where we need an extra MMU
> >context to shadow the Nested Nested Page Table.
> >
> >
> >diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >index ec891a2..e7bef19 100644
> >--- a/arch/x86/include/asm/kvm_host.h
> >+++ b/arch/x86/include/asm/kvm_host.h
> >@@ -254,6 +254,7 @@ struct kvm_mmu {
> > int root_level;
> > int shadow_root_level;
> > union kvm_mmu_page_role base_role;
> >+ bool tdp_enabled;
> >
>
> This needs a different name, since the old one is still around.
> Perhaps we could call it parent_mmu and make it a kvm_mmu pointer.

Hmm, how about renaming the global tdp_enabled variable to tdp_usable?
The global variable indicates if tdp is _usable_ and we can _enable_ it
for a mmu context.

Joerg

Joerg Roedel

unread,

Mar 10, 2010, 9:50:02 AM3/10/10

to

Ok, doing the translation in kvm_read_guest is certainly the more
generic approach. I already fixed a bug related to loading l2 pdptr
pointers. Doing the translation in kvm_read_guest makes the code a lot
nicer.

Joerg

Avi Kivity

unread,

Mar 10, 2010, 10:00:02 AM3/10/10

to

On 03/10/2010 04:44 PM, Joerg Roedel wrote:
> On Mon, Mar 08, 2010 at 11:17:41AM +0200, Avi Kivity wrote:
>
>> On 03/03/2010 09:12 PM, Joerg Roedel wrote:
>>
>>> This patch changes the tdp_enabled flag from its global
>>> meaning to the mmu-context. This is necessary for Nested SVM
>>> with emulation of Nested Paging where we need an extra MMU
>>> context to shadow the Nested Nested Page Table.
>>>
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index ec891a2..e7bef19 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -254,6 +254,7 @@ struct kvm_mmu {
>>> int root_level;
>>> int shadow_root_level;
>>> union kvm_mmu_page_role base_role;
>>> + bool tdp_enabled;
>>>
>>>
>> This needs a different name, since the old one is still around.
>> Perhaps we could call it parent_mmu and make it a kvm_mmu pointer.
>>
> Hmm, how about renaming the global tdp_enabled variable to tdp_usable?
> The global variable indicates if tdp is _usable_ and we can _enable_ it
> for a mmu context.
>

I think of the global flags as host tdp, and the mmu as guest tdp (but
maybe this is wrong?). If that makes sense, the naming should reflect that.

--
error compiling committee.c: too many arguments to function

--

Joerg Roedel

unread,

Mar 10, 2010, 10:30:03 AM3/10/10

to

The basic flow of the mmu state with npt-npt is:

1. As long as the L1 is running the arch.mmu context is in tdp
mode and builds a direct-mapped page table.

2. When vmrun is emulated and the nested vmcb enables nested
paging, arch.mmu is switched to a shadow-mmu mode which now
shadows the l1 nested page table.
So when the l2-guest runs with nested paging the
arch.mmu.tdp_enabled variable on the host is false.

3. On a vmexit emulation the mmu is switched back to tdp
handling state.

So the mmu.tdp_enabled parameter is about tdp being enabled for the
mmu context (so mmu.tdp_enabled means that we build a l1-direct-mapped
page table when true or shadow a l1-page-table when false). Thats why I
think the 'tdp_enabled' name makes sense in the mmu-context.
The global flag only shows if an mmu-context could be in tdp-state. So
tdp_usable may be a good name for it.

Joerg

Avi Kivity

unread,

Mar 11, 2010, 1:50:01 AM3/11/10

to

tdp is still used in both cases, so that name is confusing. We could
call it mmu.direct_map (and set it for real mode?) or mmu.virtual_map
(with the opposite sense). Or something.

--
error compiling committee.c: too many arguments to function

--

Joerg Roedel

unread,

Mar 11, 2010, 5:40:01 AM3/11/10

to

On Thu, Mar 11, 2010 at 08:47:21AM +0200, Avi Kivity wrote:
>
> tdp is still used in both cases, so that name is confusing. We
> could call it mmu.direct_map (and set it for real mode?) or
> mmu.virtual_map (with the opposite sense). Or something.

I like the mmu.direct_map name. Its a good term too, I will change it in
the patch.

Joerg

Marcelo Tosatti

unread,

Mar 11, 2010, 4:10:02 PM3/11/10

to

OK, makes sense now, I was missing the fact that the l1-nested-ptable
needs to be shadowed and l1 translations to it must be write protected.

You should disable out of sync shadow so that l1 guest writes to
l1-nested-ptables always trap. And in the trap case, you'd have to
invalidate l2 shadow pagetable entries that used the (now obsolete)
l1-nested-ptable entry. Does that happen automatically?

Avi Kivity

unread,

Mar 12, 2010, 2:40:02 AM3/12/10

to

On 03/11/2010 10:58 PM, Marcelo Tosatti wrote:
>
>>> Can't you translate l2_gpa -> l1_gpa walking the current l1 nested
>>> pagetable, and pass that to the kvm tdp fault path (with the correct
>>> context setup)?
>>>
>> If I understand your suggestion correctly, I think thats exactly whats
>> done in the patches. Some words about the design:
>>
>> For nested-nested we need to shadow the l1-nested-ptable on the host.
>> This is done using the vcpu->arch.mmu context which holds the l1 paging
>> modes while the l2 is running. On a npt-fault from the l2 we just
>> instrument the shadow-ptable code. This is the common case. because it
>> happens all the time while the l2 is running.
>>
> OK, makes sense now, I was missing the fact that the l1-nested-ptable
> needs to be shadowed and l1 translations to it must be write protected.
>

Shadow converts (gva -> gpa -> hpa) to (gva -> hpa) or (ngpa -> gpa ->
hpa) to (ngpa -> hpa) equally well. In the second case npt still does
(ngva -> ngpa).

> You should disable out of sync shadow so that l1 guest writes to
> l1-nested-ptables always trap.

Why? The guest is under obligation to flush the tlb if it writes to a
page table, and we will resync on that tlb flush.

Unsync makes just as much sense for nnpt. Think of khugepaged in the
guest eating a page table and spitting out a PDE.

> And in the trap case, you'd have to
> invalidate l2 shadow pagetable entries that used the (now obsolete)
> l1-nested-ptable entry. Does that happen automatically?
>

What do you mean by 'l2 shadow ptable entries'? There are the guest's
page tables (ordinary direct mapped, unless the guest's guest is also
running an npt-enabled hypervisor), and the host page tables. When the
guest writes to each page table, we invalidate the shadows.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

Avi Kivity

unread,

Mar 12, 2010, 2:50:01 AM3/12/10

to

On 03/04/2010 05:58 PM, Joerg Roedel wrote:
>> You probably need to include a flag in base_role to differentiate
>> between l1 / l2 shadow tables (say if they use the same cr3 value).
>>
> Not sure if this is necessary. It may be necessary when large pages come
> into play. Otherwise the host npt pages are distinguished by the shadow
> npt pages by the direct-flag.
>

Hm, I think that direct maps for the same gfn can be shared.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

Marcelo Tosatti

unread,

Mar 15, 2010, 2:30:02 AM3/15/10

to

On Fri, Mar 12, 2010 at 09:36:41AM +0200, Avi Kivity wrote:
> On 03/11/2010 10:58 PM, Marcelo Tosatti wrote:
> >
> >>>Can't you translate l2_gpa -> l1_gpa walking the current l1 nested
> >>>pagetable, and pass that to the kvm tdp fault path (with the correct
> >>>context setup)?
> >>If I understand your suggestion correctly, I think thats exactly whats
> >>done in the patches. Some words about the design:
> >>
> >>For nested-nested we need to shadow the l1-nested-ptable on the host.
> >>This is done using the vcpu->arch.mmu context which holds the l1 paging
> >>modes while the l2 is running. On a npt-fault from the l2 we just
> >>instrument the shadow-ptable code. This is the common case. because it
> >>happens all the time while the l2 is running.
> >OK, makes sense now, I was missing the fact that the l1-nested-ptable
> >needs to be shadowed and l1 translations to it must be write protected.
>
> Shadow converts (gva -> gpa -> hpa) to (gva -> hpa) or (ngpa -> gpa
> -> hpa) to (ngpa -> hpa) equally well. In the second case npt still
> does (ngva -> ngpa).
>
> >You should disable out of sync shadow so that l1 guest writes to
> >l1-nested-ptables always trap.
>
> Why? The guest is under obligation to flush the tlb if it writes to
> a page table, and we will resync on that tlb flush.

The guests hypervisor will not flush the tlb with invlpg for updates of
its NPT pagetables. It'll create a new ASID, and KVM will not trap
that.

> Unsync makes just as much sense for nnpt. Think of khugepaged in
> the guest eating a page table and spitting out a PDE.
>
> >And in the trap case, you'd have to
> >invalidate l2 shadow pagetable entries that used the (now obsolete)
> >l1-nested-ptable entry. Does that happen automatically?
>
> What do you mean by 'l2 shadow ptable entries'? There are the
> guest's page tables (ordinary direct mapped, unless the guest's
> guest is also running an npt-enabled hypervisor), and the host page
> tables. When the guest writes to each page table, we invalidate the
> shadows.

With 'l2 shadow ptable entries' i mean the shadow pagetables that
translate GPA-L2 -> HPA.

Avi Kivity

unread,

Mar 15, 2010, 3:40:02 AM3/15/10

to

On 03/15/2010 08:27 AM, Marcelo Tosatti wrote:
>>
>>> You should disable out of sync shadow so that l1 guest writes to
>>> l1-nested-ptables always trap.
>>>
>> Why? The guest is under obligation to flush the tlb if it writes to
>> a page table, and we will resync on that tlb flush.
>>
> The guests hypervisor will not flush the tlb with invlpg for updates of
> its NPT pagetables. It'll create a new ASID, and KVM will not trap
> that.
>

We'll get a kvm_set_cr3() on the next vmrun.

>>> And in the trap case, you'd have to
>>> invalidate l2 shadow pagetable entries that used the (now obsolete)
>>> l1-nested-ptable entry. Does that happen automatically?
>>>
>> What do you mean by 'l2 shadow ptable entries'? There are the
>> guest's page tables (ordinary direct mapped, unless the guest's
>> guest is also running an npt-enabled hypervisor), and the host page
>> tables. When the guest writes to each page table, we invalidate the
>> shadows.
>>
> With 'l2 shadow ptable entries' i mean the shadow pagetables that
> translate GPA-L2 -> HPA.
>

kvm_mmu_pte_write() will invalidate those sptes and will also install
new translations if possible.

Beautiful, isn't it?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--