WE HAVE A SHELL

3 views
Skip to first unread message

gan...@gmail.com

unread,
Oct 28, 2015, 9:01:28 PM10/28/15
to Akaros
This is in our linux guest running on top of akaros!
WORKING.png

Davide Libenzi

unread,
Oct 28, 2015, 9:11:38 PM10/28/15
to aka...@googlegroups.com
That's great Gan!





On Oct 28, 2015, at 18:01, gan...@gmail.com wrote:

This is in our linux guest running on top of akaros!

--
You received this message because you are subscribed to the Google Groups "Akaros" group.
To unsubscribe from this group and stop receiving emails from it, send an email to akaros+un...@googlegroups.com.
To post to this group, send email to aka...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<WORKING.png>

ron minnich

unread,
Oct 28, 2015, 9:28:18 PM10/28/15
to aka...@googlegroups.com
Some stats.

The VMM in this case is 200 lines. The libraries are 2600 lines. The kernel code  specialized to VMs is 2154 lines. All reported by sloccount.

This is not a lot of code and we're going to shrink it.

It's also pretty portable and I'm going to propose to the Harvey guys that they take a look. But it does require
some changes to the kernel. Part of the reason this is so compact is that some key features for VMs are integrated
into Akaros, such as the way we changed the page tables so page table pages are 8k, not 4k, directly coupling EPT with page tables. You don't turn a change this intrusive off. It's wired in. But it allows you to add VM support in a very compact manner. 

Also, we require processors with Virtual APIC support. So, there's lots of hardware this doesn't run on, but as time goes on, that's going to be less of a problem.

ron

Davide Libenzi

unread,
Oct 29, 2015, 12:29:50 AM10/29/15
to aka...@googlegroups.com
How do you get 8kb pages?
Or you using 4kb page pairs and assume every system page has a sibling always allocated because it might be handed over to the VMM?

ron minnich

unread,
Oct 29, 2015, 12:48:10 AM10/29/15
to aka...@googlegroups.com
On Wed, Oct 28, 2015 at 9:29 PM 'Davide Libenzi' via Akaros <aka...@googlegroups.com> wrote:
How do you get 8kb pages?
Or you using 4kb page pairs and assume every system page has a sibling always allocated because it might be handed over to the VMM?



bingo. 

page table pages in akaros are now 8k. They are aligned to 8k boundaries. 

The low ("even") 4k is the page table page for the host. The  odd page is for the EPT. So the EPT and the host page table have almost the same root, separated by 4k. 

If you get a fault, say, in the VM, you walk the host page tables, and once you have the PTE, you add 4k to it to get the corresponding EPTE. 

One lock for both tables, so walking is easier. It's easy, given a PTE, to set the EPTE. There are lots of things that just get simpler. And, finally, it's easy to set up the VM address space as a subset of the host process address space which manages the VM. All kinds of stuff just falls into place.

ron

Davide Libenzi

unread,
Oct 29, 2015, 12:55:29 AM10/29/15
to aka...@googlegroups.com
This assumes that we have 50% of memory always reserved for the VMM though. 
We can have same efficiency if we have the EPTE pair stored in the struct page (or whatever is called in Akaros). 
So you walk PTEs, you get host PFN, you index the struct page array, and you get the sibling EPTE. 
This way there is no a priori reservation for VMM. 
No?
--

ron minnich

unread,
Oct 29, 2015, 1:00:49 AM10/29/15
to aka...@googlegroups.com
On Wed, Oct 28, 2015 at 9:55 PM 'Davide Libenzi' via Akaros <aka...@googlegroups.com> wrote:
This assumes that we have 50% of memory always reserved for the VMM though. 

I don't see how that follows.

but maybe we can sketch it out tomorrow

ron

Davide Libenzi

unread,
Oct 29, 2015, 1:05:05 AM10/29/15
to aka...@googlegroups.com
Tomorrow I will be at platforms labs tour from 1000 to 1200. Kevin will be there as well.
So I'll see you after lunch ...
--

Davide Libenzi

unread,
Oct 29, 2015, 6:35:02 AM10/29/15
to Akaros
Given that I am awfully early 😑 , let me explain better.
If I understood correctly, we need to have 8KB pages, and the host will use, say, that lower 4KB while the VMM will user the upper.
Right?
This because you want to have a fast mapping from host page to EPT page, right?
My 50% was based on the observation that in order for a pair to be available, the host should never allocate the 4KB part which is reserved for the VMM.

The 8KB forced page size is a bit weird, but I think, if I understood correctly what you are trying to achieve, we can get there w/out 8KB pages constraint.
Akaros (like pretty much every VM implementation I am aware of), has something like:

struct page {
    ...
};
struct page pages[MAX_PHYS_PAGES];

Now we can have:

struct page {
    ...
#ifdef CONFIG_VMX
    uintptr_t paired_pfn; // Or: struct page *paired_page
#endif
};

So say you walked you host page table and you found host PFN (Page Frame Number), to get EPT PFN you would simply:

pages[PFN].paired_pfn

Vice-versa would be the same.
Of course, accessory inline APIs will have to be added to pmap.h, but the impact is much smaller.
At that point you can have (modulo error checking):

#ifdef CONFIG_VMX

void alloc_page_pair(struct page **pages)
{
    kpage_alloc(&pages[0]);
    kpage_alloc(&pages[1]);
    pages[0]->paired_pfn = page2ppn(pages[1]);
    pages[1]->paired_pfn = page2ppn(pages[0]);
}

#else
...
#endif

But you can also have an alloc_cont_pages_pair() which will be able to allocate contiguous phys page sets on both host and VM.
No?


Barret Rhoden

unread,
Oct 29, 2015, 10:07:04 AM10/29/15
to aka...@googlegroups.com
On 2015-10-28 at 18:01 gan...@gmail.com wrote:
> This is in our linux guest running on top of akaros!
>

Awesome!

Next up, compile tests/draw_nanwan.c for Linux and run it from within
the VM! =)

Barret Rhoden

unread,
Oct 29, 2015, 10:37:08 AM10/29/15
to aka...@googlegroups.com
On 2015-10-29 at 03:35 "'Davide Libenzi' via Akaros"
<aka...@googlegroups.com> wrote:
> Given that I am awfully early 😑 , let me explain better.
> If I understood correctly, we need to have 8KB pages, and the host
> will use, say, that lower 4KB while the VMM will user the upper.
> Right?
> This because you want to have a fast mapping from host page to EPT
> page, right?

There's a little more to it than that.

The main thing is that there is an invariant that the address space
exposed to a process is the same whether it is in Ring 3 or in "Ring
V" (Ring 0 of VMX root, guest VM OS, etc).

In x86, this invariant translates to: the normal page table (called the
KPT, for kernel page table) and the EPT have identical mappings for
(almost) all of the users address space. The KPT and EPT are just
different windows into the same Process Virtual -> Host Physical
mapping. In the case of the EPT, "Process Virtual == Guest Physical".

That is the essence of how page tables and virtual machines work in
Akaros.

The x86-specific bit is that there is an EPT at all. Architectures
that are designed for virtualization should use the same format for the
KPT and the EPT, such that only the KPT is needed. x86 isn't like
that, but the arch-independent parts of Akaros won't cater to x86.

> My 50% was based on the observation that in order for a pair to be
> available, the host should never allocate the 4KB part which is
> reserved for the VMM.

In general, we just do an "order 1" allocation for a page table from
x86 (2 contig pages). That's a little more stress on the page
allocator. So yes, every page table page for every process in x86
costs 8KB instead of 4KB. We're okay with that.


> The 8KB forced page size is a bit weird, but I think, if I understood
> correctly what you are trying to achieve, we can get there w/out 8KB
> pages constraint.
> Akaros (like pretty much every VM implementation I am aware of), has
> something like:
>
> struct page {
> ...
> };
> struct page pages[MAX_PHYS_PAGES];
>
> Now we can have:
>
> struct page {
> ...
> #ifdef CONFIG_VMX
> uintptr_t paired_pfn; // Or: struct page *paired_page
> #endif

As a side note, that would need to be CONFIG_X86, since the need for an
EPT is an x86 thing. Additionally, we won't have CONFIG_VMX, since we
want our VMM support to be always on for Akaros, compared to the "bag
on the side" approach.



> };
>
> So say you walked you host page table and you found host PFN (Page
> Frame Number), to get EPT PFN you would simply:
>
> pages[PFN].paired_pfn

that would work, but it's a lot less convenient than simply adding a
constant to get from a KPTE to an EPTE

static inline epte_t *kpte_to_epte(kpte_t *kpte)
{
return (epte_t*)(((uintptr_t)kpte) + PGSIZE);
}

No dereferences or anything, and it's super simple.

Also, I'm reluctant to add things to the page struct. Adding 8 bytes
for the uintptr_t is a tax of 8 bytes / page. Adding an extra page to
a page table is 4KB per page table. The ratio in cost there is 512:1.
What's the ratio of page table pages to regular pages? A fully
populated PML1 has 512 entries, but there are the intermediate
PML4-2s. Then again, there are the jumbo PTEs for the KERNBASE
mapping. Anyway, the memory savings isn't clear.

> void alloc_page_pair(struct page **pages)
> {
> kpage_alloc(&pages[0]);
> kpage_alloc(&pages[1]);

Agreed that the nice thing about this is you don't need contiguous
pages.

Barret

Davide Libenzi

unread,
Oct 29, 2015, 10:56:06 AM10/29/15
to Akaros
On Thu, Oct 29, 2015 at 7:37 AM, Barret Rhoden <br...@cs.berkeley.edu> wrote:
On 2015-10-29 at 03:35 "'Davide Libenzi' via Akaros"
> My 50% was based on the observation that in order for a pair to be

> available, the host should never allocate the 4KB part which is
> reserved for the VMM.

In general, we just do an "order 1" allocation for a page table from
x86 (2 contig pages).  That's a little more stress on the page
allocator.  So yes, every page table page for every process in x86
costs 8KB instead of 4KB.  We're okay with that.

We are OK with a 50% tax on non VM pages?

> #ifdef CONFIG_VMX
>     uintptr_t paired_pfn; // Or: struct page *paired_page
> #endif

As a side note, that would need to be CONFIG_X86, since the need for an
EPT is an x86 thing.  Additionally, we won't have CONFIG_VMX, since we
want our VMM support to be always on for Akaros, compared to the "bag
on the side" approach.

Yeah, that code was not meant to be 1:1 merge. Just showing that, such tax, could be wired off for anybody who doesn't care about VMM.


> So say you walked you host page table and you found host PFN (Page
> Frame Number), to get EPT PFN you would simply:
>
> pages[PFN].paired_pfn

that would work, but it's a lot less convenient than simply adding a
constant to get from a KPTE to an EPTE

static inline epte_t *kpte_to_epte(kpte_t *kpte)
{
    return (epte_t*)(((uintptr_t)kpte) + PGSIZE);
}

No dereferences or anything, and it's super simple.

I agree that is faster, but pages[PFN].paired_pfn is no slouch either, it's wired off capable by config, and does not have 50% tax on not VM memory.


Also, I'm reluctant to add things to the page struct.  Adding 8 bytes
for the uintptr_t is a tax of 8 bytes / page.  Adding an extra page to
a page table is 4KB per page table.  The ratio in cost there is 512:1.
What's the ratio of page table pages to regular pages?  A fully
populated PML1 has 512 entries, but there are the intermediate
PML4-2s.  Then again, there are the jumbo PTEs for the KERNBASE
mapping.  Anyway, the memory savings isn't clear.

It's 1/512 on all memory (assuming being wired on - 0 if wired off), vs. 1/2 of non VM memory.
For machines where the Linux VMs are there as simple SRE agents, the percent of non VM memory is going to be pretty big.
We could have VM zones, where we pay do the 8KB trick, and zones where we don't, but IMHO the pairing is just simpler on the overall infrastructure.





ron minnich

unread,
Oct 29, 2015, 11:11:26 AM10/29/15
to aka...@googlegroups.com
let's talk when I'm back from travel and sickness. I'm in the cough up a lung stage of this cold and I doubt you want me to pass it on ;-)

Your idea would work fine IF the pte formats were compatible between EPT and host. But, the EPT format is completely incompatible (it's itanium PTEs evidently) so the guest needs to have an entirely parallel EPT page table structure, starting at the root, to the host page table structure (AMD did NOT do this with their nested page tables ... good on them!). 

Anyway, more when we can talk in person, I've done so much typing last few days I'm slow today :-)

ron

ron minnich

unread,
Oct 29, 2015, 11:19:10 AM10/29/15
to Akaros
So we currently pay a 4k cost for every 2M of memory  a process uses, so yes, 1/512. I'm ok with a 50% tax on "page table pages" for the simple reason that I want to get to 2M pages as soon as I can, and at that point the extra 4k cost per 1G is insignificant, at least to me.

In general, trading memory for extra code has always been a reasonable trade in my view, since memory always grows. Further, if we can be smart and start using 2M pages, it's just not going to matter. 

in NIX we *only* had 2M pages and we had transparent 1G pages, which to me is the desired end state. 4K pages might have worked when I had a 4M machine, but I'm aghast that we are still using them on a 512G machine.

ron

Barret Rhoden

unread,
Oct 29, 2015, 1:29:57 PM10/29/15
to aka...@googlegroups.com
On 2015-10-29 at 07:56 "'Davide Libenzi' via Akaros"
> > In general, we just do an "order 1" allocation for a page table from
> > x86 (2 contig pages). That's a little more stress on the page
> > allocator. So yes, every page table page for every process in x86
> > costs 8KB instead of 4KB. We're okay with that.
> >
>
> We are OK with a 50% tax on non VM pages?

I don't understand how it's a 50% tax for anything other than a page
table page. Maybe I'm missing something, or our current scheme is
unclear. Most page allocations (e.g. for user memory) will not need to
be an 8KB pair. It is only for the pages that are used as part of the
page table, e.g. the PML4 / env_pgdir, the set of PML3s, 2s, and 1s for
a process, etc.

Oh, when you say non-VM, VM probably == virtual machine, not virtual
memory. In akaros, we don't want to make a distinction between
processes that can be VMMs and those that choose not to. It is
possible for us to do the whole "bring in the EPT" steps only on the
first call to SYS_setup_vmm (or whatever), but we opted for having it
always on.

> > #ifdef CONFIG_VMX
> > > uintptr_t paired_pfn; // Or: struct page *paired_page
> > > #endif
> >
> > As a side note, that would need to be CONFIG_X86, since the need
> > for an EPT is an x86 thing. Additionally, we won't have
> > CONFIG_VMX, since we want our VMM support to be always on for
> > Akaros, compared to the "bag on the side" approach.
> >
>
> Yeah, that code was not meant to be 1:1 merge. Just showing that,
> such tax, could be wired off for anybody who doesn't care about VMM.

Agreed that it's not literally code to be merged, but I wanted to
express the idea that VMM is something hardwired in to the entire
system. Ideally the tax for that is small, but I didn't want to get
into an #ifdef VIRTUAL_MACHINE situation.

Barret

ron minnich

unread,
Oct 29, 2015, 1:34:07 PM10/29/15
to aka...@googlegroups.com
We care about VMMs. And one very important  target application of Akaros is effective support of VMs. Further, every config option just makes the code that harder to reason about. The fewer build time options the better.

We're not planning to go into phones :-)

ron

Davide Libenzi

unread,
Oct 29, 2015, 3:46:56 PM10/29/15
to Akaros
OK, maybe I have not understood how this 8KB page plugs into Akaros.
Is *every* 4KB page allocated on the system, taxed with its sibling 4KB, or only pages allocated for page tables structures?

Davide Libenzi

unread,
Oct 29, 2015, 3:57:22 PM10/29/15
to Akaros
Never mind, I found the culprit 😀

... we changed the page tables so page table pages are 8k, not 4k, ...

In that case it does not really matter, agreed.



ron minnich

unread,
Oct 29, 2015, 4:13:38 PM10/29/15
to Akaros
I'm happy to hear you say that :-)

Thanks for pushing to the answer!

ron
Reply all
Reply to author
Forward
0 new messages