MitchAlsup wrote:
> On Monday, April 26, 2021 at 5:02:25 PM UTC-5, EricP wrote:
>> I have two ways which eliminate the N^2 table walk cost.
>>
>> In my MMU the top 4 address bits are an index into a hardware table
>> whose entries specify the method addresses in that range are translated.
> <
> Could these bits be placed in the Root Pointer ?
>
> But I also got to thinking that my top level table only uses 7-bits of the
> virtual address, indexing only 128 entries where there are 512 doublewords
> present. I "could" convert accesses into this level page from PTP/PTE
> into a porthole descriptor using 2 of the additional doublewords, so we
> would then have a base/bounds values and a PTP/PTE to the next level
> in the table. The 3rd added doubleword would hold things like the ASID,
> various flags for the OS/HV to use, and other stuff. <see below>
> <
I think you might have slightly misunderstood me.
This is all about enhancing x64 CR3 so that it allows new mapping methods.
In a normal non-virtual-machine context there is one CR3.
In a hypervisor context there are two, one controlled by the guest OS,
and a nested CR3 controlled by HV.
Currently x64 has CR3 containing the physical address of the page table
root frame. The table has 5 levels and a 48 bit virtual space.
In that level-5 frame, PTEs [0:255] cover the lower 2^47 virtual address
range and are typically associated with the current process user space,
and PTEs [256:511] cover the upper 2^47 virtual address range and
are typically associated with the OS kernel.
There is nothing in the hardware that enforces those associations,
it is just convention.
Viewing addresses as unsigned integers, the above layout puts the
kernel "system space" at high addresses, the process "user space"
at the low addresses, and has a giant dead zone between them.
Each core in the SMP system has its own CR3 which points at its own
private page table root frame. All cores common map OS system space
so the root PTEs [256:511] are the same.
However each core can have a different process mapped so they
have different root PTEs [0:255].
When the OS want to map a new process so it can run its thread,
it copies up to 256 process PTEs from the process header into its
private page table root frame. This of course is optimized to only copy
the range that is actually valid. If any process root PTEs change,
the updates are written to the process header, then an
Inter-Processor Interrupt (IPI) informs all other cores to
re-copy process PTEs [0:255] into their private root frame.
One issue I want to eliminate is the above PTE copying on process switch.
============================================
In my first cut at improving this, I make CR3 a 2-entry array,
indexed by the virtual address msb. That would make CR3[0] map the
lower process user space, and CR3[1] map the upper system space.
There are now 2 pages tables, for the current process address space
and for system address space.
To switch processes, the OS just changes CR3[0] to point a new process
root frame leaving CR3[1] alone, so none of this PTE copying.
To optimize page table walks in the above, the CR3 root pointer
as well as interior PTE entries allow skipping levels.
For a small page table, CR3[0] can skip to page table tree at level 3.
And this still works with bottom-up translate.
Later I wanted to eliminate the TLB lookups for the large linear
allocations of virtual space. To do that, I want some form of
Block Address Translate (BAT) by arithmetic relocation.
To support BAT, CR3 becomes an array of 16 entries indexed by
virtual address bits [63:60], with entries specifying which
translation method, page table or BAT to use for that Area.
If a CR3-Area entry specifies a page translate method,
then it has details for it, root frame physical address and level #.
If a CR3-Area entry specifies a block address translate method,
then it has details for the BAT such as offset to physical address,
size in bytes, protection, cache control.
CR3-Area[0] can continue to point to a page table mapping user space,
CR3-Area[15] can continue to point to a page table mapping system space.
Additionally it can now add a BAT entry CR3-area[1] mapping the
graphics physical memory into virtual memory,
but now it requires no TLB lookups for that memory area.
To switch processes the OS just switches CR3-Area[0] to point
to that process page table, like before.
Then later when we started talking about hypervisors
I thought of a new kind of BAT which allowed paging.
It makes use of the fact the the guest OS has already mapped all its
scattered virtual addressed into compact linear range of GA's.
So we don't need a multi-level translate tree, just 1 level.
Thus was born the Indirect Block Translate.
>> Method-1 is Page Table Translate (PTT) and it specifies a table root
>> physical address. Address translates on this table can be optimized
>> with level skip and bottom-up.
>>
>> Method-2 is Direct Block Translate (DBT) which specifies an area size,
>> protection, and a 64-bit base address to add to the Effective Address (EA)
>> to produce the physical address. This performs an arithmetic relocation
>> of a contiguous range of EA to a contiguous range of PA directly
>> without any memory reads.
>> This requires that the relocated block be physically contiguous.
>>
>> Method-3 is Indirect Block Translate (IBT), a mixture of 1 & 2.
>> The entry specifies an area size, protection, and 64-bit
>> base address to add to the EA giving an area offset.
>> The page number is extracted from offset bits [63:12] and used as an
>> index into a 1 level physically contiguous PTE vector whose base
>> physical address is specified in the entry. The net result is a
>> single memory access to load a PTE to translate the VA->PA
>> while retaining the ability to page fault in that memory area.
>>
>> The majority of OS the fixed size kernel space can use 3 DBT areas
>> to relocate the OS code, read-only and read-write data.
> <
> <from above>
> So if I use several of these top level descriptors, I could map contiguous
> chunks of memory (with base and bounds limits) for those things which
> are not paged.
Yes. A Direct Block Translate, after range and access checks,
takes the 64-bit effective address and adds a 64-bit offset.
If this is a Guest-OS then that is a GA to pass to HV for its translation.
If this is a HV or bare metal then that is a PA.
> What I do not know is whether HVs page pages that the OS thinks are never
> paged ??
The guest has its own CR3 and and hypervisor has its own CR3 as before.
Its just each CR3 now specifies up to 16 maps.
Guest OS manages its page tables and BAT entries to map multiple
virtual spaces into the guest address "physical space",
then HV takes GA's and uses whichever method it wants to map
the mostly contiguous GA's to physical memory.
> So, these things might look really good for mapping HV areas that are
> actually never paged. Or, I could give the OS the illusion that certain OS
> pages are contiguous, and let the HV page them as it desires.
Yes. The first is the Direct Block Translate, the second is Indirect.
The Indirect Block Translate does look up addresses in TLB,
on a TLB-miss it reads PTEs and can page fault, but there is no
tree to walk - essentially a 1 level direct map.
Both translations are making use of the fact that there are large
lumps of continuous address ranges that don't need page table trees.
Those contiguous lumps exist either because it is inherent in the
memory, like a graphics card, or because the Guest-OS has already
used its time end energy to collect it together.
So no reason for HV to uselessly repeat this process.
> Either way, the paging overhead goes down if the top descriptor porthole
> is a PTE rather than a PTP--one access to memory with base and bounds
> limits (where that 1 access is 1/2 a cache line.)
>
> Needs more thought
> <
>
> <snip>
>
> Thanks for the ideas.
You are welcome.