On 5/23/2023 11:22 AM, Dan Cross wrote:
> In article <u4h6lu$2e7an$
1...@dont-email.me>, BGB <
cr8...@gmail.com> wrote:
>> On 5/22/2023 3:10 PM, Dan Cross wrote:
>> [snip]
>>> L2PT's like the EPT and NPT are wins here; even in the nested
>>> VM case, where we have to resort to shadow paging techniques, we
>>> can handle L2 page faults in the top-level hypervisor.
>>>
>>
>> But, if one uses SW TLB, then NPT (as a concept) has no reason to need
>> to exist...
>
> Yes, at great expense.
>
Doesn't seem all that expensive.
In terms of LUTs, a soft TLB uses far less than a page walker.
And, the TLB doesn't need to have a mechanism to send memory requests
and handle memory responses, ...
It uses some Block-RAM's for the TLB, but those aren't too expensive.
In terms of performance, it is generally around 1.5 kilocycle per TLB
miss (*1), but as-is these typically happen roughly 50 or 100 times per
second or so.
On a 50 MHz core, only about 0.2% of the CPU time is going into handling
TLB misses.
Note that a page-fault (saving a memory page to an SD card and loading a
different page) is around 1 megacycle.
*1: Much of this goes into the cost of saving and restoring all the
GPRs, where my ISA has 64x 64-bit GPRs. The per-interrupt cost could be
reduced significantly via register banking, but then one pays a lot more
for registers which are only ever used during interrupt handling.
>>> There's a reason soft-TLBs have basically disappeared. :-)
>>
>> Probably depends some on how the software-managed TLB is implemented.
>
> Not really; the design issues and the impact are both
> well-known. Think through how a nested guest (note, not a
> nested page table, but a recursive instance of a hypervisor)
> would be handled.
>
The emulators for my ISA use SW TLB, and I don't imagine a hypervisor
would be that much different, except that they would likely use TLB ->
TLB remapping, rather than abstracting the whole memory subsystem.
One could also have the guest OS use page-tables FWIW.
I had originally intended to use firmware managed TLB with the OS using
page-tables, but this switched to plain software TLB mostly because I
ran out of space in the 32K Boot ROM (mostly due to things like
boot-time CPU sanity testing, *).
*: Idea being that during boot, the CPU tests many of the core ISA
features to verify they are working as intended (say, to detect things
like if a change to the Verilog broke the ALU or similar, ...).
Besides the sanity testing, the Boot ROM also contains a FAT filesystem
interface and PE/COFF / PEL4 loader (well, and also technically an ELF
loaded, but I am mostly using PEL4).
Where PEL4 is:
PC/COFF but without the MZ stub;
Compresses most of the image using LZ4.
Decompressing LZ4 being faster than reading in more data.
The LZ4 compression seems to work well with binary code vs my own RP2
compression (which works better for general data, but not as well for
machine-code). Both formats being byte-oriented LZ variants (but they
differ in terms of how LZ matches are encoded and similar).
Have observed that LZ4 decompression tends to be slightly faster on
conventional machines (like x86-64), but on my ISA, RP2 is a little faster.
Note that Deflate can give slightly better compression, but is around an
order of magnitude slower.
Generally, in PEL4, the file headers are left in an uncompressed state,
but all of the section data and similar is LZ compressed.
Where, header magic:
PE\0\0: Uncompressed
PEL0: Also uncompressed (similar to PE\0\0)
PEL3: RP2 Compression (Not generally used)
PEL4: LZ4 Compression
PEL6: LZ4LLB (Modified LZ4, Length-Limited Encoding)
If the header is 'MZ', it checks for an offset to the start of the PE
header, but then assumes normal (uncompressed) PE/COFF.
Also PEL4 uses a different checksum algorithm from normal PE/COFF, as
the original checksum algorithm sucked and could not detect some of the
main types of corruption that result from LZ screw-ups.
The "linear sum with carry-folding" was instead replaced with a "linear
sum and sum-of-linear-sums with carry-folding XORed together". It is
significantly faster than something like Adler32 (or CRC32), while still
providing many of the same benefits (namely, better error detection than
the original checksums).
Checksum is verified after the whole image is loaded/decompressed into RAM.
For my ABI, the "Global Pointer" entry in the Data directory was
repurposed into handling a floating "data section" which may be loaded
at a different address from ".text" and friends (so multiple program or
DLL instances can share the same copy of the ".text" and similar), with
the base-relocation table being internally split in this area (there is
a GBR register which points to the start of ".data", which in turn
points to a table which can be used for the program or DLLs to reload
their own corresponding data section into GBR; for "simple case" images,
this is simply a self-pointer).
Some sections, like the resource section, were effectively replaced (the
resource section now uses a format resembling the "Quake WAD2" format,
just with a different header and the offsets in terms of RVA's). Things
like "resource lumps" could then be identified with a 16-chacracter name
(typically uncompressed, apart from any compression due to the PEL4
compression, with bitmap images typically stored in the DIB/BMP format,
audio using RIFF/WAVE, ...).
Otherwise, the format is mostly similar to normal PE/COFF.
>> In my case, TLB miss triggers an interrupt, and there is an "LDTLB"
>> instruction which basically means "Take the TLBE from these two
>> registers and shove it into the TLB at the appropriate place".
>
> That's pretty much the way they all work, yes.
>
I think there were some that exposed the TLB as MMIO or similar, and the
ISR handler would then be expected to write the new TLBE into a MMIO array.
The SH-4 ISA also had something like this (in addition to the LDTLB
instruction), but I didn't keep this feature, and from what I could
tell, the existing OS's (such as the Linux kernel) didn't appear to use
it...
They also used a fully-associative TLB, which is absurdly expensive, so
I dropped to a 4-way set-associative TLB (while also making the TLB a
bit larger).
They had used a 64-entry fully-associative array, I ended up switching
to 256x 4-way, which is a total of around 1024 TLBEs.
So, in this case, the main TLB ends up as roughly half the size of an L1
cache (in terms of Block RAM), but uses less LUTs than an L1 cache.
As-is, a 16K L1 needs roughly 32K of Block-RAM (roughly half the space
eaten by tagging metadata with a 16-byte line size; while a larger
cache-line size would more efficiently use the BRAM's, it would also
result in a significant increase in LUT cost).