On 1/22/2024 10:09 AM, Scott Lurndal wrote:
>
mitch...@aol.com (MitchAlsup1) writes:
>> BGB wrote:
>>
>
>>> Much like with MMU:
>>> Only the base level needs to actually handle TLB miss events, and
>>> everything else (nested translation, etc), can be left to software
>>> emulation.
>>
>> Name a single ISA that fakes the TLB ?? (and has an MMU)
>
> MIPS?
>
Hmm...
In my case, the use of Soft TLB is not strictly required, as the OS may
opt-in to use a hardware page-walker "if it exists", with TLB Miss
interrupts mostly happening if no hardware page walker exists (or if
there is not a valid page in the page table).
This allows the option of implementing a nested page-translation
mechanism in the top-level TLB Miss handler (with a guess able to opt
out of the hardware page walking if it wants to run its own VM, it which
case will need to recursively emulate the TLB Miss ISR's and LDTLB
handling).
Well, or come up with a convention where the top level can see the VM
state of each guest recursively, so that the top-level ISR can
(directly) handle N levels of nested page-tables (rather than needing to
nest the TLB Miss ISR).
Though, the most likely option for this would be to make the nested VM's
express their VM state using the same contest structure as normal
threads/processes, and effectively canonizing these parts of the
structure as part of the ISA/ABI spec (and a guest deviating from this
structure would come at potentially significant performance cost).
May also make sense to add specific interrupts for specific privileged
instructions, such that common cases like accessing a CR or using an
LDTLB instruction can be trapped more efficiently (IOW: not needing to
disassemble the offending instruction to figure out what to do).
>
>>> Presumably, each core gets its own ISR stack, which should not have any
>>> reason to need to interact with each other.
>>
>> I presume an interrupt can be serviced by any number of cores.
>
> Or restricted to a specific set of cores (i.e. those currently
> owned by the target guest).
>
> The guest OS will generally specify the target virutal core (or set of cores)
> for a specific interrupt. The Hypervisor and/or hardware needs
> to deal with the case where the interrupt arrives while the target
> guest core isn't currently scheduled on a physical core (and poke
> the kernel to schedule the guest optionally). Such as recording
> the pending interrupt and optionally notifying the hypervisor that
> there is a pending guest interrupt so it can schedule the guest
> core(s) on physical cores to handle the interrupt.
>
I am guessing maybe my assumed approach of always routing all of the
external hardware interrupts to a specific core, is not typical then?...
Say, only Core=0 or Core=1, will get the interrupts.
*: Here, 0 vs 1 is ambiguous partly as '0' was left as a "This core",
with other cores numbered 1-15.
This scheme does work directly with < 15 cores, with trickery for 16
cores, but would require nesting trickery for more cores.
>> I presume that there are a vast number of devices. Each device assigned
>> to a few GuestOSs.
>
> Or, with SR-IOV, virtual functions are assigned to specific guests
> and all interrupts are MSI-X messages from the device to the
> interrupt controller (LAPIC, GIC, etc).
>
> Dealing with inter-processor interrupts in a multicore guest can also
> be tricky; either trapped by the hypervisor or there must be hardware
> support in the interrupt controller to notify the hypervisor that a pending
> guest IPI interrupt has arrived. ARM started with the former behavior, but
> added a mechanism to handle direct injection of interprocessor interrupts
> by the guest, without hypervisor intervention (assuming the guest core
> is currently scheduled on a physical core, otherwise the hypervisor gets
> notified that there is a pending interrupt for a non-scheduled guest
> core).
>
Yeah.
Admittedly, I hadn't really thought about or looked into these parts...
>> I presume the core that services the interrupt (ISR) is running the same
>> GuestOS under the same HyperVisor that initiated the device.
>
> Generally a safe assumption. Note that the guest core may not be
> resident on any physical core when the guest interrupt arives.
>
Trying to route actual HW interrupts into virtual guest OS's seems like
a pain.
In any case, it needs to be routed to where it needs to go.
>> I presume the core that services the interrupt was of the lowest priority
>> of all the cores then running that GuestOS.
>> I presume the core that services the interrupt wasted no time in doing so.
>>
>> And the GuestOS decides on how its ISR stack is {formatted, allocated, used,
>> serviced, ...} which can be different for each GuestOS.
>
> To a certain extent, the format of the ISR stack is hardware defined,
> and there rest is completely up to the guest. ARM for example,
> saves the current PC into a system register (ELR_ELx) and switches
> the stack pointer. Everything else is up to the software interrupt
> handler to save/restore. I see little benefit in hardware doing
> any state saving other than that.
>
Mostly agreed.
If ARM goes minimnal here, and pretty much nowhere else, this seems
telling...
As I see it, the main limiting factor for interrupt performance is not
the instructions to save and restore the registers, but rather the L1
misses that result from doing so.
Short of having special core-local SRAM or similar, this cost is
unavoidable.
Currently there is an SRAM region, but it is shared and in the L2 Ring,
so it will not have L2 misses, but has higher access latency than if it
were in the L1 ring.
But, it is debatable if it really actually matters, and there are
probably reasons not to have core-local memory regions.
But, compared with the RISC-V solution of doing N copies of the register
file, a core-local SRAM for the ISR stack would be cheap.
But, yeah:
Save PC;
Save any CPU flags/state;
Swap the stacks;
Set CPU state to a supervisor+ISR mode;
Branch to ISR entry point (to an offset in a vector table).
Does work, and seems pretty close to the minimum requirement.
Couldn't really think up a good way to trim it down much smaller.
At least without adding a bunch of extra wonk.
In HW, there are effectively two stack-pointer register registers, which
swap places on ISR entry/exit (currently by renumbering the registers in
the decoder).
Can't really get rid of the stack-swap without adding considerably more
wonk to the ISR handling mechanism (if the ISR entry point has 0 free
registers, and no usable stack pointer, well then, we have a bit more of
a puzzle...).
So, a mechanism to swap a pair of stack-pointer registers seemed like a
necessary evil.
With a Soft-TLB, it is also basically required to fall back to physical
addressing for ISR's (and with HW page-walking, if virtual-memory could
exist in ISRs, it would likely be necessary to jump over to a different
set of page-tables from the usermode program).
>
>>
>> If the interrupt occurs often enough to mater, its instructions, data,
>> and translations will be in the cache hierarchy.
>
> Although there has been a great deal of work mitigating the
> number of interrupts (setting interrupt threshholds, RSS,
> polling (DPDK, ODP), etc)
>
> I don't see any advantages to all the fancy hardware interrupt
> proposals from either of you.
?...
In my case, I had not been arguing for any fancy interrupt handling in
hardware...
The most fancy part of my interrupt mechanism, is that one can encode
the ID of a core into the value passed to a "TRAPA", and it will
redirect the interrupt to that specific core.
But, this mechanism currently has the limitations of a 4-bit field, so
going beyond ~ 15 cores is going to require a nesting scheme and
bouncing IPI's across multiple cores.
Though, if needed, I could tweak the format slightly in this case, and
maybe expand the Core-ID for IPI's to 8-bits, albeit limiting it to 16
unique IPI interrupt types.
Or, an intermediate would be 6-bit, and then require nesting for more
than 63 cores.
Doesn't matter for an FPGA, as with the BJX2 Core, I am mostly limited
to 1 or 2 cores on "consumer grade" FPGAs (all of the FPGA's that could
fit more than two cores; well, I can no longer use free Vivado).
In theory, could fit a quad-core on a Kintex-325T that I got of
AliExpress (and probably run at a higher clock-speed as well), but,
can't exactly use this FPGA in the free version of Vivado (and the
open-source tools both didn't work for me, and put up some "major red
flags" regarding their reverse engineering strategies; so even if the
tools did work, using them to generate bitstreams for a Kintex-325T or
similar would be legally suspect).
...
Though, apparently, some people are getting clock-speeds by just letting
the design fail timing and running it at higher clock-speeds (say, if
the design passes timing at 50MHz, can in theory push it up to around
75-100 MHz before it starts glitching out).
I was playing it safe here though.
...